Interpretable Machine Learning

Powerful machine learning algorithms such as deep neural networks (DNNs) are recently able to harvest extremely large amounts of training data and can thus achieve record performances in many research fields. At the same time, DNNs are generally conceived as black box methods, because it is difficult to intuitively and quantitatively understand the result of their inference, i.e. for an individual novel input data point, what made the trained DNN model arrive at a particular response. This is a major drawback for applications where interpretability of the results is an inevitable prerequisite (e.g., medical or security domain).

We developed a principled approach to decompose a classification decision of a DNN (or any other complex machine learning method) into pixel-wise relevances indicating the contributions of a pixel to the overall classification score. The approach is derived from layer-wise conservation principles and leverages the structure of the neural network. Due to its generality, our approach can be applied to a variety of tasks and network architectures.


  1. S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, "On Pixel-wise Explanations for Non-Linear Classifier Decisions by Layer-wise Relevance Propagation", PLOS ONE, vol. 10, no. 7, pp. e0130140, July 2015.
  2. W. Samek, A. Binder, G. Montavon, S. Bach, and K.-R. Müller, "Evaluating the visualization of what a Deep Neural Network has learned", arXiv:1509.06321, September 2015.
  3. G. Montavon, S. Bach, A. Binder, W. Samek, and K.-R. Müller, "Explaining NonLinear Classification Decisions with Deep Taylor Decomposition", arXiv:1512.02479, December 2015.
  4. S. Bach, A. Binder, G. Montavon, K.-R. Müller, and W. Samek, "Analyzing Classifiers: Fisher Vectors and Deep Neural Networks", arXiv:1512.00172, December 2015.