AC295: Advanced Practical Data Science

Lecture 8: Overview of Visualization for Deep Learning

Harvard University
Spring 2020
Instructors: Pavlos Protopapas
TF: Michael Emanuel, Andrea Porelli and Giulia Zerbini
Author: Weiwei Pan and Pavlos Protopapas


Outline

  1. What is this lecture about?
  2. A Motivating Real-world Example
  3. Taxonomy of Deep Learning Visualization Literature
  4. Hands-on Exercise: Saliency Maps for Model Diagnostic

What is This Workshop About?

  • What: Today we focus on visualizations that probe the properties of machine learning models built for data sets.

  • Who: This introductory lecture will give a brief guided tour of the literature. We will define what is a neural networks model and given motivatations for model visualization.


  • How: notes for this lecture are online and are best used as a conceptual foundation, as well as a starting point for exploring this body of literature.

A Motivating Real-world Example

Predicting Positive Outcomes for IVF Patients

A Simple Model: Logistic Regression

We build a simple predictive model: logistic regression. We model the probability of an patient $\mathbf{x} \in \mathbb{R}^{\text{input}}$ having a positive outcome (encoded by $y=1$) as a function of its distance from a hyperplane parametrized by $\mathbf{w}$ that separates the outcome groups in the input space.

That is, we model $p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}(\mathbf{w}^\top \mathbf{x})$. Where $g(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}=0$ is the equation of the decision boundary.

Visualizing the Top Predictors

We sort the weights of the logistic regression model

$$p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}(\mathbf{w}^\top \mathbf{x})$$

and visualize the weights and the corresponding data attribute. Why is this a good idea?

Interpreting the Top Predictors

We visualize the distribution of the predictor E2_max for both postive and negative outcome groups. What can we conclude about this predictor?

Interpreting the Top Predictors

We visualize the distribution of the predictor Trigger_Med_nan for both postive and negative outcome groups. What can we conclude about this predictor?

Lessons for Visualization

Choosing/designing machine learning visualization requires that we think about:

  1. Why and for whom to visualize: for example

    • are we visualizing to diagnose problems with our models?
    • are we visualizing to interpret our models with clinical meaningfulness?

  2. What and how to visualize: for example

    • do we visualize decision boundaries, weights of our model, and or distributional differences in the data?

    Note: what is possible to visualize depends very much on the internal mechanics of our model and the data!

What If the Decision Boundary is Not Linear?

How would you parametrize a ellipitical decision boundary?

We can say that the decision boundary is given by a quadratic function of the input:

$$ g(\mathbf{x}) = w_1x^2_1 + w_2x^2_2 + w_3 = 0 $$

We can fit such a decision boundary using logistic regression with degree 2 polynomial features.

How would you parametrize an arbitrary complex decision boundary?

It's not easy to think of a function $g(\mathbf{x})$ can capture this decision boundary. Also, assuming an exact form for $g$ is restrictive.

GOAL: Build increasingly good approximations, $\widehat{g}$, of the "true deicision boundary" $g$ by composing simple functions. This $\widehat{g}$ is essentially a neural network.

What is a Neural Network?

Goal: build a complex function $\widehat{g}$ by composing simple functions.

For example, let the following picture represents the approximation $\widehat{g}(\mathbf{x}) = f\left(\sum_{i}w_ix_i\right)$, where $f$ is a non-linear transform:

Neural Networks as Function Approximators

Then we can define the function $\widehat{g}$ with a graphical schema representing a complex series of compositions and sums of the form, $f\left(\sum_{i}w_ix_i\right)$

This is a neural network. We denote the weights of the neural network collectively by $\mathbf{W}$. The non-linear function $f$ is called the activation function. Nodes in this diagram that are not representing input or output are called hidden nodes. This diagram, along with the choice of activation function $f$, that defines $\widehat{g}$ is called the architecture.

Training Neural Networks

We train a neural network classifier,

$$p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}\left(\widehat{g}_{\mathbf{W}}(\mathbf{x})\right)$$

by finding weights $\mathbf{W}$ that parametrizes the function $\widehat{g}_{\mathbf{W}}$ that "best fits the data" - e.g. best separate the classes.

Typically, we optimize the fit of $\widehat{g}_{\mathbf{W}}$ to the data incrementally and greedily in a process called gradient descent.

What to Visualize for Neural Network Models?

For logistic regression, $p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}(\mathbf{w}^\top \mathbf{x})$, we were able to interrogate the model by printing out the weights of the model.

For a neural network classifier, $p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}\left(\widehat{g}_{\mathbf{W}}(\mathbf{x})\right)$, would it be helpful to print out all the weights?

Weight Space Versus Function Space

While it's convienient to build up a complex function by composing simple ones, understanding the impact of each weight on the outcome is now much more difficult.

In fact, the relationship between the space of weights of a neural network and the function the network represents is extremely complicated:

  1. the same function may be represented by two very different set of weights for the same architecture

  2. the architecture may be overly expressive - it can express the function $\widehat{g}$ using a subset of the weights and hidden nodes (i.e. the trained model can have weights that are zero or nodes that contribute little to the computation).

Question: are there more global/heuristic properties of these models that we can visualize?

A Non-linear Classification Problem

In [12]:
degree_of_polynomial = 1
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
scatter_plot_data(X_train, Y_train, ax[0])
ax[0].set_title('Training Data')

plot_decision_boundary(X_train, Y_train, logistic, ax[1], degree_of_polynomial)
ax[1].set_title('Decision Boundary')
plt.show()

A Neural Network Model for Classification

We train a neural network classifier (with relu activation): $p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}\left(\widehat{g}_{\mathbf{W}}(\mathbf{x})\right)$

Generated with: http://alexlenail.me/NN-SVG/index.html

How Well Does Our Neural Network Classifier Do?

In [88]:
# Evaluate the training and testing performance of your model 
# Note: you should check both the loss function and your evaluation metric
score = model.evaluate(X_train, Y_train, verbose=0)
print('Train accuracy:', score[1], '\n\n')
Train accuracy: 0.9971428513526917 


Why is a Neural Network Classifier So Effective?

Visualizing the decision boundary:

In [95]:
degree_of_polynomial = 1

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
scatter_plot_data(X_train, Y_train, ax[0])
ax[0].set_title('Training Data')

plot_decision_boundary(X_train, Y_train, model, ax[1], degree_of_polynomial)
ax[1].set_title('Decision Boundary')
plt.show()

Why is a Neural Network Classifier So Effective?

Visualizing the output of the last hidden layer.

In [97]:
# plot the latent representation of our training data at the first hidden layer
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
ax.scatter(activations[Y_train_pred >= 0.5, 0], activations[Y_train_pred >= 0.5, 1], color='red', label='Class 1')
ax.scatter(activations[Y_train_pred < 0.5, 0], activations[Y_train_pred < 0.5, 1], color='blue', label='Class 0')
ax.set_title('Toy Classification Dataset Transformed by the NN')
ax.legend()
plt.show()

Two Interpretations of a Neural Network Classifier:

A Complex Decision Boundary $g\quad\quad\quad\quad\quad$ A Transformation $g_0$ and a linear model $g_1\quad\quad\quad\quad$

Is the Model Right for the Right Reasons?

Now that we see that the boosted performance of neural network models come from the fact that they can express complex functions (either capturing decision boundaries or transformations of the data).

Do these visualization help us answer the same questions we asked of the logistic regression model?

  • which attributes are most predictive of a positive outcome?

  • is the relationship between these attributes and the outcome clinically meaningful?

Lessons for Visualization

Choosing/designing machine learning visualization requires that we think about:

  1. Why and for whom to visualize: for example

    • are we visualizing to diagnose problems with our models?
    • are we visualizing to interpret our models with clinical meaningfulness?

  2. What and how to visualize: for example

    • do we visualize decision boundaries, weights of our model, and or distributional differences in the data?

    Note: what is possible to visualize depends very much on the internal mechanics of our model and the data!

  3. Deep models present unique challenges for visualization: we can answer the same questions about the model, but our method of interrogation must change!

Taxonomy of Deep Learning Visualization Literature

Why and for Whom

  1. Interpretability & Explainability: understand how deep learning models make decisions and what representations they have learned.

  2. Debugging & Improving Models: help model developers build and debug their models, with the hope of expediting the iterative experimentation process to ultimately improve performance.

  3. Teaching Deep Learning Concepts: educate non-expert users about AI.

From: Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers

TensorBoard

TensorBoard provides a wide array of visualization tools for model diagnostics.

From: Tensorboard

Explaining Classifier Decisions in Medical Imaging

Visualization of the evidence for the correct classification of an MRI as positive of presence of disease.

From: Visualizing Deep Neural Network Decisions: Prediction Difference Analysis

How and What

What technical components of neural networks could be visualized?

  • Computational Graph & Network Architecture
  • Learned Model Parameters: weights, filters
  • Individual Computational Units: activations, gradients
  • Aggregate information: performance metrics

How can they be insightfully visualized?

How depends on the type of data and model as well as our specific investigative goal.

All Data Types

Activations and Weights

By visualizing the network weights and activations (the output of hidden nodes) as we train, we can diagnose issues that ultimately impact model performance.

The following visualizes the distribution of activations in two hidden layers over the course of training. What problems do we see?

From: Tensorboard

Visualizing Top Predictors by Input Gradient

Since the input gradient of an objective function for a trained model indicates which input dimensions has the greatest effect on the model decision at an input $\mathbf{x}$, we can visualize the "top predictors" of outcome for a particular input $\mathbf{x}$.

We can think of this as approximating our neural network model with a linear model locally at an input $\mathbf{x}$ and then interpreting the weights of this linear approximation.

From: How to Explain Individual Classification Decisions

Image Data

Saliency Maps

When each input dimension is a pixel, we can visualize the input gradients for a particular image as a heat-map, called saliency map. Higher gradient regions representing portions of the image that is most impactful for the model's decision.

From: Top-down Visual Saliency Guided by Captions

Class Maximization

We can also visualize the image exemplar of a class, according to a trained model. That is, we find image $\mathbf{x}^*$ that maximizes the "chances" of classifying that image as class $c$: $ \mathbf{x}^* = \mathrm{argmax}_\mathbf{x}\; \mathrm{score}_c(\mathbf{x}) $

From: Deep Inside Convolutional Networks, Multifaceted Feature Visualization

Working with Image Data is Challenging

In applications involving images, the first task is often to parse an image into a set of 'features' that are relevant for the task at hand. That is, we prefer not to work with images as a set of pixels.

Formally, we want to learn a function $h$ mapping an image $X$ to a set of $K$ features $[F_1, F_2, \ldots, F_K]$, where each $F_k$ is an image represented as an array. We want to learn a neural network, called a convolutional neural network, to represent such a function $h$.

Convolutional Layers

A convolutional neural network typically consists of feature extracting layers and condensing layers.

The feature extracting layers are called convolutional layers, each node in these layers uses a small fixed set of weights to transform the image in the following way: This set of fixed weights for each node in the convolutional layer is often called a filter.

Connections to Classical Image Processing

The term "filter" comes from image processing where one has standard ways to transforms raw images:

Feature Extraction for Classification

Rather than processing image data with a pre-determined set of filters, we want to learn the filters of a CNN for feature extraction. Our goal is to extract features that best helps us to perform our downstream task (e.g. classification).

Idea: We train a CNN for feature extraction and a model (e.g. MLP, decision tree, logistic regression) for classification, simultaneously and end-to-end.

What to Visualize for CNNs?

The first things to try are:

  1. visualize the result of applying a learned filter to an image
  2. visualize the filters themselves:

Unfortunately, these simple visualizations don't shed much light on what the model has learned.

Activation Maximization: Generating Exemplars

Rather than visualizing a particular filter or the representation of a particular image at a hidden layer, we can visualize the image that maximizes the output and hence impact of that filter or layer. Such an image is an exemplar of the filter or feature that the model has learned.

That is, we find image $\mathbf{x}^*$ that maximizes activation of a filter or a hidden layer (representation) while holding the network weights fixed:

$$ \mathbf{x}^* = \mathrm{argmax}_\mathbf{x}\; \mathrm{activation}_{\text{$f$ or $l$}}(\mathbf{x}). $$

Visualizing Convolutional Features By Activation Maximization

From: Feature Visualization

Visualizing Convolutional Filters By Activation Maximization

From: DeepViz

Interpretation with a Grain of Salt

Example: Saliency Maps for Model Diagnostic

Here is a guided example of using saliency maps to diagnose problems with a neural work classifier (this has not yet been converted to keras!).

https://tinyurl.com/w5h54vg