Harvard University
Spring 2020
Instructors: Pavlos Protopapas
TF: Michael Emanuel, Andrea Porelli and Giulia Zerbini
Author: Weiwei Pan and Pavlos Protopapas
What: Today we focus on visualizations that probe the properties of machine learning models built for data sets.
Who: This introductory lecture will give a brief guided tour of the literature. We will define what is a neural networks model and given motivatations for model visualization.
How: notes for this lecture are online and are best used as a conceptual foundation, as well as a starting point for exploring this body of literature.
We build a simple predictive model: logistic regression. We model the probability of an patient $\mathbf{x} \in \mathbb{R}^{\text{input}}$ having a positive outcome (encoded by $y=1$) as a function of its distance from a hyperplane parametrized by $\mathbf{w}$ that separates the outcome groups in the input space.
That is, we model $p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}(\mathbf{w}^\top \mathbf{x})$. Where $g(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}=0$ is the equation of the decision boundary.
We sort the weights of the logistic regression model
$$p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}(\mathbf{w}^\top \mathbf{x})$$and visualize the weights and the corresponding data attribute. Why is this a good idea?
We visualize the distribution of the predictor E2_max
for both postive and negative outcome groups. What can we conclude about this predictor?
We visualize the distribution of the predictor Trigger_Med_nan
for both postive and negative outcome groups. What can we conclude about this predictor?
Choosing/designing machine learning visualization requires that we think about:
Why and for whom to visualize: for example
What and how to visualize: for example
Note: what is possible to visualize depends very much on the internal mechanics of our model and the data!
How would you parametrize a ellipitical decision boundary?
We can say that the decision boundary is given by a quadratic function of the input:
$$ g(\mathbf{x}) = w_1x^2_1 + w_2x^2_2 + w_3 = 0 $$We can fit such a decision boundary using logistic regression with degree 2 polynomial features.
It's not easy to think of a function $g(\mathbf{x})$ can capture this decision boundary. Also, assuming an exact form for $g$ is restrictive.
GOAL: Build increasingly good approximations, $\widehat{g}$, of the "true deicision boundary" $g$ by composing simple functions. This $\widehat{g}$ is essentially a neural network.
Goal: build a complex function $\widehat{g}$ by composing simple functions.
For example, let the following picture represents the approximation $\widehat{g}(\mathbf{x}) = f\left(\sum_{i}w_ix_i\right)$, where $f$ is a non-linear transform:
Then we can define the function $\widehat{g}$ with a graphical schema representing a complex series of compositions and sums of the form, $f\left(\sum_{i}w_ix_i\right)$
This is a neural network. We denote the weights of the neural network collectively by $\mathbf{W}$. The non-linear function $f$ is called the activation function. Nodes in this diagram that are not representing input or output are called hidden nodes. This diagram, along with the choice of activation function $f$, that defines $\widehat{g}$ is called the architecture.
We train a neural network classifier,
$$p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}\left(\widehat{g}_{\mathbf{W}}(\mathbf{x})\right)$$by finding weights $\mathbf{W}$ that parametrizes the function $\widehat{g}_{\mathbf{W}}$ that "best fits the data" - e.g. best separate the classes.
Typically, we optimize the fit of $\widehat{g}_{\mathbf{W}}$ to the data incrementally and greedily in a process called gradient descent.
For logistic regression, $p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}(\mathbf{w}^\top \mathbf{x})$, we were able to interrogate the model by printing out the weights of the model.
For a neural network classifier, $p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}\left(\widehat{g}_{\mathbf{W}}(\mathbf{x})\right)$, would it be helpful to print out all the weights?
While it's convienient to build up a complex function by composing simple ones, understanding the impact of each weight on the outcome is now much more difficult.
In fact, the relationship between the space of weights of a neural network and the function the network represents is extremely complicated:
Question: are there more global/heuristic properties of these models that we can visualize?
degree_of_polynomial = 1
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
scatter_plot_data(X_train, Y_train, ax[0])
ax[0].set_title('Training Data')
plot_decision_boundary(X_train, Y_train, logistic, ax[1], degree_of_polynomial)
ax[1].set_title('Decision Boundary')
plt.show()
We train a neural network classifier (with relu activation): $p(y=1 | \mathbf{w}, \mathbf{x}) = \mathrm{sigmoid}\left(\widehat{g}_{\mathbf{W}}(\mathbf{x})\right)$
Generated with: http://alexlenail.me/NN-SVG/index.html
# Evaluate the training and testing performance of your model
# Note: you should check both the loss function and your evaluation metric
score = model.evaluate(X_train, Y_train, verbose=0)
print('Train accuracy:', score[1], '\n\n')
Train accuracy: 0.9971428513526917
Visualizing the decision boundary:
degree_of_polynomial = 1
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
scatter_plot_data(X_train, Y_train, ax[0])
ax[0].set_title('Training Data')
plot_decision_boundary(X_train, Y_train, model, ax[1], degree_of_polynomial)
ax[1].set_title('Decision Boundary')
plt.show()
Visualizing the output of the last hidden layer.
# plot the latent representation of our training data at the first hidden layer
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
ax.scatter(activations[Y_train_pred >= 0.5, 0], activations[Y_train_pred >= 0.5, 1], color='red', label='Class 1')
ax.scatter(activations[Y_train_pred < 0.5, 0], activations[Y_train_pred < 0.5, 1], color='blue', label='Class 0')
ax.set_title('Toy Classification Dataset Transformed by the NN')
ax.legend()
plt.show()
A Complex Decision Boundary $g\quad\quad\quad\quad\quad$ | A Transformation $g_0$ and a linear model $g_1\quad\quad\quad\quad$ |
Now that we see that the boosted performance of neural network models come from the fact that they can express complex functions (either capturing decision boundaries or transformations of the data).
Do these visualization help us answer the same questions we asked of the logistic regression model?
Choosing/designing machine learning visualization requires that we think about:
Why and for whom to visualize: for example
What and how to visualize: for example
Note: what is possible to visualize depends very much on the internal mechanics of our model and the data!
Deep models present unique challenges for visualization: we can answer the same questions about the model, but our method of interrogation must change!
From: Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers
TensorBoard provides a wide array of visualization tools for model diagnostics.
From: Tensorboard
Visualization of the evidence for the correct classification of an MRI as positive of presence of disease.
From: Visualizing Deep Neural Network Decisions: Prediction Difference Analysis
What technical components of neural networks could be visualized?
How can they be insightfully visualized?
How depends on the type of data and model as well as our specific investigative goal.
By visualizing the network weights and activations (the output of hidden nodes) as we train, we can diagnose issues that ultimately impact model performance.
The following visualizes the distribution of activations in two hidden layers over the course of training. What problems do we see?
From: Tensorboard
Since the input gradient of an objective function for a trained model indicates which input dimensions has the greatest effect on the model decision at an input $\mathbf{x}$, we can visualize the "top predictors" of outcome for a particular input $\mathbf{x}$.
We can think of this as approximating our neural network model with a linear model locally at an input $\mathbf{x}$ and then interpreting the weights of this linear approximation.
When each input dimension is a pixel, we can visualize the input gradients for a particular image as a heat-map, called saliency map. Higher gradient regions representing portions of the image that is most impactful for the model's decision.
We can also visualize the image exemplar of a class, according to a trained model. That is, we find image $\mathbf{x}^*$ that maximizes the "chances" of classifying that image as class $c$: $ \mathbf{x}^* = \mathrm{argmax}_\mathbf{x}\; \mathrm{score}_c(\mathbf{x}) $
From: Deep Inside Convolutional Networks, Multifaceted Feature Visualization
In applications involving images, the first task is often to parse an image into a set of 'features' that are relevant for the task at hand. That is, we prefer not to work with images as a set of pixels.
Formally, we want to learn a function $h$ mapping an image $X$ to a set of $K$ features $[F_1, F_2, \ldots, F_K]$, where each $F_k$ is an image represented as an array. We want to learn a neural network, called a convolutional neural network, to represent such a function $h$.
A convolutional neural network typically consists of feature extracting layers and condensing layers.
The feature extracting layers are called convolutional layers, each node in these layers uses a small fixed set of weights to transform the image in the following way: This set of fixed weights for each node in the convolutional layer is often called a filter.
The term "filter" comes from image processing where one has standard ways to transforms raw images:
Rather than processing image data with a pre-determined set of filters, we want to learn the filters of a CNN for feature extraction. Our goal is to extract features that best helps us to perform our downstream task (e.g. classification).
Idea: We train a CNN for feature extraction and a model (e.g. MLP, decision tree, logistic regression) for classification, simultaneously and end-to-end.
The first things to try are:
Unfortunately, these simple visualizations don't shed much light on what the model has learned.
Rather than visualizing a particular filter or the representation of a particular image at a hidden layer, we can visualize the image that maximizes the output and hence impact of that filter or layer. Such an image is an exemplar of the filter or feature that the model has learned.
That is, we find image $\mathbf{x}^*$ that maximizes activation of a filter or a hidden layer (representation) while holding the network weights fixed:
$$ \mathbf{x}^* = \mathrm{argmax}_\mathbf{x}\; \mathrm{activation}_{\text{$f$ or $l$}}(\mathbf{x}). $$From: Feature Visualization
Some widely deployed saliency methods are independent of both the data the model was trained on, and the model parameters!
A transformation with no effect on the model can cause numerous saliency methods to incorrectly attribute!
From: Sanity Checks for Saliency Maps, THE (UN)RELIABILITY OF SALIENCY METHODS
Here is a guided example of using saliency maps to diagnose problems with a neural work classifier (this has not yet been converted to keras
!).