CS109A Introduction to Data Science

Lab 7: Numpy for Building a Neural Network and Dealing with Missing Values

Harvard University
Fall 2018
Instructors: Pavlos Protopapas and Kevin Rader
Lab Instructor: Eleni Kaxiras
Authors: David Sondak, Pavlos Protopapas, and Eleni Kaxiras


In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

In [5]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import pandas as pd
%matplotlib inline

Learning Goals

In this lab, we'll revise matrix operations and talk about tensors. We will also talk about handling missing data.

By the end of this lab, you should:

  • Understand how a simple neural network works and code its functions from scratch.
  • Know how to use simple pandas and scikit-learn functions to handle missing values.
  • Be able to think of vectors and arrays as tensors.

Part 1: Neural Networks 101: Starting with a Single Node

The simplest way to describe a neural network is that we have some inputs $x$, which get combined into an auxilliary variable $z$. The auxilliary variable is passed through the activation function $\sigma\left(z\right)$ and the result is the output.

Here is another image showing each step.

Notice that the inputs are linearly combined according to some weights $w$ and a bias $b$. This transformation is also sometimes called an affine transformation. The perceptron transforms the weighted inputs according to the rule of the activation function. For a single perceptron, the output $y$ is just the output from the perceptron. The linear transformation and activation of the neuron occurs within a single layer of the network (shown in the dotted box).

Let's see what the single-layer, single neuron network give us. We have a couple of choices to make:

  1. We must choose some weights and some biases
  2. We must choose an activation function

For now, we will manually specify the weights and biases.

We choose a sigmoid activation function $$\sigma\left(z\right) = \dfrac{1}{1 + e^{-z}}.$$ What are the limits $\displaystyle\lim_{z\to\infty}\sigma\left(z\right)$ and $\displaystyle\lim_{z\to-\infty}\sigma\left(z\right)$? Actually, the sigmoid we have here is called the logistic function. Sigmoids are really a family of functions and the logistic function is just one member in that family.

In class exercise : Plot the sigmoid

Plot the sigmoid

In [2]:
# your code here
In [3]:
# %load solutions/sigmoid.py
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

Generate a list of 500 $x$ points from -5 to 5 and plot both the sigmoid and the tanh (for tanh you may use np.tanh)

What do you observe?

In [ ]:
# %load solutions/plot_sig.py
x = np.linspace(-5.0, 5.0, 500) # input points
print(x[:5])
plt.plot(x, sigmoid(x), label='sigmoid')
plt.legend()
Comments
  • We say that the activation occurs when $\sigma = \dfrac{1}{2}$. We can show that this corresponds to $x = -\dfrac{b}{w}$.
  • The "steepness" of the sigmoid is controlled by $w$.
In class exercise: Approximate a Gaussian function using a node

The task is to approximate (or learn) a function $f\left(x\right)$ given some input $x$. For demonstration purposes, the function we will try to learn is a Gaussian function \begin{align} f\left(x\right) = e^{-x^{2}} \textrm{} \end{align}

Even though we represent the input $x$ as a vector on the computer, you should think of it as a single input.

Start by plotting the above function using the $x$ dataset you created earlier

In [ ]:
f = np.exp(-x*x) # The real function, x = np.linspace(-5.0, 5.0, 500) input points

plt.plot(x, f, label='gaussian')
plt.legend()

Now, let's code the single node as per the image above. Write a function named affine that does the linear transformation.

In [ ]:
# your code here
def affine(x, w, b):
    """Return affine transformation of x
    
    INPUTS
    ======
    x: A numpy array of points in x
    w: A float representing the weight of the perceptron
    b: A float representing the bias of the perceptron
    
    RETURN
    ======
    z: A numpy array of points after the affine transformation
       z = wx + b
    """
    
    # Code goes here
    return z
In [ ]:
# your code here
In [ ]:
# %load solutions/affine.py
In [ ]:
# your code here
In [ ]:
# %load solutions/sigmoid.py
In [ ]:
# your code here
w = -4.5
b = 4.

h = sigmoid(affine(x,w,b))
In [ ]:
# %load solutions/perceptron.py

And now we plot the activation function and the true function.

In [ ]:
fig, ax = plt.subplots(1,1, figsize=(11,7)) # create axes object

SIZE = 16
# Plot
ax.plot(x, f, ls='-.', lw=4, label=r'True function')
ax.plot(x, h, lw=4, label=r'Single Neuron')

# Create labels (very important!)
ax.set_xlabel('$x$', fontsize=SIZE) # Notice we make the labels big enough to read
ax.set_ylabel('$y$', fontsize=SIZE)

ax.tick_params(labelsize=SIZE) # Make the tick labels big enough to read

ax.legend(fontsize=SIZE, loc=1) # Create a legend and make it big enough to read

The single perceptron simply turns the output on and off at some point, but that's about it. We see that the neuron is on until about $x=0$ at which point it abruptly turns off. It's able to get "close" to the true function. Otherwise, it has nothing in common with the true function.

What do you think will happen if you change $w$ and $b$?

Important Observation

Notice that we wrote the output as sigmoid(affine(x)). This was not a coincidence. It looks like a composition of functions. In fact, that is what a neural network is doing. It's building up an approximation to a function by creating a composition of functions. For example, a composition of three functions would be written as $$\varphi_{3}\left(\varphi_{2}\left(\varphi_{1}\left(x\right)\right)\right).$$

What happens if we play with the weights and biases?

In [ ]:
w = [-5.0, 0.1, 5.0] # Create a list of weights
b = [0.0, -1.0, 1.0] # Create a list of biases

fig, ax = plt.subplots(1,1, figsize=(15,10))
ax.plot(x, f, lw=4, ls='-.', label='True function')
for wi, bi in zip(w, b):
    h = sigmoid(affine(x, wi, bi))
    ax.plot(x, h, lw=4, label=r'$w = {0}$, $b = {1}$'.format(wi,bi))
ax.set_title('Single neuron network', fontsize=SIZE)

# Create labels (very important!)
ax.set_xlabel('$x$', fontsize=SIZE) # Notice we make the labels big enough to read
ax.set_ylabel('$y$', fontsize=SIZE)

ax.tick_params(labelsize=SIZE) # Make the tick labels big enough to read

ax.legend(fontsize=SIZE, loc='best') # Create a legend and make it big enough to read

We didn't do an exhaustive search of the weights and biases, but it sure looks like this single perceptron is never going to match the actual function. Again, we shouldn't be suprised about this. The output layer of the network is simple the logistic function, which can only have so much flexibility.

Let's try to make our network more flexible by using more nodes!

Multiple Perceptrons in a Single Layer

It appears that a single neuron is somewhat limited in what it can accomplish. What if we expand the number of nodes/neurons in our network? We have two obvious choices here. One option is to add depth to the network by putting layers next to each other. The other option is to stack neurons on top of each other in the same layer. Now the network has some width, but is still only one layer deep.

The following figure shows a single-layer network with two nodes in one layer.

Some observations

  1. We still have a single input in this case. Note that this is not necessary in general. We're just keeping things simple with a single input for now. If we have more inputs we will have a matrix for $X$.
  2. Each node (or neuron) has a weight and bias associated with it. An affine transformation is performed for each node.
  3. Both nodes use the same activation function form $\sigma\left(\cdot\right)$ on their respective inputs.
  4. The outputs of the nodes must be combined to give the overall output of the network. There are a variety of ways of accomplishing this. In the current example, we just take a linear combination of the node outputs to produce the actual prediction. Notice that now we have weights and biases at the output too.

Let's see what happens in this case. First, we just compute the outputs of each neuron.

In [ ]:
x = np.linspace(-5.0, 5.0, 500) # input points
f = np.exp(-x*x) # data

w = np.array([3.5, -3.5])
b = np.array([3.5, 3.5])

# Affine transformations
z1 = w[0] * x + b[0]
z2 = w[1] * x + b[1]

# Node outputs
h1 = 1.0 / (1.0 + np.exp(-z1))
h2 = 1.0 / (1.0 + np.exp(-z2))

Now let's plot things and see what they look like.

In [ ]:
fig, ax = plt.subplots(1,1, figsize=(14,10))

ax.plot(x, f, lw=4, ls = '-.', label='True function')
ax.plot(x, h1, lw=4, label='First neuron')
ax.plot(x, h2, lw=4, label='Second neuron')

# Set title
ax.set_title('Comparison of Neuron Outputs', fontsize=SIZE)

# Create labels (very important!)
ax.set_xlabel('$x$', fontsize=SIZE) # Notice we make the labels big enough to read
ax.set_ylabel('$y$', fontsize=SIZE)

ax.tick_params(labelsize=SIZE) # Make the tick labels big enough to read

ax.legend(fontsize=SIZE, loc='best') # Create a legend and make it big enough to read

Just as we expected. Some sigmoids. Of course, to get the network prediction we must combine these two sigmoid curves somehow. First we'll just add $h_{1}$ and $h_{2}$ without any weights to see what happens.

Note

We are not doing classification here. We are trying to predict an actual function. The sigmoid activation is convenient when doing classification because you need to go from $0$ to $1$. However, when learning a function, we don't have as good of a reason to choose a sigmoid.

In [ ]:
# Network output
wout = np.ones(2) # Set the output weights to unity to begin
bout = -1 # No bias yet
yout = wout[0] * h1 + wout[1] * h2 + bout

And plot.

In [ ]:
fig, ax = plt.subplots(1,1, figsize=(14,10))

ax.plot(x, f, ls='-.', lw=4, label=r'True function')
ax.plot(x, yout, lw=4, label=r'$y_{out} = h_{1} + h_{2}$')

# Create labels (very important!)
ax.set_xlabel('$x$', fontsize=SIZE) # Notice we make the labels big enough to read
ax.set_ylabel('$y$', fontsize=SIZE)

ax.tick_params(labelsize=SIZE) # Make the tick labels big enough to read

ax.legend(fontsize=SIZE, loc='best') # Create a legend and make it big enough to read

Observations

  • The network prediction is still not good. But, it is pretty sophisticated. We just have two neurons, but we get some pretty interesting behavior. We didn't do anything with the output weights. Those are probably important. Now let's see what happens when we change the weights on the output.
In [ ]:
# Network output
wout = np.array([-1.5, -1.5])
bout = np.array(1.5)

yout = wout[0] * h1 + wout[1] * h2 + bout
In [ ]:
fig, ax = plt.subplots(1,1, figsize=(14,10))

ax.plot(x, f, lw=4, ls = '-.', label='True function')
ax.plot(x, yout, lw=4, label=r'$y_{out} = w_{1}^{o}h_{1} + w_{2}^{o}h_{2} + b^{o}$')

# Create labels (very important!)
ax.set_xlabel('$x$', fontsize=SIZE) # Notice we make the labels big enough to read
ax.set_ylabel('$y$', fontsize=SIZE)

ax.tick_params(labelsize=SIZE) # Make the tick labels big enough to read

ax.legend(fontsize=SIZE, loc=1) # Create a legend and make it big enough to read

Very cool! The two nodes interact with each other to produce a pretty complicated-looking function. It still doesn't match the true function, but now we have some hope. In fact, it's starting to look a little bit like a Gaussian!

We can do better. There are three obvious options at this point:

  1. Change the number of nodes
  2. Change the activation functions
  3. Change the weights

Some Mathematical Notation

Before proceeding, let's learn a more succint way of doing the calculations. If you have a network with a lot of nodes, then you probably don't want to manually determine the output of each node. It will take forever. Instead, you can package the computations up using a more compact notation. We'll illustrate the ideas with the two-node network.

Suppose we have a single input $x$ to a single-layer two-node network. We can store the weights from each node in a vector $\mathbf{w} \in \mathbb{R}^{2}$. Similarly, we can store the biases from each node in a vector $\mathbf{b} \in \mathbb{R}^{2}$. The affine transformation is then written as $$\mathbf{z} = \mathbf{w}x + \mathbf{b}$$ where the usual laws of vector addition and multiplication by a scalar apply. Of course, we have that $\mathbf{z} \in \mathbb{R}^{2}$ as well. Next we evaluate the output from each node. Formally, we write $$\mathbf{h} = \sigma\left(\mathbf{z}\right)$$ where, once again, $\mathbf{h}\in\mathbb{R}^{2}$. Moreover, it is understood that $\sigma$ operates on each individual element of $\mathbf{z}$ separately. If we denote each component of $\mathbf{z}$ by $z_{j}, \quad j = 1, 2$ then we can write $$h_{j} = \sigma\left(z_{j}\right), \quad j = 1, 2.$$

Lastly, we must do something about the output layer. Mathematically we write $$y_{out} = \mathbf{w}_{out} \cdot \mathbf{h} + b_{out}$$ where $\mathbf{w}_{out} \in \mathbb{R}^{2}$ and $b_{out} \in \mathbb{R}$.

Part 2. Handling Missing Variables

Imputation in pandas

Pandas has chosen to use the floating point NaN internally to denote missing data and that was largely for simplicity and performance reasons. The library will also recognize the Python None as “missing” or “not available” or “NA”. When doing descriptive statistics, pandas excludes missing values by default.

In [ ]:
s = pd.Series([0, 1,[] , np.nan, 4, 6, 8, 15, 23, 25, np.nan])
s
In [ ]:
# look for missing values
s.isna() # or s.isnull()

pandas can interpolate according to different methods.

In [ ]:
# how many?
s.isna().sum()
In [ ]:
s.isnull().sum()
In [ ]:
# fill with zeros
s.fillna(0)
# s['one'].fillna('missing')
In [ ]:
# You may wish to simply exclude labels from a data set which refer to missing data. To do this, use 
s = pd.Series([0, 1, np.nan, 4, 6, 8, 15, 23, 25, np.nan])
s = s.dropna()
s
In [ ]:
s = pd.Series([0, 1, np.nan, 4, 6, 8, 15, 23, 25, np.nan, 27])
s.interpolate()

Imputation in scikit-learn

In [ ]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(copy=True, missing_values=np.nan, strategy='mean')
s1 = pd.Series([0, 1, np.nan, 4, 6, 8, 15, 23, 25, np.nan])
s2 = pd.Series([0, 1, np.nan, 4, 6, 8, 15, 23, 25, np.nan])
d = {'col1': s1, 'col2': s2}
df = pd.DataFrame(data=d)
df
In [ ]:
imp.fit(df) 
df = imp.transform(df)
df
In [ ]:
SimpleImputer(copy=True, fill_value=None, missing_values=np.nan, strategy='mean', verbose=0)
print(imp.transform(df))           
In [ ]:
imp2 = SimpleImputer(strategy="most_frequent")
s1 = pd.Series(['eleni', 'john', np.nan, 'john', 'mary', 'john', 'george', 'eleni', 'john', np.nan])
d = {'name': s1}
df = pd.DataFrame(data=d)
df
In [ ]:
imp2.fit(df) 
df = imp2.transform(df)
df

Part 3. Reese Witherspoon as a rank 3 Tensor

We can think of tensors as multidimensional arrays of numerical values; their job is to generalize matrices to multiple dimensions. While tensors first emerged in the 20th century, they have since been applied to numerous other disciplines, including machine learning. Tensor decomposition/factorization can solve, among other, problems in unsupervised learning settings, temporal and multirelational data. When we get to handling images for Convolutional Neural Networks, it's a good idea to have the understanding of tensors of rank 3.

We will use the following naming conventions:

  • scalar = just a number = rank 0 tensor ($a$ ∈ $F$,)

  • vector = 1D array = rank 1 tensor ( $x = (\;x_1,...,x_i\;)⊤$ ∈ $F^n$ )

  • matrix = 2D array = rank 2 tensor ( $\textbf{X} = [a_{ij}] ∈ F^{m×n}$ )

  • 3D array = rank 3 tensor ( $\mathscr{X} =[t_{i,j,k}]∈F^{m×n×l}$ )

  • $\mathscr{N}$D array = rank $\mathscr{N}$ tensor ( $\mathscr{T} =[t_{i1},...,t_{i\mathscr{N}}]∈F^{n_1×...×n_\mathscr{N}}$ )

Tensor indexing

We can create subarrays by fixing some of the given tensor’s indices. We can create a vector by fixing all but one index. A 2D matrix is created when fixing all but two indices. For example, for a third order tensor the vectors are $\mathscr{X}[:,j,k]$ = $\mathscr{X}[j,k]$ (column), $\mathscr{X}[i,:,k]$ = $\mathscr{X}[i,k]$ (row), and $\mathscr{X}[i,j,:]$ = $\mathscr{X}[i,j]$ (tube); the matrices are $\mathscr{X}[:,:,k]$ (frontal), $\mathscr{X}[:,j,:]$ (lateral), $\mathscr{X}[i,:,:]$ (horizontal).

Tensor multiplication

We can multiply one matrix with another as long as the sizes are compatible ((n × m) × (m × p) = n × p), and also multiply an entire matrix by a constant. Numpy numpy.dot performs a matrix multiplication which is straightforward when we have 2D or 1D arrays. But what about > 3D arrays? The function will choose according to the matching dimentions but if we want to choose we should use tensordot, but, again, we do not need tensordot for this class.

In [8]:
# load and show the image
img = mpimg.imread('../figs/Reese_Witherspoon.jpg')
imgplot = plt.imshow(img)

What type is this image? We have a suspicion that it's an array so let's find out its shape.

In [ ]:
type(img), img.shape
In [ ]:
img

It’s a 24-bit RGB PNG image (height, width, channels) with 8 bits for each of R, G, B. Explore and print the array.

In [ ]:
img[:,:,:3].shape # selects all rows, 1st column, and all slices
In [ ]:
img[130]
In [ ]:
img[130][100].shape