Key Word(s): Gradient Descent, Stochastic Gradient Descent, Back Propagation, Optimizers

Title¶

Exercise: Back-propagation by chain rule

Description¶

The aim of this exercise is to understand how the chain rule works. We will continue using the simple neural network from the previous exercise.

Instructions:¶

Get the response and predictor variables from the backprop.csv file.
Visualise the data generated.
For the given simple neural network, write a function that computes the gradient of the loss function with respect to the weights.
To do this, compute the partial derivatives using individual functions. Refer to the instructions in the scaffold.

Hints:¶

The partial derivative of the loss function $L$ wrt $w_2$ and $w_1$ can be expressed as:

$$\frac{\partial L}{\partial w_2}\ =\ \frac{\partial L}{\partial y}\ \frac{\partial y}{\partial a_2}\frac{\partial a_2}{\partial w_2}$$$$\frac{\partial L}{\partial w_1}\ =\ \frac{\partial L}{\partial y}\ \frac{\partial y}{\partial a_2}\frac{\partial a_2}{\partial h_1}\ \frac{\partial h_1}{\partial a_1}\frac{\partial a_1}{\partial w_1}$$

np.cos() : Returns cosine element-wise

np.sin() : Returns sine element-wise

v : Calculates the exponential of all elements in the input array.

NOTE - In this exercise, we expect you to take out a piece of paper an do the backpropagation using chain rule by hand.

In [1]:

# Import the necessary libraries

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

In [48]:

# Read the file 'backprop.csv'

df = pd.read_csv('backprop.csv')

In [49]:

#Generating the predictor and response data 
x = df.x.values.reshape(-1,1)
y = df.y.values

In [50]:

# Initialize the weights and use the same random seed as the previous exercise i.e. 310
np.random.seed(310)
W = [np.random.randn(1, 1), np.random.randn(1, 1)]

In [25]:

# Function to compute the activation function
def A(x):
    return ___

# Function to compute the derivative of the activation function
def der_A(x):
    return ___

In [47]:

# Defining a simple neural network we used in the previous exercise

def neural_network(W, x):
    
    # Computing the first affine
    a1 = np.dot(x, W[0])
    
    # Defining sin() as the activation function
    fa1 = A(a1)
    
    # Computing the second affine
    a2 = np.dot(fa1,W[1])
    
    # Defining sin() as the activation function
    y = A(a2)
    
    return a1,a2,y

In [51]:

#Use the helper code below to plot the true data and the predictions of your neural network

fig,ax = plt.subplots(1,1,figsize=(8,6))
ax.plot(x,y,label = 'True Function',color='darkblue',linewidth=2)
ax.plot(x,neural_network(W,x)[2],label = 'Neural Network Predictions',color='#9FC131FF',linewidth=2)
ax.set_xlabel('$x$',fontsize=14)
ax.set_ylabel('$y$',fontsize=14)
ax.legend(fontsize=14, loc='best');

In [26]:

# Function to compute the partial derivate of a (particular neuron) with respect to corresponding weight w

def dadw(x,firstweight=0):
    '''
    The derivative of the 'a' wrt the preceding weight is just the activation of the previous neuron
    Note, account for the case where the input layer has no activation layers associated with it. i.e return x if its the first weight 
    '''
    if firstweight == 1:
        return ___
    return ___

In [27]:

# Function to compute the partial derivate of h with respect to a

def dhda(a):
    '''
    This is the derivative of the output of the activation function wrt the affine transformation.
    Return the derivative of the activation of the affine transformation
    '''
    
    return ___

In [ ]:

# Function to compute the partial derivate of y with respect to a

def dyda(a):
    '''
    This is the derivative of the output of the neural network wrt the affine transformation.
    Return the derivative of the activation of the affine transformation
    '''
    
    return ___

In [ ]:

# Function to compute the partial derivate of a with respect to h
def dadh(w):
    
    return ___

In [28]:

# Function to compute the partial derivate of loss with respect to y
def dldy(y_pred,y):
    '''
    Since our loss function is the MSE,
    The partial derivative of L wrt y will be 2*(y_pred - y), for all predictions and response
    '''
    
    return ___

In [46]:

# Function to compute the partial derivate of loss with respect to w

def dldw(W,x):
    
    '''
    Now, combine the functions from above and find the derivative wrt weights.
    These will be for all the points, hence take a mean of all values for each partial derivative and return as a list of 2 values
    
    '''
    dldw2 = ____
    dldw1 = ____
    
    return [np.mean(dldw2),np.mean(dldw1)]

In [52]:

### edTest(test_gradient) ###

# Get the predicted response, and the two activations of the network
a1, a2, y_pred = neural_network(W,x)

# Compute the gradient of the loss function with respect to the weights using function defined above
gradW = dldw(W,x)

In [ ]:

# Print the list of your gradients below
print(f'The derivatives of w1 w2 wrt L are {gradW}')

Mindchow 🍲¶

Compare your computed partial derivatives wrt the previous exercise. Are they the same?
This example was just for a simple case of 1 neuron in 1 hidden layer. How could we generalize this idea to compute partial derivatives of all the weights?