Title

Exercise: 1 - Bias Variance Tradeoff

Description

The aim of this exercise is to understand bias variance tradeoff. For this, you will fit different degree polynomial regression on the same data and plot them as given below.

Instructions:

  • Read the file noisypopulation.csv as a pandas dataframe.
  • Assign the response and predictor variables appropriately.
  • Perform bootstrap operation on the dataset.
  • For each bootstrap:
    • For degree of the chosen degree value:
      • Compute the polynomial features
      • Fit the model on the given data
      • Select a set of random points in the data to predict the model
      • Store the predicted values as a list
  • Plot the predicted values along with the random data points and true function as given above.

Hints:

sklearn.PolynomialFeatures() : Generates polynomial and interaction features

sklearn.LinearRegression() : LinearRegression fits a linear model

sklearn.fit() : Fits the linear model to the training data

sklearn.predict() : Predict using the linear model.

Note: This exercise is auto-graded and you can try multiple attempts.

In [1]:
#Importing libraries
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
In [2]:
#Helper function to set plot characteristics

def make_plot():
    fig, axes=plt.subplots(figsize=(20,8), nrows=1, ncols=2);
    axes[0].set_ylabel("$p_R$", fontsize=18)
    axes[0].set_xlabel("$x$", fontsize=18)
    axes[1].set_xlabel("$x$", fontsize=18)
    axes[1].set_yticklabels([])
    axes[0].set_ylim([0,1])
    axes[1].set_ylim([0,1])
    axes[0].set_xlim([0,1])
    axes[1].set_xlim([0,1])
    plt.tight_layout();
    return axes
In [3]:
# Reading the file into a dataframe

df = pd.read_csv("noisypopulation.csv")
df.head()
Out[3]:
f x y
0 0.047790 0.00 0.011307
1 0.051199 0.01 0.010000
2 0.054799 0.02 0.007237
3 0.058596 0.03 0.000056
4 0.062597 0.04 0.010000
In [34]:
### edTest(test_data) ###

# Set the predictor and response variable
# Column x is the predictor and column y is the response variable.
# Column f is the true function of the given data
# Select the values of the columns

x = df.___
y = df.___
f = df.___
In [36]:
### edTest(test_poly) ###
# Function to compute the Polynomial Features for the data x for the given degree d
def polyshape(d, x):
    return PolynomialFeatures(___).fit_transform(___.reshape(-1,1))
In [37]:
### edTest(test_linear) ###
#Function to fit a Linear Regression model 
def make_predict_with_model(x, y, x_pred):
    
    #Create a Linear Regression model with fit_intercept as False
    lreg = ___
    
    #Fit the model to the data x and y
    lreg.fit(___, ___)
    
    #Predict on the x_pred data
    y_pred = lreg.predict(___)
    return lreg, y_pred
In [38]:
#Function to perform bootstrap and fit the data

# degree is the maximum degree of the model
# num_boot is the number of bootstraps
# size is the number of random points selected from the data for each bootstrap
# x is the predictor variable
# y is the response variable

def gen(degree, num_boot, size, x, y):
    
    #Create 2 lists to store the prediction and model
    predicted_values, linear_models =[], []
    
    #Run the loop for the number of bootstrap value
    for i in range(num_boot):
        
        #Helper code to call the make_predict_with_model function to fit the data
        indexes=np.sort(np.random.choice(x.shape[0], size=size, replace=False))
        
        #lreg and y_pred hold the model and predicted values of the current bootstrap
        lreg, y_pred = make_predict_with_model(polyshape(degree, x[indexes]), y[indexes], polyshape(degree, x))
        
        #Append the model and predicted values into the appropriate lists
        predicted_values.append(___)
        linear_models.append(___)
    
    #Return the 2 lists
    return predicted_values, linear_models
In [39]:
### edTest(test_gen) ###
# Call the function gen twice with x and y as the predictor and response variable respectively
# The number of bootstraps should be 200 and the number of samples from the dataset should be 30
# Store the return values in appropriate variables

# Get results for degree 1
predicted_1, model_1 = gen(___);

# Get results for degree 100
predicted_100, model_100 = gen(___);
In [ ]:
#Helper code to plot the data

indexes=np.sort(np.random.choice(x.shape[0], size=30, replace=False))

plt.figure(figsize = (12,8))
axes=make_plot()

#Plot for Degree 1
axes[0].plot(x,f,label="f", color='darkblue',linewidth=4)
axes[0].plot(x, y, '.', label="Population y", color='#009193',markersize=8)
axes[0].plot(x[indexes], y[indexes], 's', color='black', label="Data y")

for i,p in enumerate(predicted_1[:-1]):
    axes[0].plot(x,p,alpha=0.03,color='#FF9300')
axes[0].plot(x, predicted_1[-1], alpha=0.3,color='#FF9300',label="Degree 1 from different samples")


#Plot for Degree 100
axes[1].plot(x,f,label="f", color='darkblue',linewidth=4)
axes[1].plot(x, y, '.', label="Population y", color='#009193',markersize=8)
axes[1].plot(x[indexes], y[indexes], 's', color='black', label="Data y")


for i,p in enumerate(predicted_100[:-1]):
    axes[1].plot(x,p,alpha=0.03,color='#FF9300')
axes[1].plot(x,predicted_100[-1],alpha=0.2,color='#FF9300',label="Degree 100 from different samples")

axes[0].legend(loc='best')
axes[1].legend(loc='best')

#edgecolor='black', linewidth=3, facecolor='green',
plt.show()

After you mark the exercise, run the code again, but this time with degree 10 instead of 100. Do you see a decrease in variance? Why are the edges still so erractic?

Your answer here

Also check the values of the coefficients for each of your runs. Do you see a pattern?

Your answer here