Title

Exercise: A.1 - Linear and Polynomial Regression with Residual Analysis

Description

The goal of this exercise is to fit linear regression and polynomial regression to the given data. Plot the fit curves of both the models along with the data and observe what the residuals tell us about the two fits.

Instructions

  • Read the poly.csv file into a dataframe
  • Fit a linear regression model on the entire data, using LinearRegression() object from sklearn library
  • Guesstimate the degree of the polynomial which would best fit the data
  • Fit a polynomial regression model on the computed PolynomialFeatures using LinearRegression() object from sklearn library
  • Plot the linear and polynomial model predictions along with the data
  • Compute the polynomial and linear model residuals using the formula below $\epsilon = y_i - \hat{y}$
  • Plot the histogram of the residuals and comment on your choice of the polynomial degree.

Hints:

df.head() : Returns a pandas dataframe containing the data and labels from the file data.

plt.subplots() : Create a figure and a set of subplots.

sklearn.PolynomialFeatures() : Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.

sklearn.fit_transform() : Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

sklearn.LinearRegression() : LinearRegression fits a linear model.

sklearn.fit() : Fits the linear model to the training data.

sklearn.predict() : Predict using the linear model.

plt.plot() : Plots x versus y as lines and/or markers.

plt.axvline() : Add a vertical line across the axes.

ax.hist() : Plots a histogram.

Note: This exercise is auto-graded and you can try multiple attempts.

In [1]:
#import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
%matplotlib inline
In [2]:
# Read the data from 'poly.csv' to a dataframe
df = pd.read_csv('poly.csv')
# Get the column values for x & y in numpy arrays
x = df[['x']].___
y = df['y'].___
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
 in 
      2 df = pd.read_csv('poly.csv')
      3 # Get the column values for x & y
----> 4 x = df[['x']].___
      5 y = df['y'].___

/usr/lib/python3.8/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5128             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5129                 return self[name]
-> 5130             return object.__getattribute__(self, name)
   5131 
   5132     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute '___'
In [ ]:
# Take a quick look at the dataframe
df.head()
In [ ]:
# Plot x & y to visually inspect the data

fig, ax = plt.subplots()
ax.plot(x,y,'x')
ax.set_xlabel('$x$ values')
ax.set_ylabel('$y$ values')
ax.set_title('$y$ vs $x$');
In [16]:
# Fit a linear model on the data
model = ____
model.___(___)

# Get the predictions on the entire data using the .predict() function
y_lin_pred = model.predict(___)
In [18]:
### edTest(test_deg) ###
# Now, we try polynomial regression
# GUESS the correct polynomial degree based on the above graph

guess_degree = ___

# Generate polynomial features on the entire data
x_poly= PolynomialFeatures(degree=guess_degree).fit_transform(___)
In [19]:
#Fit a polynomial model on the data, using x_poly as features
polymodel = LinearRegression()
polymodel.fit(_,_)

y_poly_pred = polymodel.predict(___)
In [20]:
# To visualise the results, sort the x values using the helper code below
# Worth examining and understand the code 

idx = np.argsort(x[:,0])
x = x[idx]

# Use the above index to get the appropriate predicted values for y
# y values corresponding to sorted x
y = y[idx]

#Linear predicted values  
y_lin_pred = y_lin_pred[idx]

#Non-linear predicted values
y_poly_pred= y_poly_pred[idx]
In [ ]:
# First plot x & y values using plt.scatter
plt.scatter(_, _, s=10, label="Data")

# Now, plot the linear regression fit curve
plt.plot(_,_,label="Linear fit")

# Also plot the polynomial regression fit curve
plt.plot(_, _, label="Polynomial fit")

#Assigning labels to the axes
plt.xlabel("x values")
plt.ylabel("y values")
plt.legend()
plt.show()
In [10]:
### edTest(test_poly_predictions) ###
#Calculate the residual values for the polynomial model
poly_residuals = ___
In [ ]:
### edTest(test_linear_predictions) ###
#Calculate the residual values for the linear model
lin_residuals = ___
In [ ]:
#Use the below helper code to plot residual values
#Plot the histograms of the residuals for the two cases

#Distribution of residuals
fig, ax = plt.subplots(1,2, figsize = (10,4))
bins = np.linspace(-20,20,20)
ax[0].set_xlabel('Residuals')
ax[0].set_ylabel('Frequency')

#Plot the histograms for the polynomial regression
ax[0].hist(___, bins,label = ___)

#Plot the histograms for the linear regression
ax[0].hist(___, bins, label = ___)

ax[0].legend(loc = 'upper left')

# Distribution of predicted values with the residuals
ax[1].scatter(y_poly_pred, poly_residuals, s=10)
ax[1].scatter(y_lin_pred, lin_residuals, s = 10 )
ax[1].set_xlim(-75,75)
ax[1].set_xlabel('Predicted values')
ax[1].set_ylabel('Residuals')

fig.suptitle('Residual Analysis (Linear vs Polynomial)');

Question:

Do you think that polynomial degree is appropriate. Experiment with a degree of polynomial =2 and comment on what you observe for the residuals.