Title :¶
Exercise: Multi-collinearity vs Model Predictions
Description :¶
The goal of this exercise is to see how multi-collinearity can affect the predictions of a model.
For this, perform a multi-linear regression on the given dataset and compare the coefficients with those from simple linear regression of the individual predictors.
Data Description:¶
Instructions:¶
- Read the dataset
colinearity.csv
as a dataframe. - For each of the predictor variable, create a linear regression model with the same response variable.
- Compute the coefficients for each model and store in a list.
- Fit all predictors using a separate multi-linear regression object.
- Calculate the coefficients of each model.
- Compare the coefficients of the multi-linear regression model with those of the simple linear regression model.
DISCUSSION: Why do you think the coefficients change and what does it mean?
Hints:¶
pd.read_csv(filename) Returns a pandas dataframe containing the data and labels from the file data.
pd.DataFrame.drop() Drop specified labels from rows or columns.
sklearn.linear_model.LinearRegression Returns a linear regression object from the sklearn library.
sklearn.linear_model.LinearRegression.coef_ This attribute returns the coefficient(s) of the linear regression object.
sklearn.linear_model.LinearRegression.fit() Fit linear model to the data.
pd.Series.reshape() Return a np.ndndarray with the values in the specified shape.
Note: This exercise is auto-graded and you can try multiple attempts.
# Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
from pprint import pprint
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline
# Read the file named "colinearity.csv" into a Pandas dataframe
df = pd.read_csv(___)
# Take a quick look at the dataset
df.head()
Creation of Linear Regression Objects¶
# Choose all the predictors as the variable 'X' (note capitalization of X for multiple features)
X = df.drop([___],axis=1)
# Choose the response variable 'y'
y = df.___
### edTest(test_coeff) ###
# Initialize a list to store the beta values for each linear regression model
linear_coef = []
# Loop over all the predictors
# In each loop "i" holds the name of the predictor
for i in X:
# Set the current predictor as the variable x
x = df[[___]]
# Create a linear regression object
linreg = ____
# Fit the model with training data
# Remember to choose only one column at a time i.e. given by x
linreg.fit(___,___)
# Add the coefficient value of the model to the list
linear_coef.append(linreg.coef_)
Multi-Linear Regression using all variables¶
# Perform multi-linear regression with all predictors
multi_linear = LinearRegression()
# Fit the multi-linear regression on all features of the entire data
multi_linear.fit(___,___)
# Get the coefficients (plural) of the model
multi_coef = multi_linear.coef_
Printing the individual $\beta$ values¶
# Helper code to see the beta values of the linear regression models
print('By simple(one variable) linear regression for each variable:', sep = '\n')
for i in range(4):
pprint(f'Value of beta{i+1} = {linear_coef[i][0]:.2f}')
### edTest(test_multi_coeff) ###
# Helper code to compare with the values from the multi-linear regression
print('By multi-Linear regression on all variables')
for i in range(4):
pprint(f'Value of beta{i+1} = {round(multi_coef[i],2)}')
⏸ Why do you think the $\beta$ values are different in the two cases?
A. Because the random seed selected is not as random as we would imagine.¶
B. Because of collinearity between $\beta_1$ and $\beta_4$¶
C. Because multi-linear regression is not a stable model¶
D. Because of the measurement error in the data¶
### edTest(test_chow1) ###
# Submit an answer choice as a string below
# (Eg. if you choose option C, put 'C')
answer1 = '___'
# Helper code to visualize the heatmap of the covariance matrix
corrMatrix = df[['x1','x2','x3','x4']].corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()