Key Word(s): model selection, cross validation
Title¶
Exercise: A.2 - Multi-collinearity vs Model Predictions
Description¶
The goal of this exercise is to see how multi-collinearity can affect the predictions of a model.
For this, perform a multi-linear regression on the given dataset and compare the coefficients with those from simple linear regression of the individual predictors.
Roadmap¶
- Read the dataset 'colinearity.csv' as a dataframe
- For each of the predictor variable, create a linear regression model with the same response variable
- Compute the coefficients for each model and store in a list.
- Fit all predictors using a separate multi-linear regression object
- Calculate the coefficients of each model
- Compare the coefficients of the multi-linear regression model with those of the simple linear regression model.
DISCUSSION: Why do you think the coefficients change and what does it mean?
Hints¶
LinearRegression() : Returns a linear regression object from the sklearn library.
LinearRegression().coef_ : This attribute returns the coefficient(s) of the linear regression object
sklearn.fit() : Fit linear model
df.reshape() : Return a ndarray with the values in the specified shape
Note: This exercise is auto-graded and you can try multiple attempts.
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from pprint import pprint
%matplotlib inline
# Read the file named "colinearity.csv"
df = pd.read_csv("colinearity.csv")
#Take a quick look at the dataset
df.head()
Creation of Linear Regression Objects¶
# Choose all the predictors as the variable 'X' (note capitalization of X for multiple features)
X = df.drop([___],axis=1)
# Choose the response variable 'y' for y values
y = df.___
### edTest(test_coeff) ###
# Here we create a dictionary that will store the Beta values of each linear regression model
linear_coef = []
for i in X:
x = df[[___]]
#Create a linear regression object
linreg = ____
#Fit it with training values.
#Remember to choose only one column at a time as the predictor variable
linreg.fit(___,___)
# Add the coefficient value of the model to the list
linear_coef.append(linreg.coef_)
Multi-Linear Regression using all variables¶
# Here you must do a multi-linear regression with all predictors
# use sklearn library to define a new model 'multi_linear'
multi_linear = ____
# Fit the multi-linear regression on all features and the response
multi_linear.fit(___,___)
# append the coefficients (plural) of the model to a variable multi_coef
multi_coef = multi_linear.coef_
Printing the individual $\beta$ values¶
# Run this command to see the beta values of the linear regression models
print('By simple(one variable) linear regression for each variable:', sep = '\n')
for i in range(4):
pprint(f'Value of beta{i+1} = {linear_coef[i][0]:.2f}')
### edTest(test_multi_coeff) ###
#Now let's compare with the values from the multi-linear regression
print('By multi-Linear regression on all variables')
for i in range(4):
pprint(f'Value of beta{i+1} = {round(multi_coef[i],2)}')
Why do you think the $\beta$ values are different in the two cases?¶
corrMatrix = df[['x1','x2','x3','x4']].corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()