{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Title\n", "\n", "**Exercise: A.2 - Multi-collinearity vs Model Predictions**\n", "\n", "# Description\n", "\n", "The goal of this exercise is to see how multi-collinearity can affect the predictions of a model.\n", "\n", "For this, perform a multi-linear regression on the given dataset and compare the coefficients with those from simple linear regression of the individual predictors.\n", "\n", "# Roadmap\n", "- Read the dataset 'colinearity.csv' as a dataframe\n", "- For each of the predictor variable, create a linear regression model with the same response variable\n", "- Compute the coefficients for each model and store in a list.\n", "- Fit all predictors using a separate multi-linear regression object\n", "- Calculate the coefficients of each model\n", "- Compare the coefficients of the multi-linear regression model with those of the simple linear regression model.\n", "\n", "**DISCUSSION:** Why do you think the coefficients change and what does it mean? \n", "\n", "# Hints\n", "\n", "LinearRegression() : Returns a linear regression object from the sklearn library.\n", "\n", "LinearRegression().coef_ : This attribute returns the coefficient(s) of the linear regression object\n", "\n", "sklearn.fit() : Fit linear model\n", "\n", "df.reshape() : Return a ndarray with the values in the specified shape \n", "\n", "Note: This exercise is **auto-graded and you can try multiple attempts.**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# import libraries\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns \n", "import matplotlib.pyplot as plt\n", "from sklearn.linear_model import LinearRegression\n", "from pprint import pprint\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Read the file named \"colinearity.csv\"\n", "\n", "df = pd.read_csv(\"colinearity.csv\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x1x2x3x4y
0-1.109823-1.172554-0.897949-6.572526-158.193913
10.2883810.3605262.2986903.884887198.312926
2-1.0591940.8330670.285517-1.22593112.152087
30.2260171.9793670.7440385.380823190.281938
40.664165-1.3737390.317570-0.437413-72.681681
\n", "
" ], "text/plain": [ " x1 x2 x3 x4 y\n", "0 -1.109823 -1.172554 -0.897949 -6.572526 -158.193913\n", "1 0.288381 0.360526 2.298690 3.884887 198.312926\n", "2 -1.059194 0.833067 0.285517 -1.225931 12.152087\n", "3 0.226017 1.979367 0.744038 5.380823 190.281938\n", "4 0.664165 -1.373739 0.317570 -0.437413 -72.681681" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Take a quick look at the dataset\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Creation of Linear Regression Objects" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Choose all the predictors as the variable 'X' (note capitalization of X for multiple features)\n", "\n", "X = df.drop([___],axis=1)\n", "\n", "# Choose the response variable 'y' for y values\n", "\n", "y = df.___" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "### edTest(test_coeff) ###\n", "\n", "# Here we create a dictionary that will store the Beta values of each linear regression model\n", "linear_coef = []\n", "\n", "for i in X:\n", " \n", " x = df[[___]]\n", "\n", " #Create a linear regression object\n", " linreg = ____\n", "\n", " #Fit it with training values. \n", " #Remember to choose only one column at a time as the predictor variable\n", " linreg.fit(___,___)\n", " \n", " # Add the coefficient value of the model to the list\n", " linear_coef.append(linreg.coef_)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multi-Linear Regression using all variables" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Here you must do a multi-linear regression with all predictors\n", "\n", "# use sklearn library to define a new model 'multi_linear'\n", "multi_linear = ____\n", "\n", "# Fit the multi-linear regression on all features and the response\n", "\n", "multi_linear.fit(___,___)\n", "\n", "# append the coefficients (plural) of the model to a variable multi_coef\n", "\n", "multi_coef = multi_linear.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Printing the individual $\\beta$ values" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "By simple(one variable) linear regression for each variable:\n", "'Value of beta1 = 34.73'\n", "'Value of beta2 = 68.63'\n", "'Value of beta3 = 59.40'\n", "'Value of beta4 = 20.92'\n" ] } ], "source": [ "# Run this command to see the beta values of the linear regression models\n", "\n", "print('By simple(one variable) linear regression for each variable:', sep = '\\n')\n", "\n", "for i in range(4):\n", " \n", " pprint(f'Value of beta{i+1} = {linear_coef[i][0]:.2f}')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "By multi-Linear regression on all variables\n", "'Value of beta1 = -24.61'\n", "'Value of beta2 = 27.72'\n", "'Value of beta3 = 37.67'\n", "'Value of beta4 = 19.27'\n" ] } ], "source": [ "### edTest(test_multi_coeff) ###\n", "\n", "#Now let's compare with the values from the multi-linear regression\n", "print('By multi-Linear regression on all variables')\n", "for i in range(4):\n", " pprint(f'Value of beta{i+1} = {round(multi_coef[i],2)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why do you think the $\\beta$ values are different in the two cases?" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "corrMatrix = df[['x1','x2','x3','x4']].corr() \n", "sns.heatmap(corrMatrix, annot=True) \n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }