CS109A Introduction to Data Science
Standard Section 3: Multiple Linear Regression and Polynomial Regression¶
Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Section Leaders: Marios Mattheakis, Abhimanyu (Abhi) Vasishth, Robbert (Rob) Struyven
#RUN THIS CELL
import requests
from IPython.core.display import HTML
styles = requests.get("http://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
For this section, our goal is to get you familiarized with Multiple Linear Regression. We have learned how to model data with kNN Regression and Simple Linear Regression and our goal now is to dive deep into Linear Regression.
Specifically, we will:
- Load in the titanic dataset from seaborn
- Learn a few ways to plot distributions of variables using seaborn
- Learn about different kinds of variables including continuous, categorical and ordinal
- Perform single and multiple linear regression
- Learn about interaction terms
- Understand how to interpret coefficients in linear regression
- Look at polynomial regression
- Understand the assumptions being made in a linear regression model
- (Extra): look at some cool plots to raise your EDA game
# Data and Stats packages
import numpy as np
import pandas as pd
# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
Extending Linear Regression¶
Working with the Titanic Dataset from Seaborn¶
For our dataset, we'll be using the passenger list from the Titanic, which famously sank in 1912. Let's have a look at the data. Some descriptions of the data are at https://www.kaggle.com/c/titanic/data, and here's how seaborn preprocessed it.
The task is to build a regression model to predict the fare, based on different attributes.
Let's keep a subset of the data, which includes the following variables:
- age
- sex
- class
- embark_town
- alone
- fare (the response variable)
# Load the dataset from seaborn
titanic = sns.load_dataset("titanic")
titanic.head()
# checking for null values
chosen_vars = ['age', 'sex', 'class', 'embark_town', 'alone', 'fare']
titanic = titanic[chosen_vars]
titanic.info()
Exercise: check the datatypes of each column and display the statistics (min, max, mean and any others) for all the numerical columns of the dataset.
## your code here
# %load 'solutions/sol1.py'
Exercise: drop all the non-null rows in the dataset. Is this always a good idea?
## your code here
# %load 'solutions/sol2.py'
Now let us visualize the response variable. A good visualization of the distribution of a variable will enable us to answer three kinds of questions:
- What values are central or typical? (e.g., mean, median, modes)
- What is the typical spread of values around those central values? (e.g., variance/stdev, skewness)
- What are unusual or exceptional values (e.g., outliers)
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(24, 6))
ax = ax.ravel()
sns.distplot(titanic['fare'], ax=ax[0])
ax[0].set_title('Seaborn distplot')
ax[0].set_ylabel('Normalized frequencies')
sns.violinplot(x='fare', data=titanic, ax=ax[1])
ax[1].set_title('Seaborn violin plot')
ax[1].set_ylabel('Frequencies')
sns.boxplot(x='fare', data=titanic, ax=ax[2])
ax[2].set_title('Seaborn box plot')
ax[2].set_ylabel('Frequencies')
fig.suptitle('Distribution of count');
How do we interpret these plots?
Train-Test Split¶
from sklearn.model_selection import train_test_split
titanic_train, titanic_test = train_test_split(titanic, train_size=0.7, random_state=99)
titanic_train = titanic_train.copy()
titanic_test = titanic_test.copy()
print(titanic_train.shape, titanic_test.shape)
Simple one-variable OLS¶
Exercise: You've done this before: make a simple model using the OLS package from the statsmodels library predicting fare using age using the training data. Name your model model_1
and display the summary
from statsmodels.api import OLS
import statsmodels.api as sm
# Your code here
# %load 'solutions/sol3.py'
Dealing with different kinds of variables¶
In general, you should be able to distinguish between three kinds of variables:
- Continuous variables: such as
fare
orage
- Categorical variables: such as
sex
oralone
. There is no inherent ordering between the different values that these variables can take on. These are sometimes called nominal variables. Read more here. - Ordinal variables: such as
class
(first > second > third). There is some inherent ordering of the values in the variables, but the values are not continuous either.
Note: While there is some inherent ordering in class
, we will be treating it like a categorical variable.
titanic_orig = titanic_train.copy()
Let us now examine the sex
column and see the value counts.
titanic_train['sex'].value_counts()
Exercise: Create a column sex_male
that is 1 if the passenger is male, 0 if female. The value counts indicate that these are the two options in this particular dataset. Ensure that the datatype is int
.
# your code here
# %load 'solutions/sol4.py'
Do we need a sex_female
column, or a sex_others
column? Why or why not?
Now, let us look at class
in greater detail.
titanic_train['class_Second'] = (titanic_train['class'] == 'Second').astype(int)
titanic_train['class_Third'] = 1 * (titanic_train['class'] == 'Third') # just another way to do it
titanic_train.info()
# This function automates the above:
titanic_train_copy = pd.get_dummies(titanic_train, columns=['sex', 'class'], drop_first=True)
titanic_train_copy.head()
Linear Regression with More Variables¶
Exercise: Fit a linear regression including the new sex and class variables. Name this model model_2
. Don't forget the constant!
# your code here
# %load 'solutions/sol5.py'
Interpreting These Results¶
- Which of the predictors do you think are important? Why?
- All else equal, what does being male do to the fare?
Going back to the example from class¶
- What is the interpretation of $\beta_0$ and $\beta_1$?
Exploring Interactions¶
sns.lmplot(x="age", y="fare", hue="sex", data=titanic_train, size=6)
The slopes seem to be different for male and female. What does that indicate?
Let us now try to add an interaction effect into our model.
# It seemed like gender interacted with age and class. Can we put that in our model?
titanic_train['sex_male_X_age'] = titanic_train['age'] * titanic_train['sex_male']
model_3 = sm.OLS(
titanic_train['fare'],
sm.add_constant(titanic_train[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age']])
).fit()
model_3.summary()
What happened to the age
and male
terms?
# It seemed like gender interacted with age and class. Can we put that in our model?
titanic_train['sex_male_X_class_Second'] = titanic_train['age'] * titanic_train['class_Second']
titanic_train['sex_male_X_class_Third'] = titanic_train['age'] * titanic_train['class_Third']
model_4 = sm.OLS(
titanic_train['fare'],
sm.add_constant(titanic_train[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age',
'sex_male_X_class_Second', 'sex_male_X_class_Third']])
).fit()
model_4.summary()
Polynomial Regression¶
Perhaps we now believe that the fare also depends on the square of age. How would we include this term in our model?
fig, ax = plt.subplots(figsize=(12,6))
ax.plot(titanic_train['age'], titanic_train['fare'], 'o')
x = np.linspace(0,80,100)
ax.plot(x, x, '-', label=r'$y=x$')
ax.plot(x, 0.04*x**2, '-', label=r'$y=c x^2$')
ax.set_title('Plotting Age (x) vs Fare (y)')
ax.set_xlabel('Age (x)')
ax.set_ylabel('Fare (y)')
ax.legend();
Exercise: Create a model that predicts fare from all the predictors in model_4
+ the square of age. Show the summary of this model. Call it model_5
. Remember to use the training data, titanic_train
.
# your code here
# %load 'solutions/sol6.py'
Looking at All Our Models: Model Selection¶
What has happened to the $R^2$ as we added more features? Does this mean that the model is better? (What if we kept adding more predictors and interaction terms? In general, how should we choose a model? We will spend a lot more time on model selection and learn about ways to do so as the course progresses.
models = [model_1, model_2, model_3, model_4, model_5]
fig, ax = plt.subplots(figsize=(12,6))
ax.plot([model.df_model for model in models], [model.rsquared for model in models], 'x-')
ax.set_xlabel("Model degrees of freedom")
ax.set_title('Model degrees of freedom vs training $R^2$')
ax.set_ylabel("$R^2$");
What about the test data?
We added a lot of columns to our training data and must add the same to our test data in order to calculate $R^2$ scores.
# Added features for model 1
# Nothing new to be added
# Added features for model 2
titanic_test = pd.get_dummies(titanic_test, columns=['sex', 'class'], drop_first=True)
# Added features for model 3
titanic_test['sex_male_X_age'] = titanic_test['age'] * titanic_test['sex_male']
# Added features for model 4
titanic_test['sex_male_X_class_Second'] = titanic_test['age'] * titanic_test['class_Second']
titanic_test['sex_male_X_class_Third'] = titanic_test['age'] * titanic_test['class_Third']
# Added features for model 5
titanic_test['age^2'] = titanic_test['age'] **2
Calculating R^2 scores
from sklearn.metrics import r2_score
r2_scores = []
y_preds = []
y_true = titanic_test['fare']
# model 1
y_preds.append(model_1.predict(sm.add_constant(titanic_test['age'])))
# model 2
y_preds.append(model_2.predict(sm.add_constant(titanic_test[['age', 'sex_male', 'class_Second', 'class_Third']])))
# model 3
y_preds.append(model_3.predict(sm.add_constant(titanic_test[['age', 'sex_male', 'class_Second', 'class_Third',
'sex_male_X_age']])))
# model 4
y_preds.append(model_4.predict(sm.add_constant(titanic_test[['age', 'sex_male', 'class_Second', 'class_Third',
'sex_male_X_age', 'sex_male_X_class_Second',
'sex_male_X_class_Third']])))
# model 5
y_preds.append(model_5.predict(sm.add_constant(titanic_test[['age', 'sex_male', 'class_Second',
'class_Third', 'sex_male_X_age',
'sex_male_X_class_Second',
'sex_male_X_class_Third', 'age^2']])))
for y_pred in y_preds:
r2_scores.append(r2_score(y_true, y_pred))
models = [model_1, model_2, model_3, model_4, model_5]
fig, ax = plt.subplots(figsize=(12,6))
ax.plot([model.df_model for model in models], r2_scores, 'x-')
ax.set_xlabel("Model degrees of freedom")
ax.set_title('Model degrees of freedom vs test $R^2$')
ax.set_ylabel("$R^2$");
Regression Assumptions. Should We Even Regress Linearly?¶
Question: What are the assumptions of a linear regression model?
We find that the answer to this question can be found on closer examimation of $\epsilon$. What is $\epsilon$? It is assumed that $\epsilon$ is normally distributed with a mean of 0 and variance $\sigma^2$. But what does this tell us?
- Assumption 1: Constant variance of $\epsilon$ errors. This means that if we plot our residuals, which are the differences between the true $Y$ and our predicted $\hat{Y}$, they should look like they have constant variance and a mean of 0. We will show this in our plots.
- Assumption 2: Independence of $\epsilon$ errors. This again comes from the distribution of $\epsilon$ that we decide beforehand.
- Assumption 3: Linearity. This is an implicit assumption as we claim that Y can be modeled through a linear combination of the predictors. Important Note: Even though our predictors, for instance $X_2$, can be created by squaring or cubing another variable, we still use them in a linear equation as shown above, which is why polynomial regression is still a linear model.
- Assumption 4: Normality. We assume that the $\epsilon$ is normally distributed, and we can show this in a histogram of the residuals.
Exercise: Calculate the residuals for model 5, our most recent model. Optionally, plot and histogram these residuals and check the assumptions of the model.
# your code here
# %load 'solutions/sol7.py'
What can you say about the assumptions of the model?
Extra: Visual exploration of predictors' correlations¶
The dataset for this problem contains 10 simulated predictors and a response variable.
# read in the data
data = pd.read_csv('../data/dataset3.txt')
data.head()
# this effect can be replicated using the scatter_matrix function in pandas plotting
sns.pairplot(data);
Predictors x1, x2, x3 seem to be perfectly correlated while predictors x4, x5, x6, x7 seem correlated.
data.corr()
sns.heatmap(data.corr())
Extra: A Handy Matplotlib Guide¶
source: http://matplotlib.org/faq/usage_faq.html
See also this matplotlib tutorial.
See also this violin plot tutorial.