CS109A Introduction to Data Science

Standard Section 3: Multiple Linear Regression and Polynomial Regression

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Section Leaders: Marios Mattheakis, Sean Murphy, Yinyu Ji


In [ ]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("http://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

For this section, our goal is to get you familiarized with Multiple Linear Regression. We have learned how to model data with kNN Regression and Simple Linear Regression and our goal now is to dive deep into Linear Regression.

Specifically, we will:

  • Load in the titanic dataset from seaborn
  • Learn a few ways to visualize distributions of variables using seaborn
  • Practice single variable OLS and how to interpret coefficients in linear regression
  • Practice multiple linear regression with interaction terms and polynomial regression terms
  • Learn about bootstrapping to generate confidence intervals
  • Understand the assumptions being made in a linear regression model
  • (Bonus 1): look at some cool plots to raise your exploratory data analysis (EDA) game
  • (Bonus 2): look at some example stats models code that produces equivalent results

meme

In [ ]:
# Data and Stats packages
import numpy as np
import pandas as pd

# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Modeling
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

Extending Linear Regression

Working with the Titanic Dataset from Seaborn

For our dataset, we'll be using the passenger list from the Titanic, which famously sank in 1912. Let's have a look at the data. Some descriptions of the data are at https://www.kaggle.com/c/titanic/data, and here's how seaborn preprocessed it.

The task is to build a regression model to predict the fare, based on different attributes.

Let's keep a subset of the data, which includes the following variables:

  • age
  • sex
  • class
  • embark_town
  • alone
  • fare (the response variable)
In [ ]:
# Load the dataset from seaborn 
titanic = sns.load_dataset("titanic")
titanic.head()
In [ ]:
# checking for null values
chosen_vars = ['age', 'sex', 'class', 'embark_town', 'alone', 'fare']
titanic = titanic[chosen_vars]
titanic.info()

Check the datatypes of each column and display the statistics (min, max, mean and any others) for all the numerical columns of the dataset.

In [ ]:
print(titanic.dtypes)
titanic.describe()

Drop all the non-null rows in the dataset. Is this always a good idea?

In [ ]:
titanic = titanic.dropna(axis=0)
titanic.info()

Now let us visualize the response variable. A good visualization of the distribution of a variable will enable us to answer three kinds of questions:

  • What values are central or typical? (e.g., mean, median, modes)
  • What is the typical spread of values around those central values? (e.g., variance/stdev, skewness)
  • What are unusual or exceptional values (e.g., outliers)
In [ ]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(24, 6))
ax = ax.ravel()

sns.distplot(titanic['fare'], ax=ax[0])
ax[0].set_title('Seaborn distplot')
ax[0].set_ylabel('Normalized frequencies')

sns.violinplot(x='fare', data=titanic, ax=ax[1])
ax[1].set_title('Seaborn violin plot')
ax[1].set_ylabel('Frequencies')

sns.boxplot(x='fare', data=titanic, ax=ax[2])
ax[2].set_title('Seaborn box plot')
ax[2].set_ylabel('Frequencies')
fig.suptitle('Distribution of count');

How do we interpret these plots?

Train-Test Split

Let's split the data into train and test sets.

In [ ]:
titanic_train, titanic_test = train_test_split(titanic, train_size=0.7, random_state=42)
# important for avoiding the infamous SettingwithCopyWarning
titanic_train = titanic_train.copy()
titanic_test = titanic_test.copy()
print(titanic_train.shape, titanic_test.shape)

Simple One-Variable Linear Regression

Now, let's fit a simple model on the training data using LinearRegression from the sklearn library, predicting fare using age.

We'll call this model_1 and take a look at its coefficient and R-squared.

In [ ]:
# assign predictor and response variables
X_train = titanic_train[['age']]
y_train = titanic_train['fare']
X_test = titanic_test[['age']]
y_test = titanic_test['fare']

model_1 = LinearRegression().fit(X_train, y_train)
In [ ]:
# check the slope/coefficient and intercept values
print("Coefficient of the model: ",model_1.coef_)
print("Intercept of the model: ",model_1.intercept_)

# predict on test set
y_test_pred = model_1.predict(X_test)

# get R-squared
print("R-squared of the model:", r2_score(y_test, y_test_pred))

Variable Types

In general, you should be able to distinguish between three kinds of variables:

  1. Continuous variables: such as fare or age
  2. Categorical variables: such as sex or alone. There is no inherent ordering between the different values that these variables can take on. These are sometimes called nominal variables. Read more here.
  3. Ordinal variables: such as class (first > second > third). There is some inherent ordering of the values in the variables, but the values are not continuous either.

Note: While there is some inherent ordering in class, we will be treating it like a categorical variable.

Let us now examine the sex column and see the value counts.

In [ ]:
titanic_train['sex'].value_counts()

Create a column sex_male that is 1 if the passenger is male, 0 if female. The value counts indicate that these are the two options in this particular dataset. Ensure that the datatype is int.

In [ ]:
titanic_train['sex_male'] = (titanic_train.sex == 'male').astype(int)
titanic_train['sex_male'].value_counts()

Do we need a sex_female column, or a sex_others column? Why or why not?

Now, let us look at class in greater detail. Let's one hot encode.

One Hot Encoding

In [ ]:
titanic_train['class_Second'] = (titanic_train['class'] == 'Second').astype(int)
titanic_train['class_Third'] = 1 * (titanic_train['class'] == 'Third') # just another way to do it
In [ ]:
# do the same for the test set
titanic_test_orig = titanic_test.copy()

titanic_test = pd.get_dummies(titanic_test, columns=['sex', 'class'], drop_first=True)
titanic_test.head()

Working with binary predictors

male_female

Linear Regression with More Variables

Now let's fit a linear regression including the new sex and class variables, calling this model model_2.

image.png

In [ ]:
# assign predictor and response variables
X_train = titanic_train[['age', 'sex_male', 'class_Second', 'class_Third']]
y_train = titanic_train['fare']
X_test = titanic_test[['age', 'sex_male', 'class_Second', 'class_Third']]
y_test = titanic_test['fare']

model_2 = LinearRegression().fit(X_train, y_train)
In [ ]:
# check the coefficients and intercept values
print("Coefficients of the model: ",model_2.coef_)
print("Intercept of the model: ",model_2.intercept_)

# predict on test set
y_test_pred = model_2.predict(X_test)

# get R-squared
print("R-squared of the model:", r2_score(y_test, y_test_pred))

Interpreting These Results

  1. Which of the predictors do you think are important? Why?
  2. All else equal, what does being male do to the fare?

Engineering Variables: Exploring Interactions

In [ ]:
sns.lmplot(y = "fare", x= "age", hue = "sex_male", data = titanic_train, height=7.27, aspect=11.7/8.27)
ax = plt.gca()
ax.set_title("Predicting Fare from Age + Age*Sex")
plt.show()

The slopes seem to be different for male and female. What does that indicate?

Let us now try to add an interaction effect into our model.

In [ ]:
# It seemed like sex interacted with age. Can we put that in our model?
titanic_train['sex_male_X_age'] = titanic_train['age'] * titanic_train['sex_male']
titanic_test['sex_male_X_age'] = titanic_test['age'] * titanic_test['sex_male']

X_train = titanic_train[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age']]
y_train = titanic_train['fare']
X_test = titanic_test[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age']]
y_test = titanic_test['fare']

model_3 = LinearRegression().fit(X_train, y_train)
In [ ]:
# check the coefficients and intercept values
print("Coefficients of the model: ",model_3.coef_)
print("Intercept of the model: ",model_3.intercept_)

# predict on test set
y_test_pred = model_3.predict(X_test)

# get R-squared
print("R-squared of the model:", r2_score(y_test, y_test_pred))
In [ ]:
# It seemed like sex interacted with class. Can we put that in our model?
titanic_train['sex_male_X_class_Second'] = titanic_train['age'] * titanic_train['class_Second']
titanic_train['sex_male_X_class_Third'] = titanic_train['age'] * titanic_train['class_Third']
titanic_test['sex_male_X_class_Second'] = titanic_test['age'] * titanic_test['class_Second']
titanic_test['sex_male_X_class_Third'] = titanic_test['age'] * titanic_test['class_Third']

X_train = titanic_train[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age', 
                         'sex_male_X_class_Second', 'sex_male_X_class_Third']]
y_train = titanic_train['fare']
X_test = titanic_test[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age',
                      'sex_male_X_class_Second', 'sex_male_X_class_Third']]
y_test = titanic_test['fare']

model_4 = LinearRegression().fit(X_train, y_train)
In [ ]:
# check the coefficients and intercept values
print("Coefficients of the model: ",model_4.coef_)
print("Intercept of the model: ",model_4.intercept_)

# predict on test set
y_test_pred = model_4.predict(X_test)

# get R-squared
print("R-squared of the model:", r2_score(y_test, y_test_pred))

What happened to the age and male terms?

Prepare for Breakout Room 1

Boston Housing Prices Dataset

Data Description:

  • RM: Average number of rooms per dwelling
  • LSTAT: Percent of housing population classified as "lower status"
  • MEDV: Median value of owner-occupied homes in $1000's

MEDV will be the response variable

(For the curious: https://www.kaggle.com/c/boston-housing)

Load and inspect the data

In [ ]:
boston = pd.read_csv('../data/boston_housing.csv')
boston.head()

Inspect data by visualization (we can combine matplotlib with seaborn)

In [ ]:
plt.figure(figsize=[16,4])
plt.subplot(1,3,1)
sns.scatterplot(x="RM", y="MEDV", data=boston)
plt.subplot(1,3,2)
sns.scatterplot(x="LSTAT", y="MEDV", data=boston)
plt.subplot(1,3,3)
sns.scatterplot(x="RM", y="LSTAT", data=boston);

Split the data and perform single variable linear regression with the predictors RM and LSTAT

In [ ]:
boston_train, boston_test = train_test_split(boston, train_size=0.7, random_state=109)
boston_train = boston_train.copy()
boston_test = boston_test.copy()

# Single variable linear regression with RM
model_boston_0a = LinearRegression().fit(boston_train[["RM"]], boston_train.MEDV)
print("R^2 on training set of the model with RM:",  r2_score(boston_train.MEDV, model_boston_0a.predict(boston_train[["RM"]])) )
print("R^2 on testing  set of the model with RM:",  r2_score(boston_test.MEDV, model_boston_0a.predict(boston_test[["RM"]])) )
print('')

# Single variable linear regression with RM
model_boston_0b = LinearRegression().fit(boston_train[["LSTAT"]], boston_train.MEDV)
print("R^2 on training set of the model with LSTAT:",  r2_score(boston_train.MEDV, model_boston_0b.predict(boston_train[["LSTAT"]])) )
print("R^2 on testing set  of the model with LSTAT:",  r2_score(boston_test.MEDV, model_boston_0b.predict(boston_test[["LSTAT"]])) )

# Store R2 on testing for later
R2_0a = r2_score(boston_test.MEDV, model_boston_0a.predict(boston_test[["RM"]])) 
R2_0b = r2_score(boston_test.MEDV, model_boston_0b.predict(boston_test[["LSTAT"]])) 

Time for Breakout Room 1

Goal: Learn about how to include interaction terms in your linear regression model using the sklearn library.

Directions:

1. Build an linear regression model using `RM` and `LSTAT` as predictors.
    - As before, fit the model on the training data.
    - Print the coefficients.
    - Report the R^2 score on the test data. Did the performance improve?
    - Does the multiple regression model perform better than the two single regression models?
2. Build a model with `LSTAT`, `RM`, and an interaction term between `LSTAT` and `RM`
    - Print the coefficients.
    - Does the interaction term improve R^2?

Store R^2 on testing sets for later investigation. 
In [ ]:
# %load ../solutions/breakout_1_sol.py

Engineering Variables: Exploring Polynomial Regression

poly

Perhaps we now believe that the fare also depends on the square of age. How would we include this term in our model?

In [ ]:
fig, ax = plt.subplots(figsize=(12,6))
ax.plot(titanic_train['age'], titanic_train['fare'], 'o')
x = np.linspace(0,80,100)
ax.plot(x, x, '-', label=r'$y=x$')
ax.plot(x, 0.04*x**2, '-', label=r'$y=c x^2$')
ax.set_title('Plotting Age (x) vs Fare (y)')
ax.set_xlabel('Age (x)')
ax.set_ylabel('Fare (y)')
ax.legend();

Create a model that predicts fare from all the predictors in model_4 + the square of age. Show the summary of this model. Call it model_5. Remember to use the training data, titanic_train.

In [ ]:
titanic_train['age^2'] = titanic_train['age']**2
titanic_test['age^2'] = titanic_test['age']**2

X_train = titanic_train[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age', 
                             'sex_male_X_class_Second', 'sex_male_X_class_Third', 'age^2']]
y_train = titanic_train['fare']
X_test = titanic_test[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age', 
                             'sex_male_X_class_Second', 'sex_male_X_class_Third', 'age^2']]
y_test = titanic_test['fare']

model_5 = LinearRegression().fit(X_train, y_train)
In [ ]:
# check the coefficients and intercept values
print("Coefficients of the model: ",model_5.coef_)
print("Intercept of the model: ",model_5.intercept_)

# predict on test set
y_test_pred = model_5.predict(X_test)

# get R-squared
print("R-squared of the model:", r2_score(y_test, y_test_pred))

Time for Breakout Room 2

Goal: Learn about how to include polynomial terms in your model.

Directions:

In all the cases print the coefficients and report the R^2 in testing:

1. Build a polynomial regression model including the 2nd degree terms of both `LSTAT` and `RM` 
2. Next, include the interaction term between `LSTAT` and `RM` in your model
3. Finally, include the 3rd degree terms of both `LSTAT` and `RM` in your model

Can you see any improvement? Reviewing the models you have built thus far, which one would you choose and why?

In [ ]:
# %load ../solutions/breakout_2_sol.py
In [ ]:
# Performance of previous models:
print('R^2: ', R2_1, R2_1_inter)

Model Selection

A good model is the model that gives good performance on the testing set while it does not use too many predictors (features). Less features makes a model more stable.

In [ ]:
R2 = [R2_0b, R2_0a,  R2_1, R2_1_inter, R2_2, R2_2_inter, R2_3_inter]
In [ ]:
plt.figure(figsize=[8,6])
x= np.arange(0,7,1)
plt.plot(x,R2,'-ob')

xTick_label =  ['R2_0b', 'R2_0a',  'R2_1', 'R2_1_inter', 'R2_2', 'R2_2_inter', 'R2_3_inter'] 
plt.xticks(x,xTick_label, rotation ='vertical') 
plt.ylabel('$R^2$')
plt.xlabel('Model')
plt.title('Model Performance')

We observe that including the interaction term the performance jumps from ~60% to ~72%. Including higher polynomial terms we do not notice a significant improvement. Hence, the best model is the first order polynomial with interaction term.

Here, we do not suggest a method for selecting the best model, just introduce the concept. We will be covering cross validation in depth next section!

Regression Assumptions

linear regression

The answer to this question can be found on closer examimation of $\epsilon$. What is $\epsilon$? It is assumed that $\epsilon$ is normally distributed with a mean of 0 and variance $\sigma^2$. But what does this tell us?

  1. Assumption 1: Linearity This is an implicit assumption as we claim that Y can be modeled through a linear combination of the predictors.
  2. Assumption 2: Independence of observations This comes from the randomness of $\epsilon$.

  3. Assumption 3: Constant variance of $\epsilon$ errors This means that if we plot our residuals, which are the differences between the true $Y$ and our predicted $\hat{Y}$, they should look like they have constant variance and a mean of 0.

  4. Assumption 4: Normality We assume that the $\epsilon$ is normally distributed, and we can show this in a histogram of the residuals.
In [ ]:
y_hat = model_5.predict(X_train)
residuals = titanic_train['fare'] - y_hat

# plotting
fig, ax = plt.subplots(ncols=2, figsize=(16,5))
ax = ax.ravel()
ax[0].set_title('Plot of Residuals')
ax[0].scatter(y_hat, residuals, alpha=0.2)
ax[0].set_xlabel(r'$\hat{y}$')
ax[0].set_xlabel('residuals')

ax[1].set_title('Histogram of Residuals')
ax[1].hist(residuals, alpha=0.7)
ax[1].set_xlabel('residuals')
ax[1].set_ylabel('frequency');

# Mean of residuals
print('Mean of residuals: {}'.format(np.mean(residuals)))

What can you say about the assumptions of the model?

Real data violate assumptions. So why use linear regression? Because linear regression has an analytical solution, so we guaranteed to find the optimal solutions and the solutions are computationally cheap to obtain.

Bootstrapping

What is a confidence interval?

A confidence interval is a range of values that is likely to include a parameter of interest with some degree of certainty or "confidence."

How do we interpret 95% confidence intervals?

If we were to compute 95% confidence intervals for each of K repeated samples, we would expect 0.95*K of those confidence intervals to contain the true parameter of interest.

What is bootstrapping?

Bootstrapping is a procedure for resampling a dataset with replacement to produce a distribution of the value of interest.

Using the model we selected from above, let's do some bootstrapping to generate the confidence intervals for our coefficients!

In [ ]:
print("Let's check out the coefficients of R2_1_inter")
print("Beta0",":",model_boston_1_inter.intercept_)
for i in range(3):
    print("Beta"+str(i+1),":",model_boston_1_inter.coef_[i])
In [ ]:
# number of bootstraps

bootstrap = []
numboot = 200

for i in range(numboot):
    boston_sampled = boston.sample(frac=1, replace=True)
    boston_sampled["LSTAT*RM"] = boston_sampled["LSTAT"]*boston_sampled["RM"]
    model_boston_1_inter = LinearRegression().fit(boston_sampled[["LSTAT", "RM", "LSTAT*RM"]], boston_sampled.MEDV)
    bootstrap.append(model_boston_1_inter.coef_)

bootstrap = np.array(bootstrap)

fig, ax = plt.subplots(2,2, figsize = (18,10))
ax = ax.ravel()
for i in range(3):
    betavals = bootstrap[:,i]
    betavals.sort()
    x1 = np.percentile(betavals,2.5)
    x2 = np.percentile(betavals,97.5)
    x = np.linspace(x1,x2,500)
    counts, bins = np.histogram(betavals)
    y = counts.max()
    ax[i].hist(bootstrap[:,i], bins = 10, color="gold",alpha=0.5,edgecolor='black', linewidth=1)
    ax[i].fill_between(x,y, color = 'cornflowerblue',alpha=0.3)
    ax[i].set_ylabel(f'Distribution of beta {i}',fontsize=18)
    ax[i].set_xlabel(f'Value of beta {i}',fontsize=18)
    ax[i].axvline(x = np.mean(betavals), color='r')

fig.delaxes(ax[3])
fig.suptitle(f'95 % confidence interval of R2_1_inter Coefficients', fontsize = 24)
sns.despine()

End of Standard Section


Extra: Visual exploration of predictors' correlations

The dataset for this problem contains 10 simulated predictors and a response variable.

In [ ]:
# read in the data 
data = pd.read_csv('../data/dataset3.txt')
data.head()
In [ ]:
# this effect can be replicated using the scatter_matrix function in pandas plotting
sns.pairplot(data);

Predictors x1, x2, x3 seem to be perfectly correlated while predictors x4, x5, x6, x7 seem correlated.

In [ ]:
data.corr()
In [ ]:
sns.heatmap(data.corr())

Extra: A Handy Matplotlib Guide

source: http://matplotlib.org/faq/usage_faq.html

See also this matplotlib tutorial.

violin plot

See also this violin plot tutorial.

Using statsmodel OLS

Exercise: You've done this before: make a simple model using the OLS package from the statsmodels library predicting fare using age using the training data. Name your model model_1 and display the summary

In [ ]:
from statsmodels.api import OLS
import statsmodels.api as sm
In [ ]:
age_ca = sm.add_constant(titanic_train['age'])
model_1 = OLS(titanic_train['fare'], age_ca).fit()
model_1.summary()
In [ ]:
model_2 = sm.OLS(titanic_train['fare'], 
                 sm.add_constant(titanic_train[['age', 'sex_male', 'class_Second', 'class_Third']])).fit()
model_2.summary()
In [ ]:
titanic_train['sex_male_X_age'] = titanic_train['age'] * titanic_train['sex_male']

model_3 = sm.OLS(
    titanic_train['fare'],
    sm.add_constant(titanic_train[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age']])
).fit()
model_3.summary()
In [ ]:
# It seemed like gender interacted with age and class. Can we put that in our model?
titanic_train['sex_male_X_class_Second'] = titanic_train['age'] * titanic_train['class_Second']
titanic_train['sex_male_X_class_Third'] = titanic_train['age'] * titanic_train['class_Third']

model_4 = sm.OLS(
    titanic_train['fare'],
    sm.add_constant(titanic_train[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age', 
                             'sex_male_X_class_Second', 'sex_male_X_class_Third']])
).fit()
model_4.summary()
In [ ]:
titanic_train['age^2'] = titanic_train['age']**2

model_5 = sm.OLS(
    titanic_train['fare'],
    sm.add_constant(titanic_train[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age', 
                             'sex_male_X_class_Second', 'sex_male_X_class_Third', 'age^2']])
).fit()
model_5.summary()

predictors = sm.add_constant(titanic_train[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age', 
                                      'sex_male_X_class_Second', 'sex_male_X_class_Third', 'age^2']])
y_hat = model_5.predict(predictors)
residuals = titanic_train['fare'] - y_hat

# plotting
fig, ax = plt.subplots(ncols=2, figsize=(16,5))
ax = ax.ravel()
ax[0].set_title('Plot of Residuals')
ax[0].scatter(y_hat, residuals, alpha=0.2)
ax[0].set_xlabel(r'$\hat{y}$')
ax[0].set_xlabel('residuals')

ax[1].set_title('Histogram of Residuals')
ax[1].hist(residuals, alpha=0.7)
ax[1].set_xlabel('residuals')
ax[1].set_ylabel('frequency');

# Mean of residuals
print('Mean of residuals: {}'.format(np.mean(residuals)))