Key Word(s): Linear Regression, Interaction Terms, Model Selection
CS109A Introduction to Data Science
Standard Section 3: Predictor types and Feature Selection¶
Harvard University
Fall 2018
Instructors: Pavlos Protopapas, Kevin Rader
Section Leaders: Mehul Smriti Raje, Ken Arnold, Karan Motwani, Cecilia Garraffo
#RUN THIS CELL
import requests
from IPython.core.display import HTML
styles = requests.get("http://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
For this section, our goal is to discuss the complexities around different types of data features and thinking carefully about how different datatypes and collinearity issues can affect our models, whether our true goal is inference or prediction.
Specifically, we will:
1. Discuss different variable types, and techniques of “one-hot-encoding” our factor variables
2. Build a variable selection function that performs an exhaustive feature search overall all possible combinations of predictors
For this section we will be using the following packages:
#Check Python Version
import sys
assert(sys.version_info.major==3), print(sys.version)
# Data and Stats packages
import numpy as np
import pandas as pd
from sklearn import metrics, datasets
from sklearn.model_selection import train_test_split
import statsmodels.api as sm # RUNNING FOR ME (MSR)
# Visualization packages
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
matplotlib.rcParams['figure.figsize'] = (13.0, 6.0)
# Other Helpful fucntions
import itertools
import warnings
warnings.filterwarnings("ignore")
#Aesthetic settings
from IPython.display import display
pd.set_option('display.max_rows', 999)
pd.set_option('display.width', 500)
pd.set_option('display.notebook_repr_html', True)
1. Extending Linear Regression by Transforming Predictors¶
Linear regression works great when our features are all continuous and all linearly affect the output. But often real data have more interesting characteristics. Here, we'll look at how we can extend linear regression to handle:
- Categorical predictors, like gender, which have one of a few discrete values
- Interactions between predictors, which let us model how one variable changes the effect of another.
For our dataset, we'll be using the passenger list from the Titanic, which famously sank in 1912. Let's have a look at the data. Some descriptions of the data are at https://www.kaggle.com/c/titanic/data, and here's how seaborn preprocessed it.
# Load the dataset from seaborn
titanic = sns.load_dataset("titanic")
titanic.head()
# Keep only a subset of the predictors; some are redundant, others (like deck) have too many missing values.
titanic = titanic[['age', 'sex', 'class', 'embark_town', 'alone', 'fare']]
titanic.info()
# Drop missing data (is this a good idea?)
titanic = titanic.dropna()
titanic.info()
Let's explore this data. Most people look at differences in survival, which is important but requires knowing how to deal with categorical responses, which we'll learn how to do next week. For this week, let's see if there are systematic differences in what fare people paid.
First let's look at the distribution of fares.
# Your code here
titanic.fare.describe()
# Your code here
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(16, 8))
sns.distplot(titanic.fare, ax=axes[0,0])
sns.violinplot(x='fare', data=titanic, ax=axes[0,1])
sns.boxplot(x='fare', data=titanic, ax=axes[1,0])
sns.barplot(x='fare', data=titanic, ax=axes[1,1])
# Of course you should label your axes, etc.
What do we learn from each visualization? Which is most helpful?
What can we say about the fares that passengers paid?
Your answer here A good visualization of the distribution of a variable will enable us to answer three kinds of questions:
- What values are central or typical? (e.g., mean, median, modes)
- What is the typical spread of values around those central values? (e.g., variance/stdev, skewness)
- What are unusual or exceptional values (e.g., outliers)
The barplot clearly shows the mean and stdev, but that's about it. The other visualizations give complementary info, e.g., the boxplot shows outliers clearly, but the histogram most clearly suggests that there might be secondary modes in addition to the primary mode.
Overall, we can see that most passengers paid below $50, but there are many outliers. If we were carefully analyzing this data, we might consider removing some outliers entirely and transforming the remaining data to make its distribution more Normal. But for now we'll just use the data as-is.
Exploring predictors¶
Cabin class is probably going to matter for fare, but we might wonder if age and gender also matter. Let's explore them.
sns.distplot(titanic.age, rug=True, rug_kws={'alpha': .1, 'color': 'k'})
sns.lmplot(x="age", y="fare", hue="sex", data=titanic, size=8)
Hmm... the slopes seem to be different for males and females. Remember this for later.
How about sex and class?
titanic.sex.value_counts()
titanic['class'].value_counts()
# Why couldn't we write `titanic.class.value_counts()` here?
sns.barplot(x="class", hue="sex", y="fare", data=titanic)
sns.violinplot(x="class", hue="sex", y="fare", data=titanic)
Aside: can we replicate that violin plot without Seaborn?¶
# plt.violinplot takes an array of data arrays, and plots each one as a separate violin.
# The data arrays don't have to be the same length.
# It returns Matplotlib objects corresponding to each of the pieces of the visualization.
plt.violinplot([
[0,1,2],
[10, 14, 9, 10, 10, 10],
[6]
])
# We can use Pandas queries to make datasets for each class.
# We need to use `.values` to get plain Numpy arrays.
classes = 'First Second Third'.split()
plt.violinplot([
titanic['fare'][titanic['class'] == cls].values
for cls in classes
])
plt.xticks([1,2,3], classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
# It gets a little more tricky when we want to plot the two sexes side-by-side.
# Fortunately we can tell violinplot to place the violins at different positions.
position_array = [0, 4, 5] # just to show the effect of the position array.
plt.violinplot([
titanic['fare'][titanic['class'] == cls].values
for cls in classes
], positions=position_array)
#plt.xticks(position_array, classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
# We can use the position array to shift the male and female violins.
sexes = 'male female'.split()
positions_array = np.arange(len(classes)) # this will be [0, 1, 2]
for i, sex in enumerate(sexes):
offset = .15 * (-1 if i == 0 else 1)
violin = plt.violinplot([
titanic['fare'][(titanic['sex'] == sex) & (titanic['class'] == cls)].values
for cls in classes
], positions=positions_array + offset, widths=.25, showmedians=True, showextrema=True)
plt.xticks(positions_array, classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
# But what about a legend? Unfortunately plt.violinplot doesn't support legends, so we
# have to do it by hand.
positions_array = np.arange(len(classes))
fake_handles = []
for i, sex in enumerate(sexes):
offset = .15 * (-1 if i == 0 else 1)
violin = plt.violinplot([
titanic['fare'][(titanic['sex'] == sex) & (titanic['class'] == cls)].values
for cls in classes
], positions=positions_array + offset, widths=.25, showmedians=True, showextrema=True)
fake_handles.append(mpatches.Patch())
plt.legend(fake_handles, sexes)
plt.xticks(positions_array, classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
# Dang, the colors are wrong. We'll have to use some hacky code to pull the color out of the violin...
positions_array = np.arange(len(classes))
fake_handles = []
for i, sex in enumerate(sexes):
offset = .15 * (-1 if i == 0 else 1)
violin = plt.violinplot([
titanic['fare'][(titanic['sex'] == sex) & (titanic['class'] == cls)].values
for cls in classes
], positions=positions_array + offset, widths=.25, showmedians=True, showextrema=True)
cur_violin_color = violin['cbars'].get_color()[0]
fake_handles.append(mpatches.Patch(color=cur_violin_color))
plt.legend(fake_handles, sexes)
plt.xticks(positions_array, classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
# Or explicitly control the colors.
colors = sns.color_palette("Set1", n_colors=len(sexes), desat=.5)
sns.palplot(colors)
positions_array = np.arange(len(classes))
fake_handles = []
for i, sex in enumerate(sexes):
offset = .15 * (-1 if i == 0 else 1)
violin = plt.violinplot([
titanic['fare'][(titanic['sex'] == sex) & (titanic['class'] == cls)].values
for cls in classes
], positions=positions_array + offset, widths=.25, showmedians=True, showextrema=True)
# Set the color
color = colors[i]
for part_name, part in violin.items():
if part_name == 'bodies':
for body in violin['bodies']:
body.set_color(color)
else:
part.set_color(color)
fake_handles.append(mpatches.Patch(color=color))
plt.legend(fake_handles, sexes)
plt.xticks(positions_array, classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
No more digression, back to regressions.¶
So it looks like fare varies with class, age, and maybe gender, and the way that fare depends on class and age may be different for male vs female.
Let's first do a simple linear regression on age.
model1 = sm.OLS(
titanic.fare,
sm.add_constant(titanic['age'])
).fit()
model1.summary()
How do we interpret this?
- How good is this model?
- Does age affect fare? How can we tell? And if so, how?
Handling categorical variables¶
Statistical packages generally distinguish between three types of variables:
- continuous variables, like age and fare.
- nominal variables, like gender and whether the person survived
- ordinal variables like cabin class (first > second > third)
(Ordinal variables can often be treated like nominal variables; that's what we'll do for now.)
How do we deal with gender, or class? They're categorical variables. We'll need to use dummy variables to encode them.
titanic_orig = titanic.copy()
# Your code here
titanic['sex_male'] = (titanic.sex == 'male').astype(int)
your answer here
No, because it's redundant: the only values for sex
in this dataset are male
and female
, so if a passenger is not male, we know they're female.
# Your code here
titanic['class_Second'] = (titanic['class'] == 'Second').astype(int)
titanic['class_Third'] = 1 * (titanic['class'] == 'Third') # just another way to do it
titanic.info()
# Your code here
# This function automates the above:
titanic = pd.get_dummies(titanic_orig, columns=['sex', 'class'], drop_first=True)
titanic.head()
# Your code here
model2 = sm.OLS(
titanic.fare,
sm.add_constant(titanic[['age', 'sex_male', 'class_Second', 'class_Third']])
).fit()
model2.summary()
- All else being equal, what does being male do to the fare?
- What can we say about being male and first-class?
Your answer here
Interactions¶
# It seemed like gender interacted with age and class. Can we put that in our model?
titanic['sex_male_X_age'] = titanic['age'] * titanic['sex_male']
model3 = sm.OLS(
titanic.fare,
sm.add_constant(titanic[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age']])
).fit()
model3.summary()
What happened to the age
and male
terms?
# It seemed like gender interacted with age and class. Can we put that in our model?
titanic['sex_male_X_class_Second'] = titanic['age'] * titanic['class_Second']
titanic['sex_male_X_class_Third'] = titanic['age'] * titanic['class_Third']
model4 = sm.OLS(
titanic.fare,
sm.add_constant(titanic[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age', 'sex_male_X_class_Second', 'sex_male_X_class_Third']])
).fit()
model4.summary()
What has happened to the $R^2$ as we added more features? Does this mean that the model is better? (What if we kept adding more predictors and interaction terms? We'll explore this in the next homework.)
models = [model1, model2, model3, model4]
plt.plot([model.df_model for model in models], [model.rsquared for model in models], 'x-')
plt.xlabel("Model df")
plt.ylabel("$R^2$");
2. Model Selection via exhaustive search selection¶
The dataset for this problem contains 10 simulated predictors and a response variable.
# Import the dataset
data = pd.read_csv('data/dataset3.txt')
data.head()
By visually inspecting the data set, do we find that some of the predictors are correlated amongst themselves?
sns.pairplot(data)
Predictors x1, x2, x3 seem to be perfectly correlated while predictors x4, x5, x6, x7 show very high degrees of correlation.
data.corr()
Let us compute the cofficient of correlation between each pair of predictors, and visualize the matrix of correlation coefficients using a heat map.
# generating heat map
sns.heatmap(data.corr())
Do the predictors fall naturally into groups based on the correlation values?
We can see that higly correlated predictors fall into dark red groups based on correlation values close to 1. Other correlation values also form differently coloured groups.
Your answer here
We may choose one predictor from the x1,x2,x3 list - let's pick x1.
Similarly, let's pick x6 out of x4,x5,x6,x7.
The other predictors are not strongly correlated, so we pick them all, i.e. x8,x9,x10.
Thus, we have picked 5 predictors x1, x6, x8, x9, x10 for our regression model.
Model Selection Criteria: Bayesian Information Criterion (BIC)¶
Generally BIC = -2 x Log-likehood + 2 x log(K)
For least-squares regression specifically,
$$BIC = n \log \Big(\frac{RSS}{n}\Big) + \log(n)*K$$where,
RSS = Residual Sum of Squares
n = the number of obervations
K = the number of features in our model
Selecting minimal subset of predictors¶
Let us apply the exhaustive search variable selection methods discussed in class to choose a minimal subset of predictors that yield high prediction accuracy. We will use the Bayesian Information Criterion (BIC) to choose the subset size.
x = data.iloc[:,:-1]
y = data.iloc[:,-1]
x.shape, y.shape
def find_best_subset_of_size(x, y, num_predictors):
predictors = x.columns
best_r_squared = -np.inf
best_model_data = None
# Enumerate subsets of the given size
subsets_of_size_k = itertools.combinations(predictors, num_predictors)
# Inner loop: iterate through subsets_k
for subset in subsets_of_size_k:
# Fit regression model using ‘subset’ and calculate R^2
# Keep track of subset with highest R^2
features = list(subset)
x_subset = sm.add_constant(x[features])
model = sm.OLS(y, x_subset).fit()
r_squared = model.rsquared
# Check if we get a higher R^2 value than than current max R^2.
# If so, update our best subset
if r_squared > best_r_squared:
best_r_squared = r_squared
best_model_data = {
'r_squared': r_squared,
'subset': features,
'model': model
}
return best_model_data
find_best_subset_of_size(x, y, 8)
def exhaustive_search_selection(x, y):
"""Exhaustively search predictor combinations
Parameters:
-----------
x : DataFrame of predictors/features
y : response varible
Returns:
-----------
Dataframe of model comparisons and OLS Model with
lowest BIC for subset with highest R^2
"""
predictors = x.columns
stats = []
models = dict()
# Outer loop: iterate over sizes 1, 2 .... d
for k in range(1, len(predictors)):
best_size_k_model = find_best_subset_of_size(
x, y, num_predictors=k)
best_subset = best_size_k_model['subset']
best_model = best_size_k_model['model']
stats.append({
'k': k,
'formula': "y ~ {}".format(' + '.join(best_subset)),
'bic': best_model.bic,
'r_squared': best_model.rsquared
})
models[k] = best_model
return pd.DataFrame(stats), models
stats, models = exhaustive_search_selection(x, y)
stats
stats.plot(x='k', y='bic', marker='*')
stats.info()
best_stat = stats.iloc[stats.bic.idxmin()]
best_stat
best_k = best_stat['k']
best_bic = best_stat['bic']
best_formula = best_stat['formula']
best_r2 = best_stat['r_squared']
print("The best overall model is `{formula}` with bic={bic:.2f} and R^2={r_squared:.3f}".format(
formula=best_formula, bic=best_bic, r_squared=best_r2))
models[best_k].summary()
Do the chosen subsets match the ones you picked using the correlation matrix you had visualized?
Yes! The predictor subset matches with the ones we picked before.
Plotting multiple axes in a single figure¶
source: http://matplotlib.org/faq/usage_faq.html
See also this matplotlib tutorial.