CS109A Introduction to Data Science

Standard Section 3: Predictor types and Feature Selection

Harvard University
Fall 2018
Instructors: Pavlos Protopapas, Kevin Rader
Section Leaders: Mehul Smriti Raje, Ken Arnold, Karan Motwani, Cecilia Garraffo


In [1]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("http://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

For this section, our goal is to discuss the complexities around different types of data features and thinking carefully about how different datatypes and collinearity issues can affect our models, whether our true goal is inference or prediction.

Specifically, we will:

1. Discuss different variable types, and techniques of “one-hot-encoding” our factor variables 
2. Build a variable selection function that performs an exhaustive feature search overall all possible combinations of predictors 

For this section we will be using the following packages:

In [2]:
#Check Python Version
import sys
assert(sys.version_info.major==3), print(sys.version)

# Data and Stats packages
import numpy as np
import pandas as pd
from sklearn import metrics, datasets
from sklearn.model_selection import train_test_split
import statsmodels.api as sm # RUNNING FOR ME (MSR)

# Visualization packages
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
matplotlib.rcParams['figure.figsize'] = (13.0, 6.0)

# Other Helpful fucntions
import itertools
import warnings
warnings.filterwarnings("ignore")

#Aesthetic settings
from IPython.display import display
pd.set_option('display.max_rows', 999)
pd.set_option('display.width', 500)
pd.set_option('display.notebook_repr_html', True)

1. Extending Linear Regression by Transforming Predictors

Linear regression works great when our features are all continuous and all linearly affect the output. But often real data have more interesting characteristics. Here, we'll look at how we can extend linear regression to handle:

  • Categorical predictors, like gender, which have one of a few discrete values
  • Interactions between predictors, which let us model how one variable changes the effect of another.

For our dataset, we'll be using the passenger list from the Titanic, which famously sank in 1912. Let's have a look at the data. Some descriptions of the data are at https://www.kaggle.com/c/titanic/data, and here's how seaborn preprocessed it.

In [3]:
# Load the dataset from seaborn 
titanic = sns.load_dataset("titanic")
titanic.head()
In [4]:
# Keep only a subset of the predictors; some are redundant, others (like deck) have too many missing values.
titanic = titanic[['age', 'sex', 'class', 'embark_town', 'alone', 'fare']]
In [5]:
titanic.info()
In [6]:
# Drop missing data (is this a good idea?)
titanic = titanic.dropna()
titanic.info()

Let's explore this data. Most people look at differences in survival, which is important but requires knowing how to deal with categorical responses, which we'll learn how to do next week. For this week, let's see if there are systematic differences in what fare people paid.

First let's look at the distribution of fares.

**Exercise 1**: Show the distribution of fares in at least two ways. You may use functions from matplotlib, Pandas, or Seaborn.
In [7]:
# Your code here
In [8]:
# Your code here

What do we learn from each visualization? Which is most helpful?

What can we say about the fares that passengers paid?

Your answer here

Exploring predictors

Cabin class is probably going to matter for fare, but we might wonder if age and gender also matter. Let's explore them.

In [9]:
sns.distplot(titanic.age, rug=True, rug_kws={'alpha': .1, 'color': 'k'})
In [10]:
sns.lmplot(x="age", y="fare", hue="sex", data=titanic, size=8)

Hmm... the slopes seem to be different for males and females. Remember this for later.

How about sex and class?

In [11]:
titanic.sex.value_counts()
In [12]:
titanic['class'].value_counts()
# Why couldn't we write `titanic.class.value_counts()` here?
In [13]:
sns.barplot(x="class", hue="sex", y="fare", data=titanic)
In [14]:
sns.violinplot(x="class", hue="sex", y="fare", data=titanic)

Aside: can we replicate that violin plot without Seaborn?

In [15]:
# plt.violinplot takes an array of data arrays, and plots each one as a separate violin.
# The data arrays don't have to be the same length.
# It returns Matplotlib objects corresponding to each of the pieces of the visualization.
plt.violinplot([
    [0,1,2], 
    [10, 14, 9, 10, 10, 10],
    [6]
])
In [16]:
# We can use Pandas queries to make datasets for each class.
# We need to use `.values` to get plain Numpy arrays.
classes = 'First Second Third'.split()
plt.violinplot([
    titanic['fare'][titanic['class'] == cls].values
    for cls in classes
])
plt.xticks([1,2,3], classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
In [17]:
# It gets a little more tricky when we want to plot the two sexes side-by-side.
# Fortunately we can tell violinplot to place the violins at different positions.
position_array = [0, 4, 5] # just to show the effect of the position array.
plt.violinplot([
    titanic['fare'][titanic['class'] == cls].values
    for cls in classes
], positions=position_array)
#plt.xticks(position_array, classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
In [18]:
# We can use the position array to shift the male and female violins.

sexes = 'male female'.split()
positions_array = np.arange(len(classes)) # this will be [0, 1, 2]
for i, sex in enumerate(sexes):
    offset = .15 * (-1 if i == 0 else 1)
    violin = plt.violinplot([
        titanic['fare'][(titanic['sex'] == sex) & (titanic['class'] == cls)].values
        for cls in classes
    ], positions=positions_array + offset, widths=.25, showmedians=True, showextrema=True)
plt.xticks(positions_array, classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
In [19]:
# But what about a legend? Unfortunately plt.violinplot doesn't support legends, so we
# have to do it by hand.
positions_array = np.arange(len(classes))
fake_handles = []
for i, sex in enumerate(sexes):
    offset = .15 * (-1 if i == 0 else 1)
    violin = plt.violinplot([
        titanic['fare'][(titanic['sex'] == sex) & (titanic['class'] == cls)].values
        for cls in classes
    ], positions=positions_array + offset, widths=.25, showmedians=True, showextrema=True)
    fake_handles.append(mpatches.Patch())
plt.legend(fake_handles, sexes)
plt.xticks(positions_array, classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
In [20]:
# Dang, the colors are wrong. We'll have to use some hacky code to pull the color out of the violin...
positions_array = np.arange(len(classes))
fake_handles = []
for i, sex in enumerate(sexes):
    offset = .15 * (-1 if i == 0 else 1)
    violin = plt.violinplot([
        titanic['fare'][(titanic['sex'] == sex) & (titanic['class'] == cls)].values
        for cls in classes
    ], positions=positions_array + offset, widths=.25, showmedians=True, showextrema=True)
    cur_violin_color = violin['cbars'].get_color()[0]
    fake_handles.append(mpatches.Patch(color=cur_violin_color))
plt.legend(fake_handles, sexes)
plt.xticks(positions_array, classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");
In [21]:
# Or explicitly control the colors.
colors = sns.color_palette("Set1", n_colors=len(sexes), desat=.5)
sns.palplot(colors)
In [22]:
positions_array = np.arange(len(classes))
fake_handles = []
for i, sex in enumerate(sexes):
    offset = .15 * (-1 if i == 0 else 1)
    violin = plt.violinplot([
        titanic['fare'][(titanic['sex'] == sex) & (titanic['class'] == cls)].values
        for cls in classes
    ], positions=positions_array + offset, widths=.25, showmedians=True, showextrema=True)

    # Set the color
    color = colors[i]
    for part_name, part in violin.items():
        if part_name == 'bodies':
            for body in violin['bodies']:
                body.set_color(color)
        else:
            part.set_color(color)
    fake_handles.append(mpatches.Patch(color=color))

plt.legend(fake_handles, sexes)
plt.xticks(positions_array, classes)
plt.xlabel("Ticket Class")
plt.ylabel("Fare");

No more digression, back to regressions.

So it looks like fare varies with class, age, and maybe gender, and the way that fare depends on class and age may be different for male vs female.

Let's first do a simple linear regression on age.

In [23]:
model1 = sm.OLS(
    titanic.fare,
    sm.add_constant(titanic['age'])
).fit()
model1.summary()

How do we interpret this?

  • How good is this model?
  • Does age affect fare? How can we tell? And if so, how?

Handling categorical variables

Statistical packages generally distinguish between three types of variables:

  • continuous variables, like age and fare.
  • nominal variables, like gender and whether the person survived
  • ordinal variables like cabin class (first > second > third)

(Ordinal variables can often be treated like nominal variables; that's what we'll do for now.)

How do we deal with gender, or class? They're categorical variables. We'll need to use dummy variables to encode them.

In [24]:
titanic_orig = titanic.copy()
**Exercise 2**: Create a column `sex_male` that is 1 if the passenger is male, 0 otherwise.
In [25]:
# Your code here
**Exercise 3**: Do we need a `sex_female` column, or any others? Why or why not?

your answer here

**Exercise 4**: Create columns for `class_`
In [26]:
# Your code here
In [27]:
titanic.info()
In [28]:
# Your code here
**Exercise 5**: Fit a linear regression including the new sex and class variables.
In [29]:
# Your code here

**Exercise 6** How do we interpret these results?

  • All else being equal, what does being male do to the fare?
  • What can we say about being male and first-class?

Your answer here

Interactions

In [30]:
# It seemed like gender interacted with age and class. Can we put that in our model?
titanic['sex_male_X_age'] = titanic['age'] * titanic['sex_male']
model3 = sm.OLS(
    titanic.fare,
    sm.add_constant(titanic[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age']])
).fit()
model3.summary()

What happened to the age and male terms?

In [31]:
# It seemed like gender interacted with age and class. Can we put that in our model?
titanic['sex_male_X_class_Second'] = titanic['age'] * titanic['class_Second']
titanic['sex_male_X_class_Third'] = titanic['age'] * titanic['class_Third']
model4 = sm.OLS(
    titanic.fare,
    sm.add_constant(titanic[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age', 'sex_male_X_class_Second', 'sex_male_X_class_Third']])
).fit()
model4.summary()

What has happened to the $R^2$ as we added more features? Does this mean that the model is better? (What if we kept adding more predictors and interaction terms? We'll explore this in the next homework.)

In [32]:
models = [model1, model2, model3, model4]
plt.plot([model.df_model for model in models], [model.rsquared for model in models], 'x-')
plt.xlabel("Model df")
plt.ylabel("$R^2$");

2. Model Selection via exhaustive search selection

The dataset for this problem contains 10 simulated predictors and a response variable.

In [33]:
# Import the dataset
data = pd.read_csv('data/dataset3.txt')
data.head()

By visually inspecting the data set, do we find that some of the predictors are correlated amongst themselves?

In [34]:
sns.pairplot(data)

Predictors x1, x2, x3 seem to be perfectly correlated while predictors x4, x5, x6, x7 show very high degrees of correlation.

In [35]:
data.corr()

Let us compute the cofficient of correlation between each pair of predictors, and visualize the matrix of correlation coefficients using a heat map.

In [36]:
# generating heat map
sns.heatmap(data.corr())

Do the predictors fall naturally into groups based on the correlation values?

We can see that higly correlated predictors fall into dark red groups based on correlation values close to 1. Other correlation values also form differently coloured groups.

**If you were asked to select a minimal subset of predictors based on the correlation information in order to build a good regression model, how many predictors will you pick, and which ones will you choose? **

Your answer here We may choose one predictor from the x1,x2,x3 list - let's pick x1.
Similarly, let's pick x6 out of x4,x5,x6,x7.
The other predictors are not strongly correlated, so we pick them all, i.e. x8,x9,x10.
Thus, we have picked 5 predictors x1, x6, x8, x9, x10 for our regression model.

Model Selection Criteria: Bayesian Information Criterion (BIC)

Generally BIC = -2 x Log-likehood + 2 x log(K)

For least-squares regression specifically,

$$BIC = n \log \Big(\frac{RSS}{n}\Big) + \log(n)*K$$

where,
RSS = Residual Sum of Squares
n = the number of obervations
K = the number of features in our model

Selecting minimal subset of predictors

Let us apply the exhaustive search variable selection methods discussed in class to choose a minimal subset of predictors that yield high prediction accuracy. We will use the Bayesian Information Criterion (BIC) to choose the subset size.

In [37]:
x = data.iloc[:,:-1]
y = data.iloc[:,-1]
x.shape, y.shape
In [38]:
def find_best_subset_of_size(x, y, num_predictors):
    predictors = x.columns
    
    best_r_squared = -np.inf
    best_model_data = None

    # Enumerate subsets of the given size
    subsets_of_size_k = itertools.combinations(predictors, num_predictors)

    # Inner loop: iterate through subsets_k
    for subset in subsets_of_size_k:

        # Fit regression model using ‘subset’ and calculate R^2 
        # Keep track of subset with highest R^2

        features = list(subset)
        x_subset = sm.add_constant(x[features])

        model = sm.OLS(y, x_subset).fit()
        r_squared = model.rsquared

        # Check if we get a higher R^2 value than than current max R^2.
        # If so, update our best subset 
        if r_squared > best_r_squared:
            best_r_squared = r_squared
            best_model_data = {
                'r_squared': r_squared,
                'subset': features,
                'model': model
            }
    return best_model_data
In [39]:
find_best_subset_of_size(x, y, 8)
In [40]:
def exhaustive_search_selection(x, y):
    """Exhaustively search predictor combinations

    Parameters:
    -----------
    x : DataFrame of predictors/features
    y : response varible 
    
    
    Returns:
    -----------
    
    Dataframe of model comparisons and OLS Model with 
    lowest BIC for subset with highest R^2
    
    """
    
    predictors = x.columns
    
    stats = []
    models = dict()
    
    # Outer loop: iterate over sizes 1, 2 .... d
    for k in range(1, len(predictors)):
        
        best_size_k_model = find_best_subset_of_size(
            x, y, num_predictors=k)
        best_subset = best_size_k_model['subset']
        best_model = best_size_k_model['model']
        
        stats.append({
            'k': k,
            'formula': "y ~ {}".format(' + '.join(best_subset)),
            'bic': best_model.bic,
            'r_squared': best_model.rsquared
        })
        models[k] = best_model
        
    return pd.DataFrame(stats), models
In [41]:
stats, models = exhaustive_search_selection(x, y)
stats
In [42]:
stats.plot(x='k', y='bic', marker='*')
In [43]:
stats.info()
In [44]:
best_stat = stats.iloc[stats.bic.idxmin()]
best_stat
In [45]:
best_k = best_stat['k']
best_bic = best_stat['bic']
best_formula = best_stat['formula']
best_r2 = best_stat['r_squared']
In [46]:
print("The best overall model is `{formula}` with bic={bic:.2f} and R^2={r_squared:.3f}".format(
    formula=best_formula, bic=best_bic, r_squared=best_r2))
In [47]:
models[best_k].summary()

Do the chosen subsets match the ones you picked using the correlation matrix you had visualized?

Yes! The predictor subset matches with the ones we picked before.

In [ ]:
 

Plotting multiple axes in a single figure

source: http://matplotlib.org/faq/usage_faq.html

See also this matplotlib tutorial.

In [ ]: