CS109A Introduction to Data Science
Standard Section 4: Model Selection¶
Harvard University
Fall 2018
Instructors: Pavlos Protopapas, Kevin Rader
Section Leaders: Mehul Smriti Raje, Ken Arnold, Karan Motwani, Cecilia Garraffo
#RUN THIS CELL
from IPython.core.display import HTML
HTML('')
# Data and Stats packages
import numpy as np
import pandas as pd
from sklearn import metrics, datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV
import statsmodels.api as sm
# Visualization packages
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.figsize'] = (13.0, 6.0)
# Other
import itertools
import tqdm
# Aesthetic settings
from IPython.display import display
pd.set_option('display.max_columns', 999)
pd.set_option('display.width', 500)
sns.set_style('whitegrid')
sns.set_context('talk')
NYC Car Hire Dataset¶
The dataset can be downloaded from https://drive.google.com/open?id=0B28c493CP9GtRHFVM0U0SVI2Yms
Our goal, as in lecture, will be to predict taxi fares based on various other features of the ride.
For descriptions of what the columns mean, see http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml and data dictionaries there.
# Instead of converting to datetime after loading, you can tell Pandas the datatypes you want while loading.
nyc_cab_df = pd.read_csv(
'nyc_car_hire_data.csv',
dtype={'Store_and_fwd_flag': str, 'Base': str},
parse_dates=['lpep_pickup_datetime', 'Lpep_dropoff_datetime'])
nyc_cab_df['pickup_hour'] = nyc_cab_df.lpep_pickup_datetime.dt.hour
nyc_cab_df['dropoff_hour'] = nyc_cab_df.Lpep_dropoff_datetime.dt.hour
nyc_cab_df.head()
Let's have a quick look at the data before we get started. (We really should do more than this, but skipping that for time.)
plt.boxplot(nyc_cab_df.Fare_amount);
# We notice there's some big outliers; let's remove them for now.
q1, q3 = nyc_cab_df.Fare_amount.quantile([.25, .75])
iqr = q3 - q1
outlier = (nyc_cab_df.Fare_amount == 0) | (nyc_cab_df.Fare_amount > q3 + 2.5 * iqr) # A generous criterion.
nyc_cab_df_clean = nyc_cab_df[~outlier]
print(f"Removed {outlier.sum()} outliers ({outlier.mean():.1%})")
# Just plotting distributions to see if there's any more blatant things to notice. Skipping for presentation.
sns.distplot(nyc_cab_df_clean.Fare_amount)
Let's work with a subset of this data for now.¶
# TODO: some useful predictors may be missing, but maybe because they have missing data.
# nyc_cab_df.columns.difference(all_predictors)
all_predictors = ['Trip Length (min)', 'Type', 'Trip_distance', 'TMAX', 'TMIN',
'pickup_hour', 'dropoff_hour', 'Pickup_longitude',
'Pickup_latitude', 'SNOW', 'SNWD', 'PRCP']
sample_frac = .01
shuffled = nyc_cab_df_clean.sample(frac=sample_frac, random_state=0)[all_predictors + ['Fare_amount']]
shuffled.shape
n_total = len(shuffled)
n_test = int(n_total * .2)
n_valid = int(n_total * .2)
n_train = n_total - n_valid - n_test
train = shuffled.iloc[:n_train].copy()
valid = shuffled.iloc[n_train:n_train + n_valid].copy()
test = shuffled.iloc[n_train + n_valid:].copy()
assert len(train) == n_train
assert len(valid) == n_valid
assert len(test) == n_test
print(f"{n_train} train, {n_valid} validation, {n_test} test.")
train.info()
Scaling / Normalization¶
Warm-up exercise: for which of the following do the units of the predictors matter (e.g., trip length in minutes vs seconds; temperature in F or C)?
- k-NN (Nearest Neighbors regression)
- Linear regression
- Lasso regression
- Ridge regression
- kNN: yes. Scaling affects distance metric, which determines what "neighbor" means
- Linear regression: no. Multiply predictor by $c$ -> divide coef by $c$.
- Lasso: yes: If we divided coef by $c$, then corresponding penalty term is also divided by $c$.
- Ridge: yes: Same as Lasso, except penalty divided by $c^2$.
Remember that the mean and variance used to scale data are parameters that need to be learned from our training data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(train[all_predictors])
pd.DataFrame({
'mean': scaler.mean_,
'variance': scaler.var_
}, index=all_predictors).T
# Scaling in place... not my favorite approach
for df in [train, valid, test]:
df[all_predictors] = scaler.transform(df[all_predictors])
train[all_predictors].describe()
train.Fare_amount.describe()
Nearest-Neighbors Regression: How many neighbors should we use?¶
k_vals = [1, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60]
knns = {
k: KNeighborsRegressor(n_neighbors=k).fit(
train[all_predictors], train.Fare_amount)
for k in k_vals}
train_r2s = [
metrics.r2_score(train.Fare_amount, model.predict(train[all_predictors]))
for k, model in knns.items()]
plt.plot(k_vals, train_r2s, '-+', label="Train")
plt.xlabel('n_neighbors')
plt.ylabel("$R^2$")
plt.legend();
Which n_neighbors should we choose?
What's our goal?¶
Do best on unseen data.
Validation set is currently unseen.¶
So let's see how well our models do on it.
val_r2s = [
metrics.r2_score(valid.Fare_amount, model.predict(valid[all_predictors]))
for k, model in knns.items()]
plt.plot(k_vals, train_r2s, '-+', label="Train")
plt.plot(k_vals, val_r2s, '-*', label="Validation")
plt.xlabel('n_neighbors')
plt.ylabel("$R^2$")
plt.legend();
Now, which n_neighbors should we choose?
best_r2_idx = np.argmax(val_r2s)
best_r2 = val_r2s[best_r2_idx]
best_n_neighbors = k_vals[best_r2_idx]
print(f"Best n_neighbors is {best_n_neighbors}, which gives a validation R^2 of {best_r2:.3f}")
How confident are we about that choice? What if there was something unusual about our particular validation set?
How could we gain more confidence?
Cross-Validation¶
If we repeat an experiment and the results say the same thing, we get more confident in that result. How could we repeat this experiment?
Let's choose different subsets of our data as our training and validation sets.
from sklearn.model_selection import KFold
n_splits = 10
folds = np.zeros((n_splits, len(train)))
fake_dataset = np.arange(len(train))
for fold_idx, (train_indices, valid_indices) in enumerate(KFold(n_splits=n_splits).split(fake_dataset)):
folds[fold_idx, train_indices] = 2
folds[fold_idx, valid_indices] = 1
fig, ax = plt.subplots(figsize=(10, 5))
sns.heatmap(folds[::-1], ax=ax) # TODO: label this better!
ax.set(xlabel="Datum idx", ylabel="Fold idx");
Cross-Validation core
X = train[all_predictors]
y = train.Fare_amount
n_neighbors = 10 # for example
cv = KFold(n_splits=10)
cv_r2 = []
for fold_idx, (train_indices, valid_indices) in enumerate(cv.split(X)):
# Train on training set.
model = KNeighborsRegressor(n_neighbors=n_neighbors)
model.fit(X.iloc[train_indices], y.iloc[train_indices])
# Evaluate on validation set.
val_predictions = model.predict(X.iloc[valid_indices])
val_r2 = metrics.r2_score(y.iloc[valid_indices], val_predictions)
cv_r2.append(val_r2)
sns.boxplot(cv_r2); plt.xlabel("Validation $R^2$");
Now do this for each candidate n_neighbors!
def cross_validate_knn(X, y, n_splits, k_vals):
cv_results = []
cv = KFold(n_splits=n_splits)
# Outer loop: pick n_neighbors
for n_neighbors in k_vals:
# Inner loop: pick train and validation sets.
for fold_idx, (train_indices, valid_indices) in enumerate(cv.split(X)):
# Train on training set.
model = KNeighborsRegressor(n_neighbors=n_neighbors)
model.fit(X.iloc[train_indices], y.iloc[train_indices])
# Evaluate on validation set. I include MSE here just to show how.
val_predictions = model.predict(X.iloc[valid_indices])
val_r2 = metrics.r2_score(y.iloc[valid_indices], val_predictions)
val_mse = metrics.mean_squared_error(y.iloc[valid_indices], val_predictions)
cv_results.append(dict(
fold_idx=fold_idx,
n_neighbors=n_neighbors,
val_r2=val_r2,
val_mse=val_mse
))
return pd.DataFrame(cv_results)
cv_results = cross_validate_knn(X, y, n_splits=10, k_vals=k_vals)
cv_results.head(11)
# Make a table of the validation R^2 data, where each row is a different fold.
val_r2_table = cv_results.pivot(index='fold_idx', columns='n_neighbors', values='val_r2')
# alternative: cv_results.set_index(['fold_idx', 'n_neighbors']).val_r2.unstack(level=1)
val_r2_table
Let's visualize that data, 3 different ways.
fig, ax = plt.subplots(figsize=(10, 5))
for fold_idx, row in val_r2_table.iterrows():
ax.plot(row.index, row.values, '-x', c='black', alpha=.3, label="single fold" if fold_idx is 0 else None)
# plt.plot(row.index, val_r2_table.mean(axis=0), c='white', linewidth=5)
ax.plot(row.index, val_r2_table.mean(axis=0), label='Mean')
ax.legend()
ax.set(xlabel="n_neighbors", ylabel="$R^2$ on validation fold", title="Performance vs hyperparameter over different CV folds");
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(val_r2_table.values, positions=k_vals, widths=4);
ax.set(xlabel="n_neighbors", ylabel="$R^2$ on validation fold");
cv_r2_mean = cv_results.groupby('n_neighbors').val_r2.mean()
cv_r2_std = cv_results.groupby('n_neighbors').val_r2.std()
plt.errorbar(k_vals, cv_r2_mean, yerr=cv_r2_std, capsize=5)
plt.gca().set(xlabel="n_neighbors", ylabel="Validation $R^2$ $\pm$ stdev");
Now, which n_neighbors should you choose?
Make a final model with our optimal hyperparameter value¶
For now, let's use the model with the best mean CV performance.
best_n_neighbors = cv_r2_mean.idxmax()
estimated_r2_best = cv_r2_mean[best_n_neighbors]
print("The best n_neighbors is {}, with estimated R^2 = {:.2f} +- {:.2f}".format(
best_n_neighbors, estimated_r2_best, cv_r2_std[best_n_neighbors]))
all_my_data = pd.concat([train, valid])
my_best_model = (
KNeighborsRegressor(n_neighbors=best_n_neighbors)
.fit(all_my_data[all_predictors], all_my_data.Fare_amount))
assert len(train) + len(valid) == len(all_my_data)
print(f"Trained a new model on {len(train)} + {len(valid)} = {len(all_my_data)} samples.")
How well does this model perform on unseen data?¶
We've already looked at train and validation. Only test remains completely unseen. The point of no return...
print(f"We expect our best model to get an R^2 of {estimated_r2_best:.3f}, or even better because we have more data now.")
print("Drumroll please...")
test_predictions = my_best_model.predict(test[all_predictors])
test_r2 = metrics.r2_score(test.Fare_amount, test_predictions)
print(f"Our best model got an R^2 of {test_r2:.3f}.")
btw, when I first ran this, the test R^2 was much worse, ~0.5. But I figured out why. Can you?
It turns out that the test set happened to have some enormous outliers.
Same approach works for penalized regression!¶
def cross_validate_general(train_model, X, y, n_splits, params):
cv_results = []
cv = KFold(n_splits=n_splits)
# Outer loop: pick hyperparameter setting.
for param_setting in params:
# Inner loop: pick train and validation sets.
for fold_idx, (train_indices, valid_indices) in enumerate(cv.split(X, y)):
# Train on training set.
model = train_model(X[train_indices], y[train_indices], param_setting)
# Evaluate on validation set.
val_predictions = model.predict(X[valid_indices])
val_r2 = metrics.r2_score(y[valid_indices], val_predictions)
val_mse = metrics.mean_squared_error(y[valid_indices], val_predictions)
cv_results.append(dict(
param_setting,
fold_idx=fold_idx,
val_r2=val_r2,
val_mse=val_mse
))
return pd.DataFrame(cv_results)
def train_lasso(X, y, params):
return Lasso(alpha=params['alpha']).fit(X, y)
def train_KNN(X, y, params):
return KNeighborsRegressor(n_neighbors=params['n_neighbors']).fit(X, y)
X = train[all_predictors].values
y = train.Fare_amount.values
params = [{'alpha': alpha} for alpha in [.001, .01, .1, 1., 10.]]
cv_results = cross_validate_general(
train_model=train_lasso,
# train_model=train_KNN,
X=X, y=y,
n_splits=10, params=params)
agg_cv_results = cv_results.groupby('alpha').val_r2.agg(['mean', 'std'])
agg_cv_results
best_alpha = agg_cv_results['mean'].idxmax()
estimated_r2 = agg_cv_results['mean'][best_alpha]
best_model = Lasso(alpha=best_alpha).fit(X, y)
test_predictions = best_model.predict(test[all_predictors].values)
test_r2 = metrics.r2_score(test.Fare_amount, test_predictions)
print(f"Best alpha was {best_alpha}, with estimated R^2 of {estimated_r2:.3f}.")
print(f"Training this on all data gives a test R^2 of {test_r2:.3f}")
# We could also use a meta-estimator from sklearn.
# But make sure you understand all of what's going on in the above first!
from sklearn.model_selection import GridSearchCV
gridsearch_model = GridSearchCV(
estimator=Lasso(),
param_grid=[{'alpha': [.1, 1., 10.]}],
cv=KFold(n_splits=10),
scoring='r2',
return_train_score=False)
gridsearch_model.fit(X, y)
pd.DataFrame(gridsearch_model.cv_results_).set_index('param_alpha')
# It also refits the model on the full dataset, so we get the same test R^2.
gridsearch_model.score(test[all_predictors], test.Fare_amount)
Homework Critique¶
- Good visuals "don't make me think".
- Good written analysis interprets (doesn't just describe) and connects the data with the real world.
- Good code is easy to change without making mistakes, which lets you iterate faster.