CS109A Introduction to Data Science

Standard Section 4: Model Selection

Harvard University
Fall 2018
Instructors: Pavlos Protopapas, Kevin Rader
Section Leaders: Mehul Smriti Raje, Ken Arnold, Karan Motwani, Cecilia Garraffo


In [2]:
#RUN THIS CELL 
from IPython.core.display import HTML
HTML('')
Out[2]:
In [2]:
# Data and Stats packages
import numpy as np
import pandas as pd
from sklearn import metrics, datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV
import statsmodels.api as sm

# Visualization packages
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.figsize'] = (13.0, 6.0)

# Other
import itertools
import tqdm

# Aesthetic settings
from IPython.display import display
pd.set_option('display.max_columns', 999)
pd.set_option('display.width', 500)
sns.set_style('whitegrid')
sns.set_context('talk')
/Users/kcarnold/anaconda3/envs/py36/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

NYC Car Hire Dataset

The dataset can be downloaded from https://drive.google.com/open?id=0B28c493CP9GtRHFVM0U0SVI2Yms

Our goal, as in lecture, will be to predict taxi fares based on various other features of the ride.

For descriptions of what the columns mean, see http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml and data dictionaries there.

In [3]:
# Instead of converting to datetime after loading, you can tell Pandas the datatypes you want while loading.
nyc_cab_df = pd.read_csv(
    'nyc_car_hire_data.csv',
    dtype={'Store_and_fwd_flag': str, 'Base': str},
    parse_dates=['lpep_pickup_datetime', 'Lpep_dropoff_datetime'])

nyc_cab_df['pickup_hour'] = nyc_cab_df.lpep_pickup_datetime.dt.hour
nyc_cab_df['dropoff_hour'] = nyc_cab_df.Lpep_dropoff_datetime.dt.hour

nyc_cab_df.head()
Out[3]:
AWND Base Day Dropoff_latitude Dropoff_longitude Ehail_fee Extra Fare_amount Lpep_dropoff_datetime MTA_tax PRCP Passenger_count Payment_type Pickup_latitude Pickup_longitude RateCodeID SNOW SNWD Store_and_fwd_flag TMAX TMIN Tip_amount Tolls_amount Total_amount Trip_distance Trip_type Type VendorID lpep_pickup_datetime Trip Length (min) pickup_hour dropoff_hour
0 4.7 B02512 1 NaN NaN NaN NaN 33.863498 2014-04-01 00:24:00 NaN 0.0 NaN NaN 40.7690 -73.9549 NaN 0.0 0.0 NaN 60 39 NaN NaN NaN 4.083561 NaN 1 NaN 2014-04-01 00:11:00 13.0 0 0
1 4.7 B02512 1 NaN NaN NaN NaN 19.022892 2014-04-01 00:29:00 NaN 0.0 NaN NaN 40.7267 -74.0345 NaN 0.0 0.0 NaN 60 39 NaN NaN NaN 3.605694 NaN 1 NaN 2014-04-01 00:17:00 12.0 0 0
2 4.7 B02512 1 NaN NaN NaN NaN 25.498981 2014-04-01 00:34:00 NaN 0.0 NaN NaN 40.7316 -73.9873 NaN 0.0 0.0 NaN 60 39 NaN NaN NaN 4.221763 NaN 1 NaN 2014-04-01 00:21:00 13.0 0 0
3 4.7 B02512 1 NaN NaN NaN NaN 28.024628 2014-04-01 00:39:00 NaN 0.0 NaN NaN 40.7588 -73.9776 NaN 0.0 0.0 NaN 60 39 NaN NaN NaN 2.955510 NaN 1 NaN 2014-04-01 00:28:00 11.0 0 0
4 4.7 B02512 1 NaN NaN NaN NaN 12.083589 2014-04-01 00:40:00 NaN 0.0 NaN NaN 40.7594 -73.9722 NaN 0.0 0.0 NaN 60 39 NaN NaN NaN 1.922292 NaN 1 NaN 2014-04-01 00:33:00 7.0 0 0

Let's have a quick look at the data before we get started. (We really should do more than this, but skipping that for time.)

In [4]:
plt.boxplot(nyc_cab_df.Fare_amount);
In [5]:
# We notice there's some big outliers; let's remove them for now.
q1, q3 = nyc_cab_df.Fare_amount.quantile([.25, .75])
iqr = q3 - q1
outlier = (nyc_cab_df.Fare_amount == 0) | (nyc_cab_df.Fare_amount > q3 + 2.5 * iqr) # A generous criterion.
nyc_cab_df_clean = nyc_cab_df[~outlier]
print(f"Removed {outlier.sum()} outliers ({outlier.mean():.1%})")
Removed 10019 outliers (0.5%)
In [6]:
# Just plotting distributions to see if there's any more blatant things to notice. Skipping for presentation.
sns.distplot(nyc_cab_df_clean.Fare_amount)
/Users/kcarnold/anaconda3/envs/py36/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[6]:

Let's work with a subset of this data for now.

In [7]:
# TODO: some useful predictors may be missing, but maybe because they have missing data.
# nyc_cab_df.columns.difference(all_predictors)
all_predictors = ['Trip Length (min)', 'Type', 'Trip_distance', 'TMAX', 'TMIN', 
                  'pickup_hour', 'dropoff_hour', 'Pickup_longitude', 
                  'Pickup_latitude', 'SNOW', 'SNWD', 'PRCP']

sample_frac = .01
shuffled = nyc_cab_df_clean.sample(frac=sample_frac, random_state=0)[all_predictors + ['Fare_amount']]
shuffled.shape

n_total = len(shuffled)
n_test = int(n_total * .2)
n_valid = int(n_total * .2)
n_train = n_total - n_valid - n_test

train = shuffled.iloc[:n_train].copy()
valid = shuffled.iloc[n_train:n_train + n_valid].copy()
test = shuffled.iloc[n_train + n_valid:].copy()
assert len(train) == n_train
assert len(valid) == n_valid
assert len(test) == n_test
print(f"{n_train} train, {n_valid} validation, {n_test} test.")
11183 train, 3727 validation, 3727 test.
In [8]:
train.info()

Int64Index: 11183 entries, 959269 to 686097
Data columns (total 13 columns):
Trip Length (min)    11183 non-null float64
Type                 11183 non-null int64
Trip_distance        11183 non-null float64
TMAX                 11183 non-null int64
TMIN                 11183 non-null int64
pickup_hour          11183 non-null int64
dropoff_hour         11183 non-null int64
Pickup_longitude     11183 non-null float64
Pickup_latitude      11183 non-null float64
SNOW                 11183 non-null float64
SNWD                 11183 non-null float64
PRCP                 11183 non-null float64
Fare_amount          11183 non-null float64
dtypes: float64(8), int64(5)
memory usage: 1.2 MB

Scaling / Normalization

Warm-up exercise: for which of the following do the units of the predictors matter (e.g., trip length in minutes vs seconds; temperature in F or C)?

  • k-NN (Nearest Neighbors regression)
  • Linear regression
  • Lasso regression
  • Ridge regression
  • kNN: yes. Scaling affects distance metric, which determines what "neighbor" means
  • Linear regression: no. Multiply predictor by $c$ -> divide coef by $c$.
  • Lasso: yes: If we divided coef by $c$, then corresponding penalty term is also divided by $c$.
  • Ridge: yes: Same as Lasso, except penalty divided by $c^2$.

Remember that the mean and variance used to scale data are parameters that need to be learned from our training data.

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(train[all_predictors])

pd.DataFrame({
    'mean': scaler.mean_,
    'variance': scaler.var_
}, index=all_predictors).T
Out[9]:
Trip Length (min) Type Trip_distance TMAX TMIN pickup_hour dropoff_hour Pickup_longitude Pickup_latitude SNOW SNWD PRCP
mean 11.330680 0.300009 2.875462 60.886167 43.269874 13.980148 13.978897 -73.879007 40.718952 0.0 0.0 0.336174
variance 61.431918 0.210004 5.458100 70.157211 36.335824 42.114960 43.055980 4.887413 1.486695 0.0 0.0 1.105074
In [10]:
# Scaling in place... not my favorite approach
for df in [train, valid, test]:
    df[all_predictors] = scaler.transform(df[all_predictors])
In [11]:
train[all_predictors].describe()
Out[11]:
Trip Length (min) Type Trip_distance TMAX TMIN pickup_hour dropoff_hour Pickup_longitude Pickup_latitude SNOW SNWD PRCP
count 1.118300e+04 1.118300e+04 1.118300e+04 1.118300e+04 1.118300e+04 1.118300e+04 1.118300e+04 1.118300e+04 1.118300e+04 11183.0 11183.0 1.118300e+04
mean -8.895286e-17 -8.895286e-18 1.105557e-16 -2.265121e-16 -5.534139e-16 -5.511900e-17 1.906133e-17 -4.778039e-16 3.429768e-15 0.0 0.0 -9.530664e-18
std 1.000045e+00 1.000045e+00 1.000045e+00 1.000045e+00 1.000045e+00 1.000045e+00 1.000045e+00 1.000045e+00 1.000045e+00 0.0 0.0 1.000045e+00
min -1.445636e+00 -6.546676e-01 -1.230798e+00 -1.657854e+00 -2.035507e+00 -2.154238e+00 -2.130375e+00 -2.332211e-01 -3.339532e+01 0.0 0.0 -3.197923e-01
25% -6.801201e-01 -6.546676e-01 -6.615118e-01 -8.221316e-01 -5.424547e-01 -7.674041e-01 -7.587809e-01 -4.550192e-02 -3.110732e-04 0.0 0.0 -3.197923e-01
50% -1.697762e-01 -6.546676e-01 -2.420375e-01 1.359045e-02 -4.477065e-02 1.571516e-01 3.080150e-01 -3.345387e-02 2.595593e-02 0.0 0.0 -3.197923e-01
75% 3.405678e-01 1.527493e+00 3.186884e-01 7.299237e-01 4.529134e-01 7.735220e-01 7.652132e-01 -1.930459e-02 6.536592e-02 0.0 0.0 -2.532033e-01
max 5.954351e+00 1.527493e+00 8.113206e+00 1.923812e+00 2.609544e+00 1.389892e+00 1.374811e+00 3.341808e+01 3.855061e-01 0.0 0.0 4.408025e+00
In [12]:
train.Fare_amount.describe()
Out[12]:
count    11183.000000
mean        14.650476
std          8.598972
min          0.010000
25%          7.500000
50%         13.000000
75%         20.014209
max         52.500000
Name: Fare_amount, dtype: float64

Nearest-Neighbors Regression: How many neighbors should we use?

In [13]:
k_vals = [1, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60]
knns = {
    k: KNeighborsRegressor(n_neighbors=k).fit(
        train[all_predictors], train.Fare_amount)
    for k in k_vals}

train_r2s = [
    metrics.r2_score(train.Fare_amount, model.predict(train[all_predictors]))
    for k, model in knns.items()]
In [14]:
plt.plot(k_vals, train_r2s, '-+', label="Train")
plt.xlabel('n_neighbors')  
plt.ylabel("$R^2$")
plt.legend();

Which n_neighbors should we choose?

What's our goal?

Do best on unseen data.

Validation set is currently unseen.

So let's see how well our models do on it.

In [15]:
val_r2s = [
    metrics.r2_score(valid.Fare_amount, model.predict(valid[all_predictors]))
    for k, model in knns.items()]
In [16]:
plt.plot(k_vals, train_r2s, '-+', label="Train")
plt.plot(k_vals, val_r2s, '-*', label="Validation")
plt.xlabel('n_neighbors')
plt.ylabel("$R^2$")
plt.legend();

Now, which n_neighbors should we choose?

In [17]:
best_r2_idx = np.argmax(val_r2s)
best_r2 = val_r2s[best_r2_idx]
best_n_neighbors = k_vals[best_r2_idx]
print(f"Best n_neighbors is {best_n_neighbors}, which gives a validation R^2 of {best_r2:.3f}")
Best n_neighbors is 10, which gives a validation R^2 of 0.855

How confident are we about that choice? What if there was something unusual about our particular validation set?

How could we gain more confidence?

Cross-Validation

If we repeat an experiment and the results say the same thing, we get more confident in that result. How could we repeat this experiment?

Let's choose different subsets of our data as our training and validation sets.

In [18]:
from sklearn.model_selection import KFold
n_splits = 10
folds = np.zeros((n_splits, len(train)))
fake_dataset = np.arange(len(train))
for fold_idx, (train_indices, valid_indices) in enumerate(KFold(n_splits=n_splits).split(fake_dataset)):
    folds[fold_idx, train_indices] = 2
    folds[fold_idx, valid_indices] = 1
fig, ax = plt.subplots(figsize=(10, 5))
sns.heatmap(folds[::-1], ax=ax) # TODO: label this better!
ax.set(xlabel="Datum idx", ylabel="Fold idx");

Cross-Validation core

In [19]:
X = train[all_predictors]
y = train.Fare_amount
In [20]:
n_neighbors = 10 # for example
cv = KFold(n_splits=10)
cv_r2 = []
for fold_idx, (train_indices, valid_indices) in enumerate(cv.split(X)):
    # Train on training set.
    model = KNeighborsRegressor(n_neighbors=n_neighbors)
    model.fit(X.iloc[train_indices], y.iloc[train_indices])

    # Evaluate on validation set.
    val_predictions = model.predict(X.iloc[valid_indices])
    val_r2 = metrics.r2_score(y.iloc[valid_indices], val_predictions)

    cv_r2.append(val_r2)
In [21]:
sns.boxplot(cv_r2); plt.xlabel("Validation $R^2$");

Now do this for each candidate n_neighbors!

In [22]:
def cross_validate_knn(X, y, n_splits, k_vals):
    cv_results = []
    cv = KFold(n_splits=n_splits)
    
    # Outer loop: pick n_neighbors
    for n_neighbors in k_vals:
        
        # Inner loop: pick train and validation sets.
        for fold_idx, (train_indices, valid_indices) in enumerate(cv.split(X)):
            # Train on training set.
            model = KNeighborsRegressor(n_neighbors=n_neighbors)
            model.fit(X.iloc[train_indices], y.iloc[train_indices])

            # Evaluate on validation set. I include MSE here just to show how.
            val_predictions = model.predict(X.iloc[valid_indices])
            val_r2 = metrics.r2_score(y.iloc[valid_indices], val_predictions)
            val_mse = metrics.mean_squared_error(y.iloc[valid_indices], val_predictions)

            cv_results.append(dict(
                fold_idx=fold_idx,
                n_neighbors=n_neighbors,
                val_r2=val_r2,
                val_mse=val_mse
            ))
    return pd.DataFrame(cv_results)
cv_results = cross_validate_knn(X, y, n_splits=10, k_vals=k_vals)
cv_results.head(11)
Out[22]:
fold_idx n_neighbors val_mse val_r2
0 0 1 19.582487 0.751775
1 1 1 17.710063 0.743087
2 2 1 17.890290 0.739744
3 3 1 15.108882 0.790357
4 4 1 21.316752 0.713888
5 5 1 15.927632 0.773481
6 6 1 16.262528 0.774167
7 7 1 16.395153 0.789799
8 8 1 19.922931 0.728344
9 9 1 16.710313 0.795838
10 0 5 12.751734 0.838361
In [23]:
# Make a table of the validation R^2 data, where each row is a different fold.
val_r2_table = cv_results.pivot(index='fold_idx', columns='n_neighbors', values='val_r2')
# alternative: cv_results.set_index(['fold_idx', 'n_neighbors']).val_r2.unstack(level=1)
val_r2_table
Out[23]:
n_neighbors 1 5 10 15 20 25 30 35 40 50 60
fold_idx
0 0.751775 0.838361 0.846247 0.848543 0.846561 0.845755 0.843973 0.844033 0.840830 0.837155 0.834562
1 0.743087 0.829843 0.853299 0.858193 0.857334 0.856293 0.854176 0.853082 0.851862 0.849573 0.847336
2 0.739744 0.851182 0.859116 0.861349 0.863230 0.860448 0.860142 0.857826 0.855546 0.852625 0.849647
3 0.790357 0.855258 0.860757 0.863444 0.862979 0.862656 0.860542 0.858673 0.859154 0.855150 0.851734
4 0.713888 0.785308 0.791115 0.797459 0.797057 0.794462 0.792262 0.789779 0.788168 0.783295 0.780932
5 0.773481 0.842808 0.852990 0.859311 0.862505 0.858309 0.855681 0.854264 0.853136 0.849946 0.846897
6 0.774167 0.832477 0.844946 0.843760 0.844797 0.843010 0.841472 0.839870 0.838088 0.835430 0.832205
7 0.789799 0.855152 0.860794 0.860559 0.861657 0.859818 0.856437 0.854543 0.852385 0.850082 0.846750
8 0.728344 0.788996 0.800917 0.801670 0.800536 0.799309 0.798490 0.796204 0.794484 0.790631 0.789120
9 0.795838 0.871517 0.875833 0.874913 0.875388 0.874800 0.874535 0.873019 0.871999 0.869174 0.865932

Let's visualize that data, 3 different ways.

In [24]:
fig, ax = plt.subplots(figsize=(10, 5))
for fold_idx, row in val_r2_table.iterrows():
    ax.plot(row.index, row.values, '-x', c='black', alpha=.3, label="single fold" if fold_idx is 0 else None)
# plt.plot(row.index, val_r2_table.mean(axis=0), c='white', linewidth=5)
ax.plot(row.index, val_r2_table.mean(axis=0), label='Mean')
ax.legend()
ax.set(xlabel="n_neighbors", ylabel="$R^2$ on validation fold", title="Performance vs hyperparameter over different CV folds");
In [25]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(val_r2_table.values, positions=k_vals, widths=4);
ax.set(xlabel="n_neighbors", ylabel="$R^2$ on validation fold");
In [26]:
cv_r2_mean = cv_results.groupby('n_neighbors').val_r2.mean()
cv_r2_std = cv_results.groupby('n_neighbors').val_r2.std()
plt.errorbar(k_vals, cv_r2_mean, yerr=cv_r2_std, capsize=5)
plt.gca().set(xlabel="n_neighbors", ylabel="Validation $R^2$ $\pm$ stdev");

Now, which n_neighbors should you choose?

Make a final model with our optimal hyperparameter value

For now, let's use the model with the best mean CV performance.

In [27]:
best_n_neighbors = cv_r2_mean.idxmax()
estimated_r2_best = cv_r2_mean[best_n_neighbors]
print("The best n_neighbors is {}, with estimated R^2 = {:.2f} +- {:.2f}".format(
      best_n_neighbors, estimated_r2_best, cv_r2_std[best_n_neighbors]))
The best n_neighbors is 20, with estimated R^2 = 0.85 +- 0.03
In [28]:
all_my_data = pd.concat([train, valid])
my_best_model = (
    KNeighborsRegressor(n_neighbors=best_n_neighbors)
    .fit(all_my_data[all_predictors], all_my_data.Fare_amount))
assert len(train) + len(valid) == len(all_my_data)
print(f"Trained a new model on {len(train)} + {len(valid)} = {len(all_my_data)} samples.")
Trained a new model on 11183 + 3727 = 14910 samples.

How well does this model perform on unseen data?

We've already looked at train and validation. Only test remains completely unseen. The point of no return...

In [29]:
print(f"We expect our best model to get an R^2 of {estimated_r2_best:.3f}, or even better because we have more data now.")
print("Drumroll please...")
We expect our best model to get an R^2 of 0.847, or even better because we have more data now.
Drumroll please...
In [30]:
test_predictions = my_best_model.predict(test[all_predictors])
test_r2 = metrics.r2_score(test.Fare_amount, test_predictions)
print(f"Our best model got an R^2 of {test_r2:.3f}.")
Our best model got an R^2 of 0.817.

btw, when I first ran this, the test R^2 was much worse, ~0.5. But I figured out why. Can you?

It turns out that the test set happened to have some enormous outliers.

Same approach works for penalized regression!

In [36]:
def cross_validate_general(train_model, X, y, n_splits, params):
    cv_results = []
    cv = KFold(n_splits=n_splits)
    
    # Outer loop: pick hyperparameter setting.
    for param_setting in params:
        
        # Inner loop: pick train and validation sets.
        for fold_idx, (train_indices, valid_indices) in enumerate(cv.split(X, y)):
            # Train on training set.
            model = train_model(X[train_indices], y[train_indices], param_setting)

            # Evaluate on validation set.
            val_predictions = model.predict(X[valid_indices])
            val_r2 = metrics.r2_score(y[valid_indices], val_predictions)
            val_mse = metrics.mean_squared_error(y[valid_indices], val_predictions)

            cv_results.append(dict(
                param_setting,
                fold_idx=fold_idx,
                val_r2=val_r2,
                val_mse=val_mse
            ))
    return pd.DataFrame(cv_results)

def train_lasso(X, y, params):
    return Lasso(alpha=params['alpha']).fit(X, y)
    
def train_KNN(X, y, params):
    return KNeighborsRegressor(n_neighbors=params['n_neighbors']).fit(X, y)

X = train[all_predictors].values
y = train.Fare_amount.values

params = [{'alpha': alpha} for alpha in [.001, .01, .1, 1., 10.]]
cv_results = cross_validate_general(
    train_model=train_lasso,
#     train_model=train_KNN,
    X=X, y=y,
    n_splits=10, params=params)

agg_cv_results = cv_results.groupby('alpha').val_r2.agg(['mean', 'std'])
agg_cv_results
/Users/kcarnold/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
Out[36]:
mean std
alpha
0.001 0.844636 0.031926
0.010 0.844776 0.031824
0.100 0.844775 0.031353
1.000 0.813970 0.026487
10.000 -0.001185 0.001649
In [37]:
best_alpha = agg_cv_results['mean'].idxmax()
estimated_r2 = agg_cv_results['mean'][best_alpha]
best_model = Lasso(alpha=best_alpha).fit(X, y)
test_predictions = best_model.predict(test[all_predictors].values)
test_r2 = metrics.r2_score(test.Fare_amount, test_predictions)
print(f"Best alpha was {best_alpha}, with estimated R^2 of {estimated_r2:.3f}.")
print(f"Training this on all data gives a test R^2 of {test_r2:.3f}")
Best alpha was 0.01, with estimated R^2 of 0.845.
Training this on all data gives a test R^2 of 0.803
In [33]:
# We could also use a meta-estimator from sklearn.
# But make sure you understand all of what's going on in the above first!
from sklearn.model_selection import GridSearchCV

gridsearch_model = GridSearchCV(
    estimator=Lasso(),
    param_grid=[{'alpha': [.1, 1., 10.]}],
    cv=KFold(n_splits=10),
    scoring='r2',
    return_train_score=False)
gridsearch_model.fit(X, y)
pd.DataFrame(gridsearch_model.cv_results_).set_index('param_alpha')
Out[33]:
mean_fit_time std_fit_time mean_score_time std_score_time params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score split5_test_score split6_test_score split7_test_score split8_test_score split9_test_score mean_test_score std_test_score rank_test_score
param_alpha
0.1 0.002758 0.000181 0.000329 0.000071 {'alpha': 0.1} 0.848971 0.852551 0.857965 0.864766 0.780191 0.863877 0.843819 0.856511 0.797919 0.881180 0.844777 0.029740 1
1.0 0.002498 0.000339 0.000279 0.000051 {'alpha': 1.0} 0.815950 0.819903 0.823129 0.838299 0.761552 0.825254 0.812359 0.825485 0.773266 0.844504 0.813972 0.025125 2
10.0 0.001751 0.000142 0.000266 0.000034 {'alpha': 10.0} -0.000373 -0.000410 -0.001153 -0.000108 -0.000075 -0.005123 -0.000034 -0.002568 -0.000001 -0.002008 -0.001185 0.001564 3
In [34]:
# It also refits the model on the full dataset, so we get the same test R^2.
gridsearch_model.score(test[all_predictors], test.Fare_amount)
Out[34]:
0.803262478581998

Homework Critique

  • Good visuals "don't make me think".
  • Good written analysis interprets (doesn't just describe) and connects the data with the real world.
  • Good code is easy to change without making mistakes, which lets you iterate faster.