Key Word(s): Regularization, Bootstrap, Confidence Intervals, Ridge, Polynomial Regression, Cross-Validation (CV), Lasso, Scikit-learn

Download Notebook

CS-109A Introduction to Data Science

Lab 5: Regularization and Cross-Validation¶

Harvard University
Fall 2018
Instructors: Pavlos Protopapas and Kevin Rader
Lab Instructor: Rahul Dave
Authors: Kevin Rader, Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas

Table of Contents¶

Learning Goals
Review of regularized regression
Ridge regression with one predictor on a grid
Ridge regression with polynomial features on a grid
Cross-validation --- Multiple Estimates
Cross-validation --- Finding the best regularization parameter

Learning Goals¶

In this lab, you will work with some noisy data. You will use simple linear and ridge regressions to fit linear, high-order polynomial features to the dataset. You will attempt to figure out what degree polynomial fits the dataset the best and ultimately use cross validation to determine the best polynomial order. Finally, you will automate the cross validation process using sklearn in order to determine the best regularization paramter for the ridge regression analysis on your dataset.

By the end of this lab, you should:

Really understand regularized regression principles.
Have a good grasp of working with ridge regression through the sklearn API
Understand the effects of the regularization (a.k.a penalization) parameter on fits from ridge regression
Understand the ideas behind cross-validation
- Why is it necessary?
- Why is it important?
- Basic implementation details.
Be able to use sklearn objects to automate the cross validation process.

This lab corresponds to lectures 4 and 5 and maps on to homework 4 (and beyond).

In [1]:

%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn.apionly as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Part 1: Review of regularized regression¶

We briefly review the idea of regularization as introduced in lecture. Recall that in the ordinary least squares problem we find the regression coefficients $\boldsymbol{\beta}\in\mathbb{R}^{m}$ that minimize the loss function \begin{align*} L(\boldsymbol{\beta}) = \frac{1}{n} \sum_{i=1}^n \|y_i - \boldsymbol{\beta}^T \mathbf{x}_i\|^2. \end{align*} Recall that we have $n$ observations. Here $y_i$ is the response variable for observation $i$ and $\mathbf{x}_i\in\mathbb{R}^{m}$ is a vector from the predictor matrix corresponding to observation $i$.

The general idea behind regularization is to penalize the loss function to account for possibly very large values of the coefficients $\boldsymbol{\beta}$. Instead of minimizing $L(\boldsymbol{\beta})$, we minimize the regularized loss function \begin{align*} L_{\text{reg}}(\boldsymbol{\beta}) = L(\boldsymbol{\beta}) + \lambda R(\boldsymbol{\beta}) \end{align*} where $R(\boldsymbol{\beta})$ is a penalty function and $\lambda$ is a scalar that weighs the relative importance of this penalty. In this lab we will explore one regularized regression model: ridge regression. In ridge regression, the penalty function is the sum of the squares of the parameters, which is written as \begin{align*} L_{\text{ridge}}(\boldsymbol{\beta}) = \frac{1}{n} \sum_{i=1}^n \|y_i - \boldsymbol{\beta}^T \mathbf{x}_i\|^2 + \lambda \sum_{j=1}^m \beta_{j}^{2}. \end{align*}

In lecture, you also learned about LASSO regression in which the penalty function is the sum of the absolute values of the parameters. This is written as, \begin{align*} L_{\text{LASSO}}(\boldsymbol{\beta}) = \frac{1}{n} \sum_{i=1}^n \|y_i - \boldsymbol{\beta}^T \mathbf{x}_i\|^2 + \lambda \sum_{j=1}^m |\beta_j|. \end{align*}

In this lab, we will show how these optimization problems can be solved with sklearn to determine the model parameters $\boldsymbol{\beta}$. We will also show how to choose $\lambda$ appropriately via cross-validation.

Dataset¶

You will work with a synthetic dataset contained in data/noisypopulation.csv. The data were generated from a specific function $f\left(x\right)$ (the actual form will not be revealed to you in this lab). Noise was added to the function to generate synthetic, noisy observations via $y = f\left(x\right) + \epsilon$ where $\epsilon$ was drawn from a random distribution. The idea here is that in real life the data you are working with often comes with noise. Even if you could make observations at every single value of $x$, the true function may still be obscured. Of course, the samples you actually take are usually a subset of all the possible observations. In this lab, we will refer to observations at every single value of $x$ as the population and the subset of observations as in-sample y or simply the observations.

The dataset contains three columns:

f is the true function value
x is the predictor
y is the measured response.

In [2]:

df=pd.read_csv("data/noisypopulation.csv")
df.head()

Out[2]:

	f	x	y
0	0.047790	0.00	0.011307
1	0.051199	0.01	0.010000
2	0.054799	0.02	0.007237
3	0.058596	0.03	0.000056
4	0.062597	0.04	0.010000

In this lab, we will try out some regression methods to fit the data and see how well our model matches the true function f.

In [3]:

# Convert f, x, y to numpy array
f = df.f.values
x = df.x.values
y = df.y.values

df.shape

Out[3]:

(200, 3)

Let's take a quick look at the dataset. We will plot the true function value and the population.

In [4]:

fig, ax = plt.subplots(1,1, figsize=(10,6))

ax.plot(x, y, '.', alpha=0.8, label=r'Population')
ax.plot(x, f, lw=4, label='True Response')

ax.legend(loc='upper left')

ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')

fig.tight_layout()

It is often the case that you just can't make observations at every single value of $x$. We will simulate this situation by making a random choice of $60$ points from the full $200$ points. We do it by choosing the indices randomly and then using these indices as a way of getting the appropriate samples.

In [5]:

indexes=np.sort(np.random.choice(x.shape[0], size=60, replace=False)) # Using sort to make plotting easier later
indexes

Out[5]:

array([  1,   2,   6,   9,  11,  16,  18,  24,  30,  41,  42,  44,  47,
        49,  52,  53,  55,  59,  62,  65,  69,  76,  79,  85,  86,  95,
        96,  99, 102, 103, 106, 107, 109, 111, 113, 122, 123, 127, 135,
       139, 146, 149, 151, 152, 157, 160, 161, 162, 164, 165, 166, 168,
       169, 171, 176, 177, 178, 182, 185, 199])

Note: If you are not familiar with the numpy sort method or the numpy random.choice() method, then please take a moment to look them up in the numpy documentation.

Moving on, let's get the $60$ random samples from our dataset.

In [6]:

# Create a new dataframe from the random points
sample_df = pd.DataFrame(dict(x=x[indexes],f=f[indexes],y=y[indexes])) # New dataframe
sample_df.head()

Out[6]:

	f	x	y
0	0.051199	0.01	0.010000
1	0.054799	0.02	0.007237
2	0.071233	0.06	0.048360
3	0.085865	0.09	0.050510
4	0.096800	0.11	0.183821

Let's take one more look at our data to see which points we've selected.

In [7]:

fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.plot(sample_df['x'], sample_df['y'], 's', alpha=0.4, ms=10, label="in-sample y (observed)")
ax.plot(x,y, '.', label=r'Population $y$')
ax.plot(x,f, lw=4, label='True Response')

ax.legend(loc='upper left')

ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')

fig.tight_layout()

Now we do our favorite thing and split the sample data into training and testing sets.

Note that here we are actually getting indices instead of the actual training and test set. This is okay and is another way of generating train-test splits.

In [8]:

from sklearn.model_selection import train_test_split

datasize=sample_df.shape[0]

#split dataset using the index, as we have x, f, and y that we want to split.
itrain, itest = train_test_split(np.arange(60), train_size=0.8)

xtrain = sample_df.x[itrain].values
ftrain = sample_df.f[itrain].values
ytrain = sample_df.y[itrain].values

xtest= sample_df.x[itest].values
ftest = sample_df.f[itest].values
ytest = sample_df.y[itest].values

Great! At this point we've explored our data a little bit, selected a sample of the dataset, and done a train-test split on the sample dataset.

Let's move on to the data analysis. We'll begin with ridge regression. In particular we'll do ridge regression on a single predictor and compare it with simple linear regression.

To start, let's fit the old classic, linear regression.

In [9]:

from sklearn.linear_model import LinearRegression

# fit the model to training data
simp_reg = LinearRegression().fit(xtrain.reshape(-1,1), ytrain)

# save the beta coefficients
beta0_sreg = simp_reg.intercept_
beta1_sreg = simp_reg.coef_[0]

print("(beta0, beta1) = ({0:8.6f}, {1:8.6f})".format(beta0_sreg, beta1_sreg))

(beta0, beta1) = (-0.011628, 1.083937)

Part 2: Bootstrapping¶

But wait! Unlike statsmodels, we don't get confidence intervals for the betas. Fortunately, we can bootstrap to build the confidence intervals

Exercise 1

In the code below, two key steps of bootstrapping are missing. Fill in the code to draw sample indices with replacement and to fit the model to the bootstrap sample. You'll need numpy's np.random.choice. Here's the function documentation in case you need it.
Visualize the results, and use numpy's np.percentile: function documentation.

In [10]:

N = 1000
bootstrap_beta1s = np.zeros(N)
for cur_bootstrap_rep in range(N):
    # select indices that are in the resample (easiest way to be sure we grab y values that match the x values)
    inds_to_sample = np.random.choice(xtrain.shape[0], size=xtrain.shape[0], replace=True)
    
    # take the sample
    x_train_resample = xtrain[inds_to_sample]
    y_train_resample = ytrain[inds_to_sample]
    
    # fit the model
    bootstrap_model = LinearRegression().fit(x_train_resample.reshape(-1,1), y_train_resample)
    
    # extract the beta1 and append
    bootstrap_beta1s[cur_bootstrap_rep] = bootstrap_model.coef_[0]

## display the results

# calculate 5th and 95th percentiles
lower_limit, upper_limit = np.percentile(bootstrap_beta1s,[5,95])

# plot histogram and bounds
fig, ax = plt.subplots(1,1, figsize=(20,10))
ax.hist(bootstrap_beta1s, 20, alpha=0.6, label=r"Bootstrapped $\beta_{1}$ values")
ax.axvline(lower_limit, color='red', label=r"$5$th Percentile ({:.2f})".format(lower_limit))
ax.axvline(upper_limit, color='black', label=r"$95$th Percentile ({:.2f})".format(upper_limit))

# good plots have labels
ax.set_xlabel(r"$\beta_{1}$ Values")
ax.set_ylabel("Count (out of 1000 Bootstrap Replications)")
plt.title(r"Bootstrapped Values of $\beta_{1}$")
plt.legend();

From the above, we find that the bootstrap $90\%$ confidence interval is well away from $0$. We can confidently say that $\beta_{1}$ is not secretly $0$ and we're being fooled by randomness.

Next we'll dive into ridge regression!

Part 3: Ridge regression for Simple Linear Regression¶

To begin, we'll use sklearn to do simple linear regression on the sampled training data. We'll then do ridge regression with the same data, setting the penalty parameter $\lambda$ to zero. Setting $\lambda = 0$ reduces the ridge problem to the simple ordinary least squares problem, so we expect the results of these models to be identical.

We will store the regression coefficients in a dataframe for easy comparison. The cell below provides some code to set up the dataframe ahead of time. Notice that we don't know the actual values in the pandas series, so we just set them to NaN. We will overwrite these later.

In [36]:

regression_coeffs = dict() # Store regression coefficients from each model in a dictionary

regression_coeffs['OLS'] = [np.nan]*2 # Initialize to NaN
regression_coeffs[r'Ridge $\lambda = 0$'] = [np.nan]*2

dfResults = pd.DataFrame(regression_coeffs) # Create dataframe

dfResults.rename({0: r'$\beta_{0}$', 1: r'$\beta_{1}$'}, inplace=True) # Rename rows
dfResults

Out[36]:

	OLS	Ridge $\lambda = 0$
$\beta_{0}$	NaN	NaN
$\beta_{1}$	NaN	NaN

We start with simple linear regression to get the ball rolling.

In [52]:

simp_reg = LinearRegression() # build the the ordinary least squares model

simp_reg.fit(xtrain.reshape(-1,1), ytrain) # fit the model to training data

# save the beta coefficients
beta0_sreg = simp_reg.intercept_
beta1_sreg = simp_reg.coef_[0]

dfResults['OLS'][:] = [beta0_sreg, beta1_sreg]
dfResults

Out[52]:

	OLS	Ridge $\lambda = 0$
$\beta_{0}$	-0.011628	-0.011628
$\beta_{1}$	1.083937	1.083937

In [45]:

#y_predict = lambda x : beta0_sreg + beta1_sreg*x # make predictions
ypredict_ols = simp_reg.predict(x.reshape(-1,1))
ypredict_ols.shape

Out[45]:

(200,)

We will use the above $\boldsymbol\beta$ coefficients as a benchmark for comparision to the ridge method. The same coefficients can be obtained with ridge regression, which we demonstrate now.

For reference, here is the ridge regression documentation: sklearn.linear_model.Ridge.

In [14]:

from sklearn.linear_model import Ridge

The snippet of code below implements the ridge regression with $\lambda = 0$.

Note: The weight $\lambda$ is referred to as alpha in the documentation.

Remark: $\lambda$ goes by many names including, but not limited to: regularization parameter, penalization parameter, shrinking parameter, and weight. Regardless of these names, it is a hyperparameter. That is, you set it before you begin the training process. An algorithm can be very sensitive to its hyperparameters and we will discuss how a method for selecting the "correct" hyperparameter values later in this lab.

In [51]:

ridge_reg = Ridge(alpha = 0) # build the ridge regression model with specified lambda, i.e. alpha

ridge_reg.fit(xtrain.reshape(-1,1), ytrain) # fit the model to training data

# save the beta coefficients
beta0_ridge = ridge_reg.intercept_
beta1_ridge = ridge_reg.coef_[0]

ypredict_ridge = ridge_reg.predict(x.reshape(-1,1)) # make predictions everywhere

dfResults[r'Ridge $\lambda = 0$'][:] = [beta0_ridge, beta1_ridge]
dfResults

Out[51]:

	OLS	Ridge $\lambda = 0$
$\beta_{0}$	NaN	-0.011628
$\beta_{1}$	NaN	1.083937

The beta coefficients for linear and ridge regressions coincide for $\lambda = 0$, as expected. We plot the data and fits.

In [47]:

fig, ax = plt.subplots(1,1, figsize=(10,6))

ax.plot(xtrain, ytrain, 's', alpha=0.3, ms=10, label="in-sample y (observed)") # plot in-sample training data
ax.plot(x, y, '.', alpha=0.4, label="population y") # plot population data
ax.plot(x, f, ls='-',  alpha=0.4, lw=4, label="True function")
ax.plot(x, ypredict_ols, ls='--', lw=4, label="OLS") # plot simple linear regression fit
ax.plot(x, ypredict_ridge, ls='-.', lw = 4, label="Ridge") # plot ridge regression fit

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')

ax.legend(loc=4);

fig.tight_layout()

Exercise 2

Explore the effect of $\lambda$ on ridge regression.

Make a plot with of the ridge regression predictions with $\lambda = 0, 5, 10, 100$. Be sure to include a legend.

What happens for very large $\lambda$ (e.g. $\lambda \to \infty$)?

Your plot should look something like the following plot (doesn't have to be exact):

In [17]:

# Your code here
fig, ax = plt.subplots(1,1, figsize=(20,10))

pen_params = [0, 1, 5, 10, 100]

ax.plot(x, f, ls='-', lw=6, alpha=0.5, label="True function")
ax.plot(xtrain, ytrain, 's', alpha=0.5, ms=10, label="in-sample y (observed)") # plot in-sample training data

for alpha in pen_params:
    ridge_reg = Ridge(alpha = alpha) # build the ridge regression model with specified lambda, i.e. alpha
    ridge_reg.fit(xtrain.reshape(-1,1), ytrain) # fit the model to training data
    ypredict_ridge = ridge_reg.predict(x.reshape(-1,1))
    ax.plot(x, ypredict_ridge, ls='-.', lw = 4, label=r"$\lambda = {}$".format(alpha)) # plot ridge regression fit

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')

ax.legend(loc=4, fontsize=24);

fig.tight_layout()

fig.savefig('ridge_lambda.png')

Part 3 Recap¶

That was nice, but we were just doing simple linear regression. We really want to do more interesting regression problems like multilinear regression. We will do so in the next section.

Part 4: Ridge regression with polynomial features on a grid¶

Now we'll make a more complex model by adding polynomial features. Instead of building the linear model $y = \beta_0 + \beta_1 x$, we build a polynomial model $y = \beta_0 + \beta_1 x + \beta_2 x^2 + \ldots \beta_d x^d$ for some $d$ to be determined. This regression will be linear though, since we'll be treating $x^2, \ldots, x^d$ themselves as predictors in the linear model.

The design matrix $\mathbf{X}$ contains columns corresponding to $1, x, x^2, \ldots, x^d$. To build it, we use sklearn. (The particular design matrix is also known as the Vandermonde matrix). For example, if we have three observations

\begin{align*} \left\{\left(x_{1}, y_{1}\right), \left(x_{2}, y_{2}\right), \left(x_{3}, y_{3}\right)\right\} \end{align*}
and we want polynomial features up to and including degree $4$, the design matrix looks like

\begin{align*} X = \begin{bmatrix} x_1^0 & x_1^1 & x_1^2 & x_1^3 & x_1^4\\ x_2^0 & x_2^1 & x_2^2 & x_2^3 & x_2^4\\ x_3^0 & x_3^1 & x_3^2 & x_3^3 & x_3^4\\ \end{bmatrix} = \begin{bmatrix} 1& x_1^1 & x_1^2 & x_1^3 & x_1^4\\ 1 & x_2^1 & x_2^2 & x_2^3 & x_2^4\\ 1 & x_3^1 & x_3^2 & x_3^3 & x_3^4\\ \end{bmatrix}. \end{align*}

Exercise 3

Make a toy vector called toy, where
\begin{align*} \mathrm{toy} = \begin{bmatrix} 0 \\ 2 \\ 5 \\ \end{bmatrix}. \end{align*}
Build the feature matrix up to (and including) degree $4$. Confirm that the entries in the matrix are what you'd expect based on the above discussion.

Note: You may use sklearn to build the matrix using PolynomialFeatures().

In [18]:

from sklearn.preprocessing import PolynomialFeatures

In [19]:

# your code here
toy = np.array([0, 2, 5])
PolynomialFeatures(4).fit_transform(toy.reshape(-1,1))

Out[19]:

array([[   1.,    0.,    0.,    0.,    0.],
       [   1.,    2.,    4.,    8.,   16.],
       [   1.,    5.,   25.,  125.,  625.]])

We now continue working with our data. We write a function to make polynomial features of given degrees and we store the features in a dictionary.

Exercise 4

The code provided below is missing a few lines and it's missing many comments. Do the following:

Comment every line of the code
- Normally, you won't do such excessive commenting. In this case, we want to make sure you understand every single line since you didn't actually write this code.
Fill in the missing lines
- Create a ridge regression object at each $\lambda$ value in the list
- Perform the ridge regression using the fit method from the newly created ridge regression object
- Make a prediction on the grid and store the results in ypredict_ridge.

Note: We're not giving you an example figure here since we gave you most of the code.

Warning! Make sure you understand the entire code! There are many nice things in there.

In [48]:

d = 20 # Maximum polynomial degree
# You will create a grid of plots of this size (7 x 2)
rows = 7
cols = 2
lambdas = [0., 1e-6, 1e-3, 1e-2, 1e-1, 1, 10] # Various penalization parameters to try
grid_to_predict = np.arange(0, 1, .01) # Predictions will be made on this grid

# Create training set and test set
Xtrain = PolynomialFeatures(d).fit_transform(xtrain.reshape(-1,1))
test_set = PolynomialFeatures(d).fit_transform(grid_to_predict.reshape(-1,1))

fig, axs = plt.subplots(rows, cols, sharex='col', figsize=(12, 24)) # Set up plotting objects

for i, lam in enumerate(lambdas):
    # your code here
    ridge_reg = Ridge(alpha = lam) # Create regression object
    ridge_reg.fit(Xtrain, ytrain) # Fit on regression object
    ypredict_ridge = ridge_reg.predict(test_set) # Do a prediction on the test set
    
    ### Provided code
    axs[i,0].plot(xtrain, ytrain, 's', alpha=0.4, ms=10, label="in-sample y") # Plot sample observations
    axs[i,0].plot(grid_to_predict, ypredict_ridge, 'k-', label=r"$\lambda =  {0}$".format(lam)) # Ridge regression prediction
    axs[i,0].set_ylabel('$y$') # y axis label
    axs[i,0].set_ylim((0, 1)) # y axis limits
    axs[i,0].set_xlim((0, 1)) # x axis limits
    axs[i,0].legend(loc='best') # legend
    
    coef = ridge_reg.coef_.ravel() # Unpack the coefficients from the regression
    
    axs[i,1].semilogy(np.abs(coef), ls=' ', marker='o', label=r"$\lambda =  {0}$".format(lam)) # plot coefficients
    axs[i,1].set_ylim((1e-04, 1e+15)) # Set y axis limits
    axs[i,1].set_xlim(1, 20) # Set y axis limits
    axs[i,1].yaxis.set_label_position("right") # Move y-axis label to right
    axs[i,1].set_ylabel(r'$\left|\beta_{j}\right|$') # Label y-axis
    axs[i,1].legend(loc='best') # Legend

# Label x axes
axs[-1, 0].set_xlabel("x")
axs[-1, 1].set_xlabel(r"$j$");

As you can see, as we increase $\lambda$ from 0 to 1, we start out overfitting, then doing well, and then our fits develop a mind of their own irrespective of data, as the penalty term dominates.

Exercise 5

What would you expect if you compared a performance metric between these models on a grid? What performance metric should you use?

YOUR DISCUSSION HERE

Part 4 Recap¶

We did a ridge regression on our dataset where the features were polynomial features. We also assessed the impact of the regularization parameter on the solution.

Part 5: Cross-validation --- Finding the best penalization parameter¶

Let's use cross-validation to determine the critical value of $\lambda$, which we'll refer to as $\lambda^*$. To do this we use the concept of a meta-estimator from scikit-learn.

Model selection is supported by two distinct meta-estimators:

GridSearchCV
RandomizedSearchCV The input to these meta-estimators is an estimator, which has some hyperparameters (e.g. $\lambda$) that need to be optimized, and a set of hyperparameter settings to search through.

The concept of a meta-estimator allows us to wrap, for example, cross-validation, or methods that build and combine simpler models or schemes. For example:

est = Ridge()
    parameters = {"alpha": [1e-8, 1e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 1e-2, 1e-1, 1.0]}
    gridclassifier = GridSearchCV(est, param_grid=parameters, cv=4, scoring="neg_mean_squared_error")

The GridSearchCV replaces the manual iteration over the folds using KFolds and the averaging we just did, doing it all for us. It takes a hyperparameter grid in the shape of a dictionary as input, and sets $\lambda$ to the values you want to try, one by one. It then trains the model using cross-validation, and gets the error for each value of the hyperparameter $\lambda$. Finally it compares the errors for the different $\lambda$'s, and picks the best choice model.

Here is a helper function that we will use to get the best Ridge regression.

In [21]:

from sklearn.model_selection import GridSearchCV
def cv_optimize_ridge(x: np.ndarray, y: np.ndarray, list_of_lambdas: list, n_folds: int =4):
    est = Ridge()
    parameters = {'alpha': list_of_lambdas}
    # the scoring parameter below is the default one in ridge, but you can use a different one
    # in the cross-validation phase if you want.
    gs = GridSearchCV(est, param_grid=parameters, cv=n_folds, scoring="neg_mean_squared_error")
    gs.fit(x, y)
    return gs

Exercise 6

Use the function above to fit the model on the training set with $4$-fold cross validation. Save the fit as the variable fitmodel.

In [22]:

lambs = [1e-8, 1e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0]
# your code here
fitmodel = cv_optimize_ridge(Xtrain, ytrain, lambs, n_folds=4)

In [23]:

print(fitmodel.best_estimator_, "\n")
print(fitmodel.best_params_, "\n")
print(fitmodel.best_score_, "\n")

Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001) 

{'alpha': 0.01} 

-0.00428219133737

We also output the mean cross-validation error at different $\lambda$ (with a negative sign, as scikit-learn likes to maximize negative error which is equivalent to minimizing error).

In [24]:

fitmodel.cv_results_

Out[24]:

{'mean_fit_time': array([ 0.00076169,  0.0005492 ,  0.00093275,  0.00092494,  0.00065297,
         0.00775427,  0.00097436,  0.0007785 ,  0.000678  ,  0.00085598,
         0.00073111]),
 'mean_score_time': array([ 0.00022036,  0.00016773,  0.00028032,  0.00021076,  0.00020701,
         0.00030631,  0.00032443,  0.00027955,  0.00023168,  0.00035763,
         0.0002166 ]),
 'mean_test_score': array([ -1.04422066e+02,  -2.03879402e+00,  -2.12533115e-02,
         -2.17426118e-02,  -3.83217458e-02,  -5.34829454e-02,
         -4.16644096e-02,  -4.28219134e-03,  -1.10864000e-02,
         -1.17298708e-02,  -3.53276125e-02]),
 'mean_train_score': array([-0.002438  , -0.00281841, -0.00288214, -0.00292148, -0.00295017,
        -0.00303469, -0.00307988, -0.00331017, -0.00390718, -0.00959672,
        -0.03183083]),
 'param_alpha': masked_array(data = [1e-08 1e-06 1e-05 5e-05 0.0001 0.0005 0.001 0.01 0.1 1.0 10.0],
              mask = [False False False False False False False False False False False],
        fill_value = ?),
 'params': ({'alpha': 1e-08},
  {'alpha': 1e-06},
  {'alpha': 1e-05},
  {'alpha': 5e-05},
  {'alpha': 0.0001},
  {'alpha': 0.0005},
  {'alpha': 0.001},
  {'alpha': 0.01},
  {'alpha': 0.1},
  {'alpha': 1.0},
  {'alpha': 10.0}),
 'rank_test_score': array([11, 10,  4,  5,  7,  9,  8,  1,  2,  3,  6], dtype=int32),
 'split0_test_score': array([-0.00210461, -0.00153636, -0.00135964, -0.00133985, -0.00135094,
        -0.00136229, -0.00137233, -0.0015771 , -0.00260748, -0.0128935 ,
        -0.0498873 ]),
 'split0_train_score': array([-0.00318839, -0.00369436, -0.00372455, -0.0037415 , -0.00374972,
        -0.00379193, -0.00383477, -0.00403662, -0.00462982, -0.01021204,
        -0.02983906]),
 'split1_test_score': array([-0.00610722, -0.00633563, -0.00549197, -0.00473789, -0.00448339,
        -0.00432257, -0.00439472, -0.0048511 , -0.00712679, -0.01946326,
        -0.05587429]),
 'split1_train_score': array([-0.00263924, -0.00279383, -0.00284313, -0.00291326, -0.00296615,
        -0.00308399, -0.00311842, -0.00320436, -0.00373486, -0.00851105,
        -0.02806184]),
 'split2_test_score': array([-0.00823306, -0.00862512, -0.00842519, -0.00802683, -0.00779534,
        -0.00737807, -0.00726763, -0.00702939, -0.00706486, -0.0094168 ,
        -0.0235004 ]),
 'split2_train_score': array([-0.00136817, -0.001625  , -0.00164389, -0.00168626, -0.00173432,
        -0.00189061, -0.0019627 , -0.00224233, -0.00290861, -0.00891591,
        -0.03392184]),
 'split3_test_score': array([ -4.17671818e+02,  -8.13867896e+00,  -6.97364430e-02,
         -7.28658796e-02,  -1.39657311e-01,  -2.00868856e-01,
         -1.53622963e-01,  -3.67117515e-03,  -2.75464611e-02,
         -5.14591715e-03,  -1.20484595e-02]),
 'split3_train_score': array([-0.00255619, -0.00316047, -0.003317  , -0.00334489, -0.0033505 ,
        -0.00337222, -0.00340363, -0.00375736, -0.00435544, -0.01074789,
        -0.0355006 ]),
 'std_fit_time': array([  1.41258368e-04,   8.99215039e-05,   2.09890586e-04,
          1.97631494e-04,   7.88685227e-05,   6.55120547e-03,
          8.90124478e-05,   1.66172394e-04,   4.72181285e-05,
          1.70332795e-04,   6.98077019e-05]),
 'std_score_time': array([  4.28389077e-05,   4.75195203e-06,   4.87682202e-05,
          1.77032819e-05,   7.94934703e-06,   8.44413135e-05,
          3.23465461e-05,   1.17804536e-04,   2.55688642e-05,
          2.12778731e-04,   6.24114996e-06]),
 'std_test_score': array([  1.80854829e+02,   3.52177114e+00,   2.81040598e-02,
          2.96105746e-02,   5.85504759e-02,   8.51198741e-02,
          6.46729039e-02,   1.97240858e-03,   9.67828661e-03,
          5.24066122e-03,   1.81380300e-02]),
 'std_train_score': array([ 0.00066372,  0.00075979,  0.00077999,  0.00077096,  0.00075467,
         0.00070687,  0.00069359,  0.00068542,  0.00066144,  0.0009146 ,
         0.0030005 ])}

In [25]:

fit_lambdas = [d['alpha'] for d in fitmodel.cv_results_['params']]
fit_scores = fitmodel.cv_results_['mean_test_score']

Now we make a log-log plot of -fit_scores versus fit_lambdas.

In [26]:

fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.plot(fit_lambdas, -fit_scores, ls='-', marker='o')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('$\lambda$')
ax.set_ylabel('scores');

SK-learn's `cross_val_score`: Easier Cross Validation¶

GridSearchCV is an important tool when you are searching over many hyperparameters (and believe us, you will be), but when you only need to get CV scores for a particular model, some students find cross_val_score more intuitive.

In [27]:

from sklearn.model_selection import cross_val_score

lr_object =  Ridge(alpha=0)
cross_val_score(lr_object, Xtrain, ytrain, cv=5)

Out[27]:

array([ -9.18613926e+06,  -2.48009195e+01,  -7.33851988e+01,
         7.96497457e-01,  -2.43993256e+09])

We can loop over particular models and get scores for each (equivalent to GridSearchCV over the given parameter settings).

In [28]:

for cur_alpha in [1e-8, 1e-4, 1e-2, 1.0, 10.0]:
    lr_object =  Ridge(alpha=cur_alpha)
    scores = cross_val_score(lr_object, Xtrain, ytrain, cv=5)
    print("lambda {0}\t R^2 scores: {1}\t Mean R^2: {2}".format(cur_alpha,scores,np.mean(scores)))

lambda 1e-08	 R^2 scores: [  9.77939901e-01   9.32466926e-01   9.25087958e-01   7.70596878e-01
  -1.01789793e+04]	 Mean R^2: -2035.0746438893307
lambda 0.0001	 R^2 scores: [ 0.9856221   0.94641854  0.93300439  0.81490007 -3.77073188]	 Mean R^2: -0.018157356572625627
lambda 0.01	 R^2 scores: [ 0.98186179  0.94562163  0.93554936  0.82937296  0.87111868]	 Mean R^2: 0.9127048848528474
lambda 1.0	 R^2 scores: [ 0.85221091  0.82714737  0.80963734  0.88260439  0.89951329]	 Mean R^2: 0.8542226608760588
lambda 10.0	 R^2 scores: [ 0.43557116  0.504026    0.48162533  0.64633357  0.7117304 ]	 Mean R^2: 0.5558572925115726

Built-in Cross Validation: `RidgeCV` and `LassoCV`¶

Some sklearn models have built-in, automated cross validation to tune their hyper parameters.

In [29]:

from sklearn.linear_model import RidgeCV

ridgeCV_object = RidgeCV(alphas=(1e-8, 1e-4, 1e-2, 1.0, 10.0), cv=5)
ridgeCV_object.fit(Xtrain, ytrain)
print("Best model searched:\nalpha = {}\nintercept = {}\nbetas = {}, ".format(ridgeCV_object.alpha_,
                                                                            ridgeCV_object.intercept_,
                                                                            ridgeCV_object.coef_
                                                                            )
     )

Best model searched:
alpha = 0.01
intercept = -0.004754717116992713
betas = [ 0.          0.69341921  0.7422217   0.27075073 -0.09947963 -0.28256995
 -0.32780739 -0.29382772 -0.22355048 -0.14377357 -0.06935955 -0.00731663
  0.04012127  0.07338884  0.09423722  0.10491372  0.10769605  0.10465274
  0.09754117  0.0877851   0.07649451],

Important note:¶

For any tool more automated than literally using k_fold, just setting cv=5 will NOT shuffle your data by default. This can be a problem with time-series data!
To force shuffling, explicitly pass a KFold object (with shuffling turned on) to the cv argument
You may prefer a strategy where you shuffle the rows of your data at the outset of analysis

In [30]:

# declare and pass a KFold object to properly shuffle the training data, and/or set the random state
from sklearn.model_selection import KFold
splitter = KFold(5, random_state=42, shuffle=True)

ridgeCV_object = RidgeCV(alphas=(1e-8, 1e-4, 1e-2, 1.0, 10.0), cv=splitter)
ridgeCV_object.fit(Xtrain, ytrain)
print("Best model searched:\nalpha = {}\nintercept = {}\nbetas = {}, ".format(ridgeCV_object.alpha_,
                                                                            ridgeCV_object.intercept_,
                                                                            ridgeCV_object.coef_
                                                                            )
     )

Best model searched:
alpha = 0.01
intercept = -0.004754717116992713
betas = [ 0.          0.69341921  0.7422217   0.27075073 -0.09947963 -0.28256995
 -0.32780739 -0.29382772 -0.22355048 -0.14377357 -0.06935955 -0.00731663
  0.04012127  0.07338884  0.09423722  0.10491372  0.10769605  0.10465274
  0.09754117  0.0877851   0.07649451],

Part 5b: Refitting on full training set¶

At this point, we have determined the best penalization parameter for the ridge regression on our current dataset using cross validation. Let's refit the estimator on the training set and calculate and plot the test set error and the polynomial coefficients. Notice how many of these coefficients have been pushed to lower values or 0.

Exercise 7

Assign to variable est the classifier obtained by fitting the entire training set using the best $\lambda$ found above. Assign the predictions to the variable ypredict_ridge_best.

In [31]:

# your code here
best_lambda = fitmodel.best_params_['alpha']
est = Ridge(alpha=best_lambda).fit(Xtrain,ytrain)
ypredict_ridge_best = est.predict(test_set)
est.coef_

Out[31]:

array([ 0.        ,  0.69341921,  0.7422217 ,  0.27075073, -0.09947963,
       -0.28256995, -0.32780739, -0.29382772, -0.22355048, -0.14377357,
       -0.06935955, -0.00731663,  0.04012127,  0.07338884,  0.09423722,
        0.10491372,  0.10769605,  0.10465274,  0.09754117,  0.0877851 ,
        0.07649451])

In [32]:

# code provided from here on
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
left = 0
right = 1

axs[left].plot(x,f, lw=4, label='True Response')
axs[left].plot(xtrain, ytrain, 's', alpha=0.3, ms=10, label="in-sample y (observed)")
axs[left].plot(x, y, '.', alpha=0.8, label="population y")
axs[left].plot(grid_to_predict, ypredict_ridge_best, 'k--', label=r"$\lambda =  {{{0:1.4f}}}$".format(best_lambda))
axs[left].set_ylabel('$y$')
axs[left].set_ylim((0, 1))
axs[left].set_xlim((0, 1))
axs[left].legend(loc=2)
coef = est.coef_.ravel()
axs[right].semilogy(np.abs(coef), marker='o', label=r"$\lambda =  {0}$".format(best_lambda))
axs[right].set_ylim((1e-04, 1.0e+11))
axs[right].set_xlim(1, 20)
axs[right].yaxis.set_label_position("right")
axs[right].set_ylabel(r'$\left|\beta_{j}\right|$')
axs[right].legend(loc='best')
axs[left].set_xlabel("x")
axs[right].set_xlabel(r'$j$');

In [49]:

#One more nice plot:
from sklearn.linear_model import Lasso 

ridge_coef = []
lasso_coef = []
for lamb in lambs:
    ridge_coef.append(Ridge(alpha=lamb).fit(Xtrain,ytrain).coef_)
    lasso_coef.append(Lasso(alpha=lamb).fit(Xtrain,ytrain).coef_)
ridge_coef[0:2]

/Users/krader/anaconda/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

Out[49]:

[array([  0.        ,   1.01959051,  -5.53929429,  29.44150715,
        -53.93355665,  20.47365746,  41.04441284,   1.75859441,
        -34.3729342 , -37.24889075, -13.5862036 ,  16.23323089,
         35.26279047,  36.26794064,  21.08634855,  -2.72753881,
        -25.44040864, -37.78836324, -32.52401814,  -5.0392544 ,  46.64706122]),
 array([ 0.        ,  0.69137493, -0.83581145,  4.9179027 , -4.15650457,
         1.50908912,  2.03157442, -2.29221713, -5.17735279, -4.15884132,
        -0.49733387,  3.3806434 ,  5.63291862,  5.57230418,  3.52299763,
         0.41783499, -2.60029392, -4.46734047, -4.36167693, -1.76571866,
         3.54100409])]

In [50]:

fig, axs = plt.subplots(1, 2, figsize=(16, 6))
left = 0
right = 1
axs[left].plot(lambs,ridge_coef)
axs[left].set_xscale("log")
axs[left].set_title("Ridge polynomial coefficients as a function of lambda")
#axs[left].legend(loc='best')
axs[right].plot(lambs,lasso_coef)
axs[right].set_xscale("log")
axs[right].set_title("Lasso polynomial coefficients as a function of lambda")

Out[50]:

There are Kevin's favorite plots depicting how the $\hat{\beta}$s are affected with an increasing penalty factor. They need better labels, but there are too many to be labeled nicely. Presumably the orange line in the Lasso plot is for the linear term (it becomes important as the other terms "drop out").