Key Word(s): Regularization, Bootstrap, Confidence Intervals, Ridge, Polynomial Regression, Cross-Validation (CV), Lasso, Scikit-learn
CS-109A Introduction to Data Science
Lab 5: Regularization and Cross-Validation¶
Harvard University
Fall 2018
Instructors: Pavlos Protopapas and Kevin Rader
Lab Instructor: Keivn Rader (today at least)
Authors: Kevin Rader, Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas
Table of Contents¶
- Learning Goals
- Review of regularized regression
- Bootstrapped Sampling Distributions and Confidence Intervals
- Ridge regression for Simple Regression
- Ridge regression with polynomial features on a grid
- Cross-validation --- Multiple Estimates
- Cross-validation --- Finding the best regularization parameter
Learning Goals¶
In this lab, you will work with some noisy data. You will use simple linear and ridge regressions to fit linear, high-order polynomial features to the dataset. You will attempt to figure out what degree polynomial fits the dataset the best and ultimately use cross validation to determine the best polynomial order. Finally, you will automate the cross validation process using sklearn
in order to determine the best regularization paramter for the ridge regression analysis on your dataset.
By the end of this lab, you should:
- Really understand regularized regression principles.
- Have a good grasp of working with ridge regression through the
sklearn
API - Understand the effects of the regularization (a.k.a penalization or tuning) parameter on fits from ridge regression
- Understand the ideas behind cross-validation
- Why is it necessary?
- Why is it important?
- Basic implementation details.
- Be able to use
sklearn
objects to automate the cross validation process.
This lab corresponds to lectures 5, 6, and 7 and maps to homework 4 (and beyond).
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn.apionly as sns
sns.set_style("whitegrid")
sns.set_context("poster")
Part 1: Review of regularized regression¶
We briefly review the idea of regularization as introduced in lecture. Recall that in the ordinary least squares problem we find the regression coefficients $\boldsymbol{\beta}\in\mathbb{R}^{m}$ that minimize the loss function \begin{align*} L(\boldsymbol{\beta}) = \frac{1}{n} \sum_{i=1}^n \|y_i - \boldsymbol{\beta}^T \mathbf{x}_i\|^2. \end{align*} Recall that we have $n$ observations. Here $y_i$ is the response variable for observation $i$ and $\mathbf{x}_i\in\mathbb{R}^{m}$ is a vector from the predictor matrix corresponding to observation $i$.
The general idea behind regularization is to penalize the loss function to account for possibly very large values of the coefficients $\boldsymbol{\beta}$. Instead of minimizing $L(\boldsymbol{\beta})$, we minimize the regularized loss function
\begin{align*}
L_{\text{reg}}(\boldsymbol{\beta}) = L(\boldsymbol{\beta}) + \lambda R(\boldsymbol{\beta})
\end{align*}
where $R(\boldsymbol{\beta})$ is a penalty function and $\lambda$ is a scalar that weighs the relative importance of this penalty. In this lab we will explore one regularized regression model: ridge
regression. In ridge regression, the penalty function is the sum of the squares of the parameters, which is written as
\begin{align*}
L_{\text{ridge}}(\boldsymbol{\beta}) = \frac{1}{n} \sum_{i=1}^n \|y_i - \boldsymbol{\beta}^T \mathbf{x}_i\|^2 + \lambda \sum_{j=1}^m \beta_{j}^{2}.
\end{align*}
In lecture, you also learned about LASSO
regression in which the penalty function is the sum of the absolute values of the parameters. This is written as,
\begin{align*}
L_{\text{LASSO}}(\boldsymbol{\beta}) = \frac{1}{n} \sum_{i=1}^n \|y_i - \boldsymbol{\beta}^T \mathbf{x}_i\|^2 + \lambda \sum_{j=1}^m |\beta_j|.
\end{align*}
In this lab, we will show how these optimization problems can be solved with sklearn
to determine the model parameters $\boldsymbol{\beta}$. We will also show how to choose $\lambda$ appropriately via cross-validation.
Dataset¶
We will work with a synthetic dataset contained in data/noisypopulation.csv
. The data were generated from a specific function $f\left(x\right)$ (the actual mathematical form will remain a mystery). Noise was added to the function to generate synthetic, noisy (aka, real) observations via $y = f\left(x\right) + \epsilon$ where $\epsilon$ was drawn from a random distribution. Even if you could make observations at every single value of $x$, the true function may still be obscured. Of course, the data you actually observe are a subset of all the possible observations in the population. In this lab, we will refer to observations at every single value of $x$ as the population and the subset of observations as in-sample y or simply the observations.
The dataset contains three columns:
f
is the true function valuex
is the predictory
is the measured response.
df=pd.read_csv("data/noisypopulation.csv")
df.head()
In this lab, we will try out some regression methods to fit the data and see how well our model matches the true function f
.
# Convert f, x, y to numpy array
f = df.f.values
x = df.x.values
y = df.y.values
df.shape
Let's take a quick look at the dataset. We will plot the true function value and the population.
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.plot(x, y, '.', alpha=0.8, label=r'Population')
ax.plot(x, f, lw=4, label='True Response')
ax.legend(loc='upper left')
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
fig.tight_layout()
It is often the case that you just can't make observations at every single value of $x$. We will simulate this situation by making a random choice of $60$ points from the full $200$ points. We do it by choosing the indices randomly and then using these indices as a way of getting the appropriate samples.
#np.random.seed(12345)
indexes=np.sort(np.random.choice(x.shape[0], size=60, replace=False)) # Note: using sort to make plotting easier later
indexes
Note: If you are not familiar with the numpy
sort
method or the numpy random.choice()
method, then please take a moment to look them up in the numpy
documentation.
Moving on, let's get the $60$ random samples from our dataset.
# Create a new dataframe from the random points
sample_df = pd.DataFrame(dict(x=x[indexes],f=f[indexes],y=y[indexes])) # New dataframe
sample_df.head()
Let's take one more look at our data to see which points we've selected.
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.plot(sample_df['x'], sample_df['y'], 's', alpha=0.4, ms=10, label="in-sample y (observed)")
ax.plot(x,y, '.', label=r'Population $y$')
ax.plot(x,f, lw=4, label='True Response')
ax.legend(loc='upper left')
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
fig.tight_layout()
Now we do our favorite thing and split the sample data into training and testing sets.
Note that here we are actually getting indices instead of the actual training and test set. This is okay and is another way of generating train-test splits.
from sklearn.model_selection import train_test_split
datasize=sample_df.shape[0]
#split dataset using the index, as we have x, f, and y that we want to split.
itrain, itest = train_test_split(np.arange(60), train_size=0.8)
xtrain = sample_df.x[itrain].values
ftrain = sample_df.f[itrain].values
ytrain = sample_df.y[itrain].values
xtest= sample_df.x[itest].values
ftest = sample_df.f[itest].values
ytest = sample_df.y[itest].values
Great! At this point we've explored our data a little bit, selected a sample of the dataset, and done a train-test split on the sample dataset.
Let's move on to the data analysis. We'll begin with ridge regression. In particular we'll do ridge regression on a single predictor and compare it with simple linear regression.
To start, let's fit the old classic, linear regression.
from sklearn.linear_model import LinearRegression
# fit the model to training data
simp_reg = LinearRegression().fit(xtrain.reshape(-1,1), ytrain)
# save the beta coefficients
beta0_sreg = simp_reg.intercept_
beta1_sreg = simp_reg.coef_[0]
print("(beta0, beta1) = ({0:8.6f}, {1:8.6f})".format(beta0_sreg, beta1_sreg))
Part 2: Bootstrapping¶
But wait! Unlike statsmodels
, we don't get confidence intervals for the betas. Fortunately, we can bootstrap to build the confidence intervals
- In the code below, two key steps of bootstrapping are missing. Fill in the code to draw sample indices with replacement and to fit the model to the bootstrap sample. You'll need
numpy
'snp.random.choice
. Here's the function documentation in case you need it. - Visualize the results, and use
numpy
'snp.percentile
: function documentation.
N = 1000
bootstrap_beta1s = np.zeros(N)
for cur_bootstrap_rep in range(N):
# use np.random.choice to select 48 indices ranging from 0 to 47, with replacement,
# and store them in inds_to_sample (48 is the size of our training set)
###########
# your code here
###########
# take the sample
x_train_resample = xtrain[inds_to_sample]
y_train_resample = ytrain[inds_to_sample]
# refit the model
###########
# your code here
###########
# extract the beta1 and append
bootstrap_beta1s[cur_bootstrap_rep] = bootstrap_model.coef_[0]
## calculate 5th and 95th percentiles & display the results
###########
# your code here
###########
From the above, we find that the bootstrap $90\%$ confidence interval is well away from $0$. We can confidently say that $\beta_{1}$ is not secretly $0$ and we're being fooled by randomness.
Part 3: Ridge regression for Simple Regression¶
To begin, we'll use sklearn
to do simple linear regression on the sampled training data. We'll then do ridge regression with the same data, setting the penalty parameter $\lambda$ to zero. What happens when we set $\lambda = 0$ for Ridge's Penalty factor?
We will store the regression coefficients in a dataframe for easy comparison. The cell below provides some code to set up the dataframe ahead of time. Notice that we don't know the actual values in the pandas
series, so we just set them to NaN
. We will overwrite these later.
regression_coeffs = dict() # Store regression coefficients from each model in a dictionary
regression_coeffs['OLS'] = [np.nan]*2 # Initialize to 0
regression_coeffs[r'Ridge $\lambda = 0$'] = [np.nan]*2
dfResults = pd.DataFrame(regression_coeffs) # Create dataframe
dfResults.rename({0: r'$\hat{\beta}_{0}$', 1: r'$\hat{\beta}_{1}$'}, inplace=True) # Rename rows
dfResults
We start with simple linear regression to get the ball rolling.
simp_reg = LinearRegression() # build the the ordinary least squares model
simp_reg.fit(xtrain.reshape(-1,1), ytrain) # fit the model to training data
# save the beta coefficients
beta0_sreg = simp_reg.intercept_
beta1_sreg = simp_reg.coef_[0]
dfResults['OLS'][:] = [beta0_sreg, beta1_sreg]
dfResults
#y_predict = lambda x : beta0_sreg + beta1_sreg*x # a user function to make predictions
ypredict_ols = simp_reg.predict(x.reshape(-1,1))
We will use the above $\boldsymbol\beta$ coefficients as a benchmark for comparision to the ridge method. The same coefficients can be obtained with ridge regression, which we demonstrate now.
For reference, here is the ridge regression documentation: sklearn.linear_model.Ridge.
from sklearn.linear_model import Ridge
The snippet of code below implements the ridge regression with $\lambda = 0$.
Note: The weight $\lambda$ is referred to as alpha
in the documentation.
Remark: $\lambda$ goes by many names including, but not limited to: regularization parameter, penalization parameter, penalty factor, tuning parameter, shrinking parameter, and weight. Regardless of these names, it is a hyperparameter. That is, you set it before you begin the training process. An algorithm can be very sensitive to its hyperparameters and we will discuss how a method for selecting the "correct" hyperparameter values later in this lab.
ridge_reg = Ridge(alpha = 0) # build the ridge regression model with specified lambda, i.e. alpha
ridge_reg.fit(xtrain.reshape(-1,1), ytrain) # fit the model to training data
# save the beta coefficients
beta0_ridge = ridge_reg.intercept_
beta1_ridge = ridge_reg.coef_[0]
ypredict_ridge = ridge_reg.predict(x.reshape(-1,1)) # make predictions everywhere
dfResults[r'Ridge $\lambda = 0$'][:] = [beta0_ridge, beta1_ridge]
dfResults
The beta coefficients for linear and ridge regressions coincide for $\lambda = 0$, as expected. We plot the data and fits.
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.plot(xtrain, ytrain, 's', alpha=0.3, ms=10, label="in-sample y (observed)") # plot in-sample training data
ax.plot(x, y, '.', alpha=0.4, label="population y") # plot population data
ax.plot(x, f, ls='-', alpha=0.4, lw=4, label="True function")
ax.plot(x, y_predict(x), ls='--', lw=4, label="OLS") # plot simple linear regression fit
ax.plot(x, ypredict_ridge, ls='-.', lw = 4, label="Ridge") # plot ridge regression fit
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.legend(loc=4);
fig.tight_layout()
Make a plot with of the ridge regression predictions with $\lambda = 0, 5, 10, 100$. Be sure to include a legend.
What happens for very large $\lambda$ (e.g. $\lambda \to \infty$)?
Your plot should look something like the following plot (doesn't have to be exact):
# Your code here
Part 3 Recap¶
That was nice, but we were just doing simple linear regression. We really want to do more interesting regression problems like multilinear regression. We will do so in the next section.
Part 4: Ridge regression with polynomial features on a grid¶
Now we'll make a more complex model by adding polynomial features. Instead of building the linear model $y = \beta_0 + \beta_1 x$, we build a polynomial model $y = \beta_0 + \beta_1 x + \beta_2 x^2 + \ldots \beta_d x^d$ for some $d$ to be determined. This regression will be linear though, since we'll be treating $x^2, \ldots, x^d$ themselves as predictors in the linear model.
The design matrix $\mathbf{X}$ contains columns corresponding to $1, x, x^2, \ldots, x^d$. To build it, we use sklearn
. (The particular design matrix is also known as the Vandermonde matrix). For example, if we have three observations
\begin{align*}
\left\{\left(x_{1}, y_{1}\right), \left(x_{2}, y_{2}\right), \left(x_{3}, y_{3}\right)\right\}
\end{align*}
and we want polynomial features up to and including degree $4$, the design matrix looks like
\begin{align*}
X = \begin{bmatrix}
x_1^0 & x_1^1 & x_1^2 & x_1^3 & x_1^4\\
x_2^0 & x_2^1 & x_2^2 & x_2^3 & x_2^4\\
x_3^0 & x_3^1 & x_3^2 & x_3^3 & x_3^4\\
\end{bmatrix} =
\begin{bmatrix}
1& x_1^1 & x_1^2 & x_1^3 & x_1^4\\
1 & x_2^1 & x_2^2 & x_2^3 & x_2^4\\
1 & x_3^1 & x_3^2 & x_3^3 & x_3^4\\
\end{bmatrix}.
\end{align*}
- Make a toy vector called
toy
, where
\begin{align*} \mathrm{toy} = \begin{bmatrix} 0 \\ 2 \\ 5 \\ \end{bmatrix}. \end{align*} - Build the feature matrix up to (and including) degree $4$. Confirm that the entries in the matrix are what you'd expect based on the above discussion.
Note: You may use sklearn
to build the matrix using PolynomialFeatures()
.
from sklearn.preprocessing import PolynomialFeatures
# your code here
We now continue working with our data.
The code provided below is missing a few lines and it's missing many comments. Do the following:
- Comment every line of the code
- Normally, you won't do such excessive commenting. In this case, we want to make sure you understand every single line since you didn't actually write this code.
- Fill in the missing lines
- Create a ridge regression object at each $\lambda$ value in the list
- Perform the ridge regression using the
fit
method from the newly created ridge regression object - Make a prediction on the grid and store the results in
ypredict_ridge
.
Note: We're not giving you an example figure here since we gave you most of the code.
Warning! Make sure you understand the entire code! There are many nice things in there.
d = 20
# You will create a grid of plots of this size (7 x 2)
rows = 7
cols = 2
lambdas = [0., 1e-6, 1e-3, 1e-2, 1e-1, 1, 10]
grid_to_predict = np.arange(0, 1, .01)
Xtrain = PolynomialFeatures(d).fit_transform(xtrain.reshape(-1,1))
test_set = PolynomialFeatures(d).fit_transform(grid_to_predict.reshape(-1,1))
fig, axs = plt.subplots(rows, cols, sharex='col', figsize=(12, 24)) # Set up plotting objects
for i, lam in enumerate(lambdas):
# your code here
# Create regression object
# Fit on regression object
# Do a prediction on the test set
### Provided code
axs[i,0].plot(xtrain, ytrain, 's', alpha=0.4, ms=10, label="in-sample y")
axs[i,0].plot(grid_to_predict, ypredict_ridge, 'k-', label=r"$\lambda = {0}$".format(lam))
axs[i,0].set_ylabel('$y$')
axs[i,0].set_ylim((0, 1))
axs[i,0].set_xlim((0, 1))
axs[i,0].legend(loc='best')
coef = ridge_reg.coef_.ravel()
axs[i,1].semilogy(np.abs(coef), ls=' ', marker='o', label=r"$\lambda = {0}$".format(lam))
axs[i,1].set_ylim((1e-04, 1e+15))
axs[i,1].set_xlim(1, 20)
axs[i,1].yaxis.set_label_position("right")
axs[i,1].set_ylabel(r'$\left|\beta_{j}\right|$')
axs[i,1].legend(loc='best')
axs[-1, 0].set_xlabel("x")
axs[-1, 1].set_xlabel(r"$j$");
As you can see, as we increase $\lambda$ from 0 to 1, we start out overfitting, then doing well, and then our fits develop a mind of their own irrespective of data, as the penalty term dominates.
YOUR DISCUSSION HERE
Part 4 Recap¶
We did a ridge regression on our dataset where the features were the polynomial terms. We also assessed the impact of the regularization parameter on the solution.
Part 5: Cross-validation --- Finding the best penalization parameter¶
In order to determine the critical value of $\lambda$ that gives us our best predictive model, which we'll refer to as $\lambda^*$, we will assess this via cross-validation. To do this we use the concept of a meta-estimator from scikit-learn
.
Model selection is supported by two distinct meta-estimators:
GridSearchCV
RandomizedSearchCV
The input to these meta-estimators is an estimator, which has some hyperparameters (e.g. $\lambda$) that need to be optimized, and a set of hyperparameter settings to search through.
The concept of a meta-estimator allows us to wrap, for example, cross-validation, or methods that build and combine simpler models or schemes. For example:
est = Ridge()
parameters = {"alpha": [1e-8, 1e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 1e-2, 1e-1, 1.0]}
gridclassifier = GridSearchCV(est, param_grid=parameters, cv=4, scoring="neg_mean_squared_error")
The GridSearchCV
replaces the manual iteration over the folds using KFolds
and the averaging we just did, doing it all for us. It takes a hyperparameter grid in the shape of a dictionary as input, and sets $\lambda$ to the values you want to try, one by one. It then trains the model using cross-validation, and gets the error for each value of the hyperparameter $\lambda$. Finally it compares the errors for the different $\lambda$'s, and picks the best choice model.
Here is a helper function that we will use to get the best Ridge regression.
from sklearn.model_selection import GridSearchCV
def cv_optimize_ridge(x: np.ndarray, y: np.ndarray, list_of_lambdas: list, n_folds: int =4):
est = Ridge()
parameters = {'alpha': list_of_lambdas}
# the scoring parameter below is the default one in ridge, but you can use a different one
# in the cross-validation phase if you want.
gs = GridSearchCV(est, param_grid=parameters, cv=n_folds, scoring="neg_mean_squared_error")
gs.fit(x, y)
return gs
fitmodel
.
lambs = [1e-7, 1e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0]
# your code here
print(fitmodel.best_estimator_, "\n")
print(fitmodel.best_params_, "\n")
print(fitmodel.best_score_, "\n")
We also output the mean cross-validation error at different $\lambda$ (with a negative sign, as scikit-learn likes to maximize negative error which is equivalent to minimizing error).
fitmodel.cv_results_
fit_lambdas = [d['alpha'] for d in fitmodel.cv_results_['params']]
fit_scores = fitmodel.cv_results_['mean_test_score']
Now we make a log-log
plot of -fit_scores
versus fit_lambdas
.
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.plot(fit_lambdas, -fit_scores, ls='-', marker='o')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('$\lambda$')
ax.set_ylabel('scores');
SK-learn's cross_val_score
: Easier Cross Validation¶
GridSearchCV
is an important tool when you are searching over many hyperparameters (and believe us, you will be), but when you only need to get CV scores for a particular model, some students find cross_val_score
more intuitive.
from sklearn.model_selection import cross_val_score
lr_object = Ridge(alpha=0)
cross_val_score(lr_object, Xtrain, ytrain, cv=5)
We can loop over particular models and get scores for each (equivalent to GridSearchCV over the given parameter settings).
for cur_alpha in [1e-8, 1e-4, 1e-2, 1.0, 10.0]:
lr_object = Ridge(alpha=cur_alpha)
scores = cross_val_score(lr_object, Xtrain, ytrain, cv=5)
print("lambda {0}\t R^2 scores: {1}\t Mean R^2: {2}".format(cur_alpha,scores,np.mean(scores)))
Built-in Cross Validation: RidgeCV
and LassoCV
¶
Some sklearn models have built-in, automated cross validation to tune their hyper parameters.
from sklearn.linear_model import RidgeCV
ridgeCV_object = RidgeCV(alphas=lambs, cv=5)
ridgeCV_object.fit(Xtrain, ytrain)
print("Best model searched:\nalpha = {}\nintercept = {}\nbetas = {}, ".format(ridgeCV_object.alpha_,
ridgeCV_object.intercept_,
ridgeCV_object.coef_
)
)
Important note:¶
- For any tool more automated than literally using k_fold, just setting
cv=5
will NOT shuffle your data by default. - To force shuffling, explicitly pass a
KFold
object (with shuffling turned on) to the cv argument - You may prefer a strategy where you shuffle the rows of your data at the outset of analysis
# declare and pass a KFold object to properly shuffle the training data, and/or set the random state
splitter = KFold(5, random_state=42, shuffle=True)
ridgeCV_object = RidgeCV(alphas=(1e-8, 1e-4, 1e-2, 1.0, 10.0), cv=splitter)
ridgeCV_object.fit(Xtrain, ytrain)
print("Best model searched:\nalpha = {}\nintercept = {}\nbetas = {}, ".format(ridgeCV_object.alpha_,
ridgeCV_object.intercept_,
ridgeCV_object.coef_
)
)
Part 5b: Refitting on full training set¶
At this point, we have determined the best penalization parameter for the ridge regression on our current dataset using cross validation. Let's refit the estimator on the training set and calculate and plot the test set error and the polynomial coefficients. Notice how many of these coefficients have been pushed to lower values or 0.
est
the classifier obtained by fitting the entire training set using the best $\lambda$ found above. Assign the predictions to the variable ypredict_ridge_best
.
# your code here
# code provided from here on
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
left = 0
right = 1
axs[left].plot(x,f, lw=4, label='True Response')
axs[left].plot(xtrain, ytrain, 's', alpha=0.3, ms=10, label="in-sample y (observed)")
axs[left].plot(x, y, '.', alpha=0.8, label="population y")
axs[left].plot(grid_to_predict, ypredict_ridge_best, 'k--', label=r"$\lambda = {{{0:1.4f}}}$".format(best_lambda))
axs[left].set_ylabel('$y$')
axs[left].set_ylim((0, 1))
axs[left].set_xlim((0, 1))
axs[left].legend(loc=2)
coef = est.coef_.ravel()
axs[right].semilogy(np.abs(coef), marker='o', label=r"$\lambda = {0}$".format(best_lambda))
axs[right].set_ylim((1e-04, 1.0e+11))
axs[right].set_xlim(1, 20)
axs[right].yaxis.set_label_position("right")
axs[right].set_ylabel(r'$\left|\beta_{j}\right|$')
axs[right].legend(loc='best')
axs[left].set_xlabel("x")
axs[right].set_xlabel(r'$j$');