Key Word(s): Scikit-learn, Linear Regression, k-Nearest Neighbors (kNN) Regression

# CS109A Introduction to Data Science

## Lab 3: Scikit-learn for Regression¶

**Harvard University**

**Fall 2018**

**Instructors:** Pavlos Protopapas and Kevin Rader

**Authors:** David Sondak, Will Claybaugh, Eleni Kaxiras

```
# RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
```

# Table of Contents¶

- Array creation and reshape
- Some plotting
- Simple linear regression
- $k$-nearest neighbors

## Learning Goals¶

Overall description and goal for the lab.

By the end of this lab, you should be able to:

- Understand array reshaping
- Review how to make plots
- Feel comfortable with simple linear regression
- Feel comfortable with $k$ nearest neighbors

**This lab corresponds to lecture 4 and maps on to homework 2 (and beyond).**

```
# import the necessary libraries
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
```

## Simple Linear Regression¶

Linear regression and its many extensions are a workhorse of the statistics and data science community, both in application and as a reference point for other models. Most of the major concepts in machine learning can be and often are discussed in terms of various linear regression models. Thus, this section will introduce you to building and fitting linear regression models and some of the process behind it, so that you can 1) fit models to data you encounter 2) experiment with different kinds of linear regression and observe their effects 3) see some of the technology that makes regression models work.

### Linear regression with a toy dataset¶

We first examine a toy problem, focusing our efforts on fitting a linear model to a small dataset with three observations. Each observation consists of one predictor $x_i$ and one response $y_i$ for $i = 1, 2, 3$,

\begin{align*} (x , y) = \{(x_1, y_1), (x_2, y_2), (x_3, y_3)\}. \end{align*}To be very concrete, let's set the values of the predictors and responses.

\begin{equation*} (x , y) = \{(1, 2), (2, 2), (3, 4)\} \end{equation*}There is no line of the form $\beta_0 + \beta_1 x = y$ that passes through all three observations, since the data are not collinear. Thus our aim is to find the line that best fits these observations in the *least-squares sense*, as discussed in lecture.

**Exercise (10 min)**

- Make two numpy arrays out of this data, x_train and y_train
- Check the dimentions of these arrays
- Try to reshape them into a different shape
- Make points into a very simple scatterplot
- Make a better scatterplot

```
# your code here
x_train = np.array([1,2,3])
y_train = np.array([2,3,6])
type(x_train)
```

```
x_train.shape
```

```
x_train = x_train.reshape(3,1)
x_train.shape
```

```
xx = np.array([[1,3,5],[6,2,1]])
xx.shape
```

```
xx = xx.reshape(3,-1)
xx
```

```
# make a simple scatterplot
x_train = np.array([1,2,3])
y_train = np.array([2,2,4])
plt.scatter(x_train,y_train)
# check dimensions
print(x_train.shape,y_train.shape)
```

```
def nice_scatterplot(x, y, title):
# font size
f_size = 18
# make the figure
fig, ax = plt.subplots(1,1, figsize=(8,5)) # Create figure object
# set axes limits to make the scale nice
ax.set_xlim(np.min(x)-1, np.max(x) + 1)
ax.set_ylim(np.min(y)-1, np.max(y) + 1)
# adjust size of tickmarks in axes
ax.tick_params(labelsize = f_size)
# remove tick labels
ax.tick_params(labelbottom=False, bottom=False)
# adjust size of axis label
ax.set_xlabel(r'$x$', fontsize = f_size)
ax.set_ylabel(r'$y$', fontsize = f_size)
# set figure title label
ax.set_title(title, fontsize = f_size)
# you may set up grid with this
ax.grid(True, lw=1.75, ls='--', alpha=0.15)
# make actual plot (Notice the label argument!)
#ax.scatter(x, y, label=r'$my points$')
#ax.scatter(x, y, label='$my points$')
ax.scatter(x, y, label=r'$my\,points$')
ax.legend(loc='best', fontsize = f_size);
return ax
nice_scatterplot(x_train, y_train, 'hello nice plot')
```

#### Formulae¶

Linear regression is special among the models we study beuase it can be solved explicitly. While most other models (and even some advanced versions of linear regression) must be solved itteratively, linear regression has a formula where you can simply plug in the data.

For the single predictor case it is: \begin{align} \beta_1 &= \frac{\sum_{i=1}^n{(x_i-\bar{x})(y_i-\bar{y})}}{\sum_{i=1}^n{(x_i-\bar{x})^2}}\\ \beta_0 &= \bar{y} - \beta_1\bar{x}\ \end{align}

Where $\bar{y}$ and $\bar{x}$ are the mean of the y values and the mean of the x values, respectively.

From the re-aranged second equation we can see that the best-fit line passes through $(\bar{x},\bar{y})$, the center of mass of the data

From any of the first equations, we can see that the slope of the line has to do with whether or not an x value that is above/below the center of mass is typically paired with a y value that is likewise above/below, or typically paired with one that is opposite.

### Building a model from scratch¶

In this part, we will solve the equations for simple linear regression and find the best fit solution to our toy problem.

The snippets of code below implement the linear regression equations on the observed predictors and responses, which we'll call the training data set. Let's walk through the code.

We have to reshape our arrrays to 2D. We will see later why.

**Exercise (5 min)**

- make an array with shape (2,3)
- reshape it to a size that you want

```
# your code here
xx = np.array([[1,2,3],[4,6,8]])
xxx = xx.reshape(-1,2)
xxx.shape
```

```
# Reshape to be a proper 2D array
x_train = x_train.reshape(x_train.shape[0], 1)
y_train = y_train.reshape(y_train.shape[0], 1)
print(x_train.shape)
```

```
# first, compute means
y_bar = np.mean(y_train)
x_bar = np.mean(x_train)
# build the two terms
numerator = np.sum( (x_train - x_bar)*(y_train - y_bar) )
denominator = np.sum((x_train - x_bar)**2)
print(numerator.shape, denominator.shape) #check shapes
```

- Why the empty brackets? (The numerator and denominator are scalars, as expected.)

```
#slope beta1
beta_1 = numerator/denominator
#intercept beta0
beta_0 = y_bar - beta_1*x_bar
print("The best-fit line is {0:3.2f} + {1:3.2f} * x".format(beta_0, beta_1))
print(f'The best fit is {beta_0}')
```

**Exercise (5 min)**

Turn the code from the above cells into a function called `simple_linear_regression_fit`

, that inputs the training data and returns `beta0`

and `beta1`

.

To do this, copy and paste the code from the above cells below and adjust the code as needed, so that the training data becomes the input and the betas become the output.

```
def simple_linear_regression_fit(x_train: np.ndarray, y_train: np.ndarray) -> np.ndarray:
return
```

Check your function by calling it with the training data from above and printing out the beta values.

```
def simple_linear_regression_fit(x_train: np.ndarray, y_train: np.ndarray) -> np.ndarray:
"""
Inputs:
x_train: a (num observations by 1) array holding the values of the predictor variable
y_train: a (num observations by 1) array holding the values of the response variable
Returns:
beta_vals: a (num_features by 1) array holding the intercept and slope coeficients
"""
# Check input array sizes
if len(x_train.shape) < 2:
print("Reshaping features array.")
x_train = x_train.reshape(x_train.shape[0], 1)
if len(y_train.shape) < 2:
print("Reshaping observations array.")
y_train = y_train.reshape(y_train.shape[0], 1)
# first, compute means
y_bar = np.mean(y_train)
x_bar = np.mean(x_train)
# build the two terms
numerator = np.sum( (x_train - x_bar)*(y_train - y_bar) )
denominator = np.sum((x_train - x_bar)**2)
#slope beta1
beta_1 = numerator/denominator
#intercept beta0
beta_0 = y_bar - beta_1*x_bar
return np.array([beta_0,beta_1])
```

- Let's run this function and see the coefficients

```
x_train = np.array([1 ,2, 3])
y_train = np.array([2, 2, 4])
betas = simple_linear_regression_fit(x_train, y_train)
beta_0 = betas[0]
beta_1 = betas[1]
print("The best-fit line is {0:8.6f} + {1:8.6f} * x".format(beta_0, beta_1))
```

**Exercise (5 min)**

- Do the values of
`beta0`

and`beta1`

seem reasonable? - Plot the training data using a scatter plot.
- Plot the best fit line with
`beta0`

and`beta1`

together with the training data.

```
fig_scat, ax_scat = plt.subplots(1,1, figsize=(10,6))
# Plot best-fit line
x_train = np.array([[1, 2, 3]]).T
best_fit = beta_0 + beta_1 * x_train
ax_scat.scatter(x_train, y_train, s=300, label='Training Data')
ax_scat.plot(x_train, best_fit, ls='--', label='Best Fit Line')
ax_scat.set_xlabel(r'$x_{train}$')
ax_scat.set_ylabel(r'$y$');
```

The values of `beta0`

and `beta1`

seem roughly reasonable. They capture the positive correlation. The line does appear to be trying to get as close as possible to all the points.

### Building a model with `statsmodels`

and `sklearn`

¶

Now that we can concretely fit the training data from scratch, let's learn two `python`

packages to do it all for us:

Our goal is to show how to implement simple linear regression with these packages. For an important sanity check, we compare the $\beta$ values from `statsmodels`

and `sklearn`

to the $\beta$ values that we found from above with our own implementation.

For the purposes of this lab, `statsmodels`

and `sklearn`

do the same thing. More generally though, `statsmodels`

tends to be easier for inference [finding the values of the slope and intercept and dicussing uncertainty in those values], whereas `sklearn`

has machine-learning algorithms and is better for prediction [guessing y values for a given x value]. (Note that both packages make the same guesses, it's just a question of which activity they provide more support for.

**Note:** `statsmodels`

and `sklearn`

are different packages! Unless we specify otherwise, you can use either one.

Below is the code for `statsmodels`

. `Statsmodels`

does not by default include the column of ones in the $X$ matrix, so we include it manually with `sm.add_constant`

.

```
import statsmodels.api as sm
```

```
# create the X matrix by appending a column of ones to x_train
X = sm.add_constant(x_train)
# this is the same matrix as in our scratch problem!
print(X)
# build the OLS model (ordinary least squares) from the training data
toyregr_sm = sm.OLS(y_train, X)
# do the fit and save regression info (parameters, etc) in results_sm
results_sm = toyregr_sm.fit()
# pull the beta parameters out from results_sm
beta0_sm = results_sm.params[0]
beta1_sm = results_sm.params[1]
print("The regression coefficients from the statsmodels package are: beta_0 = {0:8.6f} and beta_1 = {1:8.6f}".format(beta0_sm, beta1_sm))
```

Besides the beta parameters, `results_sm`

contains a ton of other potentially useful information.

```
import warnings
warnings.filterwarnings('ignore')
print(results_sm.summary())
```

Now let's turn our attention to the `sklearn`

library.

```
from sklearn import linear_model
```

```
# build the least squares model
toyregr = linear_model.LinearRegression()
# save regression info (parameters, etc) in results_skl
results = toyregr.fit(x_train, y_train)
# pull the beta parameters out from results_skl
beta0_skl = toyregr.intercept_
beta1_skl = toyregr.coef_[0]
print("The regression coefficients from the sklearn package are: beta_0 = {0:8.6f} and beta_1 = {1:8.6f}".format(beta0_skl, beta1_skl))
```

We should feel pretty good about ourselves now, and we're ready to move on to a real problem!

### The shape of things in `scikit-learn`

¶

Before diving right in to a "real" problem, we really ought to discuss more of the details of `sklearn`

. We do this now. Along the way, we'll import the real-world dataset.

`Scikit-learn`

is the main `python`

machine learning library. It consists of many learners which can learn models from data, as well as a lot of utility functions such as `train_test_split`

. It can be used in `python`

by the incantation `import sklearn`

.

In scikit-learn, an **estimator** is a Python object that implements the methods fit(X, y) and predict(T)

Let's see the structure of `scikit-learn`

needed to make these fits. `.fit`

always takes two arguments:

```
estimator.fit(Xtrain, ytrain)
```

We will consider two estimators in this lab: `LinearRegression`

and `KNeighborsRegressor`

.

Critically, `Xtrain`

must be in the form of an *array of arrays* (or a 2x2 array) with the inner arrays each corresponding to one sample, and whose elements correspond to the feature values for that sample (visuals coming in a moment).

`ytrain`

on the other hand is a simple array of responses. These are continuous for regression problems.

### Practice with `sklearn`

¶

We begin by loading up the `mtcars`

dataset and cleaning it up a little bit.

```
import pandas as pd
#load mtcars
dfcars = pd.read_csv("data/mtcars.csv")
dfcars = dfcars.rename(columns={"Unnamed: 0":"car name"})
dfcars.head()
```

Next, let's split the dataset into a training set and test set.

```
# split into training set and testing set
from sklearn.model_selection import train_test_split
#set random_state to get the same split every time
traindf, testdf = train_test_split(dfcars, test_size=0.2, random_state=42)
```

```
# testing set is around 20% of the total data; training set is around 80%
print("Shape of full dataset is: {0}".format(dfcars.shape))
print("Shape of training dataset is: {0}".format(traindf.shape))
print("Shape of test dataset is: {0}".format(testdf.shape))
```

Now we have training and test data. We still need to select a predictor and a response from this dataset. Keep in mind that we need to choose the predictor and response from both the training and test set. You will do this in the exercises below. However, we provide some starter code for you to get things going.

```
# Extract the response variable that we're interested in
y_train = traindf.mpg
```

Notice the shape of `y_train`

.

```
np.shape(y_train)
```

Another way to see the shape is to use the shape method.

```
y_train.shape
```

This is *not* an "array of arrays". That's okay! Remember, `sklearn`

requires an array of arrays only for the predictor array! You will have to pay close attention to this in the exercises later.

For now, let's discuss two ways out of this debacle. All we'll do is get `y_train`

to be an array of arrays. This doesn't hurt anything because `sklearn`

doesn't care too much about the shape of `y_train`

.

First, let's reshape `y_train`

to be an array of arrays using the `reshape`

method. We want the first dimension of `y_train`

to be size $25$ and the second dimension to be size $1$.

```
y_train_reshape = y_train.values.reshape(y_train.shape[0], 1)
```

```
y_train_reshape.shape
```

Notice that `y_train.shape[0]`

gives the size of the first dimension.

There's an even easier way to get the correct shape right from the beginning.

```
y_train_reshape = traindf[['mpg']]
```

```
y_train_reshape.shape
```

Finally, there is a nice shortcut to reshaping an array. `numpy`

can infer a dimension based on the other dimensions specified.

```
y_train_reshape = y_train.values.reshape(-1,1)
y_train_reshape.shape
```

In this case, we said the second dimension should be size $1$. Since the requirement of the `reshape()`

method is that the requested dimensions be compatible, `numpy`

decides the the first dimension must be size $25$.

What would the `.shape`

return if we did `y_train.values.reshape(-1,5)`

?

Okay, enough of that. The whole reason we went through that whole process was to show you how to reshape your data into the correct format.

**IMPORTANT:** Remember that your response variable `ytrain`

can be a vector but your predictor variable `xtrain`

** must** be an array!

### Simple linear regression with automobile data¶

We will now use `sklearn`

to predict automobile mileage per gallon (mpg) and evaluate these predictions. We already loaded the data and split them into a training set and a test set.

We need to choose the variables that we think will be good predictors for the dependent variable `mpg`

.

**Exercise (10 min)**

- Pick one variable to use as a predictor for simple linear regression. Create a markdown cell below and discuss your reasons.
- Justify your choice with some visualizations.
- Is there a second variable you'd like to use? For example, we're not doing multiple linear regression here, but if we were, is there another variable you'd like to include if we were using two predictors?

```
# Your code here
```

```
y_mpg = dfcars.mpg
x_wt = dfcars.wt
x_hp = dfcars.hp
fig_wt, ax_wt = plt.subplots(1,1, figsize=(10,6))
ax_wt.scatter(x_wt, y_mpg)
ax_wt.set_xlabel(r'Car Weight')
ax_wt.set_ylabel(r'Car MPG')
fig_hp, ax_hp = plt.subplots(1,1, figsize=(10,6))
ax_hp.scatter(x_hp, y_mpg)
ax_hp.set_xlabel(r'Car HP')
ax_hp.set_ylabel(r'Car MPG')
```

**Exercise**

- Use
`sklearn`

to fit the training data using simple linear regression. - Use the model to make mpg predictions on the test set.
- Plot the data and the prediction.
- Print out the mean squared error for the training set and the test set and compare.

**Hints:**

- Use the following to perform the analysis:
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error

```
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
dfcars = pd.read_csv("data/mtcars.csv")
dfcars = dfcars.rename(columns={"Unnamed: 0":"name"})
dfcars.head()
```

```
traindf, testdf = train_test_split(dfcars, test_size=0.2, random_state=42)
y_train = np.array(traindf.mpg)
X_train = np.array(traindf.wt)
X_train = X_train.reshape(X_train.shape[0], 1)
```

```
y_test = np.array(testdf.mpg)
X_test = np.array(testdf.wt)
X_test = X_test.reshape(X_test.shape[0], 1)
#create linear model
regression = LinearRegression()
#fit linear model
regression.fit(X_train, y_train)
predicted_y = regression.predict(X_test)
r2 = regression.score(X_test, y_test)
print(r2)
```

```
print(regression.score(X_train, y_train))
print(mean_squared_error(predicted_y, y_test))
print(mean_squared_error(y_train, regression.predict(X_train)))
print('Coefficients: \n', regression.coef_[0], regression.intercept_)
```

```
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.plot(y_test, predicted_y, 'o')
grid = np.linspace(np.min(dfcars.mpg), np.max(dfcars.mpg), 100)
ax.plot(grid, grid, color="black") # 45 degree line
ax.set_xlabel("actual y")
ax.set_ylabel("predicted y")
fig1, ax1 = plt.subplots(1,1, figsize=(10,6))
ax1.plot(dfcars.wt, dfcars.mpg, 'o')
xgrid = np.linspace(np.min(dfcars.wt), np.max(dfcars.wt), 100)
ax1.plot(xgrid, regression.predict(xgrid.reshape(100, 1)))
```

## $k$-nearest neighbors¶

Great, so we did a simple linear regression on the car data.

Now that you're familiar with `sklearn`

, you're ready to do a KNN regression. Let's use $5$ nearest neighbors.

```
from sklearn.neighbors import KNeighborsRegressor
knnreg = KNeighborsRegressor(n_neighbors=5)
```

```
knnreg.fit(X_train, y_train)
r2 = knnreg.score(X_test, y_test)
r2
```

**Exercise**

```
# Your code here
```

```
# solution
knnreg.score(X_train, y_train)
```

Lets vary the number of neighbors and see what we get.

```
regdict = {}
# Do a bunch of KNN regressions
for k in [1, 2, 4, 6, 8, 10, 15]:
knnreg = KNeighborsRegressor(n_neighbors=k)
knnreg.fit(X_train, y_train)
regdict[k] = knnreg # Store the regressors in a dictionary
```

```
# Now let's plot it all
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.plot(dfcars.wt, dfcars.mpg, 'o', label="data")
xgrid = np.linspace(np.min(dfcars.wt), np.max(dfcars.wt), 100)
for k in [1, 2, 6, 10, 15]:
predictions = regdict[k].predict(xgrid.reshape(100,1))
if k in [1, 6, 15]:
ax.plot(xgrid, predictions, label="{}-NN".format(k))
ax.legend();
```

Notice how the $1$-NN goes through every point on the training set but utterly fails elsewhere. Lets look at the scores on the training set.

```
ks = range(1, 15) # Grid of k's
scores_train = [] # R2 scores
for k in ks:
knnreg = KNeighborsRegressor(n_neighbors=k) # Create KNN model
knnreg.fit(X_train, y_train) # Fit the model to training data
score_train = knnreg.score(X_train, y_train) # Calculate R^2 score
scores_train.append(score_train)
# Plot
fig, ax = plt.subplots(1,1, figsize=(12,8))
ax.plot(ks, scores_train,'o-')
ax.set_xlabel(r'$k$')
ax.set_ylabel(r'$R^{2}$')
```

Why do we get a perfect $R^2$ at k=1?

**Exercise (5 min)**

- Make the same plot as above on the
*test*set. - What is the best $k$?

```
ks = range(1, 15) # Grid of k's
scores_test = [] # R2 scores
for k in ks:
knnreg = KNeighborsRegressor(n_neighbors=k) # Create KNN model
knnreg.fit(X_train, y_train) # Fit the model to training data
score_test = knnreg.score(X_test, y_test) # Calculate R^2 score
scores_test.append(score_test)
# Plot
fig, ax = plt.subplots(1,1, figsize=(12,8))
ax.plot(ks, scores_test,'o-', ms=12)
ax.set_xlabel(r'$k$')
ax.set_ylabel(r'$R^{2}$')
```