Key Word(s): pandas
CS109A Introduction to Data Science
Lab 4: Bonus material: Polynomial Regression¶
Harvard University
Fall 2021
Instructors: Pavlos Protopapas and Natesh Pillai
Lab Team: Marios Mattheakis, Hayden Joy, Chris Gumb, and Eleni Kaxiras
Authors: Eleni Kaxiras, Rahul Dave, David Sondak, Will Claybaugh, and Pavlos Protopapas
## RUN THIS CELL TO GET THE RIGHT FORMATTING
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Learning Objectives¶
- Practice Polynomial Regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from pandas.api.types import CategoricalDtype
from sklearn.compose import make_column_transformer, TransformedTargetRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer
from pandas.plotting import scatter_matrix
import seaborn as sns
%matplotlib inline
Polynomial Regression, and Revisiting the Cab Data¶
# read in the data, break into train and test
cab_df = pd.read_csv("../data/dataset_1.txt")
train_data, test_data = train_test_split(cab_df, test_size=.2, random_state=42)
cab_df.head()
TimeMin | PickupCount | |
---|---|---|
0 | 860.0 | 33.0 |
1 | 17.0 | 75.0 |
2 | 486.0 | 13.0 |
3 | 300.0 | 5.0 |
4 | 385.0 | 10.0 |
cab_df.shape
(1250, 2)
# do some data cleaning
X_train = train_data['TimeMin'].values.reshape(-1,1)/60
y_train = train_data['PickupCount'].values
X_test = test_data['TimeMin'].values.reshape(-1,1)/60
y_test = test_data['PickupCount'].values
def plot_cabs(cur_model, poly_transformer=None):
# build the x values for the prediction line
x_vals = np.arange(0,24,.1).reshape(-1,1)
# if needed, build the design matrix
if poly_transformer:
design_mat = poly_transformer.fit_transform(x_vals)
else:
design_mat = x_vals
# make the prediction at each x value
prediction = cur_model.predict(design_mat)
# plot the prediction line, and the test data
plt.plot(x_vals,prediction, color='k', label="Prediction")
plt.scatter(X_test, y_test, label="Test Data")
# label your plots
plt.ylabel("Number of Taxi Pickups")
plt.xlabel("Time of Day (Hours Past Midnight)")
plt.legend()
plt.show()
from sklearn.linear_model import LinearRegression
fitted_cab_model0 = LinearRegression().fit(X_train, y_train)
plot_cabs(fitted_cab_model0)
fitted_cab_model0.score(X_test, y_test)
0.240661535615741
We can see that there's still a lot of variation in cab pickups that's not being caught by a linear fit. And the linear fit is predicting massively more pickups at 11:59pm than at 12:00am. However, we can add columns to our design matrix for $TimeMin^2$ and $TimeMin^3$ and so on, allowing a wigglier polynomial that will better fit the data.
We'll be using sklearn's PolynomialFeatures
to take some of the tedium out of building the new design matrix. In fact, if all we want is a formula like $y \approx \beta_0 + \beta_1 x + \beta_2 x^2 + ...$ it will directly return the new design matrix.
degree = 3
transformer_3 = PolynomialFeatures(degree, include_bias=False)
new_features = transformer_3.fit_transform(X_train)
new_features
array([[6.73333333e+00, 4.53377778e+01, 3.05274370e+02], [2.18333333e+00, 4.76694444e+00, 1.04078287e+01], [1.41666667e+00, 2.00694444e+00, 2.84317130e+00], ..., [1.96666667e+01, 3.86777778e+02, 7.60662963e+03], [1.17333333e+01, 1.37671111e+02, 1.61534104e+03], [1.42000000e+01, 2.01640000e+02, 2.86328800e+03]])
A few notes on PolynomialFeatures
:
- The interface is a bit strange.
PolynomialFeatures
is a 'transformer' in sklearn. We'll be using several transformers that learn a transformation on the training data and then apply that transformation on future data. On these (more typical) transformers it makes sense to have a.fit()
and a separate.transform()
. With PolynomialFeatures, the.fit()
is pretty trivial, and we often fit and transform in one command, as seen above. - You rarely want to
include_bias
(a column of all 1s), since sklearn will add it automatically. - If you want polynomial features for a several different variables, you should call
.fit_transform()
separately on each column and append all the results to the design matrix (unless you also want interaction terms between the newly-created features). Seenp.concatenate
for joining arrays.
fitted_cab_model3 = LinearRegression().fit(new_features, y_train)
plot_cabs(fitted_cab_model3, transformer_3)
Questions:¶
- Calculate the polynomial model's $R^2$ performance on the test set.
- Does the polynomial model improve on the purely linear model?
- Make a residual plot for the polynomial model. What does this plot tell us about the model?
your answer here
- See code below
- Yes, the test set $R^2$ is higher, and the visual fit to both data sets is much better. It even looks like the predicted number of pickups at 11:59 pm and 12:00 am are nearly equal.
- See the code below. The residuals are much more evenly spread than with the linear model [not shown], but they still don't look like an even spread of gaussian noise. This makes it unlikely that the statsmodel assumptions are valid, and we might want to be careful about trusting confidence intervals, etc, and we may want to search for other models entirely.
# your code here
# test r-squared
print("Test R-squared:", fitted_cab_model3.score(transformer_3.fit_transform(X_test), y_test))
Test R-squared: 0.33412512570778774
# your code here
design_mat = transformer_3.fit_transform(X_train)
prediction = fitted_cab_model3.predict(design_mat)
residual = y_train - prediction
plt.scatter(X_train, residual, label="Residual")
plt.axhline(0, color='k')
plt.title("Residuals for the Cubic Model")
plt.ylabel("Residual Number of Taxi Pickups")
plt.xlabel("Time of Day (Hours Past Midnight)")
plt.legend()
<matplotlib.legend.Legend at 0x160eed760>
# your code here
design_mat = X_train
prediction = fitted_cab_model0.predict(design_mat)
residual = y_train - prediction
plt.scatter(X_train, residual, label="Residual")
plt.axhline(0, color='k')
plt.title("Residuals for the Linear Model")
plt.ylabel("Residual Number of Taxi Pickups")
plt.xlabel("Time of Day (Hours Past Midnight)")
plt.legend()
<matplotlib.legend.Legend at 0x160fc6d60>