Key Word(s): class repo, conda, FAS onDemand, Setup, statsmodels
CS109B Data Science 2: Advanced Topics in Data Science
Lecture 3 - Coding Environment
Setup and review of statsmodels
¶
Notebook B¶
Harvard University
Spring 2021
Instructors: Mark Glickman, Pavlos Protopapas, and Chris Tanner
Additional Instructor: Eleni Kaxiras
Content: Eleni Kaxiras and Will Claybaugh
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)
import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
Learning Goals¶
By the end of this lab, you should be able to:
- use
np.linalg.vander
- use the weird R-style formulas in
statsmodels
- practice least-squares regression in
statsmodels
Basis Functions¶
In our models we can use various types of functions as basis functions. Strictly speaking, in linear
algebra where a basis for a subspace S of $\mathbb{R}^n$ is a set of vectors that spans S and is
linearly independent. As a reminder, a set of vectors $\textbf{v}_1, \textbf{v}_2, ...,
\textbf{v}_k$ are considered linearly independent if they cannot be written as a linear combination
of each other, such that, if: $c_1\textbf{v}_1+c_2\textbf{v}_2+ ...+ c_k\textbf{v}_k = \textbf{0}$
then $c_1,c_2,...,c_k$ are all zero.
In data science where we have lots of imperfect data (with errors), as well as imperfect computers
(with round-off errors), when we substitute their values into the matrices we almost always get
column degeneracy, meaning, some of our columns become linear combinations of each other. Especially
so if we use the monomial basis and go beyond ~5,6 degree of the polynomial.
Examples are:
- Monomials such as $x,x^2,x^4,x^5$
- Sigmoid/ReLU functions (neural networks)
- Fourier functions
- Wavelets
- Splines
The matrix produced when we substitute the values of our data into the basis functions is called the design matrix.
Linear/Polynomial Regression¶
We will use the diabetes
dataset.
Variables are:
- subject: subject ID number
- age: age diagnosed with diabetes
- acidity: a measure of acidity called base deficit Response:
- y: natural log of serum C-peptide concentration
Original source is Sockett et al. (1987) mentioned in Hastie and Tibshirani's book "Generalized Additive Models".
Reading the data in Pandas:
diab = pd.read_csv("data/diabetes.csv")
diab.head()
Create the design matrix for a fictitious dataset¶
Let's keep just the age
feature and create some columns of our own. Let's see how good
this matrix is before we create the design matrix.
diab_age = diab[['age']].copy()
diab_age['age2'] = diab_age.apply(lambda row: row['age']**2, axis=1)
diab_age['random'] = np.random.normal(0,1,len(diab_age))
diab_age['same'] = diab_age['age']
diab_age.head()
A = diab_age.to_numpy(copy=True)
A[:5]
Let's check if the columns of A are linearly independent by using some linear algebra methods from
numpy.linalg
and sympy
.
from numpy.linalg import matrix_rank
matrix_rank(A)
# check out which rows are linearly independent
import sympy
_, inds = sympy.Matrix(A).T.rref()
inds
np.linalg.cond(A)
Create the design matrix for age
using a polynomial basis¶
Let's keep just the age
feature again and create the design matrix using a polynomial of
degree n
. First we will use the basic numpy
formula vander()
.
vand = np.vander(diab_age.age, 2, increasing=True)
vand[:3], vand.shape
## To our point why the Vandermonde matrix is usually ill-conditioned,
## find the condition number of this matrix
np.linalg.cond(vand), matrix_rank(vand)
Change the degree of the polynomial and comment on what happens to the condition and rank of the matrix.
vand = np.vander(diab_age.age, 8, increasing=True)
vand[:3], vand.shape
## To our point why the Vandermonde matrix is usually ill-conditioned,
## find the condition number of this matrix
np.linalg.cond(vand), matrix_rank(vand)
Linear/Polynomial regression with statsmodels.¶
As you remember from 109a, we have two tools for Linear Regression:
statsmodels
https://www.statsmodels.org/stable/regression.html, andsklearn
https://scikit-learn.org/stable/index.html
Previously, in this notebook, we worked from a vector of target values and a design matrix we built
ourself. In 109a we used e.g. sklearn
's PolynomialFeatures to build the matrix. Now we
will look at statsmodels
which allows users to fit statistical models using R-style
formulas. They build the target value and design matrix for you.
Note: Categorical features (e.g. let's say we had a categorical feature called
Region, are designated by C(Region)
), polynomial features (e.g. age) are entered as
np.power(age, n)
where n
is the degree of the polynomial
OR np.vander(age, n, increasing=True)
.
# Example: if our target variable is 'Lottery', while 'Region' is a categorical predictor and all the others are numerical:
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']]
formula='Lottery ~ Literacy + Wealth + C(Region) + Literacy * Wealth'
For more on these formulas see:
import statsmodels.formula.api as smf
model1 = smf.ols('y ~ age', data=diab)
fit1_lm = model1.fit()
Let's build a dataframe to predict values on (sometimes this is just the test or validation set). Very useful for making pretty plots of the model predictions - predict for TONS of values, not just whatever's in the training set.
x_pred = np.linspace(0.5,20,100)
predict_df = pd.DataFrame(data={"age":x_pred})
predict_df.head()
Use get_prediction(
).summary_frame()
to get the model's prediction (and error bars!)
prediction_output = fit1_lm.get_prediction(predict_df).summary_frame()
prediction_output.head()
Plot the data, the fitted model, the confidence intervals, and the prediction
intervals. For more on how statsmodels
calculates these intervals see: https://www.statsmodels.org/stable/_modules/statsmodels/regression/_prediction.html
ax1 = diab.plot.scatter(x='age',y='y',c='brown',title="Diabetes data with least-squares linear fit")
ax1.set_xlabel("Age at Diagnosis")
ax1.set_ylabel("Log C-Peptide Concentration")
ax1.plot(predict_df.age, prediction_output['mean'],color="green")
ax1.plot(predict_df.age, prediction_output['mean_ci_lower'], color="blue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['mean_ci_upper'], color="blue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['obs_ci_lower'], color="green",linestyle="dashdot")
ax1.plot(predict_df.age, prediction_output['obs_ci_upper'], color="green",linestyle="dashdot");
- Fit a 3rd degree polynomial model to predict
y
using onlyage
and - Plot the model and its confidence intervals.
- Change the degree of your polynomial and see what happens to the fitted curve.
- Does our model have an intercept? Note: we can discover the existence or not of an
intercept in our model by running:
model_name.params
# your answer here
# solution
It's easier to build higher order polynomials using np.vandrer()
.
formula = "y ~ np.vander(age, 6, increasing=True) -1"
fit3_lm = smf.ols(formula=formula, data=diab).fit()
fit3_lm.params
## To our point why the Vandermonde matrix is usually ill-conditioned,
# find the condition number of this matrix
np.linalg.cond(np.vander(predict_df.age, 6, increasing=True))
# solution
poly_predictions = fit3_lm.get_prediction(predict_df).summary_frame()
poly_predictions.head()
# solution
x_pred = np.linspace(0.5,15,100)
predict_df = pd.DataFrame(data={"age":x_pred})
ax2 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares cubic fit")
ax2.set_xlabel("Age at Diagnosis")
ax2.set_ylabel("Log C-Peptide Concentration")
ax2.plot(predict_df.age, poly_predictions['mean'],color="green")
ax2.plot(predict_df.age, poly_predictions['mean_ci_lower'], color="blue",linestyle="dashed", label='confidence interval')
ax2.plot(predict_df.age, poly_predictions['mean_ci_upper'], color="blue",linestyle="dashed");
ax2.legend();
QR decomposition (Beyond the scope of this class)¶
As you know, to find the parameters of our model, we may try to solve the so-called normal
equations, which, written in matrix form, are:
\begin{equation}
(\boldsymbol{A^T}\cdot \boldsymbol{A}) \cdot \boldsymbol{b} = \boldsymbol{A} \cdot \boldsymbol{y}
\end{equation}
The direct solution is $\hat{\boldsymbol{b}}=(\boldsymbol{A}^T\cdot \boldsymbol{A})^{-1}\cdot \boldsymbol{A}^T \cdot \boldsymbol{y}$
Solving the least-squares problem directly via the normal equations is susceptible to roundoff error
when the condition of the matrix $\boldsymbol{A}$ is large. An alternative technique involves QR
decomposition (details in any good linear algebra book). statsmodels
lets you use this
technique via a parameter in the .fit
:
.fit(method='qr')
Let's try with QR now¶
formula = "y ~ np.vander(age, 6, increasing=True) -1"
fit3_lm = smf.ols(formula=formula, data=diab).fit(method='qr')
fit3_lm.params
# solution
poly_predictions = fit3_lm.get_prediction(predict_df).summary_frame()
poly_predictions.head()
# solution
ax2 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares cubic fit")
ax2.set_xlabel("Age at Diagnosis")
ax2.set_ylabel("Log C-Peptide Concentration")
ax2.plot(predict_df.age, poly_predictions['mean'],color="green")
ax2.plot(predict_df.age, poly_predictions['mean_ci_lower'], color="blue",linestyle="dashed", label='confidence interval')
ax2.plot(predict_df.age, poly_predictions['mean_ci_upper'], color="blue",linestyle="dashed");
ax2.legend();