CS109B Data Science 2: Advanced Topics in Data Science

Lecture 3 - Coding Environment Setup and review of `statsmodels`¶

Notebook B¶

Harvard University
Spring 2021
Instructors: Mark Glickman, Pavlos Protopapas, and Chris Tanner
Additional Instructor: Eleni Kaxiras

Content: Eleni Kaxiras and Will Claybaugh

In [5]:

## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)

In [9]:

import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

Learning Goals¶

By the end of this lab, you should be able to:

use np.linalg.vander
use the weird R-style formulas in statsmodels
practice least-squares regression in statsmodels

Basis Functions¶

In our models we can use various types of functions as basis functions. Strictly speaking, in linear algebra where a basis for a subspace S of $\mathbb{R}^n$ is a set of vectors that spans S and is linearly independent. As a reminder, a set of vectors $\textbf{v}_1, \textbf{v}_2, ..., \textbf{v}_k$ are considered linearly independent if they cannot be written as a linear combination of each other, such that, if: $c_1\textbf{v}_1+c_2\textbf{v}_2+ ...+ c_k\textbf{v}_k = \textbf{0}$ then $c_1,c_2,...,c_k$ are all zero.

In data science where we have lots of imperfect data (with errors), as well as imperfect computers (with round-off errors), when we substitute their values into the matrices we almost always get column degeneracy, meaning, some of our columns become linear combinations of each other. Especially so if we use the monomial basis and go beyond ~5,6 degree of the polynomial.

Examples are:

Monomials such as $x,x^2,x^4,x^5$
Sigmoid/ReLU functions (neural networks)
Fourier functions
Wavelets
Splines

The matrix produced when we substitute the values of our data into the basis functions is called the design matrix.

Linear/Polynomial Regression¶

We will use the diabetes dataset.

Variables are:

subject: subject ID number
age: age diagnosed with diabetes
acidity: a measure of acidity called base deficit Response:
y: natural log of serum C-peptide concentration

Original source is Sockett et al. (1987) mentioned in Hastie and Tibshirani's book "Generalized Additive Models".

Reading the data in Pandas:

In [10]:

diab = pd.read_csv("data/diabetes.csv")
diab.head()

Create the design matrix for a fictitious dataset¶

Let's keep just the age feature and create some columns of our own. Let's see how good this matrix is before we create the design matrix.

In [11]:

diab_age = diab[['age']].copy()
diab_age['age2'] = diab_age.apply(lambda row: row['age']**2, axis=1)
diab_age['random'] = np.random.normal(0,1,len(diab_age))
diab_age['same'] = diab_age['age']
diab_age.head()

In [12]:

A = diab_age.to_numpy(copy=True)
A[:5]

Let's check if the columns of A are linearly independent by using some linear algebra methods from numpy.linalg and sympy.

In [161]:

from numpy.linalg import matrix_rank
matrix_rank(A)

In [162]:

# check out which rows are linearly independent
import sympy
_, inds = sympy.Matrix(A).T.rref()
inds

In [163]:

np.linalg.cond(A)

Create the design matrix for `age` using a polynomial basis¶

Let's keep just the age feature again and create the design matrix using a polynomial of degree n. First we will use the basic numpy formula vander().

In [185]:

vand = np.vander(diab_age.age, 2, increasing=True)
vand[:3], vand.shape

In [186]:

## To our point why the Vandermonde matrix is usually ill-conditioned, 
## find the condition number of this matrix
np.linalg.cond(vand), matrix_rank(vand)

Exercise: Vandermonde matrix

Change the degree of the polynomial and comment on what happens to the condition and rank of the matrix.

In [187]:

vand = np.vander(diab_age.age, 8, increasing=True)
vand[:3], vand.shape

In [188]:

## To our point why the Vandermonde matrix is usually ill-conditioned, 
## find the condition number of this matrix
np.linalg.cond(vand), matrix_rank(vand)

Linear/Polynomial regression with statsmodels.¶

As you remember from 109a, we have two tools for Linear Regression:

statsmodels https://www.statsmodels.org/stable/regression.html, and
sklearnhttps://scikit-learn.org/stable/index.html

Previously, in this notebook, we worked from a vector of target values and a design matrix we built ourself. In 109a we used e.g. sklearn's PolynomialFeatures to build the matrix. Now we will look at statsmodels which allows users to fit statistical models using R-style formulas. They build the target value and design matrix for you.

Note: Categorical features (e.g. let's say we had a categorical feature called Region, are designated by C(Region)), polynomial features (e.g. age) are entered as np.power(age, n) where n is the degree of the polynomial OR np.vander(age, n, increasing=True).

# Example: if our target variable is 'Lottery', while 'Region' is a categorical predictor and all the others are numerical:
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']]

formula='Lottery ~ Literacy + Wealth + C(Region) + Literacy * Wealth'

For more on these formulas see:

In [191]:

import statsmodels.formula.api as smf

model1 = smf.ols('y ~ age', data=diab)
fit1_lm = model1.fit()

Let's build a dataframe to predict values on (sometimes this is just the test or validation set). Very useful for making pretty plots of the model predictions - predict for TONS of values, not just whatever's in the training set.

In [194]:

x_pred = np.linspace(0.5,20,100)

predict_df = pd.DataFrame(data={"age":x_pred})
predict_df.head()

Use get_prediction( ).summary_frame() to get the model's prediction (and error bars!)

In [195]:

prediction_output = fit1_lm.get_prediction(predict_df).summary_frame()
prediction_output.head()

Plot the data, the fitted model, the confidence intervals, and the prediction intervals. For more on how statsmodels calculates these intervals see: https://www.statsmodels.org/stable/_modules/statsmodels/regression/_prediction.html

In [196]:

ax1 = diab.plot.scatter(x='age',y='y',c='brown',title="Diabetes data with least-squares linear fit")
ax1.set_xlabel("Age at Diagnosis")
ax1.set_ylabel("Log C-Peptide Concentration")

ax1.plot(predict_df.age, prediction_output['mean'],color="green")
ax1.plot(predict_df.age, prediction_output['mean_ci_lower'], color="blue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['mean_ci_upper'], color="blue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['obs_ci_lower'], color="green",linestyle="dashdot")
ax1.plot(predict_df.age, prediction_output['obs_ci_upper'], color="green",linestyle="dashdot");

Breakout Room Exercise

Fit a 3rd degree polynomial model to predict y using only age and
Plot the model and its confidence intervals.
Change the degree of your polynomial and see what happens to the fitted curve.
Does our model have an intercept? Note: we can discover the existence or not of an intercept in our model by running:
```
model_name.params
```

In [67]:

# your answer here

In [201]:

# solution

Vandermonde matrix in formulas

It's easier to build higher order polynomials using np.vandrer().

In [204]:

formula = "y ~ np.vander(age, 6, increasing=True) -1"
fit3_lm = smf.ols(formula=formula, data=diab).fit()

In [205]:

fit3_lm.params

In [207]:

## To our point why the Vandermonde matrix is usually ill-conditioned, 
# find the condition number of this matrix
np.linalg.cond(np.vander(predict_df.age, 6, increasing=True))

In [208]:

# solution 
poly_predictions = fit3_lm.get_prediction(predict_df).summary_frame()
poly_predictions.head()

In [209]:

# solution
x_pred = np.linspace(0.5,15,100)
predict_df = pd.DataFrame(data={"age":x_pred})

ax2 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares cubic fit")
ax2.set_xlabel("Age at Diagnosis")
ax2.set_ylabel("Log C-Peptide Concentration")

ax2.plot(predict_df.age, poly_predictions['mean'],color="green")
ax2.plot(predict_df.age, poly_predictions['mean_ci_lower'], color="blue",linestyle="dashed", label='confidence interval')
ax2.plot(predict_df.age, poly_predictions['mean_ci_upper'], color="blue",linestyle="dashed");
ax2.legend();

Discussion

QR decomposition (Beyond the scope of this class)¶

As you know, to find the parameters of our model, we may try to solve the so-called normal equations, which, written in matrix form, are:
\begin{equation} (\boldsymbol{A^T}\cdot \boldsymbol{A}) \cdot \boldsymbol{b} = \boldsymbol{A} \cdot \boldsymbol{y} \end{equation}

The direct solution is $\hat{\boldsymbol{b}}=(\boldsymbol{A}^T\cdot \boldsymbol{A})^{-1}\cdot \boldsymbol{A}^T \cdot \boldsymbol{y}$

Solving the least-squares problem directly via the normal equations is susceptible to roundoff error when the condition of the matrix $\boldsymbol{A}$ is large. An alternative technique involves QR decomposition (details in any good linear algebra book). statsmodels lets you use this technique via a parameter in the .fit:

   .fit(method='qr')

Let's try with QR now¶

In [210]:

formula = "y ~ np.vander(age, 6, increasing=True) -1"
fit3_lm = smf.ols(formula=formula, data=diab).fit(method='qr')

In [211]:

fit3_lm.params

In [212]:

# solution 
poly_predictions = fit3_lm.get_prediction(predict_df).summary_frame()
poly_predictions.head()

In [213]:

# solution
ax2 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares cubic fit")
ax2.set_xlabel("Age at Diagnosis")
ax2.set_ylabel("Log C-Peptide Concentration")

ax2.plot(predict_df.age, poly_predictions['mean'],color="green")
ax2.plot(predict_df.age, poly_predictions['mean_ci_lower'], color="blue",linestyle="dashed", label='confidence interval')
ax2.plot(predict_df.age, poly_predictions['mean_ci_upper'], color="blue",linestyle="dashed");
ax2.legend();