CS109B Data Science 2: Advanced Topics in Data Science

Lecture 3 - Coding Environment Setup and review of statsmodels

Notebook B

Harvard University
Spring 2021
Instructors: Mark Glickman, Pavlos Protopapas, and Chris Tanner
Additional Instructor: Eleni Kaxiras

Content: Eleni Kaxiras and Will Claybaugh


In [5]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)
In [9]:
import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline 

Learning Goals

By the end of this lab, you should be able to:

  • use np.linalg.vander
  • use the weird R-style formulas in statsmodels
  • practice least-squares regression in statsmodels

Basis Functions

In our models we can use various types of functions as basis functions. Strictly speaking, in linear algebra where a basis for a subspace S of $\mathbb{R}^n$ is a set of vectors that spans S and is linearly independent. As a reminder, a set of vectors $\textbf{v}_1, \textbf{v}_2, ..., \textbf{v}_k$ are considered linearly independent if they cannot be written as a linear combination of each other, such that, if: $c_1\textbf{v}_1+c_2\textbf{v}_2+ ...+ c_k\textbf{v}_k = \textbf{0}$ then $c_1,c_2,...,c_k$ are all zero.

In data science where we have lots of imperfect data (with errors), as well as imperfect computers (with round-off errors), when we substitute their values into the matrices we almost always get column degeneracy, meaning, some of our columns become linear combinations of each other. Especially so if we use the monomial basis and go beyond ~5,6 degree of the polynomial.

Examples are:

  • Monomials such as $x,x^2,x^4,x^5$
  • Sigmoid/ReLU functions (neural networks)
  • Fourier functions
  • Wavelets
  • Splines

The matrix produced when we substitute the values of our data into the basis functions is called the design matrix.

Linear/Polynomial Regression

We will use the diabetes dataset.

Variables are:

  • subject: subject ID number
  • age: age diagnosed with diabetes
  • acidity: a measure of acidity called base deficit Response:
  • y: natural log of serum C-peptide concentration

Original source is Sockett et al. (1987) mentioned in Hastie and Tibshirani's book "Generalized Additive Models".

Reading the data in Pandas:

In [10]:
diab = pd.read_csv("data/diabetes.csv")
diab.head()

Create the design matrix for a fictitious dataset

Let's keep just the age feature and create some columns of our own. Let's see how good this matrix is before we create the design matrix.

In [11]:
diab_age = diab[['age']].copy()
diab_age['age2'] = diab_age.apply(lambda row: row['age']**2, axis=1)
diab_age['random'] = np.random.normal(0,1,len(diab_age))
diab_age['same'] = diab_age['age']
diab_age.head()
In [12]:
A = diab_age.to_numpy(copy=True)
A[:5]

Let's check if the columns of A are linearly independent by using some linear algebra methods from numpy.linalg and sympy.

In [161]:
from numpy.linalg import matrix_rank
matrix_rank(A)
In [162]:
# check out which rows are linearly independent
import sympy
_, inds = sympy.Matrix(A).T.rref()
inds
In [163]:
np.linalg.cond(A)

Create the design matrix for age using a polynomial basis

Let's keep just the age feature again and create the design matrix using a polynomial of degree n. First we will use the basic numpy formula vander().

In [185]:
vand = np.vander(diab_age.age, 2, increasing=True)
vand[:3], vand.shape
In [186]:
## To our point why the Vandermonde matrix is usually ill-conditioned, 
## find the condition number of this matrix
np.linalg.cond(vand), matrix_rank(vand)

Exercise: Vandermonde matrix

Change the degree of the polynomial and comment on what happens to the condition and rank of the matrix.

In [187]:
vand = np.vander(diab_age.age, 8, increasing=True)
vand[:3], vand.shape
In [188]:
## To our point why the Vandermonde matrix is usually ill-conditioned, 
## find the condition number of this matrix
np.linalg.cond(vand), matrix_rank(vand)

Linear/Polynomial regression with statsmodels.

As you remember from 109a, we have two tools for Linear Regression:

Previously, in this notebook, we worked from a vector of target values and a design matrix we built ourself. In 109a we used e.g. sklearn's PolynomialFeatures to build the matrix. Now we will look at statsmodels which allows users to fit statistical models using R-style formulas. They build the target value and design matrix for you.

Note: Categorical features (e.g. let's say we had a categorical feature called Region, are designated by C(Region)), polynomial features (e.g. age) are entered as np.power(age, n) where n is the degree of the polynomial OR np.vander(age, n, increasing=True).

# Example: if our target variable is 'Lottery', while 'Region' is a categorical predictor and all the others are numerical:
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']]

formula='Lottery ~ Literacy + Wealth + C(Region) + Literacy * Wealth'

For more on these formulas see:

In [191]:
import statsmodels.formula.api as smf

model1 = smf.ols('y ~ age', data=diab)
fit1_lm = model1.fit()

Let's build a dataframe to predict values on (sometimes this is just the test or validation set). Very useful for making pretty plots of the model predictions - predict for TONS of values, not just whatever's in the training set.

In [194]:
x_pred = np.linspace(0.5,20,100)

predict_df = pd.DataFrame(data={"age":x_pred})
predict_df.head()

Use get_prediction( ).summary_frame() to get the model's prediction (and error bars!)

In [195]:
prediction_output = fit1_lm.get_prediction(predict_df).summary_frame()
prediction_output.head()

Plot the data, the fitted model, the confidence intervals, and the prediction intervals. For more on how statsmodels calculates these intervals see: https://www.statsmodels.org/stable/_modules/statsmodels/regression/_prediction.html

In [196]:
ax1 = diab.plot.scatter(x='age',y='y',c='brown',title="Diabetes data with least-squares linear fit")
ax1.set_xlabel("Age at Diagnosis")
ax1.set_ylabel("Log C-Peptide Concentration")

ax1.plot(predict_df.age, prediction_output['mean'],color="green")
ax1.plot(predict_df.age, prediction_output['mean_ci_lower'], color="blue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['mean_ci_upper'], color="blue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['obs_ci_lower'], color="green",linestyle="dashdot")
ax1.plot(predict_df.age, prediction_output['obs_ci_upper'], color="green",linestyle="dashdot");
Breakout Room Exercise
  • Fit a 3rd degree polynomial model to predict y using only age and
  • Plot the model and its confidence intervals.
  • Change the degree of your polynomial and see what happens to the fitted curve.
  • Does our model have an intercept? Note: we can discover the existence or not of an intercept in our model by running:
    model_name.params
In [67]:
# your answer here
In [201]:
# solution 
Vandermonde matrix in formulas

It's easier to build higher order polynomials using np.vandrer().

In [204]:
formula = "y ~ np.vander(age, 6, increasing=True) -1"
fit3_lm = smf.ols(formula=formula, data=diab).fit()
In [205]:
fit3_lm.params
In [207]:
## To our point why the Vandermonde matrix is usually ill-conditioned, 
# find the condition number of this matrix
np.linalg.cond(np.vander(predict_df.age, 6, increasing=True))
In [208]:
# solution 
poly_predictions = fit3_lm.get_prediction(predict_df).summary_frame()
poly_predictions.head()
In [209]:
# solution
x_pred = np.linspace(0.5,15,100)
predict_df = pd.DataFrame(data={"age":x_pred})

ax2 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares cubic fit")
ax2.set_xlabel("Age at Diagnosis")
ax2.set_ylabel("Log C-Peptide Concentration")

ax2.plot(predict_df.age, poly_predictions['mean'],color="green")
ax2.plot(predict_df.age, poly_predictions['mean_ci_lower'], color="blue",linestyle="dashed", label='confidence interval')
ax2.plot(predict_df.age, poly_predictions['mean_ci_upper'], color="blue",linestyle="dashed");
ax2.legend();
Discussion

QR decomposition (Beyond the scope of this class)

As you know, to find the parameters of our model, we may try to solve the so-called normal equations, which, written in matrix form, are:
\begin{equation} (\boldsymbol{A^T}\cdot \boldsymbol{A}) \cdot \boldsymbol{b} = \boldsymbol{A} \cdot \boldsymbol{y} \end{equation}

The direct solution is $\hat{\boldsymbol{b}}=(\boldsymbol{A}^T\cdot \boldsymbol{A})^{-1}\cdot \boldsymbol{A}^T \cdot \boldsymbol{y}$

Solving the least-squares problem directly via the normal equations is susceptible to roundoff error when the condition of the matrix $\boldsymbol{A}$ is large. An alternative technique involves QR decomposition (details in any good linear algebra book). statsmodels lets you use this technique via a parameter in the .fit:

   .fit(method='qr')

Let's try with QR now

In [210]:
formula = "y ~ np.vander(age, 6, increasing=True) -1"
fit3_lm = smf.ols(formula=formula, data=diab).fit(method='qr')
In [211]:
fit3_lm.params
In [212]:
# solution 
poly_predictions = fit3_lm.get_prediction(predict_df).summary_frame()
poly_predictions.head()
In [213]:
# solution
ax2 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares cubic fit")
ax2.set_xlabel("Age at Diagnosis")
ax2.set_ylabel("Log C-Peptide Concentration")

ax2.plot(predict_df.age, poly_predictions['mean'],color="green")
ax2.plot(predict_df.age, poly_predictions['mean_ci_lower'], color="blue",linestyle="dashed", label='confidence interval')
ax2.plot(predict_df.age, poly_predictions['mean_ci_upper'], color="blue",linestyle="dashed");
ax2.legend();