Key Word(s): Smoothers, GAMs

# CS109B Data Science 2: Advanced Topics in Data Science

## Lab 2 - Smoothers and Generalized Additive Models - Model Fitting¶

Spring 2020

Harvard University
Spring 2020
Instructors: Mark Glickman, Pavlos Protopapas, and Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras
Content: Eleni Kaxiras and Will Claybaugh

In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:
In [2]:
import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline


## Learning Goals¶

By the end of this lab, you should be able to:

• Understand how to implement GAMs with the Python package pyGAM
• Learn about the practical aspects of Splines and how to use them.

This lab corresponds to lectures 1, 2, and 3 and maps to homework 1.

• 1 - Overview - A Top View of LMs, GLMs, and GAMs to set the stage
• 2 - A review of Linear Regression with statsmodels. What are those weird formulas?
• 3 - Splines
• 4 - Generative Additive Models with pyGAM
• 5 - Smooting Splines using pyGAM

## Overview¶

Linear Models (LM), Generalized Linear Models (GLMs), Generalized Additive Models (GAMs), Splines, Natural Splines, Smoothing Splines! So many definitions. Let's try and work through an example for each of them so we can better understand them.

image source: Dani Servén Marín (one of the developers of pyGAM)

### A - Linear Models¶

First we have the Linear Models which you know from 109a. These models are linear in the coefficients. Very interpretable but suffer from high bias because let's face it, few relationships in life are linear. Simple Linear Regression (defined as a model with one predictor) as well as Multiple Linear Regression (more than one predictors) are examples of LMs. Polynomial Regression extends the linear model by adding terms that are still linear for the coefficients but non-linear when it somes to the predictiors which are now raised in a power or multiplied between them.

\begin{aligned} y = \beta{_0} + \beta{_1}{x_1} & \mbox{(simple linear regression)}\\ y = \beta{_0} + \beta{_1}{x_1} + \beta{_2}{x_2} + \beta{_3}{x_3} & \mbox{(multiple linear regression)}\\ y = \beta{_0} + \beta{_1}{x_1} + \beta{_2}{x_1^2} + \beta{_3}{x_3^3} & \mbox{(polynomial regression)}\\ \end{aligned}
Discussion
• What does it mean for a model to be interpretable?
• Are linear regression models interpretable? Are random forests? What about Neural Networks such as FFNs and CNNs?
• Do we always want interpretability? Describe cases where we do and cases where we do not care.

### B - Generalized Linear Models (GLMs)¶

\begin{aligned} y = \beta{_0} + \beta{_1}{x_1} + \beta{_2}{x_2} + \beta{_3}{x_3} \end{aligned}

Generalized Linear Models is a term coined in the early 1970s by Nelder and Wedderburn for a class of models that includes both Linear Regression and Logistic Regression. A GLM fits one coefficient per feature (predictor).

### C - Generalized Additive Models (GAMs)¶

Hastie and Tidshirani coined the term Generalized Additive Models in 1986 for a class of non-linear extensions to Generalized Linear Models.

\begin{aligned} y = \beta{_0} + f_1\left(x_1\right) + f_2\left(x_2\right) + f_3\left(x_3\right) \\ y = \beta{_0} + f_1\left(x_1\right) + f_2\left(x_2, x_3\right) + f_3\left(x_3\right) & \mbox{(with interaction terms)} \end{aligned}

In practice we add splines and regularization via smoothing penalties to our GLMs. Decision Trees also fit in this category.

image source: Dani Servén Marín

### D - Basis Functions¶

In our models we can use various types of functions as "basis".

• Monomials such as $x^2$, $x^4$ (Polynomial Regression)
• Sigmoid functions (neural networks)
• Fourier functions
• Wavelets
• Regression splines which we will look at shortly.
Discussion
• Where does polynomial regression fit in all this?

Answer: GLMs include Polynomial Regression so the graphic above should really include curved lines, not just straight...

## Implementation¶

### 1 - Linear/Polynomial Regression¶

We will use the diabetes dataset.

Variables are:

• subject: subject ID number
• age: age diagnosed with diabetes
• acidity: a measure of acidity called base deficit Response:
• y: natural log of serum C-peptide concentration

Original source is Sockett et al. (1987) mentioned in Hastie and Tibshirani's book "Generalized Additive Models".

Reading data and (some) exploring in Pandas:

In [3]:
diab = pd.read_csv("../data/diabetes.csv")

Out[3]:
subject age acidity y
0 1 5.2 -8.1 4.8
1 2 8.8 -16.1 4.1
2 3 10.5 -0.9 5.2
3 4 10.6 -7.8 5.5
4 5 10.4 -29.0 5.0
In [4]:
diab.dtypes

Out[4]:
subject      int64
age        float64
acidity    float64
y          float64
dtype: object
In [5]:
diab.describe()

Out[5]:
subject age acidity y
count 43.000000 43.000000 43.000000 43.000000
mean 22.000000 9.032558 -8.148837 4.746512
std 12.556539 4.022539 7.123080 0.720565
min 1.000000 0.900000 -29.000000 3.000000
25% 11.500000 5.500000 -12.700000 4.450000
50% 22.000000 10.400000 -7.800000 4.900000
75% 32.500000 11.850000 -2.000000 5.100000
max 43.000000 15.600000 -0.200000 6.600000

Plotting with matplotlib:

In [6]:
ax0 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data") #plotting direclty from pandas!
ax0.set_xlabel("Age at Diagnosis")
ax0.set_ylabel("Log C-Peptide Concentration");


### Linear/Polynomial regression with statsmodels.¶

As you remember from 109a, we have two tools for Linear Regression:

Previously, we worked from a vector of target values and a design matrix we built ourself (e.g. using sklearn's PolynomialFeatures). statsmodels allows users to fit statistical models using R-style formulas. They build the target value and design matrix for you.

# our target variable is 'Lottery', while 'Region' is a categorical predictor
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']]

formula='Lottery ~ Literacy + Wealth + C(Region) + Literacy * Wealth'

For more on these formulas see:

In [7]:
import statsmodels.formula.api as sm

model1 = sm.ols('y ~ age',data=diab)
fit1_lm = model1.fit()


Let's build a dataframe to predict values on (sometimes this is just the test or validation set). Very useful for making pretty plots of the model predictions - predict for TONS of values, not just whatever's in the training set.

In [8]:
x_pred = np.linspace(0,16,100)

predict_df = pd.DataFrame(data={"age":x_pred})

Out[8]:
age
0 0.000000
1 0.161616
2 0.323232
3 0.484848
4 0.646465

Use get_prediction().summary_frame() to get the model's prediction (and error bars!)

In [9]:
prediction_output = fit1_lm.get_prediction(predict_df).summary_frame()

Out[9]:
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 3.996031 0.244590 3.502071 4.489991 2.600828 5.391235
1 4.009459 0.240929 3.522892 4.496026 2.616856 5.402063
2 4.022887 0.237280 3.543691 4.502084 2.632842 5.412932
3 4.036315 0.233642 3.564466 4.508165 2.648786 5.423845
4 4.049743 0.230016 3.585216 4.514270 2.664687 5.434800

Plot the model and error bars

In [10]:
ax1 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares linear fit")
ax1.set_xlabel("Age at Diagnosis")
ax1.set_ylabel("Log C-Peptide Concentration")

ax1.plot(predict_df.age, prediction_output['mean'],color="green")
ax1.plot(predict_df.age, prediction_output['mean_ci_lower'], color="blue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['mean_ci_upper'], color="blue",linestyle="dashed");

Exercise 1
• Fit a 3rd degree polynomial model and
• plot the model+error bars.

You can either take

• Route1: Build a design df with a column for each of age, age**2, age**3, or
• Route2: Just edit the formula
In [11]:
# your answer here

In [12]:
# %load ../solutions/exercise1-1.py
fit2_lm = sm.ols(formula="y ~ age + np.power(age, 2) + np.power(age, 3)",data=diab).fit()

poly_predictions = fit2_lm.get_prediction(predict_df).summary_frame()

Out[12]:
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 2.740481 0.508197 1.712556 3.768406 1.156238 4.324724
1 2.846265 0.472858 1.889819 3.802710 1.307439 4.385090
2 2.948751 0.439558 2.059661 3.837841 1.450860 4.446641
3 3.047990 0.408303 2.222119 3.873860 1.586737 4.509242
4 3.144031 0.379104 2.377221 3.910841 1.715328 4.572735
In [13]:
# %load ../solutions/exercise1-2.py
ax2 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares cubic fit")
ax2.set_xlabel("Age at Diagnosis")
ax2.set_ylabel("Log C-Peptide Concentration")

ax2.plot(predict_df.age, poly_predictions['mean'],color="green")
ax2.plot(predict_df.age, poly_predictions['mean_ci_lower'], color="blue",linestyle="dashed")
ax2.plot(predict_df.age, poly_predictions['mean_ci_upper'], color="blue",linestyle="dashed");

Ed exercise

This example was similar with the Ed exercise. Open it in Ed and let's go though it.

### 2 - Piecewise Polynomials a.k.a. Splines¶

Splines are a type of piecewise polynomial interpolant. A spline of degree k is a piecewise polynomial that is continuously differentiable k − 1 times.

Splines are the basis of CAD software and vector graphics including a lot of the fonts used in your computer. The name “spline” comes from a tool used by ship designers to draw smooth curves. Here is the letter $epsilon$ written with splines:

font idea inspired by David Knezevic (AM205)

If the degree is 1 then we have a Linear Spline. If it is 3 then we have a Cubic spline. It turns out that cubic splines because they have a continous 2nd derivative at the knots are very smoothly looking to the eye. We do not need higher order than that. The Cubic Splines are usually Natural Cubic Splines which means they have the added constrain of the end points' second derivative = 0.

We will use the CubicSpline and the B-Spline as well as the Linear Spline.

#### scipy.interpolate¶

See all the different splines that scipy.interpolate has to offer: https://docs.scipy.org/doc/scipy/reference/interpolate.html

Let's use the simplest form which is interpolate on a set of points and then find the points between them.

In [14]:
from scipy.interpolate import splrep, splev
from scipy.interpolate import BSpline, CubicSpline
from scipy.interpolate import interp1d

# define the range of the function
a = -1
b = 1

# define the number of knots
num_knots = 10
x = np.linspace(a,b,num_knots)

# define the function we want to approximate
y = 1/(1+25*(x**2))

# make a linear spline
linspline = interp1d(x, y)

# sample at these points to plot
xx = np.linspace(a,b,1000)
yy = 1/(1+25*(xx**2))
plt.plot(x,y,'*')
plt.plot(xx, yy, label='true function')
plt.plot(xx, linspline(xx), label='linear spline');
plt.legend();

Exercise 2

The Linear interpolation does not look very good. Fit a Cubic Spline and plot along the Linear to compare.

In [15]:
# your answer here

In [16]:
# %load ../solutions/exercise2.py
# define the range of the function
a = -1
b = 1

# define the knots
num_knots = 10
x = np.linspace(a,b,num_knots)

# define the function we want to approximate
y = 1/(1+25*(x**2))

# make the Cubic spline
cubspline = CubicSpline(x, y)

# OR make a linear spline
linspline = interp1d(x, y)

# plot
xx = np.linspace(a,b,1000)
yy = 1/(1+25*(xx**2))
plt.plot(xx, yy, label='true function')
plt.plot(x,y,'*')
plt.plot(xx, linspline(xx), label='linear');
plt.plot(xx, cubspline(xx), label='cubic');
plt.legend();

Discussion
• Change the number of knots to 100 and see what happens. What would happen if we run a polynomial model of degree equal to the number of knots (a global one as in polynomial regression, not a spline)?
• What makes a spline 'Natural'?

#### B-Splines¶

A B-splines (Basis Splines) is defined by a set of control points and a set of basis functions that intepolate (fit) the function between these points. By choosing to have no smoothing factor we forces the final B-spline to pass though all the points. If, on the other hand, we set a smothing factor, our function is more of an approximation with the control points as "guidance". The latter produced a smoother curve which is prefferable for drawing software. For more on Splines see: https://en.wikipedia.org/wiki/B-spline)

We will use scipy.splrep to calulate the coefficients for the B-Spline and draw it.

#### B-Spline with no smooting¶

In [17]:
from scipy.interpolate import splev, splrep
x = np.linspace(0, 10, 10)
y = np.sin(x)

t,c,k = splrep(x, y) # (tck) is a tuple containing the vector of knots, coefficients, degree of the spline
print(t,c,k)
# define the points to plot on (x2)
x2 = np.linspace(0, 10, 200)
y2 = BSpline(t, c, k)
plt.plot(x, y, 'o', x2, y2(x2))
plt.show()

[ 0.          0.          0.          0.          2.22222222  3.33333333
4.44444444  5.55555556  6.66666667  7.77777778 10.         10.
10.         10.        ] [-4.94881722e-18  8.96543619e-01  1.39407154e+00 -2.36640266e-01
-1.18324030e+00 -8.16301228e-01  4.57836125e-01  1.48720677e+00
1.64338775e-01 -5.44021111e-01  0.00000000e+00  0.00000000e+00
0.00000000e+00  0.00000000e+00] 3


#### B-Spline with smooting factor s¶

In [18]:
from scipy.interpolate import splev, splrep
x = np.linspace(0, 10, 10)
y = np.sin(x)

s = 0.5 # add smooting factor
task = 0 # task needs to be set to 0, which represents:
# we are specifying a smoothing factor and thus only want
# splrep() to find the optimal t and c

# define the points to plot on (x2)
x2 = np.linspace(0, 10, 200)
y2 = BSpline(t, c, k)
plt.plot(x, y, 'o', x2, y2(x2))
plt.show()


#### B-Spline with given knots¶

In [19]:
x = np.linspace(0, 10, 100)
y = np.sin(x)
knots = np.quantile(x, [0.25, 0.5, 0.75])
print(knots)

[2.5 5.  7.5]

In [20]:
# calculate the B-Spline
t,c,k = splrep(x, y, t=knots)

In [21]:
curve = BSpline(t,c,k)
curve

Out[21]:
In [22]:
plt.scatter(x=x,y=y,c='grey', alpha=0.4)
yknots = np.sin(knots)
plt.scatter(knots, yknots, c='r')
plt.plot(x,curve(x))
plt.show()

Ed exercise

This example was similar with the Ed exercise. Open it in Ed and let's go though it.

### 3 - GAMs¶

#### A - Classification in pyGAM¶

Let's get our (multivariate!) data, the kyphosis dataset, and the LogisticGAM model from pyGAM to do binary classification.

• kyphosis - wherther a particular deformation was present post-operation
• age - patient's age in months
• number - the number of vertebrae involved in the operation
• start - the number of the topmost vertebrae operated on
In [23]:
kyphosis = pd.read_csv("../data/kyphosis.csv")

display(kyphosis.describe(include='all'))
display(kyphosis.dtypes)

Kyphosis Age Number Start
0 absent 71 3 5
1 absent 158 3 14
2 present 128 4 5
3 absent 2 5 1
4 absent 1 4 15
Kyphosis Age Number Start
count 81 81.000000 81.000000 81.000000
unique 2 NaN NaN NaN
top absent NaN NaN NaN
freq 64 NaN NaN NaN
mean NaN 83.654321 4.049383 11.493827
std NaN 58.104251 1.619423 4.883962
min NaN 1.000000 2.000000 1.000000
25% NaN 26.000000 3.000000 9.000000
50% NaN 87.000000 4.000000 13.000000
75% NaN 130.000000 5.000000 16.000000
max NaN 206.000000 10.000000 18.000000
Kyphosis    object
Age          int64
Number       int64
Start        int64
dtype: object
In [24]:
# convert the outcome in a binary form, 1 or 0
kyphosis["outcome"] = 1*(kyphosis["Kyphosis"] == "present")
kyphosis.describe()

Out[24]:
Age Number Start outcome
count 81.000000 81.000000 81.000000 81.000000
mean 83.654321 4.049383 11.493827 0.209877
std 58.104251 1.619423 4.883962 0.409758
min 1.000000 2.000000 1.000000 0.000000
25% 26.000000 3.000000 9.000000 0.000000
50% 87.000000 4.000000 13.000000 0.000000
75% 130.000000 5.000000 16.000000 0.000000
max 206.000000 10.000000 18.000000 1.000000
In [25]:
from pygam import LogisticGAM, s, f, l

X = kyphosis[["Age","Number","Start"]]
y = kyphosis["outcome"]
kyph_gam = LogisticGAM().fit(X,y)


#### Outcome dependence on features¶

To help us see how the outcome depends on each feature, pyGAM has the partial_dependence() function.

 pdep, confi = kyph_gam.partial_dependence(term=i, X=XX, width=0.95)

For more on this see the : https://pygam.readthedocs.io/en/latest/api/logisticgam.html

In [26]:
res = kyph_gam.deviance_residuals(X,y)
for i, term in enumerate(kyph_gam.terms):
if term.isintercept:
continue

XX = kyph_gam.generate_X_grid(term=i)
pdep, confi = kyph_gam.partial_dependence(term=i, X=XX, width=0.95)
pdep2, _ = kyph_gam.partial_dependence(term=i, X=X, width=0.95)
plt.figure()
plt.scatter(X.iloc[:,term.feature], pdep2 + res)
plt.plot(XX[:, term.feature], pdep)
plt.plot(XX[:, term.feature], confi, c='r', ls='--')
plt.title(X.columns.values[term.feature])
plt.show()


Notice that we did not specify the basis functions in the .fit(). Cool. pyGAM figures them out for us by using $s()$ (splines) for numerical variables and $f()$ for categorical features. If this is not what we want we can manually specify the basis functions, as follows:

In [27]:
kyph_gam = LogisticGAM(s(0)+s(1)+s(2)).fit(X,y)

In [28]:
res = kyph_gam.deviance_residuals(X,y)
for i, term in enumerate(kyph_gam.terms):
if term.isintercept:
continue

XX = kyph_gam.generate_X_grid(term=i)
pdep, confi = kyph_gam.partial_dependence(term=i, X=XX, width=0.95)
pdep2, _ = kyph_gam.partial_dependence(term=i, X=X, width=0.95)
plt.figure()
plt.scatter(X.iloc[:,term.feature], pdep2 + res)
plt.plot(XX[:, term.feature], pdep)
plt.plot(XX[:, term.feature], confi, c='r', ls='--')
plt.title(X.columns.values[term.feature])
plt.show()


#### B - Regression in pyGAM¶

For regression problems, we can use a linearGAM model. For this part we will use the wages dataset.

#### The wages dataset¶

Let's inspect another dataset that is included in pyGAM that notes the wages of people based on their age, year of employment and education.

In [29]:
# from the pyGAM documentation
from pygam import LinearGAM, s, f
from pygam.datasets import wage

X, y = wage(return_X_y=True)

## model
gam = LinearGAM(s(0) + s(1) + f(2))
gam.gridsearch(X, y)

## plotting
plt.figure();
fig, axs = plt.subplots(1,3);

titles = ['year', 'age', 'education']
for i, ax in enumerate(axs):
XX = gam.generate_X_grid(term=i)
ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX))
ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
if i == 0:
ax.set_ylim(-30,30)
ax.set_title(titles[i]);

100% (11 of 11) |########################| Elapsed Time: 0:00:00 Time:  0:00:00

Discussion

What are your observations from the plots above?

### 4 - Smoothing Splines using pyGAM¶

For clarity: this is the fancy spline model that minimizes $MSE - \lambda\cdot\text{wiggle penalty}$ $=$ $\sum_{i=1}^N \left(y_i - f(x_i)\right)^2 - \lambda \int \left(f''(x)\right)^2$, across all possible functions $f$. The winner will always be a continuous, cubic polynomial with a knot at each data point.

Let's see how this smoothing works in pyGAM. We start by creating some arbitrary data and fitting them with a GAM.

In [30]:
X = np.linspace(0,10,500)
y = np.sin(X*2*np.pi)*X + np.random.randn(len(X))

plt.scatter(X,y);

In [31]:
# let's try a large lambda first and lots of splines
gam = LinearGAM(lam=1e6, n_splines=50). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3);
plt.plot(XX, gam.predict(XX));


We see that the large $\lambda$ forces a straight line, no flexibility. Let's see now what happens if we make it smaller.

In [32]:
# let's try a smaller lambda
gam = LinearGAM(lam=1e2, n_splines=50). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3);
plt.plot(XX, gam.predict(XX));


There is some curvature there but still not a good fit. Let's try no penalty. That should have the line fit exactly.

In [33]:
# no penalty, let's try a 0 lambda
gam = LinearGAM(lam=0, n_splines=50). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3)
plt.plot(XX, gam.predict(XX))

Out[33]:
[]

Yes, that is good. Now let's see what happens if we lessen the number of splines. The fit should not be as good.

In [34]:
# no penalty, let's try a 0 lambda
gam = LinearGAM(lam=0, n_splines=10). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3);
plt.plot(XX, gam.predict(XX));


Indeed.