Title¶

Notebook A: pyGAM

Description :¶

You will work on the notebook outside of Ed.

Download the notebook (using the >> in the Challenge) and image files (if any) to your local enviromnent. DO NOT RUN IN Ed.

Run by choosing on of the options presented inside the notebook.

CS109B Data Science 2: Advanced Topics in Data Science

Lecture 5.5 - Smoothers and Generalized Additive Models - Model Fitting¶

JUST A NOTEBOOK

Harvard University
Spring 2021
Instructors: Mark Glickman, Pavlos Protopapas, and Chris Tanner
Lab Instructor: Eleni Kaxiras

Content: Eleni Kaxiras and Will Claybaugh

In [1]:

## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:

In [2]:

import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

Table of Contents¶

1 - Overview - A Top View of LMs, GLMs, and GAMs to set the stage
2 - A review of Linear Regression with statsmodels. Formulas.
3 - Splines
4 - Generative Additive Models with pyGAM
5 - Smooting Splines using csaps

Overview¶

image source: Dani Servén Marín (one of the developers of pyGAM)

A - Linear Models¶

First we have the Linear Models which you know from 109a. These models are linear in the coefficients. Very interpretable but suffer from high bias because let's face it, few relationships in life are linear. Simple Linear Regression (defined as a model with one predictor) as well as Multiple Linear Regression (more than one predictors) are examples of LMs. Polynomial Regression extends the linear model by adding terms that are still linear for the coefficients but non-linear when it somes to the predictiors which are now raised in a power or multiplied between them.

$$ \begin{aligned} y = \beta{_0} + \beta{_1}{x_1} & \quad \mbox{(simple linear regression)}\\ y = \beta{_0} + \beta{_1}{x_1} + \beta{_2}{x_2} + \beta{_3}{x_3} & \quad \mbox{(multiple linear regression)}\\ y = \beta{_0} + \beta{_1}{x_1} + \beta{_2}{x_1^2} + \beta{_3}{x_3^3} & \quad \mbox{(polynomial multiple regression)}\\ \end{aligned} $$

Questions to think about

What does it mean for a model to be interpretable?
Are linear regression models interpretable? Are random forests? What about Neural Networks such as Feed Forward?
Do we always want interpretability? Describe cases where we do and cases where we do not care.

B - Generalized Linear Models (GLMs)¶

Generalized Linear Models is a term coined in the early 1970s by Nelder and Wedderburn for a class of models that includes both Linear Regression and Logistic Regression. A GLM fits one coefficient per feature (predictor).

C - Generalized Additive Models (GAMs)¶

Hastie and Tidshirani coined the term Generalized Additive Models in 1986 for a class of non-linear extensions to Generalized Linear Models.

$$ \begin{aligned} y = \beta{_0} + f_1\left(x_1\right) + f_2\left(x_2\right) + f_3\left(x_3\right) \\ y = \beta{_0} + f_1\left(x_1\right) + f_2\left(x_2, x_3\right) + f_3\left(x_3\right) & \mbox{(with interaction terms)} \end{aligned} $$

In practice we add splines and regularization via smoothing penalties to our GLMs.

image source: Dani Servén Marín

D - Basis Functions¶

In our models we can use various types of functions as "basis".

Monomials such as $x^2$, $x^4$ (Polynomial Regression)
Sigmoid functions (neural networks)
Fourier functions
Wavelets
Regression splines

1 - Piecewise Polynomials a.k.a. Splines¶

Splines are a type of piecewise polynomial interpolant. A spline of degree k is a piecewise polynomial that is continuously differentiable k − 1 times.

Splines are the basis of CAD software and vector graphics including a lot of the fonts used in your computer. The name “spline” comes from a tool used by ship designers to draw smooth curves. Here is the letter $epsilon$ written with splines:

font idea inspired by Chris Rycroft (AM205)

If the degree is 1 then we have a Linear Spline. If it is 3 then we have a Cubic spline. It turns out that cubic splines because they have a continous 2nd derivative (curvature) at the knots are very smooth to the eye. We do not need higher order than that. The Cubic Splines are usually Natural Cubic Splines which means they have the added constrain of the end points' second derivative = 0.

We will use the CubicSpline and the B-Spline as well as the Linear Spline.

scipy.interpolate¶

See all the different splines that scipy.interpolate has to offer: https://docs.scipy.org/doc/scipy/reference/interpolate.html

Let's use the simplest form which is interpolate on a set of points and then find the points between them.

In [3]:

from scipy.interpolate import splrep, splev
from scipy.interpolate import BSpline, CubicSpline
from scipy.interpolate import interp1d

# define the range of the function
a = -1
b = 1

# define the number of knots 
num_knots = 11
knots = np.linspace(a,b,num_knots)

# define the function we want to approximate
y = 1/(1+25*(knots**2))

# make a linear spline
linspline = interp1d(knots, y)

# sample at these points to plot
xx = np.linspace(a,b,1000)
yy = 1/(1+25*(xx**2))
plt.plot(knots,y,'*')
plt.plot(xx, yy, label='true function')
plt.plot(xx, linspline(xx), label='linear spline');
plt.legend();

Out[3]:

Exercise

The Linear interpolation does not look very good. Fit a Cubic Spline and plot along the Linear to compare. Feel free to solve and then look at the solution.

In [4]:

# your answer here

In [5]:

# solution
# define the range of the function
a = -1
b = 1

# define the knots 
num_knots = 10
x = np.linspace(a,b,num_knots)

# define the function we want to approximate
y = 1/(1+25*(x**2))

# make the Cubic spline
cubspline = CubicSpline(x, y)

print(f'Num knots in cubic spline: {num_knots}')

# OR make a linear spline
linspline = interp1d(x, y)

# plot
xx = np.linspace(a,b,1000)
yy = 1/(1+25*(xx**2))

plt.plot(xx, yy, label='true function')
plt.plot(x,y,'*', label='knots')
plt.plot(xx, linspline(xx), label='linear');
plt.plot(xx, cubspline(xx), label='cubic');
plt.legend();

Num knots in cubic spline: 10

Out[5]:

Questions to think about

Change the number of knots to 100 and see what happens. What would happen if we run a polynomial model of degree equal to the number of knots (a global one as in polynomial regression, not a spline)?
What makes a spline 'Natural'?

In [6]:

# Optional and Outside of the scope of this class: create the `epsilon` in the figure above
x = np.array([1.,0.,-1.5,0.,-1.5,0.])
y = np.array([1.5,1.,2.5,3,4,5])
t = np.linspace(0,5,6)
f = interp1d(t,x,kind='cubic')
g = interp1d(t,y,kind='cubic')
tplot = np.linspace(0,5,200)
plt.plot(x,y, '*', f(tplot), g(tplot));

Out[6]:

[,
 ]

B-Splines (de Boor, 1978)¶

One way to construct a curve given a set of points is to interpolate the points, that is, to force the curve to pass through the points.

A B-splines (Basis Splines) is defined by a set of control points and a set of basis functions that fit the function between these points. By choosing to have no smoothing factor we force the final B-spline to pass though all the points. If, on the other hand, we set a smothing factor, our function is more of an approximation with the control points as "guidance". The latter produced a smoother curve which is prefferable for drawing software. For more on Splines see: https://en.wikipedia.org/wiki/B-spline)

We will use scipy.splrep to calulate the coefficients for the B-Spline and draw it.

B-Spline with no smooting¶

In [7]:

from scipy.interpolate import splev, splrep

x = np.linspace(0, 10, 10)
y = np.sin(x)
# (t,c,k) is a tuple containing the vector of knots, coefficients, degree of the spline
t,c,k = splrep(x, y)
x2 = np.linspace(0, 10, 200)
y2 = BSpline(t,c,k)
plt.plot(x, y, 'o', x2, y2(x2))
plt.show()

In [8]:

from scipy.interpolate import splrep
x = np.linspace(0, 10, 10)
y = np.sin(x)

t,c,k = splrep(x, y, k=3) # (tck) is a tuple containing the vector of knots, coefficients, degree of the spline
# define the points to plot on (x2)
print(f'Knots ({len(t)} of them): {t}\n')
print(f'B-Spline coefficients ({len(c)} of them): {c}\n')
print(f'B-Spline degree {k}')
x2 = np.linspace(0, 10, 100)
y2 = BSpline(t, c, k)
plt.figure(figsize=(10,5))
plt.plot(x, y, 'o', label='true points')
plt.plot(x2, y2(x2), label='B-Spline')
tt = np.zeros(len(t))
plt.plot(t, tt,'g*', label='knots eval by the function')
plt.legend()
plt.show()

Knots (14 of them): [ 0.          0.          0.          0.          2.22222222  3.33333333
  4.44444444  5.55555556  6.66666667  7.77777778 10.         10.
 10.         10.        ]

B-Spline coefficients (14 of them): [-4.94881722e-18  8.96543619e-01  1.39407154e+00 -2.36640266e-01
 -1.18324030e+00 -8.16301228e-01  4.57836125e-01  1.48720677e+00
  1.64338775e-01 -5.44021111e-01  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00]

B-Spline degree 3

What do the tuple values returned by `scipy.splrep` mean?¶

The t variable is the array that contains the knots' position in the x axis. The length of this array is, of course, the number of knots.
The c variable is the array that holds the coefficients for the B-Spline. Its length should be the same as t.

We have number_of_knots - 1 B-spline basis elements to the spline constructed via this method, and they are defined as follows:

$$ \begin{aligned} B_{i, 0}(x) = 1, \textrm{if $ti \le x < t{i+1}$, otherwise $0$,} \ \ B_{i, k}(x) = \frac{x - ti}{t{i+k} - ti} B{i, k-1}(x)

             + \frac{t_{i+k+1} - x}{t_{i+k+1} - t_{i+1}} B_{i+1, k-1}(x)

\end{aligned} $$

t $\in [t_1, t_2, ..., t_]$ is the knot vector
c : are the spline coefficients
k : is the spline degree

B-Spline with smooting factor s¶

In [9]:

from scipy.interpolate import splev, splrep
x = np.linspace(0, 10, 5)
y = np.sin(x)

s = 0.5 # add smoothing factor
task = 0 # task needs to be set to 0, which represents:
# we are specifying a smoothing factor and thus only want
# splrep() to find the optimal t and c

t,c,k = splrep(x, y, task=task, s=s)

# draw the line segments
linspline = interp1d(x, y)
# define the points to plot on (x2)
x2 = np.linspace(0, 10, 200)
y2 = BSpline(t, c, k)
plt.plot(x, y, 'o', x2, y2(x2))
plt.plot(x2, linspline(x2))
plt.show()

B-Spline with given knots¶

In [10]:

x = np.linspace(0, 10, 100)
y = np.sin(x)
knots = np.quantile(x, [0.25, 0.5, 0.75])
print(knots)

[2.5 5.  7.5]

In [11]:

# calculate the B-Spline
t,c,k = splrep(x, y, t=knots)

In [12]:

curve = BSpline(t,c,k)
curve

Out[12]:

In [13]:

plt.scatter(x=x,y=y,c='grey', alpha=0.4)
yknots = np.sin(knots)
plt.scatter(knots, yknots, c='r')
plt.plot(x,curve(x))
plt.show()

2 - GAMs¶

https://readthedocs.org/projects/pygam/downloads/pdf/latest/

Classification in `pyGAM`¶

Let's get our (multivariate!) data, the kyphosis dataset, and the LogisticGAM model from pyGAM to do binary classification.

kyphosis - wherther a particular deformation was present post-operation
age - patient's age in months
number - the number of vertebrae involved in the operation
start - the number of the topmost vertebrae operated on

In [14]:

kyphosis = pd.read_csv("data/kyphosis.csv")

display(kyphosis.head())
display(kyphosis.describe(include='all'))
display(kyphosis.dtypes)

	Kyphosis	Age	Number	Start
0	absent	71	3	5
1	absent	158	3	14
2	present	128	4	5
3	absent	2	5	1
4	absent	1	4	15

	Kyphosis	Age	Number	Start
count	81	81.000000	81.000000	81.000000
unique	2	NaN	NaN	NaN
top	absent	NaN	NaN	NaN
freq	64	NaN	NaN	NaN
mean	NaN	83.654321	4.049383	11.493827
std	NaN	58.104251	1.619423	4.883962
min	NaN	1.000000	2.000000	1.000000
25%	NaN	26.000000	3.000000	9.000000
50%	NaN	87.000000	4.000000	13.000000
75%	NaN	130.000000	5.000000	16.000000
max	NaN	206.000000	10.000000	18.000000

Kyphosis    object
Age          int64
Number       int64
Start        int64
dtype: object

In [15]:

# convert the outcome in a binary form, 1 or 0
kyphosis = pd.read_csv("data/kyphosis.csv")
kyphosis["outcome"] = 1*(kyphosis["Kyphosis"] == "present")
kyphosis.describe()

Out[15]:

	Age	Number	Start	outcome
count	81.000000	81.000000	81.000000	81.000000
mean	83.654321	4.049383	11.493827	0.209877
std	58.104251	1.619423	4.883962	0.409758
min	1.000000	2.000000	1.000000	0.000000
25%	26.000000	3.000000	9.000000	0.000000
50%	87.000000	4.000000	13.000000	0.000000
75%	130.000000	5.000000	16.000000	0.000000
max	206.000000	10.000000	18.000000	1.000000

In [0]:

from pygam import LogisticGAM, s, f, l

X = kyphosis[["Age","Number","Start"]]
y = kyphosis["outcome"]
kyph_gam = LogisticGAM().fit(X,y)

Outcome dependence on features¶

To help us see how the outcome depends on each feature, pyGAM has the partial_dependence() function.

 pdep, confi = kyph_gam.partial_dependence(term=i, X=XX, width=0.95)

For more on this see the : https://pygam.readthedocs.io/en/latest/api/logisticgam.html

In [21]:

res = kyph_gam.deviance_residuals(X,y)
for i, term in enumerate(kyph_gam.terms):
    if term.isintercept:
        continue

    XX = kyph_gam.generate_X_grid(term=i)
    pdep, confi = kyph_gam.partial_dependence(term=i, X=XX, width=0.95)
    pdep2, _ = kyph_gam.partial_dependence(term=i, X=X, width=0.95)
    plt.figure()
    plt.scatter(X.iloc[:,term.feature], pdep2 + res)
    plt.plot(XX[:, term.feature], pdep)
    plt.plot(XX[:, term.feature], confi, c='r', ls='--')
    plt.title(X.columns.values[term.feature])
    plt.show()

Notice that we did not specify the basis functions in the .fit(). pyGAM figures them out for us by using $s()$ (splines) for numerical variables and $f()$ for categorical features. If this is not what we want we can manually specify the basis functions, as follows:

In [22]:

kyph_gam = LogisticGAM(s(0)+s(1)+s(2)).fit(X,y)

In [23]:

res = kyph_gam.deviance_residuals(X,y)
for i, term in enumerate(kyph_gam.terms):
    if term.isintercept:
        continue

    XX = kyph_gam.generate_X_grid(term=i)
    pdep, confi = kyph_gam.partial_dependence(term=i, X=XX, width=0.95)
    pdep2, _ = kyph_gam.partial_dependence(term=i, X=X, width=0.95)
    plt.figure()
    plt.scatter(X.iloc[:,term.feature], pdep2 + res)
    plt.plot(XX[:, term.feature], pdep)
    plt.plot(XX[:, term.feature], confi, c='r', ls='--')
    plt.title(X.columns.values[term.feature])
    plt.show()

Regression in `pyGAM`¶

For regression problems, we can use a linearGAM model. For this part we will use the wages dataset.

https://pygam.readthedocs.io/en/latest/api/lineargam.html

The `wages` dataset¶

Let's inspect another dataset that is included in pyGAM that notes the wages of people based on their age, year of employment and education.

In [24]:

# from the pyGAM documentation
from pygam import LinearGAM, s, f
from pygam.datasets import wage

X, y = wage(return_X_y=True)

## model
gam = LinearGAM(s(0) + s(1) + f(2))
gam.gridsearch(X, y)


## plotting
plt.figure();
fig, axs = plt.subplots(1,3);

titles = ['year', 'age', 'education']
for i, ax in enumerate(axs):
    XX = gam.generate_X_grid(term=i)
    ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX))
    ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
    if i == 0:
        ax.set_ylim(-30,30)
    ax.set_title(titles[i]);

100% (11 of 11) |########################| Elapsed Time: 0:00:00 Time:  0:00:00

3 - Smoothing Splines using csaps¶

Note: this is the spline model that minimizes
$MSE - \lambda\cdot\text{wiggle penalty}$ $=$ $\sum_{i=1}^N \left(y_i - f(x_i)\right)^2 - \lambda \int \left(f''(t)\right)^2 dt$,
across all possible functions $f$.

For csaps: a) input data should be strictly increasing, so no duplicate values (see https://csaps.readthedocs.io/en/latest/api.html). You need to sort the values in an ascending order with no duplicates. b) the smoothing parameter is entered as $\lambda$ (or p) but it seems to be $1-\lambda$ (1-p) in the formula. For p=0 we get a straight line, for p=1 there is no smoothing (overfit).

In [26]:

from csaps import csaps

np.random.seed(1234)

x = np.linspace(0,10,300000)
y = np.sin(x*2*np.pi)*x + np.random.randn(len(x))

xs = np.linspace(x[0], x[-1], 1000)

ys = csaps(x, y, xs, smooth=0.99)
print(ys.shape)

#plt.plot(x, y, 'o', xs, ys, '-')
plt.plot(x, y, 'o', xs, ys, '-')
plt.show()

(1000,)

4 - Data fitting using pyGAM and Penalized B-Splines¶

When we use a spline in pyGAM we are effectively using a penalized B-Spline with a regularization parameter $\lambda$. E.g.

LogisticGAM(s(0)+s(1, lam=0.5)+s(2)).fit(X,y)

Let's see how this smoothing works in pyGAM. We start by creating some arbitrary data and fitting them with a GAM.

In [27]:

X = np.linspace(0,10,500)
y = np.sin(X*2*np.pi)*X + np.random.randn(len(X))

plt.scatter(X,y);

In [28]:

# let's try a large lambda first and lots of splines
gam = LinearGAM(lam=1e6, n_splines=50). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3);
plt.plot(XX, gam.predict(XX));

We see that the large $\lambda$ forces a straight line, no flexibility. Let's see now what happens if we make it smaller.

In [29]:

# let's try a smaller lambda 
gam = LinearGAM(lam=1e2, n_splines=50). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3);
plt.plot(XX, gam.predict(XX));

There is some curvature there but still not a good fit. Let's try no penalty. That should have the line fit exactly.

In [30]:

# no penalty, let's try a 0 lambda 
gam = LinearGAM(lam=0, n_splines=50). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3)
plt.plot(XX, gam.predict(XX))

Out[30]:

[]

Yes, that is good. Now let's see what happens if we lessen the number of splines. The fit should not be as good.

In [31]:

# no penalty, let's try a 0 lambda 
gam = LinearGAM(lam=0, n_splines=10). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3);
plt.plot(XX, gam.predict(XX));