Key Word(s): ??
CS109B Data Science 2: Advanced Topics in Data Science
Lecture 5.5 - Smoothers and Generalized Additive Models - Model Fitting¶
Harvard University
Spring 2021
Instructors: Mark Glickman, Pavlos Protopapas, and Chris Tanner
Lab Instructor: Eleni Kaxiras
Content: Eleni Kaxiras and Will Claybaugh
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)
import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
Table of Contents¶
- 1 - Overview - A Top View of LMs, GLMs, and GAMs to set the stage
- 2 - A review of Linear Regression with
statsmodels
. Formulas. - 3 - Splines
- 4 - Generative Additive Models with
pyGAM
- 5 - Smooting Splines using
csaps
Overview¶
image source: Dani Servén Marín (one of the developers of pyGAM)
A - Linear Models¶
First we have the Linear Models which you know from 109a. These models are linear in the coefficients. Very interpretable but suffer from high bias because let's face it, few relationships in life are linear. Simple Linear Regression (defined as a model with one predictor) as well as Multiple Linear Regression (more than one predictors) are examples of LMs. Polynomial Regression extends the linear model by adding terms that are still linear for the coefficients but non-linear when it somes to the predictiors which are now raised in a power or multiplied between them.
$$ \begin{aligned} y = \beta{_0} + \beta{_1}{x_1} & \quad \mbox{(simple linear regression)}\\ y = \beta{_0} + \beta{_1}{x_1} + \beta{_2}{x_2} + \beta{_3}{x_3} & \quad \mbox{(multiple linear regression)}\\ y = \beta{_0} + \beta{_1}{x_1} + \beta{_2}{x_1^2} + \beta{_3}{x_3^3} & \quad \mbox{(polynomial multiple regression)}\\ \end{aligned} $$- What does it mean for a model to be interpretable?
- Are linear regression models interpretable? Are random forests? What about Neural Networks such as Feed Forward?
- Do we always want interpretability? Describe cases where we do and cases where we do not care.
B - Generalized Linear Models (GLMs)¶
Generalized Linear Models is a term coined in the early 1970s by Nelder and Wedderburn for a class of models that includes both Linear Regression and Logistic Regression. A GLM fits one coefficient per feature (predictor).
C - Generalized Additive Models (GAMs)¶
Hastie and Tidshirani coined the term Generalized Additive Models in 1986 for a class of non-linear extensions to Generalized Linear Models.
$$ \begin{aligned} y = \beta{_0} + f_1\left(x_1\right) + f_2\left(x_2\right) + f_3\left(x_3\right) \\ y = \beta{_0} + f_1\left(x_1\right) + f_2\left(x_2, x_3\right) + f_3\left(x_3\right) & \mbox{(with interaction terms)} \end{aligned} $$In practice we add splines and regularization via smoothing penalties to our GLMs.
image source: Dani Servén Marín
D - Basis Functions¶
In our models we can use various types of functions as "basis".
- Monomials such as $x^2$, $x^4$ (Polynomial Regression)
- Sigmoid functions (neural networks)
- Fourier functions
- Wavelets
- Regression splines
1 - Piecewise Polynomials a.k.a. Splines¶
Splines are a type of piecewise polynomial interpolant. A spline of degree k is a piecewise polynomial that is continuously differentiable k − 1 times.
Splines are the basis of CAD software and vector graphics including a lot of the fonts used in your computer. The name “spline” comes from a tool used by ship designers to draw smooth curves. Here is the letter $epsilon$ written with splines:
font idea inspired by Chris Rycroft (AM205)
If the degree is 1 then we have a Linear Spline. If it is 3 then we have a Cubic spline. It turns out that cubic splines because they have a continous 2nd derivative (curvature) at the knots are very smooth to the eye. We do not need higher order than that. The Cubic Splines are usually Natural Cubic Splines which means they have the added constrain of the end points' second derivative = 0.
We will use the CubicSpline and the B-Spline as well as the Linear Spline.
scipy.interpolate¶
See all the different splines that scipy.interpolate has to offer: https://docs.scipy.org/doc/scipy/reference/interpolate.html
Let's use the simplest form which is interpolate on a set of points and then find the points between them.
from scipy.interpolate import splrep, splev
from scipy.interpolate import BSpline, CubicSpline
from scipy.interpolate import interp1d
# define the range of the function
a = -1
b = 1
# define the number of knots
num_knots = 11
knots = np.linspace(a,b,num_knots)
# define the function we want to approximate
y = 1/(1+25*(knots**2))
# make a linear spline
linspline = interp1d(knots, y)
# sample at these points to plot
xx = np.linspace(a,b,1000)
yy = 1/(1+25*(xx**2))
plt.plot(knots,y,'*')
plt.plot(xx, yy, label='true function')
plt.plot(xx, linspline(xx), label='linear spline');
plt.legend();
The Linear interpolation does not look very good. Fit a Cubic Spline and plot along the Linear to compare. Feel free to solve and then look at the solution.
# your answer here
# solution
# define the range of the function
a = -1
b = 1
# define the knots
num_knots = 10
x = np.linspace(a,b,num_knots)
# define the function we want to approximate
y = 1/(1+25*(x**2))
# make the Cubic spline
cubspline = CubicSpline(x, y)
print(f'Num knots in cubic spline: {num_knots}')
# OR make a linear spline
linspline = interp1d(x, y)
# plot
xx = np.linspace(a,b,1000)
yy = 1/(1+25*(xx**2))
plt.plot(xx, yy, label='true function')
plt.plot(x,y,'*', label='knots')
plt.plot(xx, linspline(xx), label='linear');
plt.plot(xx, cubspline(xx), label='cubic');
plt.legend();
- Change the number of knots to 100 and see what happens. What would happen if we run a polynomial model of degree equal to the number of knots (a global one as in polynomial regression, not a spline)?
- What makes a spline 'Natural'?
# Optional and Outside of the scope of this class: create the `epsilon` in the figure above
x = np.array([1.,0.,-1.5,0.,-1.5,0.])
y = np.array([1.5,1.,2.5,3,4,5])
t = np.linspace(0,5,6)
f = interp1d(t,x,kind='cubic')
g = interp1d(t,y,kind='cubic')
tplot = np.linspace(0,5,200)
plt.plot(x,y, '*', f(tplot), g(tplot));
B-Splines (de Boor, 1978)¶
One way to construct a curve given a set of points is to interpolate the points, that is, to force the curve to pass through the points.
A B-splines (Basis Splines) is defined by a set of control points and a set of basis functions that fit the function between these points. By choosing to have no smoothing factor we force the final B-spline to pass though all the points. If, on the other hand, we set a smothing factor, our function is more of an approximation with the control points as "guidance". The latter produced a smoother curve which is prefferable for drawing software. For more on Splines see: https://en.wikipedia.org/wiki/B-spline)
We will use scipy.splrep
to calulate the coefficients for the B-Spline and draw it.
B-Spline with no smooting¶
from scipy.interpolate import splev, splrep
x = np.linspace(0, 10, 10)
y = np.sin(x)
# (t,c,k) is a tuple containing the vector of knots, coefficients, degree of the spline
t,c,k = splrep(x, y)
x2 = np.linspace(0, 10, 200)
y2 = BSpline(t,c,k)
plt.plot(x, y, 'o', x2, y2(x2))
plt.show()
from scipy.interpolate import splrep
x = np.linspace(0, 10, 10)
y = np.sin(x)
t,c,k = splrep(x, y, k=3) # (tck) is a tuple containing the vector of knots, coefficients, degree of the spline
# define the points to plot on (x2)
print(f'Knots ({len(t)} of them): {t}\n')
print(f'B-Spline coefficients ({len(c)} of them): {c}\n')
print(f'B-Spline degree {k}')
x2 = np.linspace(0, 10, 100)
y2 = BSpline(t, c, k)
plt.figure(figsize=(10,5))
plt.plot(x, y, 'o', label='true points')
plt.plot(x2, y2(x2), label='B-Spline')
tt = np.zeros(len(t))
plt.plot(t, tt,'g*', label='knots eval by the function')
plt.legend()
plt.show()
What do the tuple values returned by
scipy.splrep
mean?¶
- The
t
variable is the array that contains the knots' position in the x axis. The length of this array is, of course, the number of knots. - The
c
variable is the array that holds the coefficients for the B-Spline. Its length should be the same ast
.
We have number_of_knots - 1
B-spline basis elements to the spline constructed via this
method, and they are defined as follows:
$$
\begin{aligned}
B_{i, 0}(x) = 1, \textrm{if $ti \le x < t{i+1}$, otherwise $0$,} \ \
B_{i, k}(x) = \frac{x - ti}{t{i+k} - ti} B{i, k-1}(x)
+ \frac{t_{i+k+1} - x}{t_{i+k+1} - t_{i+1}} B_{i+1, k-1}(x)
\end{aligned} $$
- t $\in [t_1, t_2, ..., t_]$ is the knot vector
- c : are the spline coefficients
- k : is the spline degree
B-Spline with smooting factor s¶
from scipy.interpolate import splev, splrep
x = np.linspace(0, 10, 5)
y = np.sin(x)
s = 0.5 # add smoothing factor
task = 0 # task needs to be set to 0, which represents:
# we are specifying a smoothing factor and thus only want
# splrep() to find the optimal t and c
t,c,k = splrep(x, y, task=task, s=s)
# draw the line segments
linspline = interp1d(x, y)
# define the points to plot on (x2)
x2 = np.linspace(0, 10, 200)
y2 = BSpline(t, c, k)
plt.plot(x, y, 'o', x2, y2(x2))
plt.plot(x2, linspline(x2))
plt.show()
B-Spline with given knots¶
x = np.linspace(0, 10, 100)
y = np.sin(x)
knots = np.quantile(x, [0.25, 0.5, 0.75])
print(knots)
# calculate the B-Spline
t,c,k = splrep(x, y, t=knots)
curve = BSpline(t,c,k)
curve
plt.scatter(x=x,y=y,c='grey', alpha=0.4)
yknots = np.sin(knots)
plt.scatter(knots, yknots, c='r')
plt.plot(x,curve(x))
plt.show()
2 - GAMs¶
https://readthedocs.org/projects/pygam/downloads/pdf/latest/
Classification in pyGAM
¶
Let's get our (multivariate!) data, the kyphosis
dataset, and the
LogisticGAM
model from pyGAM
to do binary classification.
- kyphosis - wherther a particular deformation was present post-operation
- age - patient's age in months
- number - the number of vertebrae involved in the operation
- start - the number of the topmost vertebrae operated on
kyphosis = pd.read_csv("data/kyphosis.csv")
display(kyphosis.head())
display(kyphosis.describe(include='all'))
display(kyphosis.dtypes)
# convert the outcome in a binary form, 1 or 0
kyphosis = pd.read_csv("data/kyphosis.csv")
kyphosis["outcome"] = 1*(kyphosis["Kyphosis"] == "present")
kyphosis.describe()
from pygam import LogisticGAM, s, f, l
X = kyphosis[["Age","Number","Start"]]
y = kyphosis["outcome"]
kyph_gam = LogisticGAM().fit(X,y)
Outcome dependence on features¶
To help us see how the outcome depends on each feature, pyGAM
has the partial_dependence()
function.
pdep, confi = kyph_gam.partial_dependence(term=i, X=XX, width=0.95)
For more on this see the : https://pygam.readthedocs.io/en/latest/api/logisticgam.html
res = kyph_gam.deviance_residuals(X,y)
for i, term in enumerate(kyph_gam.terms):
if term.isintercept:
continue
XX = kyph_gam.generate_X_grid(term=i)
pdep, confi = kyph_gam.partial_dependence(term=i, X=XX, width=0.95)
pdep2, _ = kyph_gam.partial_dependence(term=i, X=X, width=0.95)
plt.figure()
plt.scatter(X.iloc[:,term.feature], pdep2 + res)
plt.plot(XX[:, term.feature], pdep)
plt.plot(XX[:, term.feature], confi, c='r', ls='--')
plt.title(X.columns.values[term.feature])
plt.show()
Notice that we did not specify the basis functions in the .fit(). pyGAM
figures them out
for us by using $s()$ (splines) for numerical variables and $f()$ for categorical features. If this
is not what we want we can manually specify the basis functions, as follows:
kyph_gam = LogisticGAM(s(0)+s(1)+s(2)).fit(X,y)
res = kyph_gam.deviance_residuals(X,y)
for i, term in enumerate(kyph_gam.terms):
if term.isintercept:
continue
XX = kyph_gam.generate_X_grid(term=i)
pdep, confi = kyph_gam.partial_dependence(term=i, X=XX, width=0.95)
pdep2, _ = kyph_gam.partial_dependence(term=i, X=X, width=0.95)
plt.figure()
plt.scatter(X.iloc[:,term.feature], pdep2 + res)
plt.plot(XX[:, term.feature], pdep)
plt.plot(XX[:, term.feature], confi, c='r', ls='--')
plt.title(X.columns.values[term.feature])
plt.show()
Regression in pyGAM
¶
For regression problems, we can use a linearGAM
model. For this part we will use the
wages
dataset.
The wages
dataset¶
Let's inspect another dataset that is included in pyGAM
that notes the wages of people
based on their age, year of employment and education.
# from the pyGAM documentation
from pygam import LinearGAM, s, f
from pygam.datasets import wage
X, y = wage(return_X_y=True)
## model
gam = LinearGAM(s(0) + s(1) + f(2))
gam.gridsearch(X, y)
## plotting
plt.figure();
fig, axs = plt.subplots(1,3);
titles = ['year', 'age', 'education']
for i, ax in enumerate(axs):
XX = gam.generate_X_grid(term=i)
ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX))
ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
if i == 0:
ax.set_ylim(-30,30)
ax.set_title(titles[i]);
3 - Smoothing Splines using csaps¶
Note: this is the spline model that minimizes
$MSE - \lambda\cdot\text{wiggle penalty}$ $=$ $\sum_{i=1}^N \left(y_i - f(x_i)\right)^2 - \lambda
\int \left(f''(t)\right)^2 dt$,
across all possible functions $f$.
For csaps
:
a) input data should be strictly increasing, so no duplicate values (see https://csaps.readthedocs.io/en/latest/api.html).
You need to sort the values in an ascending order with no duplicates.
b) the smoothing parameter is entered as $\lambda$ (or p) but it seems to be $1-\lambda$ (1-p) in
the formula. For p=0 we get a straight line, for p=1 there is no smoothing (overfit).
from csaps import csaps
np.random.seed(1234)
x = np.linspace(0,10,300000)
y = np.sin(x*2*np.pi)*x + np.random.randn(len(x))
xs = np.linspace(x[0], x[-1], 1000)
ys = csaps(x, y, xs, smooth=0.99)
print(ys.shape)
#plt.plot(x, y, 'o', xs, ys, '-')
plt.plot(x, y, 'o', xs, ys, '-')
plt.show()
4 - Data fitting using pyGAM and Penalized B-Splines¶
When we use a spline in pyGAM we are effectively using a penalized B-Spline with a regularization parameter $\lambda$. E.g.
LogisticGAM(s(0)+s(1, lam=0.5)+s(2)).fit(X,y)
Let's see how this smoothing works in pyGAM
. We start by creating some arbitrary data
and fitting them with a GAM.
X = np.linspace(0,10,500)
y = np.sin(X*2*np.pi)*X + np.random.randn(len(X))
plt.scatter(X,y);
# let's try a large lambda first and lots of splines
gam = LinearGAM(lam=1e6, n_splines=50). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3);
plt.plot(XX, gam.predict(XX));
We see that the large $\lambda$ forces a straight line, no flexibility. Let's see now what happens if we make it smaller.
# let's try a smaller lambda
gam = LinearGAM(lam=1e2, n_splines=50). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3);
plt.plot(XX, gam.predict(XX));
There is some curvature there but still not a good fit. Let's try no penalty. That should have the line fit exactly.
# no penalty, let's try a 0 lambda
gam = LinearGAM(lam=0, n_splines=50). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3)
plt.plot(XX, gam.predict(XX))
Yes, that is good. Now let's see what happens if we lessen the number of splines. The fit should not be as good.
# no penalty, let's try a 0 lambda
gam = LinearGAM(lam=0, n_splines=10). fit(X,y)
XX = gam.generate_X_grid(term=0)
plt.scatter(X,y,alpha=0.3);
plt.plot(XX, gam.predict(XX));