Key Word(s): ??
Description¶
Smoothing Example Notebook
This notebook provides example code based on the lecture material.
If you wish to run or edit the notebook, we recommend downloading it and running it either on your local machine or on JupyterHub.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
%matplotlib inline
diab = pd.read_csv("data/diabetes.csv")
print("""
# Variables are
# subject: subject ID number
# age: age diagnosed with diabetes
# acidity: a measure of acidity called base deficit
# y: natural log of serum C-peptide concentration
# Original source is Sockett et al. (1987)
# mentioned in Hastie and Tibshirani's book
# "Generalized Additive Models".
"""
)
diab.head()
diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data")
plt.xlabel("Age at Diagnosis")
plt.ylabel("Log C-Peptide Concentration")
plt.show()
Linear Regression¶
After a lot of time with SKLearn, we're moving back to Statsmodels because of its fuller implementation of statistical tools, like confidence intervals.
We'll also be using statsmodels' powerful formula interface. It lets one write complex models
succinctly and without building complex design matrices by hand. Below, we write
'y~age'
to mean "the y column is approximately $\beta_1$ times the age column (plus a
constant $\beta_0$". We could include more columns, or transformations of the age column, e.g.
'y ~ age + age**2 + acidity'
.
fit1_lm = sm.ols('y~age',data=diab).fit()
We create a very dense set of values to predict on. Remember: the point of a model is to provide outputs for a wide variety of inputs. No need to only predict on the training or test values-- the model can predict on anything you ask it to!
Further, it's important when working with statsmodels that we make a named data frame- if we only use a numpy array statsmodels won't know it's the 'age' variable.
xpred = pd.DataFrame({"age":np.arange(0,16.1,0.1)})
Plot the prediction line and confidence intervals
pred1 = fit1_lm.predict(xpred)
prediction_output = fit1_lm.get_prediction(xpred).summary_frame()
prediction_output.head()
Above, we use fitted_model.get_prediction().summary_frame()
to get predictions
and confidence intervals.
ax1 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares linear fit")
ax1.set_xlabel("Age at Diagnosis")
ax1.set_ylabel("Log C-Peptide Concentration")
ax1.plot(xpred.age, prediction_output['mean'],color="green")
# CI for the predection at each x value, i.e. the line itself
ax1.plot(xpred.age, prediction_output['mean_ci_lower'], color="blue",linestyle="dashed")
ax1.plot(xpred.age, prediction_output['mean_ci_upper'], color="blue",linestyle="dashed");
# CIs for where future data will fall
#ax1.plot(xpred.age, prediction_output['obs_ci_lower'], color="skyblue",linestyle="dashed")
#ax1.plot(xpred.age, prediction_output['obs_ci_upper'], color="skyblue",linestyle="dashed");
Polynomial Regression¶
In sklearn we would use polynomial_features. Here we use vander
to build the matrix of
transformed inputs. In particular, vander
is a numpy function and therefore usable
directly in the formula- it saves us from writing out y ~ age + age**2 + age**3 + ...
and so on.
vander
is for Vandermonde. It's a matrix where the first column is $x^0$, the second is
$x^1$, the third is $x^2$ and so on.
np.vander([6,3,5], 4, increasing=True) =
array([[ 1, 6, 36, 216],
[ 1, 3, 9, 27],
[ 1, 5, 25, 125]])
Since we have a constant column in the matrix, we put a -1 in the formula to drop the additional constant term statsmodels would otherwise insert
Note that this is not an orthogonal polynomial basis. Our estimated coefficients will be more sensitive to the data than they need to be.
# *cubic* polynomial (vander's input is one more than the degree)
fit2_lm = sm.ols(formula="y ~ np.vander(age, 4, increasing=True) -1",data=diab).fit()
# the same model written out explicitly
fit2_lm_long = sm.ols(formula="y ~ age + np.power(age, 2) + np.power(age, 3)",data=diab).fit()
poly_predictions = fit2_lm.get_prediction(xpred).summary_frame()
poly_predictions
ax2 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares cubic fit")
ax2.set_xlabel("Age at Diagnosis")
ax2.set_ylabel("Log C-Peptide Concentration")
# CI for the predection at each x value, i.e. the curve itself
ax2.plot(xpred.age, poly_predictions['mean'],color="green")
ax2.plot(xpred.age, poly_predictions['mean_ci_lower'], color="blue",linestyle="dashed")
ax2.plot(xpred.age, poly_predictions['mean_ci_upper'], color="blue",linestyle="dashed");
# CIs for where future data will fall
#ax2.plot(xpred.age, poly_predictions['obs_ci_lower'], color="skyblue",linestyle="dashed")
#ax2.plot(xpred.age, poly_predictions['obs_ci_upper'], color="skyblue",linestyle="dashed");
Logistic Regression¶
Statsmodels provides logistic regression via the same formula-based interface.
For the sake of doing logistic regression, suppose we have the binary outcome "Was y above 4?"
diab['y_bin'] = 1*(diab['y'] > 4) # multiply by 1 because statsmodels wants 1s and 0s instead of true and false
logit_model = sm.logit("y_bin ~ age", data = diab).fit()
logit_prediction = logit_model.predict(xpred)
There is no built-in get_predictions
for logistic regression. The code provided below
can be used as a replacement. Because it was custom-written by your loving TFs, it will handle
polynomial models that use vander
, but little else.
from scipy.special import expit
import re
def get_logit_prediction_intervals(model, new_data_df):
if type(new_data_df) != pd.DataFrame:
raise TypeError('new_data_df must be a DataFrame')
# transform the raw data according to the formula
new_data_dict = {}
for x in model.params.index:
# only presently supports Intercept, a named column, and polynmoials created via np.vander
# the trick is finding the correct base column in the raw data
if x == "Intercept":
new_data_dict[x] = np.ones(new_data_df.shape[0])
elif x.startswith("np.vander("):
try:
will = re.match(r"np.vander\((.*), ?(.*)\)\[(.*)\]", x)
column, power, index = will.groups()
except e:
raise ValueError("Couldn't parse formula-derived feature {}".format(x))
new_data_dict[x] = np.vander(new_data_df.loc[:,column], int(power))[:,int(index)]
else:
new_data_dict[x] = new_data_df.loc[:,x]
new_data = pd.DataFrame(new_data_dict)
variance_mat = model.cov_params()
standard_devs = np.sqrt(np.sum(new_data.dot(variance_mat) * new_data, axis=1))
linear_predictions = new_data.dot(model.params)
output = pd.DataFrame({"lower": expit(linear_predictions - 1.96*standard_devs),
"predicted": expit(linear_predictions),
"upper": expit(linear_predictions + 1.96*standard_devs)
})
return output
logit_prediction_intervals = get_logit_prediction_intervals(logit_model, xpred)
logit_prediction_intervals
A pure logistic model
ax = diab.plot.scatter(x='age',y='y_bin',c='Red',title="Diabetes data with logit-linear fit")
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log-odds Log C-Peptide Concentration > 4")
ax.plot(xpred.age, logit_prediction_intervals["predicted"],color="green")
ax.plot(xpred.age, logit_prediction_intervals["lower"], color="blue",linestyle="dashed")
ax.plot(xpred.age, logit_prediction_intervals["upper"], color="blue",linestyle="dashed");
plt.show()
A logistic model wherein the probability is a cubic function of Age
logit_poly_model = sm.logit("y_bin ~ np.vander(age, 4) - 1", data = diab).fit()
logit_poly_prediction = logit_poly_model.predict(xpred)
ax = diab.plot.scatter(x='age',y='y_bin',c='Red',title="Diabetes data with logit-cubic fit")
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log-odds Log C-Peptide Concentration > 4")
logit_poly_prediction_intervals = get_logit_prediction_intervals(logit_poly_model, xpred)
ax.plot(xpred.age, logit_poly_prediction_intervals["predicted"],color="green")
ax.plot(xpred.age, logit_poly_prediction_intervals["lower"], color="blue",linestyle="dashed")
ax.plot(xpred.age, logit_poly_prediction_intervals["upper"], color="blue",linestyle="dashed");
plt.show()
Lo(w)ess¶
Lowess is available in Statsmodels. It takes a fraction of the data that should be used in smoothing each point. Please note that you are not responsible for mastering lowess - this is merely for your edification.
from statsmodels.nonparametric.smoothers_lowess import lowess as lowess
lowess_models = {}
for cur_frac in [.15,.25,.7, 1]:
lowess_models[cur_frac] = lowess(diab['y'],diab['age'],frac=cur_frac)
Note Python's lowess implementation does not have any tool to predict on new data;
it only returns the fitted function's value at the training points. We're making up for that by
drawing a straight line between consecutive fitted values (scipy's interp1d
). (There
are other more sophisticated interpolation techniques, but the ideal approach would be to predict on
new points using lowess itself. This is a limitation of the Python implementation, not lowess
itself. R, for example, has a much fuller Lowess toolkit)
from scipy.interpolate import interp1d
for cur_frac, cur_model in lowess_models.items():
ax = diab.plot.scatter(x='age',y='y',c='Red',title="Lowess Fit, Fraction = {}".format(cur_frac))
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration")
lowess_interpolation = interp1d(cur_model[:,0], cur_model[:,1], bounds_error=False)
ax.plot(xpred, lowess_interpolation(xpred), color="Blue")
plt.show()
As the fraction of data increases, Lowess moves from high-variance to high-bias
ax = diab.plot.scatter(x='age',y='y',c='Red',title="Large variance, low bias smoother")
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration")
lowess_interpolation = interp1d(lowess_models[.15][:,0], lowess_models[.15][:,1], bounds_error=False)
ax.plot(xpred, lowess_interpolation(xpred), color="lightgreen")
plt.show()
ax = diab.plot.scatter(x='age',y='y',c='Red',title="Low variance, large bias smoother")
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration")
lowess_interpolation = interp1d(lowess_models[1][:,0], lowess_models[1][:,1], bounds_error=False)
ax.plot(xpred, lowess_interpolation(xpred), color="lightgreen")
plt.show()
Splines (via knots)¶
Here, we flash back to OLS and logistic regression. So far, we've made "design" matrices by taking powers of the raw data's columns and asking the solver "how much of each transformed column is there"? That's fine if there's an a-priori theoretical reason to think the relationship truly is polynomial. But in most cases we can use much richer and better transformations of the raw data
The function below is one such better transformation of the raw data. It (depending on the parameters) applies a RELU or truncated cubic to the input data. Let's see what that looks like
def h(x, knot, exponent):
output = np.power(x-knot, exponent)
output[x<=knot] = 0
return output
Transforming the x values [0,10] with a knot at 4, power 1. Everything below 4 is zeroed out, values above 4 increase at slope 1. (If we had applied "square the data" we'd see a parabola below)
xvals = np.arange(0,10.1,0.1)
plt.plot(xvals, h(xvals,4,1), color="red")
plt.title("Truncated linear basis function with knot at x=4")
plt.xlabel("$x$")
plt.ylabel("$(x-4)_+$") #note the use of TeX in the label
plt.show()
Transforming the x values [0,10] with a knot at 4, power 3. Again, inputs below 4 are zeroed out, the rest grow cubically with their distance from 4.
plt.plot(xvals,h(xvals,4,3),color="red")
plt.title("Truncated cubic basis function with knot at x=4")
plt.xlabel("$x$")
plt.ylabel("$(x-4)_+^3$")
plt.show()
The sum of three RELUs with different knots and different coefficients can create a complicated shape, and thus fit complicated data. Below we see $3\cdot RELU(x,\text{knot=}2) - 4\cdot RELU(x,\text{knot=}5) + 0.5\cdot RELU(x,\text{knot=}8)$. Can you see how much the slope changes at each knot (i.e. at 2, 5, and 8)?
plt.plot(xvals, 3*h(xvals,2,1) - 4*h(xvals,5,1) + 0.5*h(xvals,8,1), color="red")
plt.title("Piecewise linear spline with knots at x=2, 5, and 8")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.show()
Above, but with a starting slope and intercept
(intercept=2, starting slope=1)
plt.plot(xvals, 2 + xvals + 3*h(xvals,2,1) - 4*h(xvals,5,1) + 0.5*h(xvals,8,1), color="red")
plt.title("Piecewise linear spline with knots at x=2, 5, and 8\n plus a starting slope and intercept")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.show()
Using OLS, we can find optimal coefficients for RELUs with pre-specified knots, just like we can find optimal coefficients for $x^2$ and $x^3$
# generate some fake data to fit
x = np.arange(0.1,10,9.9/100)
from scipy.stats import norm
y = norm.ppf(x/10) + np.random.normal(0,0.4,100)
Notice that we can apply the function h(x) directly in the formula, because we defined it earlier in the notebook.
fitted_spline_model = sm.ols('y~x+h(x,2,1)+h(x,5,1)+h(x,8,1)',data={'x':x,'y':y}).fit()
plt.scatter(x,y,facecolors='none', edgecolors='black')
plt.title("3 knots")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.plot(x, fitted_spline_model.predict(),color="darkblue", linewidth=2, label="Spline with knots at 2,5,8")
plt.plot(x, norm.ppf(x/10), color="red", label="Truth")
plt.legend()
plt.show()
More knots
fitted_spline_model = sm.ols('y~x+h(x,1,1)+h(x,2,1)+h(x,3.5,1)+h(x,5,1)+h(x,6.5,1)+h(x,8,1)',data={'x':x,'y':y}).fit()
plt.scatter(x,y,facecolors='none', edgecolors='black')
plt.title("6 knots")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.plot(x, fitted_spline_model.predict(),color="darkblue", label="Linear Spline with knots at\n1, 2, 3.5, 5, 6.5, 8")
plt.plot(x, norm.ppf(x/10), color="red", label="Truth")
plt.legend()
plt.show()
More knots (writing out all the knots in this formula is getting old...)
fitted_spline_model = sm.ols('y~x+h(x,1,1)+h(x,2,1)+h(x,3,1)+h(x,4,1)+h(x,5,1)+h(x,6,1)+h(x,7,1)+h(x,8,1)+h(x,9,1)',
data={'x':x,'y':y}).fit()
plt.scatter(x,y,facecolors='none', edgecolors='black')
plt.title("9 knots")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.plot(x, fitted_spline_model.predict(),color="darkblue", label="Linear Spline with 9 knots")
plt.plot(x, norm.ppf(x/10), color="red", label="Truth")
plt.legend()
plt.show()
Using code to write out the formula this time. Comments provided for those who are new to python
n_knots = 25
# make a list of strings that each look like 'h(x,?,1)' where ? takes the values in np.linspace(0,10,n_knots)
components = ['h(x,{},1)'.format(x) for x in np.linspace(0,10,n_knots)]
# glue all the strings in 'components' together with " + " between each
formula = ' + '.join(components)
# paste a 'y ~ x + ' in front of the above. Now we've got a full formula!
final_formula = 'y ~ x + ' + formula
final_formula
fitted_spline_model = sm.ols(final_formula,data={'x':x,'y':y}).fit()
plt.scatter(x,y,facecolors='none', edgecolors='black')
plt.title("25 knots")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.plot(x, fitted_spline_model.predict(),color="darkblue", label="Linear Spline with 25 knots")
plt.plot(x, norm.ppf(x/10), color="red", label="Truth")
plt.legend()
plt.show()
Cubic splines, instead of linear. Still using code to write the formula. [Knots at 2,5,8, each one cubic; starting intercept, slope, acceleration, and jerk]
components = ['h(x,{},3)'.format(x) for x in [2,5,8]]
formula = ' + '.join(components)
final_formula = 'y~x + np.power(x,2) + np.power(x,3) + ' + formula
fitted_spline_model = sm.ols(final_formula,data={'x':x,'y':y}).fit()
plt.scatter(x,y,facecolors='none', edgecolors='black')
plt.title("3 knots")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.plot(x, fitted_spline_model.predict(),color="darkblue", label="Cubic Spline with 3 knots")
plt.plot(x, norm.ppf(x/10), color="red", label="Truth")
plt.legend()
plt.show()
As above, but with more knots (a more flexible fit)
components = ['h(x,{},3)'.format(x) for x in [1,2,3.5,5,6.5,8]]
formula = ' + '.join(components)
final_formula = 'y~x + np.power(x,2) + np.power(x,3) + ' + formula
fitted_spline_model = sm.ols(final_formula,data={'x':x,'y':y}).fit()
plt.scatter(x,y,facecolors='none', edgecolors='black')
plt.title("6 knots")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.plot(x, fitted_spline_model.predict(),color="darkblue", label="Cubic Spline with 3 knots")
plt.plot(x, norm.ppf(x/10), color="red", label="Truth")
plt.legend()
plt.show()
More knots
n_knots = 9
components = ['h(x,{},3)'.format(x) for x in np.linspace(0,10,n_knots)]
formula = ' + '.join(components)
final_formula = 'y~x + np.power(x,2) + np.power(x,3) + ' + formula
fitted_spline_model = sm.ols(final_formula,data={'x':x,'y':y}).fit()
plt.scatter(x,y,facecolors='none', edgecolors='black')
plt.title("9 knots")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.plot(x, fitted_spline_model.predict(),color="darkblue", label="Cubic Spline with 9 knots")
plt.plot(x, norm.ppf(x/10), color="red", label="Truth")
plt.legend()
plt.show()
Even more knots. As with the linear splines, this looks like it is overfitting a lot.
n_knots = 25
components = ['h(x,{},3)'.format(x) for x in np.linspace(0,10,n_knots)]
formula = ' + '.join(components)
final_formula = 'y~x + np.power(x,2) + np.power(x,3) + ' + formula
fitted_spline_model = sm.ols(final_formula,data={'x':x,'y':y}).fit()
plt.scatter(x,y,facecolors='none', edgecolors='black')
plt.title("25 knots")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.plot(x, fitted_spline_model.predict(),color="darkblue", label="Cubic Spline with 25 knots")
plt.plot(x, norm.ppf(x/10), color="red", label="Truth")
plt.legend()
plt.show()
It turns out that the number of knots matters far more than where they are placed. You can play with the code above to verify if you wish.
Smoothing splines¶
Smoothing splines, as described in class, minimize the sum of squared errors subject to a penalty that depends on the wiggliness of the function. The resulting solution is a cubic spline with knots at every data value that is regularized according to a smoothing parameter.
We use the csaps library to implement smoothing splines. The smoothing parameter takes on values between 0 and 1, and is the weight attached to the error sum of squares of the weighted average between the error sum of squares and the wiggliness penalty. A smoothing parameter of 0 correspondends to a least-squares fit, and a parameter of 1 corresponds to connecting-the-dots.
import matplotlib.pyplot as plt
from csaps import csaps
Read in csaps to carry out smoothing splines.
for cur_smoothing in [0, 0.2, 0.4, 0.6, 0.8, 0.95]:
diab_sort = diab
diab_sort['age'] = diab_sort['age']+np.random.normal(0,0.001, len(diab_sort))
# ties not allowed (and the data is already sorted)
diab_sort = diab.sort_values(['age']) # need to re-sort
x = diab_sort['age']
y = diab_sort['y']
xs = np.linspace(min(x), max(x), len(x))
ys = csaps(diab_sort['age'], diab_sort['y'], xs, smooth=cur_smoothing)
plt.plot(x, y, 'o', xs, ys, '-')
plt.title("Smoothing spline, smoothing parameter = {}".format(cur_smoothing))
plt.xlabel("Age")
plt.ylabel("y")
plt.show()
We can select the smoothing parameter by CV.
from sklearn.model_selection import KFold
candidate_smoothings = [0, 0.2, 0.4, 0.6, 0.8, 0.95]
kf = KFold(n_splits=5, random_state=47, shuffle=True)
scores = np.zeros((5,len(candidate_smoothings)))
for i, (train_index, test_index) in enumerate(kf.split(diab_sort)):
train_df = diab_sort.iloc[train_index,:]
test_df = diab_sort.iloc[test_index,:]
for j,cur_smoothing in enumerate(candidate_smoothings):
train_df_sort = train_df.sort_values(['age'])
test_df_sort = test_df.sort_values(['age'])
x = train_df_sort['age']
y = train_df_sort['y']
xs = test_df_sort['age']
ys = csaps(x,y,xs,smooth=cur_smoothing)
scores[i,j] = sum((test_df['y']-ys)**2)
np.mean(scores, axis=0)
best_s = candidate_smoothings[np.argmin(np.mean(scores, axis=0))]
x = diab_sort['age']
y = diab_sort['y']
xs = np.linspace(min(x), max(x), len(x))
ys = csaps(diab_sort['age'], diab_sort['y'], xs, smooth=best_s)
plt.plot(x, y, 'o', xs, ys, '-')
plt.title("Smoothing spline, optimal smoothing parameter = {}".format(best_s))
plt.xlabel("Age")
plt.ylabel("y")
plt.show()
B-Splines¶
Back to the diabetes data. First, let's see the data
diab_sort = diab
diab_sort['age'] = diab_sort['age']+np.random.normal(0,0.001,len(diab_sort)) # again, ties not allowed
diab_sort = diab_sort.sort_values(['age'])
ax = diab_sort.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares cubic fit")
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration")
plt.show()
Get quartiles
quarts = diab_sort['age'].quantile([0.25, 0.5, 0.75]).values.reshape(-1)
Build a Bspline model. Call splrep
(spline representation) to find the knots and
coefficients that smooth the given data, then call BSpline to build something that can predict on
given values. Notice that when we're done b_spline_model
can be called like a function
from scipy.interpolate import splrep
from scipy.interpolate import BSpline
t,c,k = splrep(diab_sort['age'].values, diab_sort['y'].values, t=quarts)
b_spline_model = BSpline(t,c,k)
b_spline_model(7)
LSQUnivariateSpline
fits splines to data, using user-specified knots
from scipy.interpolate import LSQUnivariateSpline
natural_spline_model = LSQUnivariateSpline(diab_sort['age'].values, diab_sort['y'].values, quarts)
ax = diab_sort.plot.scatter(x='age',y='y',c='grey',title="Diabetes data")
ax.plot(diab_sort['age'], b_spline_model(diab_sort['age']), label="B-spline, knots at quartiles")
plt.legend()
plt.show()
ax = diab_sort.plot.scatter(x='age',y='y',c='grey',title="Diabetes data")
ax.plot(diab_sort['age'], natural_spline_model(diab_sort['age']), label="Natural Spline, knots at quartiles")
plt.legend()
plt.show()
GAMs¶
Generalized Aditive Models essentially provide spline-like fits when there are multiple input variables. We use the PyGam library, which relies on B-splines (really penalized B-splines, a.k.a. P-splines) as the smoother of choice.
Here we work with Kyphosis data, on 81 children who received a corrective spinal surgery. Each row records the child's age, the number of vertebrae operated on, the first vertebrae involved in the operation, and whether the operation was a success or experienced complications.
from sklearn.model_selection import train_test_split
kyphosis = pd.read_csv("data/kyphosis.csv")
kyphosis["outcome"] = 1*(kyphosis["Kyphosis"] == "present")
kyph_train, kyph_test = train_test_split(kyphosis, test_size=.2, stratify=kyphosis['outcome'])
kyph_train.describe()
Using pygam is a lot like using the formula interface for statsmodels, but with raw code instead of
with strings. Instead of 's(Age)+s(Number)+s(Start)'
we have
s(0)+s(1)+s(2)
(the corresponding column indices). s
is for 'smooth'. Each
smooth accepts a lam
parameter specifying the degree of smoothing. As before, larger
lambdas mean smoother curves
from pygam import LogisticGAM, s
X = kyph_train[["Age","Number","Start"]]
y = kyph_train["outcome"]
kyph_gam = LogisticGAM(s(0)+s(1, lam=0.5)+s(2)).fit(X,y)
Correct classification rate is pretty good!
kyph_gam.accuracy(X,y)
Summary of GAM fit on training data:
kyph_gam.summary()
GAMs provide plots of the effect of increasing each variable (conditional on / adjusted for the other variables)
res = kyph_gam.deviance_residuals(X,y)
for i, term in enumerate(kyph_gam.terms):
if term.isintercept:
continue
XX = kyph_gam.generate_X_grid(term=i)
pdep, confi = kyph_gam.partial_dependence(term=i, X=XX, width=0.95)
pdep2, _ = kyph_gam.partial_dependence(term=i, X=X, width=0.95)
plt.figure()
plt.plot(XX[:, term.feature], pdep)
plt.plot(XX[:, term.feature], confi, c='r', ls='--')
plt.title(X.columns.values[term.feature])
plt.show()
Above, we see that as age ranges from 1 to 200 the chance of success first increases and then decreases, though with wide error bars. Moreover, we see that the Number of the vertebrae appears to have no effect or a mild effect until it reaches 12, and then drastically lowers.
Below, we try a model with only "Age" and "Start". In just a moment we'll compare the two models via cross-validation.
X = kyph_train[["Age","Number","Start"]]
y = kyph_train["outcome"]
small_kyph_gam = LogisticGAM(s(0)+s(2)).fit(X,y)
res = small_kyph_gam.deviance_residuals(X,y)
for i, term in enumerate(small_kyph_gam.terms):
if term.isintercept:
continue
XX = small_kyph_gam.generate_X_grid(term=i)
pdep, confi = small_kyph_gam.partial_dependence(term=i, X=XX, width=0.95)
pdep2, _ = small_kyph_gam.partial_dependence(term=i, X=X, width=0.95)
plt.figure()
plt.plot(XX[:, term.feature], pdep)
plt.plot(XX[:, term.feature], confi, c='r', ls='--')
plt.title(X.columns.values[term.feature])
plt.show()
from sklearn.metrics import accuracy_score
X_test = kyph_test[["Age","Number","Start"]]
y_test = kyph_test["outcome"]
acc = accuracy_score(y_test, kyph_gam.predict(X_test))
acc_small = accuracy_score(y_test, small_kyph_gam.predict(X_test))
print("Test Accuracy m1: {:0.2f}, m2: {:0.2f}".format(acc, acc_small))
We find that the richer model has a higher accuracy on the test set (it gets rid of one false positive present in the smaller model). Depending on taste, this may or may not be enough to declare the larger model the better.
A cross-validation (below) would make better use of the small data than a simple train-test split. (We again stratify on the outcome to ensure all test sets have a representative number of kyphosis cases)
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=5, random_state=47, shuffle=True)
scores = np.zeros((5,2))
for i, (train_index, test_index) in enumerate(kf.split(kyphosis, kyphosis['outcome'])):
train_df = kyphosis.iloc[train_index,:]
test_df = kyphosis.iloc[test_index,:]
# with all three (inserting lower smoothing on 'number' to prevent errors while fitting)
cur_model_all = LogisticGAM(s(0)+s(1, lam=.5)+s(2)).fit(train_df[['Age', 'Number', 'Start']], train_df['outcome'])
scores[i,0] = accuracy_score(test_df['outcome'], cur_model_all.predict(test_df[['Age', 'Number', 'Start']]))
# dropping 'number'
cur_model_some = LogisticGAM(s(0)+s(1)).fit(train_df[['Age', 'Number', 'Start']], train_df['outcome'])
scores[i,1] = accuracy_score(test_df['outcome'], cur_model_some.predict(test_df[['Age', 'Number', 'Start']]))
print("Average accuracy", np.mean(scores, axis=0))
plt.scatter([0]*5, scores[:,0])
plt.scatter([1]*5, scores[:,1])
plt.xlabel("Model")
plt.ylabel("CV Accuracy")
plt.show()
As a naive average, it appears that the second model (without 'Number') is has the better accuracy. However, plotting the various CV scores shows that the two models aren't particularly different. As the test is inconclusive, one might choose the model without 'number' as it does not appear to be noticeably weaker, or one might stick with the richer model as it does not noticeably overfit.