CS109B Data Science 2: Advanced Topics in Data Science

Lab 2 - Smoothers and Generalized Additive Models¶

Harvard University
Spring 2019
Instructors: Mark Glickman and Pavlos Protopapas
Lab Instructors: Will Claybaugh
Contributors: Paul Tyklin and Will Claybaugh

In [1]:

## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:

Learning Goals¶

The main goal of this lab is to get familiar with calling R functions within Python. Along the way, we'll learn about the "formula" interface to statsmodels, which gives an intuitive way of specifying regression models, and we'll review the different approaches to fitting curves.

Key Skills:

Importing (base) R functions
Importing R library functions
Populating vectors R understands
Populating dataframes R understands
Populating formulas R understands
Running models in R
Getting results back to Python
Getting model predictions in R
Plotting in R
Reading R's documentation

In [ ]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Linear/Polynomial Regression (Python, Review)¶

Hopefully, you remember working with Statsmodels during 109a

Reading data and (some) exploring in Pandas:

In [ ]:

diab = pd.read_csv("data/diabetes.csv")
print("""
# Variables are:
#   subject:   subject ID number
#   age:       age diagnosed with diabetes
#   acidity:   a measure of acidity called base deficit
#   y:         natural log of serum C-peptide concentration
#
# Original source is Sockett et al. (1987)
# mentioned in Hastie and Tibshirani's book 
# "Generalized Additive Models".
""")

display(diab.head())
display(diab.dtypes)
display(diab.describe())

Plotting with matplotlib:

In [ ]:

ax0 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data") #plotting direclty from pandas!
ax0.set_xlabel("Age at Diagnosis")
ax0.set_ylabel("Log C-Peptide Concentration");

Linear regression with statsmodels.

Previously, we worked from a vector of target values and a design matrix we built ourself (e.g. from PolynomialFeatures).
Now, Statsmodels' formula interface can help build the target value and design matrix for you.

In [ ]:

#Using statsmodels
import statsmodels.formula.api as sm


model1 = sm.ols('y ~ age',data=diab)
fit1_lm = model1.fit()

Build a data frame to predict values on (sometimes this is just the test or validation set)

Very useful for making pretty plots of the model predcitions -- predict for TONS of values, not just whatever's in the training set

In [ ]:

x_pred = np.linspace(0,16,100)

predict_df = pd.DataFrame(data={"age":x_pred})
predict_df.head()

Use get_prediction().summary_frame() to get the model's prediction (and error bars!)

In [ ]:

prediction_output = fit1_lm.get_prediction(predict_df).summary_frame()
prediction_output.head()

Plot the model and error bars

In [ ]:

ax1 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares linear fit")
ax1.set_xlabel("Age at Diagnosis")
ax1.set_ylabel("Log C-Peptide Concentration")


ax1.plot(predict_df.age, prediction_output['mean'],color="green")
ax1.plot(predict_df.age, prediction_output['mean_ci_lower'], color="blue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['mean_ci_upper'], color="blue",linestyle="dashed");

ax1.plot(predict_df.age, prediction_output['obs_ci_lower'], color="skyblue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['obs_ci_upper'], color="skyblue",linestyle="dashed");

Discussion

What are the dark error bars?
What are the light error bars?

Exercise 1

Fit a 3rd degree polynomial model and plot the model+error bars
- Route1: Build a design df with a column for each of age, age**2, age**3
- Route2: Just edit the formula

Answers:

In [ ]:

# your code here

In [ ]:

# your code here

Linear/Polynomial Regression, but make it R¶

This is the meat of the lab. After this section we'll know everything we need to in order to work with R models. The rest of the lab is just applying these concepts to run particular models. This section therefore is your 'cheat sheet' for working in R.

What we need to know:

Importing (base) R functions
Importing R Library functions
Populating vectors R understands
Populating DataFrames R understands
Populating Formulas R understands
Running models in R
Getting results back to Python
Getting model predictions in R
Plotting in R
Reading R's documentation

Importing R functions

In [ ]:

# if you're on JupyterHub you may need to specify the path to R

#import os
#os.environ['R_HOME'] = "/usr/share/anaconda3/lib/R"

import rpy2.robjects as robjects

In [ ]:

r_lm = robjects.r["lm"]
r_predict = robjects.r["predict"]
#r_plot = robjects.r["plot"] # more on plotting later

#lm() and predict() are two of the most common functions we'll use

Importing R libraries

In [ ]:

from rpy2.robjects.packages import importr
#r_cluster = importr('cluster')
#r_cluster.pam;

Populating vectors R understands

In [ ]:

r_y = robjects.FloatVector(diab['y'])
r_age = robjects.FloatVector(diab['age'])
# What happens if we pass the wrong type?
# How does r_age display?
# How does r_age print?

Populating Data Frames R understands

In [ ]:

diab_r = robjects.DataFrame({"y":r_y, "age":r_age})
# How does diab_r display?
# How does diab_r print?

Populating formulas R understands

In [ ]:

simple_formula = robjects.Formula("y~age")
simple_formula.environment["y"] = r_y #populate the formula's .environment, so it knows what 'y' and 'age' refer to
simple_formula.environment["age"] = r_age

Running Models in R

In [ ]:

diab_lm = r_lm(formula=simple_formula) # the formula object is storing all the needed variables

In [ ]:

simple_formula = robjects.Formula("y~age") # reset the formula
diab_lm = r_lm(formula=simple_formula, data=diab_r) #can also use a 'dumb' formula and pass a dataframe

Getting results back to Python

In [ ]:

diab_lm #the result is already 'in' python, but it's a special object

In [ ]:

print(diab_lm.names) # view all names

In [ ]:

diab_lm[0] #grab the first element

In [ ]:

diab_lm.rx2("coefficients") #use rx2 to get elements by name!

In [ ]:

np.array(diab_lm.rx2("coefficients")) #r vectors can be converted to numpy (but rarely needed)

Getting Predictions

In [ ]:

# make a df to predict on (might just be the validation or test dataframe)
predict_df = robjects.DataFrame({"age": robjects.FloatVector(np.linspace(0,16,100))})

# call R's predict() function, passing the model and the data 
predictions = r_predict(diab_lm, predict_df)

In [ ]:

x_vals = predict_df.rx2("age")

In [ ]:

ax = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data")
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration");

ax.plot(x_vals,predictions); #plt still works with r vectors as input!

Plotting in R

In [ ]:

%load_ext rpy2.ipython

The above turns on the %R "magic"
R's plot() command responds differently based on what you hand to it; Different models get different plots!
- For any specific model search for plot.modelname. E.g. for a GAM model, search plot.gam for any details of plotting a GAM model
The %R "magic" runs R code in 'notebook' mode, so figures display nicely
- Ahead of the plot() code we pass in the variables R needs to know about (-i is for "input")

In [ ]:

%R -i diab_lm plot(diab_lm);

Reading R's documentation

The documentation for the lm() funciton is here, and a prettier version (same content) is here. When googling, perfer rdocumentation.org when possible. Sections:

Usage: gives the function signature, including all optional arguments
Arguments: What each function input controls
Details: additional info on what the funciton does and how arguments interact. Often the right place to start reading
Value: the structure of the object returned by the function
Refferences: The relevant academic papers
See Also: other functions of interest

Exercise 2

Add confidence intervals calculated in R to the linear regression plot above. Use the interval= argument to r_predict() (documentation here). You will have to work with a matrix returned by R.
Fit a 5th degree polynomial to the diabetes data in R. Search the web for an easier method than writing out a formula with all 5 polynomial terms.

Answers

In [ ]:

# your code here

In [ ]:

# your code here

Lowess Smoothing¶

Lowess Smoothing is implemented in both Python and R. We'll use it as another example as we transition languages.

Discussion

What is lowess smoothing? Which 109a models is it related to?
How explainable is lowess?
What are the tunable parameters?

In Python

In [ ]:

from statsmodels.nonparametric.smoothers_lowess import lowess as lowess

ss1 = lowess(diab['y'],diab['age'],frac=0.15)
ss2 = lowess(diab['y'],diab['age'],frac=0.25)
ss3 = lowess(diab['y'],diab['age'],frac=0.7)
ss4 = lowess(diab['y'],diab['age'],frac=1)

In [ ]:

ss1[:10,:] # we get back simple a smoothed y value for each x value in the data

Notice the clean code to plot different models. We'll see even cleaner code in a minute

In [ ]:

for cur_model, cur_frac in zip([ss1,ss2,ss3,ss4],[0.15,0.25,0.7,1]):

    ax = diab.plot.scatter(x='age',y='y',c='Red',title="Lowess Fit, Fraction = {}".format(cur_frac))
    ax.set_xlabel("Age at Diagnosis")
    ax.set_ylabel("Log C-Peptide Concentration")
    ax.plot(cur_model[:,0],cur_model[:,1],color="blue")
    
    plt.show()

Discussion

Which model has high variance, which has high bias?
What makes a model high variance or high bias?

In R

We need to:

Import the loess function
Send data over to R
Call the function and get results

In [ ]:

r_loess = robjects.r['loess.smooth'] #extract R function
r_y = robjects.FloatVector(diab['y'])
r_age = robjects.FloatVector(diab['age'])

ss1_r = r_loess(r_age,r_y, span=0.15, degree=1)

In [ ]:

ss1_r #again, a smoothed y value for each x value in the data

Exercise 3

Predict the output of

ss1_r[0]
ss1_r.rx2("y")

your answer here

Varying span
Next, some extremely clean code to fit and plot models with various parameter settings. (Though the zip() method seen earlier is great when e.g. the label and the parameter differ)

In [ ]:

for cur_frac in [0.15,0.25,0.7,1]:
    
    cur_smooth = r_loess(r_age,r_y, span=cur_frac)

    ax = diab.plot.scatter(x='age',y='y',c='Red',title="Lowess Fit, Fraction = {}".format(cur_frac))
    ax.set_xlabel("Age at Diagnosis")
    ax.set_ylabel("Log C-Peptide Concentration")
    ax.plot(cur_smooth[0], cur_smooth[1], color="blue")
    
    plt.show()

Discussion

Mark wasn't kidding; the Python and R results differ for frac=.15. Thoughts?
Why isn't the bottom plot a straight line? We're using 100% of the data in each window...

Smoothing Splines¶

From this point forward, we're working with R functions; these models aren't (well) supported in Python.

For clarity: this is the fancy spline model that minimizes $MSE - \lambda\cdot\text{wiggle penalty}$ $=$ $\sum_{i=1}^N \left(y_i - f(x_i)\right)^2 - \lambda \int \left(f''(x)\right)^2$, across all possible functions $f$. The winner will always be a continuous, cubic polynomial with a knot at each data point

Discussion

Any idea why the winner is cubic?
How interpretable is this model?
What are the tunable parameters?

In [ ]:

r_smooth_spline = robjects.r['smooth.spline'] #extract R function

# run smoothing function
spline1 = r_smooth_spline(r_age, r_y, spar=0)

Exercise 4

We actually set the spar parameter, a scale-free value that translates to a $\lambda$ through a complex expression. Inspect the 'spline1' result and extract the implied value of $\lambda$
Working from the fitting/plotting loop examples above, produce a plot like the one below for spar = [0,.5,.9,2], including axes labels and title.

In [ ]:

# your answer here

In [ ]:

# your answer here

CV
R's smooth_spline funciton has built-in CV to find a good lambda. See package docs.

In [ ]:

spline_cv = r_smooth_spline(r_age, r_y, cv=True) 

lambda_cv = spline_cv.rx2("lambda")[0]

ax19 = diab.plot.scatter(x='age',y='y',c='Red',title="smoothing spline with $\lambda=$"+str(np.round(lambda_cv,4))+", chosen by cross-validation")
ax19.set_xlabel("Age at Diagnosis")
ax19.set_ylabel("Log C-Peptide Concentration")
ax19.plot(spline_cv.rx2("x"),spline_cv.rx2("y"),color="darkgreen");

Discussion

Does the selected model look reasonable?
How would you describe the effect of age at diagnosis on C_peptide concentration?
What are the costs/benefits of the (fancy) spline model, relative to the linear regression we fit above?

Natural & Basis Splines¶

Here, we take a step backward on model complexity, but a step forward in coding complexity. We'll be working with R's formula interface again, so we will need to populate Formulas and DataFrames.

Discussion

In what way are Natural and Basis splines less complex than the splines we were just working with?
What makes a spline 'natural'?
What makes a spline 'basis'?
What are the tuning parameters?

In [ ]:

#We will now work with a new dataset, called GAGurine.
#The dataset description (from the R package MASS) is below:
#Data were collected on the concentration of a chemical GAG 
# in the urine of 314 children aged from zero to seventeen years. 
# The aim of the study was to produce a chart to help a paediatrican
# to assess if a child's GAG concentration is ‘normal’.

#The variables are:
# Age: age of child in years.
# GAG: concentration of GAG (the units have been lost).

In [ ]:

GAGurine = pd.read_csv("data/GAGurine.csv")
display(GAGurine.head())

ax31 = GAGurine.plot.scatter(x='Age',y='GAG',c='black',title="GAG in urine of children")
ax31.set_xlabel("Age");
ax31.set_ylabel("GAG");

Standard stuff: import function, convert variables to R format, call function

In [ ]:

from rpy2.robjects.packages import importr
r_splines = importr('splines')

# populate R variables
r_gag = robjects.FloatVector(GAGurine['GAG'].values)
r_age = robjects.FloatVector(GAGurine['Age'].values)
r_quarts = robjects.FloatVector(np.quantile(r_age,[.25,.5,.75])) #woah, numpy functions run on R objects!

What happens when we call the ns or bs functions from r_splines?

In [ ]:

ns_design = r_splines.ns(r_age, knots=r_quarts)
bs_design = r_splines.bs(r_age, knots=r_quarts)

In [ ]:

print(ns_design)

ns and bs return design matrices, not model objects! That's because they're meant to work with lm's formula interface. To get a model object we populate a formula including ns(,) and fit to data

In [ ]:

r_lm = robjects.r['lm']
r_predict = robjects.r['predict']

# populate the formula
ns_formula = robjects.Formula("Gag ~ ns(Age, knots=r_quarts)")
ns_formula.environment['Gag'] = r_gag
ns_formula.environment['Age'] = r_age
ns_formula.environment['r_quarts'] = r_quarts
         
# fit the model
ns_model = r_lm(ns_formula)

Predict like usual: build a dataframe to predict on and call predict()

In [ ]:

# predict
predict_frame = robjects.DataFrame({"Age": robjects.FloatVector(np.linspace(0,20,100))})

ns_out = r_predict(ns_model, predict_frame)

In [ ]:

ax32 = GAGurine.plot.scatter(x='Age',y='GAG',c='grey',title="GAG in urine of children")
ax32.set_xlabel("Age")
ax32.set_ylabel("GAG")
ax32.plot(predict_frame.rx2("Age"),ns_out, color='red')
ax32.legend(["Natural spline, knots at quartiles"]);

Exercise 5

Fit a basis spline model with the same knots, and add it to the plot above
Fit a basis spline with 8 knots placed at [2,4,6...14,16] and add it to the plot above

Answers:
1.

In [ ]:

# your answer here

In [ ]:

# your answer here

In [ ]:

#%R -i overfit_model plot(overfit_model)
# we'd get the same diagnostic plot we get from an lm model

GAMs¶

We come, at last, to our most advanced model. The coding here isn't any more complex than we've done before, though the behind-the-scenes is awesome.

First, let's get our (multivariate!) data

In [ ]:

kyphosis = pd.read_csv("data/kyphosis.csv")

print("""
# kyphosis - wherther a particular deformation was present post-operation
# age - patient's age in months
# number - the number of vertebrae involved in the operation
# start - the number of the topmost vertebrae operated on

""")
display(kyphosis.head())
display(kyphosis.describe(include='all'))
display(kyphosis.dtypes)

In [ ]:

#If there are errors about missing R packages, run the code below:

#r_utils = importr('utils')
#r_utils.install_packages('codetools')
#r_utils.install_packages('gam')

To fit a GAM, we

Import the gam library
Populate a formula including s() on variables we want to fit smooths for
Call gam(formula, family=) where family is a string naming a probability distribution, chosen based on how the response variable is thought to occur.
Rough family guidelines:
- Response is binary or "N occurances out of M tries", e.g. number of lab rats (out of 10) developing disease: chooose "binomial"
- Response is a count with no logical upper bound, e.g. number of ice creams sold: choose "poisson"
- Response is real, with normally-distributed noise, e.g. person's height: choose "gaussian" (the default)

In [ ]:

#There is a Python library in development for using GAMs (https://github.com/dswah/pyGAM)
# but it is not yet as comprehensive as the R GAM library, which we will use here instead.

# R also has the mgcv library, which implements some more advanced/flexible fitting methods

r_gam_lib = importr('gam')
r_gam = r_gam_lib.gam

r_kyph = robjects.FactorVector(kyphosis[["Kyphosis"]].values)
r_Age = robjects.FloatVector(kyphosis[["Age"]].values)
r_Number = robjects.FloatVector(kyphosis[["Number"]].values)
r_Start = robjects.FloatVector(kyphosis[["Start"]].values)

kyph1_fmla = robjects.Formula("Kyphosis ~ s(Age) + s(Number) + s(Start)")
kyph1_fmla.environment['Kyphosis']=r_kyph
kyph1_fmla.environment['Age']=r_Age
kyph1_fmla.environment['Number']=r_Number
kyph1_fmla.environment['Start']=r_Start


kyph1_gam = r_gam(kyph1_fmla, family="binomial")

The fitted gam model has a lot of interesting data within it

In [ ]:

print(kyph1_gam.names)

Remember plotting? Calling R's plot() on a gam model is the easiest way to view the fitted splines

In [ ]:

%R -i kyph1_gam plot(kyph1_gam, residuals=TRUE,se=TRUE, scale=20);

Prediction works like normal (build a data frame to predict on, if you don't already have one, and call predict()). However, predict always reports the sum of the individual variable effects. If family is non-default this can be different from the actual prediction for that point.

For instance, we're doing a 'logistic regression' so the raw prediction is log odds, but we can get probability by using in predict(..., type="response")

In [ ]:

kyph_new = robjects.DataFrame({'Age': robjects.IntVector((84,85,86)), 
                               'Start': robjects.IntVector((5,3,1)), 
                               'Number': robjects.IntVector((1,6,10))})

print("Raw response (so, Log odds):")
display(r_predict(kyph1_gam, kyph_new))
print("Scaled response (so, probabilty of kyphosis):")
display(r_predict(kyph1_gam, kyph_new, type="response"))

Discussion

Exercise 6

What lambda did we use?
What is the model telling us about the effects of age, starting vertebrae, and number of vertebae operated on
If we fit a logistic regression instead, which variables might want quadratic terms. What is the cost and benefit of a logistic regression model versus a GAM?
Critique the model:
- What is it assuming? Are the assumptions reasonable
- Are we using the right data?
- Does the model's story about the world make sense?

Appendix¶

GAMs and smoothing splines support hypothesis tets to compare models. (We can always compare models via out-of-sample prediction quality (i.e. performance on a validation set), but statistical ideas like hypothesis tests yet information criteria allow us to use all data for training and still compare the quality of model A to model B)

In [ ]:

r_anova = robjects.r["anova"]

kyph0_fmla = robjects.Formula("Kyphosis~1")
kyph0_fmla.environment['Kyphosis']=r_kyph

kyph0_gam = r_gam(kyph0_fmla, family="binomial")
print(r_anova(kyph0_gam, kyph1_gam, test="Chi"))

Explicitly joining spline functions

In [ ]:

def h(x, xi, pow_arg): #pow is a reserved keyword in Python
    if (x > xi):
        return pow((x-xi),pow_arg)
    else:
        return 0
h = np.vectorize(h,otypes=[np.float]) #default behavior is to return ints, which gives incorrect answer
#also, vectorize does not play nicely with default arguments, so better to set directly (e.g., pow_arg=1)

In [ ]:

xvals = np.arange(0,10.1,0.1)
ax20 = plt.plot(xvals,h(xvals,4,1),color="red")
_ = plt.title("Truncated linear basis function with knot at x=4")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$(x-4)_+$") #note the use of TeX in the label

In [ ]:

ax21 = plt.plot(xvals,h(xvals,4,3),color="red")
_ = plt.title("Truncated cubic basis function with knot at x=4")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$(x-4)_+^3$")

In [ ]:

ax22 = plt.plot(xvals,2+xvals+3*h(xvals,2,1)-4*h(xvals,5,1)+0.5*h(xvals,8,1),color="red")
_ = plt.title("Piecewise linear spline with knots at x=2, 5, and 8")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")

Comparing splines to the (noisy) model that generated them.

In [ ]:

x = np.arange(0.1,10,9.9/100) 
from scipy.stats import norm
#ppf (percent point function) is the rather unusual name for
#the quantile or inverse CDF function in SciPy
y = norm.ppf(x/10) + np.random.normal(0,0.4,100)
ax23 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("3 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+h(x,2,1)+h(x,5,1)+h(x,8,1)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")

In [ ]:

ax24 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("6 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+h(x,1,1)+h(x,2,1)+h(x,3.5,1)+h(x,5,1)+h(x,6.5,1)+h(x,8,1)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")

In [ ]:

ax25 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("9 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+h(x,1,1)+h(x,2,1)+h(x,3,1)+h(x,4,1)+h(x,5,1)+h(x,6,1)+h(x,7,1)+h(x,8,1)+h(x,9,1)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")

In [ ]:

regstr = 'y~x+'
for i in range(1,26):
    regstr += 'h(x,'+str(i/26*10)+',1)+'
regstr = regstr[:-1] #drop last +
ax26 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("25 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols(regstr,data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")

Exercise:¶

Try generating random data from different distributions and fitting polynomials of different degrees to it. What do you observe?

In [ ]:

# try it here

In [ ]:

#So, we see that increasing the number of knots results in a more polynomial-like fit

In [ ]:

#Next, we look at cubic splines with increasing numbers of knots
ax27 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("3 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+np.power(x,2)+np.power(x,3)+h(x,2,3)+h(x,5,3)+h(x,8,3)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")

In [ ]:

ax28 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("6 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+np.power(x,2)+np.power(x,3)+h(x,1,3)+h(x,2,3)+h(x,3.5,3)+h(x,5,3)+h(x,6.5,3)+h(x,8,3)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")

In [ ]:

ax29 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("9 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+np.power(x,2)+np.power(x,3)+h(x,1,3)+h(x,2,3)+h(x,3,3)+h(x,4,3)+h(x,5,3)+h(x,6,3)+h(x,7,3)+h(x,8,3)+h(x,9,3)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")

In [ ]:

regstr2 = 'y~x+np.power(x,2)+np.power(x,3)+'
for i in range(1,26):
    regstr2 += 'h(x,'+str(i/26*10)+',3)+'
regstr2 = regstr2[:-1] #drop last +
ax30 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("25 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols(regstr2,data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")