CS109B Data Science 2: Advanced Topics in Data Science
Lab 2 - Smoothers and Generalized Additive Models¶
Harvard University
Spring 2019
Instructors: Mark Glickman and Pavlos Protopapas
Lab Instructors: Will Claybaugh
Contributors: Paul Tyklin and Will Claybaugh
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)
Learning Goals¶
The main goal of this lab is to get familiar with calling R functions within Python. Along the way, we'll learn about the "formula" interface to statsmodels, which gives an intuitive way of specifying regression models, and we'll review the different approaches to fitting curves.
Key Skills:
- Importing (base) R functions
- Importing R library functions
- Populating vectors R understands
- Populating dataframes R understands
- Populating formulas R understands
- Running models in R
- Getting results back to Python
- Getting model predictions in R
- Plotting in R
- Reading R's documentation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Linear/Polynomial Regression (Python, Review)¶
Hopefully, you remember working with Statsmodels during 109a
Reading data and (some) exploring in Pandas:
diab = pd.read_csv("data/diabetes.csv")
print("""
# Variables are:
# subject: subject ID number
# age: age diagnosed with diabetes
# acidity: a measure of acidity called base deficit
# y: natural log of serum C-peptide concentration
#
# Original source is Sockett et al. (1987)
# mentioned in Hastie and Tibshirani's book
# "Generalized Additive Models".
""")
display(diab.head())
display(diab.dtypes)
display(diab.describe())
Plotting with matplotlib:
ax0 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data") #plotting direclty from pandas!
ax0.set_xlabel("Age at Diagnosis")
ax0.set_ylabel("Log C-Peptide Concentration");
Linear regression with statsmodels.
- Previously, we worked from a vector of target values and a design matrix we built ourself (e.g. from PolynomialFeatures).
- Now, Statsmodels' formula interface can help build the target value and design matrix for you.
#Using statsmodels
import statsmodels.formula.api as sm
model1 = sm.ols('y ~ age',data=diab)
fit1_lm = model1.fit()
Build a data frame to predict values on (sometimes this is just the test or validation set)
- Very useful for making pretty plots of the model predcitions -- predict for TONS of values, not just whatever's in the training set
x_pred = np.linspace(0,16,100)
predict_df = pd.DataFrame(data={"age":x_pred})
predict_df.head()
Use get_prediction().summary_frame()
to get the model's prediction (and error bars!)
prediction_output = fit1_lm.get_prediction(predict_df).summary_frame()
prediction_output.head()
Plot the model and error bars
ax1 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares linear fit")
ax1.set_xlabel("Age at Diagnosis")
ax1.set_ylabel("Log C-Peptide Concentration")
ax1.plot(predict_df.age, prediction_output['mean'],color="green")
ax1.plot(predict_df.age, prediction_output['mean_ci_lower'], color="blue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['mean_ci_upper'], color="blue",linestyle="dashed");
ax1.plot(predict_df.age, prediction_output['obs_ci_lower'], color="skyblue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['obs_ci_upper'], color="skyblue",linestyle="dashed");
- What are the dark error bars?
- What are the light error bars?
- Fit a 3rd degree polynomial model and plot the model+error bars
- Route1: Build a design df with a column for each of
age
,age**2
,age**3
- Route2: Just edit the formula
- Route1: Build a design df with a column for each of
Answers:
1.
# your code here
2.
# your code here
Linear/Polynomial Regression, but make it R¶
This is the meat of the lab. After this section we'll know everything we need to in order to work with R models. The rest of the lab is just applying these concepts to run particular models. This section therefore is your 'cheat sheet' for working in R.
What we need to know:
- Importing (base) R functions
- Importing R Library functions
- Populating vectors R understands
- Populating DataFrames R understands
- Populating Formulas R understands
- Running models in R
- Getting results back to Python
- Getting model predictions in R
- Plotting in R
- Reading R's documentation
Importing R functions
# if you're on JupyterHub you may need to specify the path to R
#import os
#os.environ['R_HOME'] = "/usr/share/anaconda3/lib/R"
import rpy2.robjects as robjects
r_lm = robjects.r["lm"]
r_predict = robjects.r["predict"]
#r_plot = robjects.r["plot"] # more on plotting later
#lm() and predict() are two of the most common functions we'll use
Importing R libraries
from rpy2.robjects.packages import importr
#r_cluster = importr('cluster')
#r_cluster.pam;
Populating vectors R understands
r_y = robjects.FloatVector(diab['y'])
r_age = robjects.FloatVector(diab['age'])
# What happens if we pass the wrong type?
# How does r_age display?
# How does r_age print?
Populating Data Frames R understands
diab_r = robjects.DataFrame({"y":r_y, "age":r_age})
# How does diab_r display?
# How does diab_r print?
Populating formulas R understands
simple_formula = robjects.Formula("y~age")
simple_formula.environment["y"] = r_y #populate the formula's .environment, so it knows what 'y' and 'age' refer to
simple_formula.environment["age"] = r_age
Running Models in R
diab_lm = r_lm(formula=simple_formula) # the formula object is storing all the needed variables
simple_formula = robjects.Formula("y~age") # reset the formula
diab_lm = r_lm(formula=simple_formula, data=diab_r) #can also use a 'dumb' formula and pass a dataframe
Getting results back to Python
diab_lm #the result is already 'in' python, but it's a special object
print(diab_lm.names) # view all names
diab_lm[0] #grab the first element
diab_lm.rx2("coefficients") #use rx2 to get elements by name!
np.array(diab_lm.rx2("coefficients")) #r vectors can be converted to numpy (but rarely needed)
Getting Predictions
# make a df to predict on (might just be the validation or test dataframe)
predict_df = robjects.DataFrame({"age": robjects.FloatVector(np.linspace(0,16,100))})
# call R's predict() function, passing the model and the data
predictions = r_predict(diab_lm, predict_df)
x_vals = predict_df.rx2("age")
ax = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data")
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration");
ax.plot(x_vals,predictions); #plt still works with r vectors as input!
Plotting in R
%load_ext rpy2.ipython
- The above turns on the %R "magic"
- R's plot() command responds differently based on what you hand to it; Different models get different plots!
- For any specific model search for plot.modelname. E.g. for a GAM model, search plot.gam for any details of plotting a GAM model
- The %R "magic" runs R code in 'notebook' mode, so figures display nicely
- Ahead of the
plot(
code we pass in the variables R needs to know about () -i
is for "input")
- Ahead of the
%R -i diab_lm plot(diab_lm);
Reading R's documentation
The documentation for the lm()
funciton is here, and a prettier version (same content) is here. When googling, perfer rdocumentation.org when possible.
Sections:
- Usage: gives the function signature, including all optional arguments
- Arguments: What each function input controls
- Details: additional info on what the funciton does and how arguments interact. Often the right place to start reading
- Value: the structure of the object returned by the function
- Refferences: The relevant academic papers
- See Also: other functions of interest
- Add confidence intervals calculated in R to the linear regression plot above. Use the
interval=
argument tor_predict()
(documentation here). You will have to work with a matrix returned by R. - Fit a 5th degree polynomial to the diabetes data in R. Search the web for an easier method than writing out a formula with all 5 polynomial terms.
Answers
1.
# your code here
2.
# your code here
Lowess Smoothing¶
Lowess Smoothing is implemented in both Python and R. We'll use it as another example as we transition languages.
- What is lowess smoothing? Which 109a models is it related to?
- How explainable is lowess?
- What are the tunable parameters?
In Python
from statsmodels.nonparametric.smoothers_lowess import lowess as lowess
ss1 = lowess(diab['y'],diab['age'],frac=0.15)
ss2 = lowess(diab['y'],diab['age'],frac=0.25)
ss3 = lowess(diab['y'],diab['age'],frac=0.7)
ss4 = lowess(diab['y'],diab['age'],frac=1)
ss1[:10,:] # we get back simple a smoothed y value for each x value in the data
Notice the clean code to plot different models. We'll see even cleaner code in a minute
for cur_model, cur_frac in zip([ss1,ss2,ss3,ss4],[0.15,0.25,0.7,1]):
ax = diab.plot.scatter(x='age',y='y',c='Red',title="Lowess Fit, Fraction = {}".format(cur_frac))
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration")
ax.plot(cur_model[:,0],cur_model[:,1],color="blue")
plt.show()
- Which model has high variance, which has high bias?
- What makes a model high variance or high bias?
In R
We need to:
- Import the loess function
- Send data over to R
- Call the function and get results
r_loess = robjects.r['loess.smooth'] #extract R function
r_y = robjects.FloatVector(diab['y'])
r_age = robjects.FloatVector(diab['age'])
ss1_r = r_loess(r_age,r_y, span=0.15, degree=1)
ss1_r #again, a smoothed y value for each x value in the data
Predict the output of
ss1_r[0]
ss1_r.rx2("y")
1.
your answer here
2.
your answer here
Varying span
Next, some extremely clean code to fit and plot models with various parameter settings. (Though the zip()
method seen earlier is great when e.g. the label and the parameter differ)
for cur_frac in [0.15,0.25,0.7,1]:
cur_smooth = r_loess(r_age,r_y, span=cur_frac)
ax = diab.plot.scatter(x='age',y='y',c='Red',title="Lowess Fit, Fraction = {}".format(cur_frac))
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration")
ax.plot(cur_smooth[0], cur_smooth[1], color="blue")
plt.show()
- Mark wasn't kidding; the Python and R results differ for frac=.15. Thoughts?
- Why isn't the bottom plot a straight line? We're using 100% of the data in each window...
Smoothing Splines¶
From this point forward, we're working with R functions; these models aren't (well) supported in Python.
For clarity: this is the fancy spline model that minimizes $MSE - \lambda\cdot\text{wiggle penalty}$ $=$ $\sum_{i=1}^N \left(y_i - f(x_i)\right)^2 - \lambda \int \left(f''(x)\right)^2$, across all possible functions $f$. The winner will always be a continuous, cubic polynomial with a knot at each data point
- Any idea why the winner is cubic?
- How interpretable is this model?
- What are the tunable parameters?
r_smooth_spline = robjects.r['smooth.spline'] #extract R function
# run smoothing function
spline1 = r_smooth_spline(r_age, r_y, spar=0)
- We actually set the spar parameter, a scale-free value that translates to a $\lambda$ through a complex expression. Inspect the 'spline1' result and extract the implied value of $\lambda$
- Working from the fitting/plotting loop examples above, produce a plot like the one below for spar = [0,.5,.9,2], including axes labels and title.
1.
# your answer here
2.
# your answer here
CV
R's smooth_spline
funciton has built-in CV to find a good lambda. See package docs.
spline_cv = r_smooth_spline(r_age, r_y, cv=True)
lambda_cv = spline_cv.rx2("lambda")[0]
ax19 = diab.plot.scatter(x='age',y='y',c='Red',title="smoothing spline with $\lambda=$"+str(np.round(lambda_cv,4))+", chosen by cross-validation")
ax19.set_xlabel("Age at Diagnosis")
ax19.set_ylabel("Log C-Peptide Concentration")
ax19.plot(spline_cv.rx2("x"),spline_cv.rx2("y"),color="darkgreen");
- Does the selected model look reasonable?
- How would you describe the effect of age at diagnosis on C_peptide concentration?
- What are the costs/benefits of the (fancy) spline model, relative to the linear regression we fit above?
Natural & Basis Splines¶
Here, we take a step backward on model complexity, but a step forward in coding complexity. We'll be working with R's formula interface again, so we will need to populate Formulas and DataFrames.
- In what way are Natural and Basis splines less complex than the splines we were just working with?
- What makes a spline 'natural'?
- What makes a spline 'basis'?
- What are the tuning parameters?
#We will now work with a new dataset, called GAGurine.
#The dataset description (from the R package MASS) is below:
#Data were collected on the concentration of a chemical GAG
# in the urine of 314 children aged from zero to seventeen years.
# The aim of the study was to produce a chart to help a paediatrican
# to assess if a child's GAG concentration is ‘normal’.
#The variables are:
# Age: age of child in years.
# GAG: concentration of GAG (the units have been lost).
GAGurine = pd.read_csv("data/GAGurine.csv")
display(GAGurine.head())
ax31 = GAGurine.plot.scatter(x='Age',y='GAG',c='black',title="GAG in urine of children")
ax31.set_xlabel("Age");
ax31.set_ylabel("GAG");
Standard stuff: import function, convert variables to R format, call function
from rpy2.robjects.packages import importr
r_splines = importr('splines')
# populate R variables
r_gag = robjects.FloatVector(GAGurine['GAG'].values)
r_age = robjects.FloatVector(GAGurine['Age'].values)
r_quarts = robjects.FloatVector(np.quantile(r_age,[.25,.5,.75])) #woah, numpy functions run on R objects!
What happens when we call the ns or bs functions from r_splines?
ns_design = r_splines.ns(r_age, knots=r_quarts)
bs_design = r_splines.bs(r_age, knots=r_quarts)
print(ns_design)
ns
and bs
return design matrices, not model objects! That's because they're meant to work with lm
's formula interface. To get a model object we populate a formula including ns(,
and fit to data
r_lm = robjects.r['lm']
r_predict = robjects.r['predict']
# populate the formula
ns_formula = robjects.Formula("Gag ~ ns(Age, knots=r_quarts)")
ns_formula.environment['Gag'] = r_gag
ns_formula.environment['Age'] = r_age
ns_formula.environment['r_quarts'] = r_quarts
# fit the model
ns_model = r_lm(ns_formula)
Predict like usual: build a dataframe to predict on and call predict()
# predict
predict_frame = robjects.DataFrame({"Age": robjects.FloatVector(np.linspace(0,20,100))})
ns_out = r_predict(ns_model, predict_frame)
ax32 = GAGurine.plot.scatter(x='Age',y='GAG',c='grey',title="GAG in urine of children")
ax32.set_xlabel("Age")
ax32.set_ylabel("GAG")
ax32.plot(predict_frame.rx2("Age"),ns_out, color='red')
ax32.legend(["Natural spline, knots at quartiles"]);
- Fit a basis spline model with the same knots, and add it to the plot above
- Fit a basis spline with 8 knots placed at [2,4,6...14,16] and add it to the plot above
Answers:
1.
# your answer here
2.
# your answer here
#%R -i overfit_model plot(overfit_model)
# we'd get the same diagnostic plot we get from an lm model
GAMs¶
We come, at last, to our most advanced model. The coding here isn't any more complex than we've done before, though the behind-the-scenes is awesome.
First, let's get our (multivariate!) data
kyphosis = pd.read_csv("data/kyphosis.csv")
print("""
# kyphosis - wherther a particular deformation was present post-operation
# age - patient's age in months
# number - the number of vertebrae involved in the operation
# start - the number of the topmost vertebrae operated on
""")
display(kyphosis.head())
display(kyphosis.describe(include='all'))
display(kyphosis.dtypes)
#If there are errors about missing R packages, run the code below:
#r_utils = importr('utils')
#r_utils.install_packages('codetools')
#r_utils.install_packages('gam')
To fit a GAM, we
- Import the
gam
library - Populate a formula including
s()
on variables we want to fit smooths for - Call
gam(formula, family=
where) family
is a string naming a probability distribution, chosen based on how the response variable is thought to occur. - Rough
family
guidelines:- Response is binary or "N occurances out of M tries", e.g. number of lab rats (out of 10) developing disease: chooose
"binomial"
- Response is a count with no logical upper bound, e.g. number of ice creams sold: choose
"poisson"
- Response is real, with normally-distributed noise, e.g. person's height: choose
"gaussian"
(the default)
- Response is binary or "N occurances out of M tries", e.g. number of lab rats (out of 10) developing disease: chooose
#There is a Python library in development for using GAMs (https://github.com/dswah/pyGAM)
# but it is not yet as comprehensive as the R GAM library, which we will use here instead.
# R also has the mgcv library, which implements some more advanced/flexible fitting methods
r_gam_lib = importr('gam')
r_gam = r_gam_lib.gam
r_kyph = robjects.FactorVector(kyphosis[["Kyphosis"]].values)
r_Age = robjects.FloatVector(kyphosis[["Age"]].values)
r_Number = robjects.FloatVector(kyphosis[["Number"]].values)
r_Start = robjects.FloatVector(kyphosis[["Start"]].values)
kyph1_fmla = robjects.Formula("Kyphosis ~ s(Age) + s(Number) + s(Start)")
kyph1_fmla.environment['Kyphosis']=r_kyph
kyph1_fmla.environment['Age']=r_Age
kyph1_fmla.environment['Number']=r_Number
kyph1_fmla.environment['Start']=r_Start
kyph1_gam = r_gam(kyph1_fmla, family="binomial")
The fitted gam model has a lot of interesting data within it
print(kyph1_gam.names)
Remember plotting? Calling R's plot()
on a gam model is the easiest way to view the fitted splines
%R -i kyph1_gam plot(kyph1_gam, residuals=TRUE,se=TRUE, scale=20);
Prediction works like normal (build a data frame to predict on, if you don't already have one, and call predict()
). However, predict always reports the sum of the individual variable effects. If family
is non-default this can be different from the actual prediction for that point.
For instance, we're doing a 'logistic regression' so the raw prediction is log odds, but we can get probability by using in predict(..., type="response")
kyph_new = robjects.DataFrame({'Age': robjects.IntVector((84,85,86)),
'Start': robjects.IntVector((5,3,1)),
'Number': robjects.IntVector((1,6,10))})
print("Raw response (so, Log odds):")
display(r_predict(kyph1_gam, kyph_new))
print("Scaled response (so, probabilty of kyphosis):")
display(r_predict(kyph1_gam, kyph_new, type="response"))
- What lambda did we use?
- What is the model telling us about the effects of age, starting vertebrae, and number of vertebae operated on
- If we fit a logistic regression instead, which variables might want quadratic terms. What is the cost and benefit of a logistic regression model versus a GAM?
- Critique the model:
- What is it assuming? Are the assumptions reasonable
- Are we using the right data?
- Does the model's story about the world make sense?
Appendix¶
GAMs and smoothing splines support hypothesis tets to compare models. (We can always compare models via out-of-sample prediction quality (i.e. performance on a validation set), but statistical ideas like hypothesis tests yet information criteria allow us to use all data for training and still compare the quality of model A to model B)
r_anova = robjects.r["anova"]
kyph0_fmla = robjects.Formula("Kyphosis~1")
kyph0_fmla.environment['Kyphosis']=r_kyph
kyph0_gam = r_gam(kyph0_fmla, family="binomial")
print(r_anova(kyph0_gam, kyph1_gam, test="Chi"))
Explicitly joining spline functions
def h(x, xi, pow_arg): #pow is a reserved keyword in Python
if (x > xi):
return pow((x-xi),pow_arg)
else:
return 0
h = np.vectorize(h,otypes=[np.float]) #default behavior is to return ints, which gives incorrect answer
#also, vectorize does not play nicely with default arguments, so better to set directly (e.g., pow_arg=1)
xvals = np.arange(0,10.1,0.1)
ax20 = plt.plot(xvals,h(xvals,4,1),color="red")
_ = plt.title("Truncated linear basis function with knot at x=4")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$(x-4)_+$") #note the use of TeX in the label
ax21 = plt.plot(xvals,h(xvals,4,3),color="red")
_ = plt.title("Truncated cubic basis function with knot at x=4")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$(x-4)_+^3$")
ax22 = plt.plot(xvals,2+xvals+3*h(xvals,2,1)-4*h(xvals,5,1)+0.5*h(xvals,8,1),color="red")
_ = plt.title("Piecewise linear spline with knots at x=2, 5, and 8")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
Comparing splines to the (noisy) model that generated them.
x = np.arange(0.1,10,9.9/100)
from scipy.stats import norm
#ppf (percent point function) is the rather unusual name for
#the quantile or inverse CDF function in SciPy
y = norm.ppf(x/10) + np.random.normal(0,0.4,100)
ax23 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("3 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+h(x,2,1)+h(x,5,1)+h(x,8,1)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")
ax24 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("6 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+h(x,1,1)+h(x,2,1)+h(x,3.5,1)+h(x,5,1)+h(x,6.5,1)+h(x,8,1)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")
ax25 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("9 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+h(x,1,1)+h(x,2,1)+h(x,3,1)+h(x,4,1)+h(x,5,1)+h(x,6,1)+h(x,7,1)+h(x,8,1)+h(x,9,1)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")
regstr = 'y~x+'
for i in range(1,26):
regstr += 'h(x,'+str(i/26*10)+',1)+'
regstr = regstr[:-1] #drop last +
ax26 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("25 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols(regstr,data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")
Exercise:¶
Try generating random data from different distributions and fitting polynomials of different degrees to it. What do you observe?
# try it here
#So, we see that increasing the number of knots results in a more polynomial-like fit
#Next, we look at cubic splines with increasing numbers of knots
ax27 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("3 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+np.power(x,2)+np.power(x,3)+h(x,2,3)+h(x,5,3)+h(x,8,3)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")
ax28 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("6 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+np.power(x,2)+np.power(x,3)+h(x,1,3)+h(x,2,3)+h(x,3.5,3)+h(x,5,3)+h(x,6.5,3)+h(x,8,3)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")
ax29 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("9 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols('y~x+np.power(x,2)+np.power(x,3)+h(x,1,3)+h(x,2,3)+h(x,3,3)+h(x,4,3)+h(x,5,3)+h(x,6,3)+h(x,7,3)+h(x,8,3)+h(x,9,3)',data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")
regstr2 = 'y~x+np.power(x,2)+np.power(x,3)+'
for i in range(1,26):
regstr2 += 'h(x,'+str(i/26*10)+',3)+'
regstr2 = regstr2[:-1] #drop last +
ax30 = plt.scatter(x,y,facecolors='none', edgecolors='black')
_ = plt.title("25 knots")
_ = plt.xlabel("$x$")
_ = plt.ylabel("$y$")
_ = plt.plot(x,sm.ols(regstr2,data={'x':x,'y':y}).fit().predict(),color="darkblue",linewidth=2)
_ = plt.plot(x,norm.ppf(x/10),color="red")