Lab 2: Model Fitting
Key Word(s): Smoothing, Model Fitting, GAM
R provides functions for fitting linear (lm
) and generalized linear
models (glm
), local polynomial regression models (loess
), and
nonlinear least squares (nls
) among others. Functions for fitting
many other kinds of models are available in various R packages.
Formula
What most of these model fitting functions have in common is a way of
specifying relationships among variables in the model in terms of a
formula. A formula is usually defined using the tilde operator
(~
), with the response on the left and predictors on the right.
Thinking of the formula operator as a domain-specific language is helpful, because many things work differently inside a formula than they do in the rest of R. For example:
- outside of a formula,
+
means "addition", but inside a formula+
means "inclusion" - outside of a formula ":" is a sequence operator, but inside a formula it means "interaction"
- outside a formula
(a + b + c)^2
means "square the sum of a, b, and c"; inside a formula it means "include a, b, c, and all two-way interactions between them"
Note that you can escape to the usual meaning with the I
function:
I((a + b + c)^2)
means "square the sum of a, b, and c, even if
inside a formula.
More details on the formula interface are described in the help page
(?formula
).
Model fitting examples
R comes with some built-in data sets that can be used for examples and
demos. You can ask for the list of available data sets by calling
data()
. Here we will use the built-in mtcars data set.
data(mtcars)
str(mtcars)
?mtcars
Fit a simple linear model predicting miles per gallon (mpg
) from
horsepower (hp
) and weight (wt
).
lm(mpg ~ hp + wt, data = mtcars)
Add interaction between hp
and wt
.
lm(mpg ~ hp * wt, data = mtcars)
Add interactions between hp
and automatic transmission (am
), and
between wt
and am
.
lm(mpg ~ (hp + wt) * am, data = mtcars)
Types, Classes and methods (lm, predict, resid)
We've seen how to use formula in R to fit linear models, but you may
have been somewhat underwhelmed by the result. lm
returns simply the
Call that produced it, and the regression coefficients. To do
anything useful with these models we will want to assign them to a
name and then do some post-estimation calculations. In order to do
that we need to know what lm
returns, and what methods exist for it.
Most of this is documented in ?lm
, but we can use R to inspect the
result and learn for ourselves.
Objects produced by lm
are named lists, and can be treated just
like any other list in R. Let's simplify our last example model and
take a closer look.
mod.mpg1 <- lm(mpg ~ wt * am, data = mtcars)
typeof(mod.mpg1)
length(mod.mpg1)
names(mod.mpg1)
mod.mpg1[["coefficients"]]
Since mod.mpg1
is a list, we can do anything with it that we can do
with other lists in R. For example, we can extract elements by
position or name:
mod.mpg1[["coefficients"]]
hist(mod.mpg1[[2]])
Unlike other lists in R, lists produced by lm
are of class lm
.
There are functions with methods that correspond to this class, and we
can ask R to tell us what those functions are using the methods
function.
class(mod.mpg1)
methods(class = class(mod.mpg1))
We now know that there are summary
, anova
, and confint
methods
(among others) for objects of class lm
. That is, we know that we can
do the following:
summary(mod.mpg1)
anova(mod.mpg1)
confint(mod.mpg1)
Note In Python the equivalent of summary(mod.mpg1)
would
probably look like mod.mpg1.summary()
. This object-oriented style
does exist in R (though it uses $
instead of .
to access methods),
but it is not commonly used.
This technique of identifying the class of an R object, and then
looking up which generic functions have corresponding methods is
very useful. Remember thought that it only shows you functions that
have specific methods; for example methods(class = class(mod.mpg1))
did not show us a method for [[
, but we know that we can use bracket
extraction on a lm object. That just means that [[
doesn't do
anything special because of the object's class -- it just treats it
like any other list.
We can also go the other way around, and ask R what methods exist for a specific generic function. For example,
methods(summary)
shows us all the different object classes that have summary
methods.
Most modeling functions in R usually return objects with at least
summary
, print
, and predict
methods. The predict
methods are
useful for visualizing the quality of your model. Once you've
calculated predicted values you can use the plot
or ggplot
functions to construct a graph. Let's use ggplot
to visualize the
predictions from our last model.
## install.packages("ggplot2") # uncomment and run if not already installed.
library(ggplot2)
mtc <- transform(mtcars, pred.mpg = predict(mod.mpg1, interval = "confidence"))
mtc <- mtc[order(mtc$wt), ]
ggplot(mtc,
mapping = aes(x = wt, y = mpg, color = as.factor(am))) +
geom_point() +
geom_smooth(mapping = aes(y = pred.mpg.fit,
ymin = pred.mpg.lwr,
ymax = pred.mpg.upr),
stat = "identity")
Your turn: Get to know lm
- Fit a linear model predicting miles per gallon (
mpg
) from number of cylinders (cyl
), displacement (disp
), horsepower (hp
), and all two-way interactions among these three predictors. Do not include the three way interaction.
### YOUR SOLUTION HERE
- Number of cylinders and displacement are so highly correlated that
it makes sense to think of them both as measures of engine size.
Fit a model predicting mpg from the sum of
cyl
anddisp
.
### YOUR SOLUTION HERE
- Fit a model predicting
mpg
fromdisp
, assigning the result to the namemod.mpg
. Get thesummary
of the model and assign the result to the namemod.mpg.sum
.
### YOUR SOLUTION HERE
- Extract the
r.squared
value frommod.mpg.sum
.
### YOUR SOLUTION HERE
- Graph the observed and predicted values of
mpg
as a function ofdisp
based on your model.
```{r,error=T} library(ggplot2) mtc <- transform(mtcars, pred.mpg = predict(mod.mpg))
YOUR SOLUTION HERE
6. (Advanced) Fit a logistic model predicting *automatic vs.
manual transmission* (`am`) from *displacement* (`disp`). (Hint: use `glm`)
```{r}
### YOUR SOLUTION HERE
- (Advanced) Fit a logistic model predicting automatic vs.
manual transmission (
am
) from displacement (disp
) and number of gears (gear
). Treat gear as an ordered factor and use polynomial contrasts.
### YOUR SOLUTION HERE
- (Advanced) Inference from linear regression models is based on several assumptions, including the assumption that the residuals are normally distributed. Visually inspect your model to determine if this assumption has been violated.
### YOUR SOLUTION HERE
- (Advanced) Add
am
and the interaction betweendisp
andam
to your model. What is the regression coefficient formpg
ondisp
whenam = 0
? Whenam = 1
?
### YOUR SOLUTION HERE
Transformations (Polynomials and Splines)
The R formula provides a way to specify relationships among
variables, and among transformations of those variables. We've
already seen that *
transforms the inputs by including both the
specified variables and their product. Many other transformations are
possible. Indeed, for categorical predictors transformation is
mandatory and happens by default; R will transform categorical
predictors into a matrix of indicators or dummy codes with the first
level as the reference group.
In general, any transformation that returns a vector of length equal to the number of rows in the data or a matrix with the same row dimension as the input data is legal.
Standardized regression coefficients
A example of transformation is the scale
function, which
standardizes it's first argument by subtracting the mean and dividing
by the standard deviation of each column seperately:
str(mtcars[c("hp", "wt")])
str(scale(mtcars[c("hp", "wt")]))
Thus if we want standardized regression coefficients we can fit our model as follows:
mod.mpg2 <- lm(mpg ~ scale(cbind(hp = hp, wt = wt)) * am, data = mtcars)
summary(mod.mpg2)
Polynomials
If we want polynomial terms we can generate them ourselves, or use the
poly
function to do it for us.
summary(lm(mpg ~ disp + I(disp^2) + I(disp^3), data = mtcars))
summary(lm(mpg ~ poly(disp, degree = 3, raw = TRUE), data = mtcars))
Splines
Similarly, we can use the bs
function from the splines
package to
generate piece-wise polynomials. You can specify either df
or you
can specify knots
directly (in which case ns
will choose df - 1 -
intercept
knots at suitable intervals).
library(splines)
mod.mpg4 <- lm(mpg ~ bs(disp, knots = quantile(disp, c(.25, .50, .75))),
data = mtcars)
summary(mod.mpg4)
Again there is nothing magical about splines, they are basically piece-wise polynomials. We can calculate a quick-and-dirty version ourselves without too much trouble:
disp.qtl <- quantile(mtcars$disp, c(.25, .50, .75))
mod.bs.manual <- predict(lm(mpg ~ disp +
I(disp^2) +
I(disp^3) +
ifelse(disp > disp.qtl[1],
(disp - disp.qtl[1])^3,
0) +
ifelse(disp > disp.qtl[2],
(disp - disp.qtl[2])^3,
0) +
ifelse(disp > disp.qtl[3],
(disp - disp.qtl[3])^3,
0),
data = mtcars))
## our quick-and-dirty spline captures the same information as those
## generated by the bs function:
all.equal(mod.bs.manual, predict(mod.mpg4))
## although, it is not scaled or parameterized exactly the same.
Finally, we can construct natural cubic splines using the ns
function. As with bs
you can specify either df
or you can specify
knots
directly.
What all of these transformation functions (i.e., scale
, poly
,
bs
, ns
) have in common is that they take a variable as input and
return a matrix where the row dimension is equal to the length of the
input variable. When called inside a formula
of a modeling function
(e.g., lm
) the generated matrix will be used in the right-hand-side
of the equation.
Your turn: Visualizing polynomial and spline regressions
- Predict
mpg
fromhp
using a natural spline (ns
) withdf = 3
.
```{r,error=T}
YOUR SOLUTION HERE
mod.mpg.ns3 <- summary(mod.mpg.ns3)
2. Predict `mpg` from `hp` using a *natural spline* with `df = 8`.
```{r,error=T}
### YOUR SOLUTION HERE
mod.mpg.ns8 <-
summary(mod.mpg.ns8)
- Plot the predictions from these two models.
```{r,error=T}
YOU FILL PREDICTIONS HERE
mtcars$mpg.ns3 <- mtcars$mpg.ns8 <-
Plot:
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + geom_line(aes(y = mpg.ns3,color='red'))+ geom_line(aes(y = mpg.ns8,color='blue'))
4. (Advanced) Refit the `df=3` model, adding an interaction with `am`.
Plot the predictions, coloring by `am`.
```{r}
### YOUR SOLUTION HERE
Selecting parameter values
Once we start fitting models with polynomials or splines, we will quickly realize that we need a way to choose the optimal parameters. What degree polynomial should we use? How many degrees of freedom in our basis-splines?
A simple answer is to choose a measure of model fit, test out a range of parameters, and see which one fits the best. That makes sense, but be careful not to over-fit the model. In other words, you want you model to capture generalizable patterns, and not to capture noise in the sample you happen to have.
Basic cross-validation
There are several approaches to avoiding over-fitting, with cross-validation being a simple and common method. Here are the basic steps.
- Split the data into training and test sets.
- Choose a measure of model fit, e.g., $R^2$ or root-mean-square error ($rmse$).
- Fit each model to the training set.
- Calculate the measure of model fit on the test set.
Here is an example using the mtcars
data.
rsq <- function(model, data, y) {
y <- data[[y]]
predict <- predict(model, newdata = data)
tss = sum((y - mean(y))^2)
rss = sum((y-predict)^2)
rsq_ = max(0, 1 - rss/tss)
return(rsq_)
}
mse<-function(model,data,y){
y<-data[[y]]
predict<-predict(model,newdata=data)
mse<-mean((y-predict)^2)
return(mse)
}
Split the data into test and training sets.
id.train <- sample(1:nrow(mtcars), floor(nrow(mtcars) * 0.5))
mt.train <- mtcars[id.train, ]
mt.test <- mtcars[-id.train, ]
Next, define a function to fit a model with given df and calculate R-square.
model.performance <- function(df, train, test) {
mod <- lm(mpg ~ bs(disp, df = df), data = train)
c(train.r2 = rsq(mod, train, "mpg"),
test.r2 = rsq(mod, test, "mpg"))
}
Finally, fit the model with varying parameters and calculate performance for each model
dfs <- 2:6 ## parameters we want to iterate over
## fit models using sapply. We could have used a for-loop instead.
performance <- sapply(dfs, model.performance,
train = mt.train,
test = mt.test,
simplify = FALSE)
## arrange the result in a data.frame
performance <- as.data.frame(do.call(rbind, performance))
performance$df <- dfs
performance
Usually it's a good idea to visualize model performance as a function of the varying parameter(s).
library(ggplot2)
ggplot(performance, aes(x = df)) +
geom_point(aes(y = train.r2, color = "train")) +
geom_point(aes(y = test.r2, color = "test"))
k-fold cross-validation
There is a serious limitation to using a single test/training split to perform cross validation. Would you get the same result with a different split? Because we don't want our results to depend on the random partitioning into test and training sets, it is a good idea to use multiple splits. This leads us to k-fold cross-validation. The basic steps are:
- Split the data into
k
paritions. - Choose a measure of model fit, e.g., $R^2$ or $rmse$.
- For each partition, fit the model to the data excluding that partition.
- Calculate the measure of model fit on the excluded partition.
- Average the
k
measures of model fit.
Here is an example, again using the mtcars
data.
Create k=5
partitions.
k <- 5
mtcars$partition <- cut(sample(1:nrow(mtcars), nrow(mtcars)), 5)
Fit the model with varying parameters and calculate performance for each model.
dfs <- 2:6 ## parameters we want to iterate over
## Use a for-loop to iterate over the partitions.
## We could have used sapply instead
performance <- vector(mode = "list", length = k)
names(performance) <- unique(mtcars$partition)
for(partition in names(performance)) {
## Fit models using sapply. We could have used a for-loop instead.
test <- mtcars$partition == partition
performance[[partition]] <- sapply(dfs,
model.performance,
train = mtcars[!test, ],
test = mtcars[test, ],
simplify = FALSE)
}
## arrange the result in a list of data.frames
performance <- sapply(performance, function(x) {
x <- as.data.frame(do.call(rbind, x))
x$df <- dfs
x},
simplify = FALSE)
## add partition column
for(partition in names(performance)) {
performance[[partition]] <- data.frame(performance[[partition]],
partition = partition)
}
## reduce list of data.frames to a single data.frame
performance <- do.call(rbind, performance)
performance
Usually it's a good idea to visualize model performance as a function of the varying parameter(s).
library(ggplot2)
ggplot(performance, aes(x = df)) +
geom_point(aes(y = train.r2, color = "train")) +
geom_point(aes(y = test.r2, color = "test"))
for(dd in dfs){
print(dd)
print(c("Test R2",mean(performance$test.r2[performance$df==dd])))
print(c("Train R2",mean(performance$train.r2[performance$df==dd])))
}
The caret
package is a full-featured system for training
regression and classification models. The modelr
package is a
light-weight alternative. A little touch to get the hang of, but go a
Your turn: Use cross validation to determine the optimal degrees of freedom
Note: You may use the convenience functions we wrote earlier, or write your own.
- Split the
mtcars
data into a test and training set.
```{r,error=T}
YOU FINISH CODE HERE
train <-
mt.train <- mtcars[train, ] mt.test <- mtcars[-train, ]
2. Iterate over `df` from 2-6, predicting `mpg` from `hp` and 'drat' using
*natural splines* (`ns`). For each `df` combination calculate the $R^2$ value in
the test and training set.
```{r,error=T}
model.performance2 <- function(df, train, test) {
mod <- lm(mpg ~ ns(hp, df = df[1])+ns(drat,df=df[2]), data = train)
c(train.mse = mse(mod, train, "mpg"),
test.mse = mse(mod, test, "mpg"))
}
### YOU FINISH CODE HERE
?expand.grid
dfs <-
names(dfs)<-c('df.hp','df.drat')
ns.performance<- apply()
- Plot the $R^2$ values in the test and training set as a function of
df
forhp
anddrat
. What is the bestdf
pair? ```{r,error=T} plot.performance<-data.frame(dfs,t(ns.performance))
YOUR PLOT HERE
4. Make sure that your solution is NOT on the boundary of your df grid. If it is, adjust your bounds to ensure that the global minimum of hte MSE is not on a boundary.
5. (Advanced) Repeat steps 1-3 several times. Do you get the same
answer with different train/test splits?
```{r}
### YOUR SOLUTION HERE
Smoothing models (loess, gam)
As flexible as the transformation mechanism described above is, it
is limited. Other approaches to modeling non-linear functions include
local polynomial regression (loess
) and generalized additive models
(gam
). Rather than using standard (generalized) linear models with
transformations on the right-hand-side, these models are best fit in R
using specialized functions.
Local polynomial regression works by fitting a polynomial weighted by
distance from the point being predicted. The most important parameters
are span
and degree
, which control the down-weighting of more
distant points, and the degree of the polynomial, respectively.
mod.mpg5 <- loess(mpg ~ disp, data = mtcars) # default span = 0.75
mod.mpg6 <- loess(mpg ~ disp, span = 0.25, data = mtcars) # less smooth
mod.mpg7 <- loess(mpg ~ disp, span = 2.0, data = mtcars) # more smooth
ggplot(data = transform(mtcars,
pred.mpg5 = predict(mod.mpg5),
pred.mpg6 = predict(mod.mpg6),
pred.mpg7 = predict(mod.mpg7)),
mapping = aes(x = disp, y = mpg)) +
geom_point() +
geom_line(mapping = aes(y = pred.mpg5,color=0.75))+
geom_line(mapping = aes(y = pred.mpg6,color=0.25))+
geom_line(mapping = aes(y = pred.mpg7,color=2))
There are two popular implementations of generalized additive models in R, the older gam package, and the more recent mgcv package. mgcv. We'll use gam because it is somewhat simpler and easier to understand.
Calls go the gam
function look much like calls to lm
with
transformations that we saw earlier:
library(gam)
mod.mpg8 <- gam(mpg ~ s(disp, spar = 0.1), data = mtcars)
mod.mpg9 <- gam(mpg ~ s(disp, spar = 0.7), data = mtcars)
mod.mpg10 <- gam(mpg ~ s(disp, spar = 1), data = mtcars)
ggplot(data = transform(mtcars,
pred.mpg8 = predict(mod.mpg8),
pred.mpg9 = predict(mod.mpg9),
pred.mpg10 = predict(mod.mpg10)),
mapping = aes(x = disp, y = mpg)) +
geom_point() +
geom_line(mapping = aes(y = pred.mpg8,color=0.1))+
geom_line(mapping = aes(y = pred.mpg9,color=0.7))+
geom_line(mapping = aes(y = pred.mpg10,color=1))
Under the hood things are a bit different. The s
function doesn't
actually compute the smooth splines -- it just adds some attributes so
that the gam
function knows how to generate the model matrix. In
other words the s
function is specific to the gam package:
lm(mpg ~ s(disp, spar = 0.1), data = mtcars)
will not work.
Your turn: Use cross validation to determine the optimal smoothing parameter for a GAM
Note You may modify and use the functions we wrote earlier, or write your own.
- Split the
mtcars
data into a test and training set.
```{r,error=T}
YOU FINISH CODE HERE
train <- mt.train <- mtcars[train, ] mt.test <- mtcars[-train, ]
2. Iterate over `spar` from 0.1 -- 0.8 (in increments of 0.1),
predicting `mpg` from `hp` using a *generalized additive model*
(`gam`). For each value of`spar` calculate the $R^2$ value in the
test and training set.
```{r,error=T}
library(gam)
### YOU FINISH CODE HERE
spars <-
###
## vectors to store results
train.r2 <- c()
test.r2 <- c()
## iterate using a for-loop
for(sp in spars) {
### YOUR SOLUTION HERE
mod <-
train.r2 <- c(train.r2,#HERE)
test.r2 <- c(test.r2, #HERE)
###
}
## arrange the data
gam.r2 <- data.frame(spar = c(spars, spars),
set = rep(c("train", "test"), each = length(spars)),
r2 = c(train.r2, test.r2))
- Plot the $R^2$ values in the test and training set as a function of
spar
. What is the bestspar
setting?
### YOUR SOLUTION HERE
- (Advanced) Repeat steps 1-3 several times. Do you get the same answer with different train/test splits?
### YOUR SOLUTION HERE
- (Advanced) Use 5-fold cross-validation to select the optimal
spar
value.
### YOUR SOLUTION HERE