Lab 2: Model Fitting

Key Word(s): Smoothing, Model Fitting, GAM



Supplementary Material

R provides functions for fitting linear (lm) and generalized linear models (glm), local polynomial regression models (loess), and nonlinear least squares (nls) among others. Functions for fitting many other kinds of models are available in various R packages.

Formula

What most of these model fitting functions have in common is a way of specifying relationships among variables in the model in terms of a formula. A formula is usually defined using the tilde operator (~), with the response on the left and predictors on the right.

Thinking of the formula operator as a domain-specific language is helpful, because many things work differently inside a formula than they do in the rest of R. For example:

Note that you can escape to the usual meaning with the I function: I((a + b + c)^2) means "square the sum of a, b, and c, even if inside a formula.

More details on the formula interface are described in the help page (?formula).

Model fitting examples

R comes with some built-in data sets that can be used for examples and demos. You can ask for the list of available data sets by calling data(). Here we will use the built-in mtcars data set.

data(mtcars)
str(mtcars)
?mtcars

Fit a simple linear model predicting miles per gallon (mpg) from horsepower (hp) and weight (wt).

lm(mpg ~ hp + wt, data = mtcars)

Add interaction between hp and wt.

lm(mpg ~ hp * wt, data = mtcars)

Add interactions between hp and automatic transmission (am), and between wt and am.

lm(mpg ~ (hp + wt) * am, data = mtcars)

Types, Classes and methods (lm, predict, resid)

We've seen how to use formula in R to fit linear models, but you may have been somewhat underwhelmed by the result. lm returns simply the Call that produced it, and the regression coefficients. To do anything useful with these models we will want to assign them to a name and then do some post-estimation calculations. In order to do that we need to know what lm returns, and what methods exist for it. Most of this is documented in ?lm, but we can use R to inspect the result and learn for ourselves.

Objects produced by lm are named lists, and can be treated just like any other list in R. Let's simplify our last example model and take a closer look.

mod.mpg1 <- lm(mpg ~ wt * am, data = mtcars)

typeof(mod.mpg1)
length(mod.mpg1)
names(mod.mpg1)

mod.mpg1[["coefficients"]]

Since mod.mpg1 is a list, we can do anything with it that we can do with other lists in R. For example, we can extract elements by position or name:

mod.mpg1[["coefficients"]]
hist(mod.mpg1[[2]])

Unlike other lists in R, lists produced by lm are of class lm. There are functions with methods that correspond to this class, and we can ask R to tell us what those functions are using the methods function.

class(mod.mpg1)
methods(class = class(mod.mpg1))

We now know that there are summary, anova, and confint methods (among others) for objects of class lm. That is, we know that we can do the following:

summary(mod.mpg1)
anova(mod.mpg1)
confint(mod.mpg1)

Note In Python the equivalent of summary(mod.mpg1) would probably look like mod.mpg1.summary(). This object-oriented style does exist in R (though it uses $ instead of . to access methods), but it is not commonly used.

This technique of identifying the class of an R object, and then looking up which generic functions have corresponding methods is very useful. Remember thought that it only shows you functions that have specific methods; for example methods(class = class(mod.mpg1)) did not show us a method for [[, but we know that we can use bracket extraction on a lm object. That just means that [[ doesn't do anything special because of the object's class -- it just treats it like any other list.

We can also go the other way around, and ask R what methods exist for a specific generic function. For example,

methods(summary)

shows us all the different object classes that have summary methods.

Most modeling functions in R usually return objects with at least summary, print, and predict methods. The predict methods are useful for visualizing the quality of your model. Once you've calculated predicted values you can use the plot or ggplot functions to construct a graph. Let's use ggplot to visualize the predictions from our last model.

## install.packages("ggplot2") # uncomment and run if not already installed.
library(ggplot2)

mtc <- transform(mtcars, pred.mpg = predict(mod.mpg1, interval = "confidence"))
mtc <- mtc[order(mtc$wt), ]

ggplot(mtc,
       mapping = aes(x = wt, y = mpg, color = as.factor(am))) +
    geom_point() +
    geom_smooth(mapping = aes(y = pred.mpg.fit,
                              ymin = pred.mpg.lwr,
                              ymax = pred.mpg.upr),
                stat = "identity")

Your turn: Get to know lm

  1. Fit a linear model predicting miles per gallon (mpg) from number of cylinders (cyl), displacement (disp), horsepower (hp), and all two-way interactions among these three predictors. Do not include the three way interaction.
### YOUR SOLUTION HERE
  1. Number of cylinders and displacement are so highly correlated that it makes sense to think of them both as measures of engine size. Fit a model predicting mpg from the sum of cyl and disp.
### YOUR SOLUTION HERE
  1. Fit a model predicting mpg from disp, assigning the result to the name mod.mpg. Get the summary of the model and assign the result to the name mod.mpg.sum.
### YOUR SOLUTION HERE
  1. Extract the r.squared value from mod.mpg.sum.
### YOUR SOLUTION HERE
  1. Graph the observed and predicted values of mpg as a function of disp based on your model.

```{r,error=T} library(ggplot2) mtc <- transform(mtcars, pred.mpg = predict(mod.mpg))

YOUR SOLUTION HERE

6. (Advanced) Fit a logistic model predicting *automatic vs.
   manual transmission* (`am`) from *displacement* (`disp`). (Hint: use `glm`)

```{r}
### YOUR SOLUTION HERE
  1. (Advanced) Fit a logistic model predicting automatic vs. manual transmission (am) from displacement (disp) and number of gears (gear). Treat gear as an ordered factor and use polynomial contrasts.
### YOUR SOLUTION HERE
  1. (Advanced) Inference from linear regression models is based on several assumptions, including the assumption that the residuals are normally distributed. Visually inspect your model to determine if this assumption has been violated.
### YOUR SOLUTION HERE
  1. (Advanced) Add am and the interaction between disp and am to your model. What is the regression coefficient for mpg on disp when am = 0? When am = 1?
### YOUR SOLUTION HERE

Transformations (Polynomials and Splines)

The R formula provides a way to specify relationships among variables, and among transformations of those variables. We've already seen that * transforms the inputs by including both the specified variables and their product. Many other transformations are possible. Indeed, for categorical predictors transformation is mandatory and happens by default; R will transform categorical predictors into a matrix of indicators or dummy codes with the first level as the reference group.

In general, any transformation that returns a vector of length equal to the number of rows in the data or a matrix with the same row dimension as the input data is legal.

Standardized regression coefficients

A example of transformation is the scale function, which standardizes it's first argument by subtracting the mean and dividing by the standard deviation of each column seperately:

str(mtcars[c("hp", "wt")])
str(scale(mtcars[c("hp", "wt")]))

Thus if we want standardized regression coefficients we can fit our model as follows:

mod.mpg2 <- lm(mpg ~ scale(cbind(hp = hp, wt = wt)) * am, data = mtcars)
summary(mod.mpg2)

Polynomials

If we want polynomial terms we can generate them ourselves, or use the poly function to do it for us.

summary(lm(mpg ~ disp + I(disp^2) + I(disp^3), data = mtcars))
summary(lm(mpg ~ poly(disp, degree = 3, raw = TRUE), data = mtcars))

Splines

Similarly, we can use the bs function from the splines package to generate piece-wise polynomials. You can specify either df or you can specify knots directly (in which case ns will choose df - 1 - intercept knots at suitable intervals).

library(splines)
mod.mpg4 <- lm(mpg ~ bs(disp, knots = quantile(disp, c(.25, .50, .75))),
               data = mtcars)
summary(mod.mpg4)

Again there is nothing magical about splines, they are basically piece-wise polynomials. We can calculate a quick-and-dirty version ourselves without too much trouble:

disp.qtl <- quantile(mtcars$disp, c(.25, .50, .75))
mod.bs.manual <- predict(lm(mpg ~ disp +
                              I(disp^2) +
                              I(disp^3) +
                              ifelse(disp > disp.qtl[1],
                                     (disp - disp.qtl[1])^3,
                                     0) +
                              ifelse(disp > disp.qtl[2],
                                     (disp - disp.qtl[2])^3,
                                     0) +
                              ifelse(disp > disp.qtl[3],
                                     (disp - disp.qtl[3])^3,
                                     0),
                            data = mtcars))

## our quick-and-dirty spline captures the same information as those
## generated by the bs function:

all.equal(mod.bs.manual, predict(mod.mpg4))

## although, it is not scaled or parameterized exactly the same.

Finally, we can construct natural cubic splines using the ns function. As with bs you can specify either df or you can specify knots directly.


What all of these transformation functions (i.e., scale, poly, bs, ns) have in common is that they take a variable as input and return a matrix where the row dimension is equal to the length of the input variable. When called inside a formula of a modeling function (e.g., lm) the generated matrix will be used in the right-hand-side of the equation.

Your turn: Visualizing polynomial and spline regressions

  1. Predict mpg from hp using a natural spline (ns) with df = 3.

```{r,error=T}

YOUR SOLUTION HERE

mod.mpg.ns3 <- summary(mod.mpg.ns3)

2. Predict `mpg` from `hp` using a *natural spline* with `df = 8`.

```{r,error=T}
### YOUR SOLUTION HERE
mod.mpg.ns8 <-
summary(mod.mpg.ns8)
  1. Plot the predictions from these two models.

```{r,error=T}

YOU FILL PREDICTIONS HERE

mtcars$mpg.ns3 <- mtcars$mpg.ns8 <-

Plot:

ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + geom_line(aes(y = mpg.ns3,color='red'))+ geom_line(aes(y = mpg.ns8,color='blue'))

4. (Advanced) Refit the `df=3` model, adding an interaction with `am`.
   Plot the predictions, coloring by `am`.

```{r}
### YOUR SOLUTION HERE

Selecting parameter values

Once we start fitting models with polynomials or splines, we will quickly realize that we need a way to choose the optimal parameters. What degree polynomial should we use? How many degrees of freedom in our basis-splines?

A simple answer is to choose a measure of model fit, test out a range of parameters, and see which one fits the best. That makes sense, but be careful not to over-fit the model. In other words, you want you model to capture generalizable patterns, and not to capture noise in the sample you happen to have.

Basic cross-validation

There are several approaches to avoiding over-fitting, with cross-validation being a simple and common method. Here are the basic steps.

  1. Split the data into training and test sets.
  2. Choose a measure of model fit, e.g., $R^2$ or root-mean-square error ($rmse$).
  3. Fit each model to the training set.
  4. Calculate the measure of model fit on the test set.

Here is an example using the mtcars data.

rsq <- function(model, data, y) {
  y <- data[[y]]
  predict <- predict(model, newdata = data)
  tss = sum((y - mean(y))^2)
  rss = sum((y-predict)^2)
  rsq_ = max(0, 1 - rss/tss)
  return(rsq_)
}

mse<-function(model,data,y){
  y<-data[[y]]
  predict<-predict(model,newdata=data)
  mse<-mean((y-predict)^2)
  return(mse)
}

Split the data into test and training sets.

id.train <- sample(1:nrow(mtcars), floor(nrow(mtcars) * 0.5))
mt.train <- mtcars[id.train, ]
mt.test <- mtcars[-id.train, ]

Next, define a function to fit a model with given df and calculate R-square.

model.performance <- function(df, train, test) {
    mod <- lm(mpg ~ bs(disp, df = df), data = train)
    c(train.r2 = rsq(mod, train, "mpg"),
      test.r2 = rsq(mod, test, "mpg"))
}

Finally, fit the model with varying parameters and calculate performance for each model

dfs <- 2:6 ## parameters we want to iterate over

## fit models using sapply. We could have used a for-loop instead.
performance <- sapply(dfs, model.performance,
                      train = mt.train,
                      test = mt.test,
                      simplify = FALSE)

## arrange the result in a data.frame
performance <- as.data.frame(do.call(rbind, performance))
performance$df <- dfs

performance

Usually it's a good idea to visualize model performance as a function of the varying parameter(s).

library(ggplot2)

ggplot(performance, aes(x = df)) +
    geom_point(aes(y = train.r2, color = "train")) +
    geom_point(aes(y = test.r2, color = "test"))

k-fold cross-validation

There is a serious limitation to using a single test/training split to perform cross validation. Would you get the same result with a different split? Because we don't want our results to depend on the random partitioning into test and training sets, it is a good idea to use multiple splits. This leads us to k-fold cross-validation. The basic steps are:

  1. Split the data into k paritions.
  2. Choose a measure of model fit, e.g., $R^2$ or $rmse$.
  3. For each partition, fit the model to the data excluding that partition.
  4. Calculate the measure of model fit on the excluded partition.
  5. Average the k measures of model fit.

Here is an example, again using the mtcars data.

Create k=5 partitions.

k <- 5

mtcars$partition <- cut(sample(1:nrow(mtcars), nrow(mtcars)), 5)

Fit the model with varying parameters and calculate performance for each model.

dfs <- 2:6 ## parameters we want to iterate over

## Use a for-loop to iterate over the partitions.
## We could have used sapply instead
performance <- vector(mode = "list", length = k)
names(performance) <- unique(mtcars$partition)
for(partition in names(performance)) {
    ## Fit models using sapply. We could have used a for-loop instead.
    test <- mtcars$partition == partition
    performance[[partition]] <- sapply(dfs,
                                       model.performance,
                                       train = mtcars[!test, ],
                                       test = mtcars[test, ],
                                       simplify = FALSE)
}

## arrange the result in a list of data.frames
performance <- sapply(performance, function(x) {
    x <- as.data.frame(do.call(rbind, x))
    x$df <- dfs
    x},
    simplify = FALSE)

## add partition column
for(partition in names(performance)) {
    performance[[partition]] <- data.frame(performance[[partition]],
                                           partition = partition)
}

## reduce list of data.frames to a single data.frame
performance <- do.call(rbind, performance)

performance

Usually it's a good idea to visualize model performance as a function of the varying parameter(s).

library(ggplot2)

ggplot(performance, aes(x = df)) +
    geom_point(aes(y = train.r2, color = "train")) +
    geom_point(aes(y = test.r2, color = "test"))

for(dd in dfs){
  print(dd)
  print(c("Test R2",mean(performance$test.r2[performance$df==dd])))
  print(c("Train R2",mean(performance$train.r2[performance$df==dd])))
}

The caret package is a full-featured system for training regression and classification models. The modelr package is a light-weight alternative. A little touch to get the hang of, but go a

Your turn: Use cross validation to determine the optimal degrees of freedom

Note: You may use the convenience functions we wrote earlier, or write your own.

  1. Split the mtcars data into a test and training set.

```{r,error=T}

YOU FINISH CODE HERE

train <-

mt.train <- mtcars[train, ] mt.test <- mtcars[-train, ]

2. Iterate over `df` from 2-6, predicting `mpg` from `hp` and 'drat' using
   *natural splines* (`ns`). For each `df` combination calculate the $R^2$ value in
   the test and training set.

```{r,error=T}
model.performance2 <- function(df, train, test) {
    mod <- lm(mpg ~ ns(hp, df = df[1])+ns(drat,df=df[2]), data = train)
    c(train.mse = mse(mod, train, "mpg"),
      test.mse = mse(mod, test, "mpg"))
}

### YOU FINISH CODE HERE
?expand.grid
dfs <-
names(dfs)<-c('df.hp','df.drat')

ns.performance<- apply()
  1. Plot the $R^2$ values in the test and training set as a function of df forhp and drat. What is the best df pair? ```{r,error=T} plot.performance<-data.frame(dfs,t(ns.performance))

YOUR PLOT HERE

4. Make sure that your solution is NOT on the boundary of your df grid. If it is, adjust your bounds to ensure that the global minimum of hte MSE is not on a boundary.

5. (Advanced) Repeat steps 1-3 several times. Do you get the same
   answer with different train/test splits?

```{r}
### YOUR SOLUTION HERE

Smoothing models (loess, gam)

As flexible as the transformation mechanism described above is, it is limited. Other approaches to modeling non-linear functions include local polynomial regression (loess) and generalized additive models (gam). Rather than using standard (generalized) linear models with transformations on the right-hand-side, these models are best fit in R using specialized functions.

Local polynomial regression works by fitting a polynomial weighted by distance from the point being predicted. The most important parameters are span and degree, which control the down-weighting of more distant points, and the degree of the polynomial, respectively.

mod.mpg5 <- loess(mpg ~ disp, data = mtcars) # default span = 0.75
mod.mpg6 <- loess(mpg ~ disp, span = 0.25, data = mtcars) # less smooth
mod.mpg7 <- loess(mpg ~ disp, span = 2.0, data = mtcars) # more smooth

ggplot(data = transform(mtcars,
                 pred.mpg5 = predict(mod.mpg5),
                 pred.mpg6 = predict(mod.mpg6),
                 pred.mpg7 = predict(mod.mpg7)),
       mapping = aes(x = disp, y = mpg)) +
  geom_point() +
  geom_line(mapping = aes(y = pred.mpg5,color=0.75))+
  geom_line(mapping = aes(y = pred.mpg6,color=0.25))+
  geom_line(mapping = aes(y = pred.mpg7,color=2))

There are two popular implementations of generalized additive models in R, the older gam package, and the more recent mgcv package. mgcv. We'll use gam because it is somewhat simpler and easier to understand.

Calls go the gam function look much like calls to lm with transformations that we saw earlier:

library(gam)
mod.mpg8 <- gam(mpg ~ s(disp, spar = 0.1), data = mtcars)
mod.mpg9 <- gam(mpg ~ s(disp, spar = 0.7), data = mtcars)
mod.mpg10 <- gam(mpg ~ s(disp, spar = 1), data = mtcars)

ggplot(data = transform(mtcars,
                 pred.mpg8 = predict(mod.mpg8),
                 pred.mpg9 = predict(mod.mpg9),
                 pred.mpg10 = predict(mod.mpg10)),
       mapping = aes(x = disp, y = mpg)) +
  geom_point() +
  geom_line(mapping = aes(y = pred.mpg8,color=0.1))+
  geom_line(mapping = aes(y = pred.mpg9,color=0.7))+
  geom_line(mapping = aes(y = pred.mpg10,color=1))

Under the hood things are a bit different. The s function doesn't actually compute the smooth splines -- it just adds some attributes so that the gam function knows how to generate the model matrix. In other words the s function is specific to the gam package: lm(mpg ~ s(disp, spar = 0.1), data = mtcars) will not work.

Your turn: Use cross validation to determine the optimal smoothing parameter for a GAM

Note You may modify and use the functions we wrote earlier, or write your own.

  1. Split the mtcars data into a test and training set.

```{r,error=T}

YOU FINISH CODE HERE

train <- mt.train <- mtcars[train, ] mt.test <- mtcars[-train, ]

2. Iterate over `spar` from 0.1 -- 0.8 (in increments of 0.1),
   predicting `mpg` from `hp` using a *generalized additive model*
   (`gam`). For each value of`spar` calculate the $R^2$ value in the
   test and training set.

```{r,error=T}
library(gam)

### YOU FINISH CODE HERE
spars <-
###

## vectors to store results
train.r2 <- c()
test.r2 <- c()

## iterate using a for-loop
for(sp in spars) {

  ### YOUR SOLUTION HERE
  mod <-
  train.r2 <- c(train.r2,#HERE)
  test.r2 <- c(test.r2, #HERE)
  ###
}

## arrange the data
gam.r2 <- data.frame(spar = c(spars, spars),
                     set = rep(c("train", "test"), each = length(spars)),
                     r2 = c(train.r2, test.r2))
  1. Plot the $R^2$ values in the test and training set as a function of spar. What is the best spar setting?
### YOUR SOLUTION HERE
  1. (Advanced) Repeat steps 1-3 several times. Do you get the same answer with different train/test splits?
### YOUR SOLUTION HERE
  1. (Advanced) Use 5-fold cross-validation to select the optimal spar value.
### YOUR SOLUTION HERE