# Lab 2: Model Fitting

Key Word(s): Smoothing, Model Fitting, GAM

Supplementary Material

R provides functions for fitting linear (`lm`) and generalized linear models (`glm`), local polynomial regression models (`loess`), and nonlinear least squares (`nls`) among others. Functions for fitting many other kinds of models are available in various R packages.

# Formula

What most of these model fitting functions have in common is a way of specifying relationships among variables in the model in terms of a formula. A formula is usually defined using the tilde operator (`~`), with the response on the left and predictors on the right.

Thinking of the formula operator as a domain-specific language is helpful, because many things work differently inside a formula than they do in the rest of R. For example:

• outside of a formula, `+` means "addition", but inside a formula `+` means "inclusion"
• outside of a formula ":" is a sequence operator, but inside a formula it means "interaction"
• outside a formula `(a + b + c)^2` means "square the sum of a, b, and c"; inside a formula it means "include a, b, c, and all two-way interactions between them"

Note that you can escape to the usual meaning with the `I` function: `I((a + b + c)^2)` means "square the sum of a, b, and c, even if inside a formula.

More details on the formula interface are described in the help page (`?formula`).

## Model fitting examples

R comes with some built-in data sets that can be used for examples and demos. You can ask for the list of available data sets by calling `data()`. Here we will use the built-in mtcars data set.

```data(mtcars)
str(mtcars)
?mtcars
```

Fit a simple linear model predicting miles per gallon (`mpg`) from horsepower (`hp`) and weight (`wt`).

```lm(mpg ~ hp + wt, data = mtcars)
```

Add interaction between `hp` and `wt`.

```lm(mpg ~ hp * wt, data = mtcars)
```

Add interactions between `hp` and automatic transmission (`am`), and between `wt` and `am`.

```lm(mpg ~ (hp + wt) * am, data = mtcars)
```

# Types, Classes and methods (lm, predict, resid)

We've seen how to use formula in R to fit linear models, but you may have been somewhat underwhelmed by the result. `lm` returns simply the Call that produced it, and the regression coefficients. To do anything useful with these models we will want to assign them to a name and then do some post-estimation calculations. In order to do that we need to know what `lm` returns, and what methods exist for it. Most of this is documented in `?lm`, but we can use R to inspect the result and learn for ourselves.

Objects produced by `lm` are named lists, and can be treated just like any other list in R. Let's simplify our last example model and take a closer look.

```mod.mpg1 <- lm(mpg ~ wt * am, data = mtcars)

typeof(mod.mpg1)
length(mod.mpg1)
names(mod.mpg1)

mod.mpg1[["coefficients"]]
```

Since `mod.mpg1` is a list, we can do anything with it that we can do with other lists in R. For example, we can extract elements by position or name:

```mod.mpg1[["coefficients"]]
```
```hist(mod.mpg1[])
```

Unlike other lists in R, lists produced by `lm` are of class `lm`. There are functions with methods that correspond to this class, and we can ask R to tell us what those functions are using the `methods` function.

```class(mod.mpg1)
methods(class = class(mod.mpg1))
```

We now know that there are `summary`, `anova`, and `confint` methods (among others) for objects of class `lm`. That is, we know that we can do the following:

```summary(mod.mpg1)
anova(mod.mpg1)
confint(mod.mpg1)
```

Note In Python the equivalent of `summary(mod.mpg1)` would probably look like `mod.mpg1.summary()`. This object-oriented style does exist in R (though it uses `\$` instead of `.` to access methods), but it is not commonly used.

This technique of identifying the class of an R object, and then looking up which generic functions have corresponding methods is very useful. Remember thought that it only shows you functions that have specific methods; for example `methods(class = class(mod.mpg1))` did not show us a method for `[[`, but we know that we can use bracket extraction on a lm object. That just means that `[[` doesn't do anything special because of the object's class -- it just treats it like any other list.

We can also go the other way around, and ask R what methods exist for a specific generic function. For example,

```methods(summary)
```

shows us all the different object classes that have `summary` methods.

Most modeling functions in R usually return objects with at least `summary`, `print`, and `predict` methods. The `predict` methods are useful for visualizing the quality of your model. Once you've calculated predicted values you can use the `plot` or `ggplot` functions to construct a graph. Let's use `ggplot` to visualize the predictions from our last model.

```## install.packages("ggplot2") # uncomment and run if not already installed.
library(ggplot2)

mtc <- transform(mtcars, pred.mpg = predict(mod.mpg1, interval = "confidence"))
mtc <- mtc[order(mtc\$wt), ]

ggplot(mtc,
mapping = aes(x = wt, y = mpg, color = as.factor(am))) +
geom_point() +
geom_smooth(mapping = aes(y = pred.mpg.fit,
ymin = pred.mpg.lwr,
ymax = pred.mpg.upr),
stat = "identity")
```

# Your turn: Get to know lm

1. Fit a linear model predicting miles per gallon (`mpg`) from number of cylinders (`cyl`), displacement (`disp`), horsepower (`hp`), and all two-way interactions among these three predictors. Do not include the three way interaction.
```### YOUR SOLUTION HERE
```
1. Number of cylinders and displacement are so highly correlated that it makes sense to think of them both as measures of engine size. Fit a model predicting mpg from the sum of `cyl` and `disp`.
```### YOUR SOLUTION HERE
```
1. Fit a model predicting `mpg` from `disp`, assigning the result to the name `mod.mpg`. Get the `summary` of the model and assign the result to the name `mod.mpg.sum`.
```### YOUR SOLUTION HERE
```
1. Extract the `r.squared` value from `mod.mpg.sum`.
```### YOUR SOLUTION HERE
```
1. Graph the observed and predicted values of `mpg` as a function of `disp` based on your model.

```{r,error=T} library(ggplot2) mtc <- transform(mtcars, pred.mpg = predict(mod.mpg))

```6. (Advanced) Fit a logistic model predicting *automatic vs.
manual transmission* (`am`) from *displacement* (`disp`). (Hint: use `glm`)

```{r}
```
1. (Advanced) Fit a logistic model predicting automatic vs. manual transmission (`am`) from displacement (`disp`) and number of gears (`gear`). Treat gear as an ordered factor and use polynomial contrasts.
```### YOUR SOLUTION HERE
```
1. (Advanced) Inference from linear regression models is based on several assumptions, including the assumption that the residuals are normally distributed. Visually inspect your model to determine if this assumption has been violated.
```### YOUR SOLUTION HERE
```
1. (Advanced) Add `am` and the interaction between `disp` and `am` to your model. What is the regression coefficient for `mpg` on `disp` when `am = 0`? When `am = 1`?
```### YOUR SOLUTION HERE
```

# Transformations (Polynomials and Splines)

The R formula provides a way to specify relationships among variables, and among transformations of those variables. We've already seen that `*` transforms the inputs by including both the specified variables and their product. Many other transformations are possible. Indeed, for categorical predictors transformation is mandatory and happens by default; R will transform categorical predictors into a matrix of indicators or dummy codes with the first level as the reference group.

In general, any transformation that returns a vector of length equal to the number of rows in the data or a matrix with the same row dimension as the input data is legal.

## Standardized regression coefficients

A example of transformation is the `scale` function, which standardizes it's first argument by subtracting the mean and dividing by the standard deviation of each column seperately:

```str(mtcars[c("hp", "wt")])
str(scale(mtcars[c("hp", "wt")]))
```

Thus if we want standardized regression coefficients we can fit our model as follows:

```mod.mpg2 <- lm(mpg ~ scale(cbind(hp = hp, wt = wt)) * am, data = mtcars)
summary(mod.mpg2)
```

## Polynomials

If we want polynomial terms we can generate them ourselves, or use the `poly` function to do it for us.

```summary(lm(mpg ~ disp + I(disp^2) + I(disp^3), data = mtcars))
summary(lm(mpg ~ poly(disp, degree = 3, raw = TRUE), data = mtcars))
```

## Splines

Similarly, we can use the `bs` function from the `splines` package to generate piece-wise polynomials. You can specify either `df` or you can specify `knots` directly (in which case `ns` will choose ```df - 1 - intercept``` knots at suitable intervals).

```library(splines)
mod.mpg4 <- lm(mpg ~ bs(disp, knots = quantile(disp, c(.25, .50, .75))),
data = mtcars)
summary(mod.mpg4)
```

Again there is nothing magical about splines, they are basically piece-wise polynomials. We can calculate a quick-and-dirty version ourselves without too much trouble:

```disp.qtl <- quantile(mtcars\$disp, c(.25, .50, .75))
mod.bs.manual <- predict(lm(mpg ~ disp +
I(disp^2) +
I(disp^3) +
ifelse(disp > disp.qtl,
(disp - disp.qtl)^3,
0) +
ifelse(disp > disp.qtl,
(disp - disp.qtl)^3,
0) +
ifelse(disp > disp.qtl,
(disp - disp.qtl)^3,
0),
data = mtcars))

## our quick-and-dirty spline captures the same information as those
## generated by the bs function:

all.equal(mod.bs.manual, predict(mod.mpg4))

## although, it is not scaled or parameterized exactly the same.
```

Finally, we can construct natural cubic splines using the `ns` function. As with `bs` you can specify either `df` or you can specify `knots` directly.

What all of these transformation functions (i.e., `scale`, `poly`, `bs`, `ns`) have in common is that they take a variable as input and return a matrix where the row dimension is equal to the length of the input variable. When called inside a `formula` of a modeling function (e.g., `lm`) the generated matrix will be used in the right-hand-side of the equation.

# Your turn: Visualizing polynomial and spline regressions

1. Predict `mpg` from `hp` using a natural spline (`ns`) with ```df = 3```.

```{r,error=T}

mod.mpg.ns3 <- summary(mod.mpg.ns3)

```2. Predict `mpg` from `hp` using a *natural spline* with `df = 8`.

```{r,error=T}
mod.mpg.ns8 <-
summary(mod.mpg.ns8)
```
1. Plot the predictions from these two models.

```{r,error=T}

### YOU FILL PREDICTIONS HERE

mtcars\$mpg.ns3 <- mtcars\$mpg.ns8 <-

### Plot:

ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + geom_line(aes(y = mpg.ns3,color='red'))+ geom_line(aes(y = mpg.ns8,color='blue'))

```4. (Advanced) Refit the `df=3` model, adding an interaction with `am`.
Plot the predictions, coloring by `am`.

```{r}
```

# Selecting parameter values

Once we start fitting models with polynomials or splines, we will quickly realize that we need a way to choose the optimal parameters. What degree polynomial should we use? How many degrees of freedom in our basis-splines?

A simple answer is to choose a measure of model fit, test out a range of parameters, and see which one fits the best. That makes sense, but be careful not to over-fit the model. In other words, you want you model to capture generalizable patterns, and not to capture noise in the sample you happen to have.

## Basic cross-validation

There are several approaches to avoiding over-fitting, with cross-validation being a simple and common method. Here are the basic steps.

1. Split the data into training and test sets.
2. Choose a measure of model fit, e.g., \$R^2\$ or root-mean-square error (\$rmse\$).
3. Fit each model to the training set.
4. Calculate the measure of model fit on the test set.

Here is an example using the `mtcars` data.

```rsq <- function(model, data, y) {
y <- data[[y]]
predict <- predict(model, newdata = data)
tss = sum((y - mean(y))^2)
rsq_ = max(0, 1 - rss/tss)
return(rsq_)
}

mse<-function(model,data,y){
y<-data[[y]]
predict<-predict(model,newdata=data)
mse<-mean((y-predict)^2)
return(mse)
}
```

Split the data into test and training sets.

```id.train <- sample(1:nrow(mtcars), floor(nrow(mtcars) * 0.5))
mt.train <- mtcars[id.train, ]
mt.test <- mtcars[-id.train, ]
```

Next, define a function to fit a model with given df and calculate R-square.

```model.performance <- function(df, train, test) {
mod <- lm(mpg ~ bs(disp, df = df), data = train)
c(train.r2 = rsq(mod, train, "mpg"),
test.r2 = rsq(mod, test, "mpg"))
}
```

Finally, fit the model with varying parameters and calculate performance for each model

```dfs <- 2:6 ## parameters we want to iterate over

## fit models using sapply. We could have used a for-loop instead.
performance <- sapply(dfs, model.performance,
train = mt.train,
test = mt.test,
simplify = FALSE)

## arrange the result in a data.frame
performance <- as.data.frame(do.call(rbind, performance))
performance\$df <- dfs

performance
```

Usually it's a good idea to visualize model performance as a function of the varying parameter(s).

```library(ggplot2)

ggplot(performance, aes(x = df)) +
geom_point(aes(y = train.r2, color = "train")) +
geom_point(aes(y = test.r2, color = "test"))
```

## k-fold cross-validation

There is a serious limitation to using a single test/training split to perform cross validation. Would you get the same result with a different split? Because we don't want our results to depend on the random partitioning into test and training sets, it is a good idea to use multiple splits. This leads us to k-fold cross-validation. The basic steps are:

1. Split the data into `k` paritions.
2. Choose a measure of model fit, e.g., \$R^2\$ or \$rmse\$.
3. For each partition, fit the model to the data excluding that partition.
4. Calculate the measure of model fit on the excluded partition.
5. Average the `k` measures of model fit.

Here is an example, again using the `mtcars` data.

Create `k=5` partitions.

```k <- 5

mtcars\$partition <- cut(sample(1:nrow(mtcars), nrow(mtcars)), 5)
```

Fit the model with varying parameters and calculate performance for each model.

```dfs <- 2:6 ## parameters we want to iterate over

## Use a for-loop to iterate over the partitions.
## We could have used sapply instead
performance <- vector(mode = "list", length = k)
names(performance) <- unique(mtcars\$partition)
for(partition in names(performance)) {
## Fit models using sapply. We could have used a for-loop instead.
test <- mtcars\$partition == partition
performance[[partition]] <- sapply(dfs,
model.performance,
train = mtcars[!test, ],
test = mtcars[test, ],
simplify = FALSE)
}

## arrange the result in a list of data.frames
performance <- sapply(performance, function(x) {
x <- as.data.frame(do.call(rbind, x))
x\$df <- dfs
x},
simplify = FALSE)

for(partition in names(performance)) {
performance[[partition]] <- data.frame(performance[[partition]],
partition = partition)
}

## reduce list of data.frames to a single data.frame
performance <- do.call(rbind, performance)

performance
```

Usually it's a good idea to visualize model performance as a function of the varying parameter(s).

```library(ggplot2)

ggplot(performance, aes(x = df)) +
geom_point(aes(y = train.r2, color = "train")) +
geom_point(aes(y = test.r2, color = "test"))

for(dd in dfs){
print(dd)
print(c("Test R2",mean(performance\$test.r2[performance\$df==dd])))
print(c("Train R2",mean(performance\$train.r2[performance\$df==dd])))
}
```

The `caret` package is a full-featured system for training regression and classification models. The `modelr` package is a light-weight alternative. A little touch to get the hang of, but go a

# Your turn: Use cross validation to determine the optimal degrees of freedom

Note: You may use the convenience functions we wrote earlier, or write your own.

1. Split the `mtcars` data into a test and training set.

```{r,error=T}

### YOU FINISH CODE HERE

train <-

mt.train <- mtcars[train, ] mt.test <- mtcars[-train, ]

```2. Iterate over `df` from 2-6, predicting `mpg` from `hp` and 'drat' using
*natural splines* (`ns`). For each `df` combination calculate the \$R^2\$ value in
the test and training set.

```{r,error=T}
model.performance2 <- function(df, train, test) {
mod <- lm(mpg ~ ns(hp, df = df)+ns(drat,df=df), data = train)
c(train.mse = mse(mod, train, "mpg"),
test.mse = mse(mod, test, "mpg"))
}

### YOU FINISH CODE HERE
?expand.grid
dfs <-
names(dfs)<-c('df.hp','df.drat')

ns.performance<- apply()
```
1. Plot the \$R^2\$ values in the test and training set as a function of `df` for`hp` and `drat`. What is the best `df` pair? ```{r,error=T} plot.performance<-data.frame(dfs,t(ns.performance))

```4. Make sure that your solution is NOT on the boundary of your df grid. If it is, adjust your bounds to ensure that the global minimum of hte MSE is not on a boundary.

5. (Advanced) Repeat steps 1-3 several times. Do you get the same

```{r}
```

# Smoothing models (loess, gam)

As flexible as the transformation mechanism described above is, it is limited. Other approaches to modeling non-linear functions include local polynomial regression (`loess`) and generalized additive models (`gam`). Rather than using standard (generalized) linear models with transformations on the right-hand-side, these models are best fit in R using specialized functions.

Local polynomial regression works by fitting a polynomial weighted by distance from the point being predicted. The most important parameters are `span` and `degree`, which control the down-weighting of more distant points, and the degree of the polynomial, respectively.

```mod.mpg5 <- loess(mpg ~ disp, data = mtcars) # default span = 0.75
mod.mpg6 <- loess(mpg ~ disp, span = 0.25, data = mtcars) # less smooth
mod.mpg7 <- loess(mpg ~ disp, span = 2.0, data = mtcars) # more smooth

ggplot(data = transform(mtcars,
pred.mpg5 = predict(mod.mpg5),
pred.mpg6 = predict(mod.mpg6),
pred.mpg7 = predict(mod.mpg7)),
mapping = aes(x = disp, y = mpg)) +
geom_point() +
geom_line(mapping = aes(y = pred.mpg5,color=0.75))+
geom_line(mapping = aes(y = pred.mpg6,color=0.25))+
geom_line(mapping = aes(y = pred.mpg7,color=2))
```

There are two popular implementations of generalized additive models in R, the older gam package, and the more recent mgcv package. mgcv. We'll use gam because it is somewhat simpler and easier to understand.

Calls go the `gam` function look much like calls to `lm` with transformations that we saw earlier:

```library(gam)
mod.mpg8 <- gam(mpg ~ s(disp, spar = 0.1), data = mtcars)
mod.mpg9 <- gam(mpg ~ s(disp, spar = 0.7), data = mtcars)
mod.mpg10 <- gam(mpg ~ s(disp, spar = 1), data = mtcars)

ggplot(data = transform(mtcars,
pred.mpg8 = predict(mod.mpg8),
pred.mpg9 = predict(mod.mpg9),
pred.mpg10 = predict(mod.mpg10)),
mapping = aes(x = disp, y = mpg)) +
geom_point() +
geom_line(mapping = aes(y = pred.mpg8,color=0.1))+
geom_line(mapping = aes(y = pred.mpg9,color=0.7))+
geom_line(mapping = aes(y = pred.mpg10,color=1))
```

Under the hood things are a bit different. The `s` function doesn't actually compute the smooth splines -- it just adds some attributes so that the `gam` function knows how to generate the model matrix. In other words the `s` function is specific to the gam package: `lm(mpg ~ s(disp, spar = 0.1), data = mtcars)` will not work.

# Your turn: Use cross validation to determine the optimal smoothing parameter for a GAM

Note You may modify and use the functions we wrote earlier, or write your own.

1. Split the `mtcars` data into a test and training set.

```{r,error=T}

### YOU FINISH CODE HERE

train <- mt.train <- mtcars[train, ] mt.test <- mtcars[-train, ]

```2. Iterate over `spar` from 0.1 -- 0.8 (in increments of 0.1),
predicting `mpg` from `hp` using a *generalized additive model*
(`gam`). For each value of`spar` calculate the \$R^2\$ value in the
test and training set.

```{r,error=T}
library(gam)

### YOU FINISH CODE HERE
spars <-
###

## vectors to store results
train.r2 <- c()
test.r2 <- c()

## iterate using a for-loop
for(sp in spars) {

mod <-
train.r2 <- c(train.r2,#HERE)
test.r2 <- c(test.r2, #HERE)
###
}

## arrange the data
gam.r2 <- data.frame(spar = c(spars, spars),
set = rep(c("train", "test"), each = length(spars)),
r2 = c(train.r2, test.r2))
```
1. Plot the \$R^2\$ values in the test and training set as a function of `spar`. What is the best `spar` setting?
```### YOUR SOLUTION HERE
```
1. (Advanced) Repeat steps 1-3 several times. Do you get the same answer with different train/test splits?
```### YOUR SOLUTION HERE
```
1. (Advanced) Use 5-fold cross-validation to select the optimal `spar` value.
```### YOUR SOLUTION HERE
```