# Lab 2: Model Fitting

Key Word(s): Smoothing, Model Fitting, GAM

R provides functions for fitting linear (`lm`

) and generalized linear
models (`glm`

), local polynomial regression models (`loess`

), and
nonlinear least squares (`nls`

) among others. Functions for fitting
many other kinds of models are available in various R packages.

# Formula

What most of these model fitting functions have in common is a way of
specifying relationships among variables in the model in terms of a
*formula*. A formula is usually defined using the tilde operator
(`~`

), with the response on the left and predictors on the right.

Thinking of the formula operator as a domain-specific language is helpful, because many things work differently inside a formula than they do in the rest of R. For example:

- outside of a formula,
`+`

means "addition", but inside a formula`+`

means "inclusion" - outside of a formula ":" is a sequence operator, but inside a formula it means "interaction"
- outside a formula
`(a + b + c)^2`

means "square the sum of a, b, and c"; inside a formula it means "include a, b, c, and all two-way interactions between them"

Note that you can escape to the usual meaning with the `I`

function:
`I((a + b + c)^2)`

means "square the sum of a, b, and c, even if
inside a formula.

More details on the formula interface are described in the help page
(`?formula`

).

## Model fitting examples

R comes with some built-in data sets that can be used for examples and
demos. You can ask for the list of available data sets by calling
`data()`

. Here we will use the built-in *mtcars* data set.

```
data(mtcars)
str(mtcars)
?mtcars
```

Fit a simple linear model predicting *miles per gallon* (`mpg`

) from
*horsepower* (`hp`

) and *weight* (`wt`

).

```
lm(mpg ~ hp + wt, data = mtcars)
```

Add interaction between `hp`

and `wt`

.

```
lm(mpg ~ hp * wt, data = mtcars)
```

Add interactions between `hp`

and *automatic transmission* (`am`

), and
between `wt`

and `am`

.

```
lm(mpg ~ (hp + wt) * am, data = mtcars)
```

# Types, Classes and methods (lm, predict, resid)

We've seen how to use *formula* in R to fit linear models, but you may
have been somewhat underwhelmed by the result. `lm`

returns simply the
*Call* that produced it, and the regression coefficients. To do
anything useful with these models we will want to assign them to a
name and then do some post-estimation calculations. In order to do
that we need to know what `lm`

returns, and what methods exist for it.
Most of this is documented in `?lm`

, but we can use R to inspect the
result and learn for ourselves.

Objects produced by `lm`

are named *lists*, and can be treated just
like any other list in R. Let's simplify our last example model and
take a closer look.

```
mod.mpg1 <- lm(mpg ~ wt * am, data = mtcars)
typeof(mod.mpg1)
length(mod.mpg1)
names(mod.mpg1)
mod.mpg1[["coefficients"]]
```

Since `mod.mpg1`

is a list, we can do anything with it that we can do
with other lists in R. For example, we can extract elements by
position or name:

```
mod.mpg1[["coefficients"]]
```

```
hist(mod.mpg1[[2]])
```

Unlike other lists in R, lists produced by `lm`

are of *class* `lm`

.
There are functions with methods that correspond to this class, and we
can ask R to tell us what those functions are using the `methods`

function.

```
class(mod.mpg1)
methods(class = class(mod.mpg1))
```

We now know that there are `summary`

, `anova`

, and `confint`

methods
(among others) for objects of class `lm`

. That is, we know that we can
do the following:

```
summary(mod.mpg1)
anova(mod.mpg1)
confint(mod.mpg1)
```

**Note** In Python the equivalent of `summary(mod.mpg1)`

would
probably look like `mod.mpg1.summary()`

. This *object-oriented* style
does exist in R (though it uses `$`

instead of `.`

to access methods),
but it is not commonly used.

This technique of identifying the *class* of an R object, and then
looking up which generic functions have corresponding *methods* is
very useful. Remember thought that it only shows you functions that
have specific methods; for example `methods(class = class(mod.mpg1))`

did not show us a method for `[[`

, but we know that we can use bracket
extraction on a *lm* object. That just means that `[[`

doesn't do
anything special because of the object's class -- it just treats it
like any other list.

We can also go the other way around, and ask R what methods exist for a specific generic function. For example,

```
methods(summary)
```

shows us all the different object classes that have `summary`

methods.

Most modeling functions in R usually return objects with at least
`summary`

, `print`

, and `predict`

methods. The `predict`

methods are
useful for visualizing the quality of your model. Once you've
calculated predicted values you can use the `plot`

or `ggplot`

functions to construct a graph. Let's use `ggplot`

to visualize the
predictions from our last model.

```
## install.packages("ggplot2") # uncomment and run if not already installed.
library(ggplot2)
mtc <- transform(mtcars, pred.mpg = predict(mod.mpg1, interval = "confidence"))
mtc <- mtc[order(mtc$wt), ]
ggplot(mtc,
mapping = aes(x = wt, y = mpg, color = as.factor(am))) +
geom_point() +
geom_smooth(mapping = aes(y = pred.mpg.fit,
ymin = pred.mpg.lwr,
ymax = pred.mpg.upr),
stat = "identity")
```

# Your turn: Get to know *lm*

- Fit a linear model predicting
*miles per gallon*(`mpg`

) from number of cylinders (`cyl`

),*displacement*(`disp`

),*horsepower*(`hp`

), and all two-way interactions among these three predictors. Do*not*include the three way interaction.

```
### YOUR SOLUTION HERE
```

- Number of cylinders and displacement are so highly correlated that
it makes sense to think of them both as measures of
*engine size*. Fit a model predicting mpg from the sum of`cyl`

and`disp`

.

```
### YOUR SOLUTION HERE
```

- Fit a model predicting
`mpg`

from`disp`

, assigning the result to the name`mod.mpg`

. Get the`summary`

of the model and assign the result to the name`mod.mpg.sum`

.

```
### YOUR SOLUTION HERE
```

- Extract the
`r.squared`

value from`mod.mpg.sum`

.

```
### YOUR SOLUTION HERE
```

- Graph the observed and predicted values of
`mpg`

as a function of`disp`

based on your model.

```{r,error=T} library(ggplot2) mtc <- transform(mtcars, pred.mpg = predict(mod.mpg))

### YOUR SOLUTION HERE

```
6. (Advanced) Fit a logistic model predicting *automatic vs.
manual transmission* (`am`) from *displacement* (`disp`). (Hint: use `glm`)
```{r}
### YOUR SOLUTION HERE
```

- (Advanced) Fit a logistic model predicting
*automatic vs. manual transmission*(`am`

) from*displacement*(`disp`

) and*number of gears*(`gear`

). Treat gear as an ordered factor and use polynomial contrasts.

```
### YOUR SOLUTION HERE
```

- (Advanced) Inference from linear regression models is based on several assumptions, including the assumption that the residuals are normally distributed. Visually inspect your model to determine if this assumption has been violated.

```
### YOUR SOLUTION HERE
```

- (Advanced) Add
`am`

and the interaction between`disp`

and`am`

to your model. What is the regression coefficient for`mpg`

on`disp`

when`am = 0`

? When`am = 1`

?

```
### YOUR SOLUTION HERE
```

# Transformations (Polynomials and Splines)

The R *formula* provides a way to specify relationships among
variables, and among *transformations* of those variables. We've
already seen that `*`

transforms the inputs by including both the
specified variables and their product. Many other transformations are
possible. Indeed, for categorical predictors transformation is
mandatory and happens by default; R will transform categorical
predictors into a matrix of indicators or dummy codes with the first
level as the reference group.

In general, any transformation that returns a vector of length equal to the number of rows in the data or a matrix with the same row dimension as the input data is legal.

## Standardized regression coefficients

A example of transformation is the `scale`

function, which
standardizes it's first argument by subtracting the mean and dividing
by the standard deviation of each column seperately:

```
str(mtcars[c("hp", "wt")])
str(scale(mtcars[c("hp", "wt")]))
```

Thus if we want standardized regression coefficients we can fit our model as follows:

```
mod.mpg2 <- lm(mpg ~ scale(cbind(hp = hp, wt = wt)) * am, data = mtcars)
summary(mod.mpg2)
```

## Polynomials

If we want polynomial terms we can generate them ourselves, or use the
`poly`

function to do it for us.

```
summary(lm(mpg ~ disp + I(disp^2) + I(disp^3), data = mtcars))
summary(lm(mpg ~ poly(disp, degree = 3, raw = TRUE), data = mtcars))
```

## Splines

Similarly, we can use the `bs`

function from the `splines`

package to
generate piece-wise polynomials. You can specify either `df`

or you
can specify `knots`

directly (in which case `ns`

will choose ```
df - 1 -
intercept
```

knots at suitable intervals).

```
library(splines)
mod.mpg4 <- lm(mpg ~ bs(disp, knots = quantile(disp, c(.25, .50, .75))),
data = mtcars)
summary(mod.mpg4)
```

Again there is nothing magical about splines, they are basically piece-wise polynomials. We can calculate a quick-and-dirty version ourselves without too much trouble:

```
disp.qtl <- quantile(mtcars$disp, c(.25, .50, .75))
mod.bs.manual <- predict(lm(mpg ~ disp +
I(disp^2) +
I(disp^3) +
ifelse(disp > disp.qtl[1],
(disp - disp.qtl[1])^3,
0) +
ifelse(disp > disp.qtl[2],
(disp - disp.qtl[2])^3,
0) +
ifelse(disp > disp.qtl[3],
(disp - disp.qtl[3])^3,
0),
data = mtcars))
## our quick-and-dirty spline captures the same information as those
## generated by the bs function:
all.equal(mod.bs.manual, predict(mod.mpg4))
## although, it is not scaled or parameterized exactly the same.
```

Finally, we can construct natural cubic splines using the `ns`

function. As with `bs`

you can specify either `df`

or you can specify
`knots`

directly.

What all of these transformation functions (i.e., `scale`

, `poly`

,
`bs`

, `ns`

) have in common is that they take a variable as input and
return a matrix where the row dimension is equal to the length of the
input variable. When called inside a `formula`

of a modeling function
(e.g., `lm`

) the generated matrix will be used in the right-hand-side
of the equation.

# Your turn: Visualizing polynomial and spline regressions

- Predict
`mpg`

from`hp`

using a*natural spline*(`ns`

) with`df = 3`

.

```{r,error=T}

### YOUR SOLUTION HERE

mod.mpg.ns3 <- summary(mod.mpg.ns3)

```
2. Predict `mpg` from `hp` using a *natural spline* with `df = 8`.
```{r,error=T}
### YOUR SOLUTION HERE
mod.mpg.ns8 <-
summary(mod.mpg.ns8)
```

- Plot the predictions from these two models.

```{r,error=T}

### YOU FILL PREDICTIONS HERE

mtcars$mpg.ns3 <- mtcars$mpg.ns8 <-

### Plot:

ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + geom_line(aes(y = mpg.ns3,color='red'))+ geom_line(aes(y = mpg.ns8,color='blue'))

```
4. (Advanced) Refit the `df=3` model, adding an interaction with `am`.
Plot the predictions, coloring by `am`.
```{r}
### YOUR SOLUTION HERE
```

# Selecting parameter values

Once we start fitting models with polynomials or splines, we will quickly realize that we need a way to choose the optimal parameters. What degree polynomial should we use? How many degrees of freedom in our basis-splines?

A simple answer is to choose a measure of model fit, test out a range
of parameters, and see which one fits the best. That makes sense, but
be careful not to over-fit the model. In other words, you want you
model to capture generalizable patterns, and *not* to capture noise in
the sample you happen to have.

## Basic cross-validation

There are several approaches to avoiding over-fitting, with cross-validation being a simple and common method. Here are the basic steps.

- Split the data into training and test sets.
- Choose a measure of model fit, e.g., $R^2$ or root-mean-square error ($rmse$).
- Fit each model to the training set.
- Calculate the measure of model fit on the
*test*set.

Here is an example using the `mtcars`

data.

```
rsq <- function(model, data, y) {
y <- data[[y]]
predict <- predict(model, newdata = data)
tss = sum((y - mean(y))^2)
rss = sum((y-predict)^2)
rsq_ = max(0, 1 - rss/tss)
return(rsq_)
}
mse<-function(model,data,y){
y<-data[[y]]
predict<-predict(model,newdata=data)
mse<-mean((y-predict)^2)
return(mse)
}
```

Split the data into test and training sets.

```
id.train <- sample(1:nrow(mtcars), floor(nrow(mtcars) * 0.5))
mt.train <- mtcars[id.train, ]
mt.test <- mtcars[-id.train, ]
```

Next, define a function to fit a model with given df and calculate R-square.

```
model.performance <- function(df, train, test) {
mod <- lm(mpg ~ bs(disp, df = df), data = train)
c(train.r2 = rsq(mod, train, "mpg"),
test.r2 = rsq(mod, test, "mpg"))
}
```

Finally, fit the model with varying parameters and calculate performance for each model

```
dfs <- 2:6 ## parameters we want to iterate over
## fit models using sapply. We could have used a for-loop instead.
performance <- sapply(dfs, model.performance,
train = mt.train,
test = mt.test,
simplify = FALSE)
## arrange the result in a data.frame
performance <- as.data.frame(do.call(rbind, performance))
performance$df <- dfs
performance
```

Usually it's a good idea to visualize model performance as a function of the varying parameter(s).

```
library(ggplot2)
ggplot(performance, aes(x = df)) +
geom_point(aes(y = train.r2, color = "train")) +
geom_point(aes(y = test.r2, color = "test"))
```

## k-fold cross-validation

There is a serious limitation to using a single test/training split to perform cross validation. Would you get the same result with a different split? Because we don't want our results to depend on the random partitioning into test and training sets, it is a good idea to use multiple splits. This leads us to k-fold cross-validation. The basic steps are:

- Split the data into
`k`

paritions. - Choose a measure of model fit, e.g., $R^2$ or $rmse$.
- For each partition, fit the model to the data excluding that partition.
- Calculate the measure of model fit on the excluded partition.
- Average the
`k`

measures of model fit.

Here is an example, again using the `mtcars`

data.

Create `k=5`

partitions.

```
k <- 5
mtcars$partition <- cut(sample(1:nrow(mtcars), nrow(mtcars)), 5)
```

Fit the model with varying parameters and calculate performance for each model.

```
dfs <- 2:6 ## parameters we want to iterate over
## Use a for-loop to iterate over the partitions.
## We could have used sapply instead
performance <- vector(mode = "list", length = k)
names(performance) <- unique(mtcars$partition)
for(partition in names(performance)) {
## Fit models using sapply. We could have used a for-loop instead.
test <- mtcars$partition == partition
performance[[partition]] <- sapply(dfs,
model.performance,
train = mtcars[!test, ],
test = mtcars[test, ],
simplify = FALSE)
}
## arrange the result in a list of data.frames
performance <- sapply(performance, function(x) {
x <- as.data.frame(do.call(rbind, x))
x$df <- dfs
x},
simplify = FALSE)
## add partition column
for(partition in names(performance)) {
performance[[partition]] <- data.frame(performance[[partition]],
partition = partition)
}
## reduce list of data.frames to a single data.frame
performance <- do.call(rbind, performance)
performance
```

Usually it's a good idea to visualize model performance as a function of the varying parameter(s).

```
library(ggplot2)
ggplot(performance, aes(x = df)) +
geom_point(aes(y = train.r2, color = "train")) +
geom_point(aes(y = test.r2, color = "test"))
for(dd in dfs){
print(dd)
print(c("Test R2",mean(performance$test.r2[performance$df==dd])))
print(c("Train R2",mean(performance$train.r2[performance$df==dd])))
}
```

The `caret`

package is a full-featured system for training
regression and classification models. The `modelr`

package is a
light-weight alternative. A little touch to get the hang of, but go a

# Your turn: Use cross validation to determine the optimal degrees of freedom

*Note*: You may use the convenience functions we wrote earlier, or
write your own.

- Split the
`mtcars`

data into a test and training set.

```{r,error=T}

### YOU FINISH CODE HERE

train <-

mt.train <- mtcars[train, ] mt.test <- mtcars[-train, ]

```
2. Iterate over `df` from 2-6, predicting `mpg` from `hp` and 'drat' using
*natural splines* (`ns`). For each `df` combination calculate the $R^2$ value in
the test and training set.
```{r,error=T}
model.performance2 <- function(df, train, test) {
mod <- lm(mpg ~ ns(hp, df = df[1])+ns(drat,df=df[2]), data = train)
c(train.mse = mse(mod, train, "mpg"),
test.mse = mse(mod, test, "mpg"))
}
### YOU FINISH CODE HERE
?expand.grid
dfs <-
names(dfs)<-c('df.hp','df.drat')
ns.performance<- apply()
```

- Plot the $R^2$ values in the test and training set as a function of
`df`

for`hp`

and`drat`

. What is the best`df`

pair? ```{r,error=T} plot.performance<-data.frame(dfs,t(ns.performance))

### YOUR PLOT HERE

```
4. Make sure that your solution is NOT on the boundary of your df grid. If it is, adjust your bounds to ensure that the global minimum of hte MSE is not on a boundary.
5. (Advanced) Repeat steps 1-3 several times. Do you get the same
answer with different train/test splits?
```{r}
### YOUR SOLUTION HERE
```

# Smoothing models (loess, gam)

As flexible as the *transformation* mechanism described above is, it
is limited. Other approaches to modeling non-linear functions include
local polynomial regression (`loess`

) and generalized additive models
(`gam`

). Rather than using standard (generalized) linear models with
transformations on the right-hand-side, these models are best fit in R
using specialized functions.

Local polynomial regression works by fitting a polynomial weighted by
distance from the point being predicted. The most important parameters
are `span`

and `degree`

, which control the down-weighting of more
distant points, and the degree of the polynomial, respectively.

```
mod.mpg5 <- loess(mpg ~ disp, data = mtcars) # default span = 0.75
mod.mpg6 <- loess(mpg ~ disp, span = 0.25, data = mtcars) # less smooth
mod.mpg7 <- loess(mpg ~ disp, span = 2.0, data = mtcars) # more smooth
ggplot(data = transform(mtcars,
pred.mpg5 = predict(mod.mpg5),
pred.mpg6 = predict(mod.mpg6),
pred.mpg7 = predict(mod.mpg7)),
mapping = aes(x = disp, y = mpg)) +
geom_point() +
geom_line(mapping = aes(y = pred.mpg5,color=0.75))+
geom_line(mapping = aes(y = pred.mpg6,color=0.25))+
geom_line(mapping = aes(y = pred.mpg7,color=2))
```

There are two popular implementations of generalized additive models
in R, the older *gam* package, and the more recent *mgcv* package.
*mgcv*. We'll use *gam* because it is somewhat simpler and easier to
understand.

Calls go the `gam`

function look much like calls to `lm`

with
transformations that we saw earlier:

```
library(gam)
mod.mpg8 <- gam(mpg ~ s(disp, spar = 0.1), data = mtcars)
mod.mpg9 <- gam(mpg ~ s(disp, spar = 0.7), data = mtcars)
mod.mpg10 <- gam(mpg ~ s(disp, spar = 1), data = mtcars)
ggplot(data = transform(mtcars,
pred.mpg8 = predict(mod.mpg8),
pred.mpg9 = predict(mod.mpg9),
pred.mpg10 = predict(mod.mpg10)),
mapping = aes(x = disp, y = mpg)) +
geom_point() +
geom_line(mapping = aes(y = pred.mpg8,color=0.1))+
geom_line(mapping = aes(y = pred.mpg9,color=0.7))+
geom_line(mapping = aes(y = pred.mpg10,color=1))
```

Under the hood things are a bit different. The `s`

function doesn't
actually compute the smooth splines -- it just adds some attributes so
that the `gam`

function knows how to generate the model matrix. In
other words the `s`

function is *specific* to the *gam* package:
`lm(mpg ~ s(disp, spar = 0.1), data = mtcars)`

will not work.

# Your turn: Use cross validation to determine the optimal smoothing parameter for a GAM

*Note* You may modify and use the functions we wrote
earlier, or write your own.

- Split the
`mtcars`

data into a test and training set.

```{r,error=T}

### YOU FINISH CODE HERE

train <- mt.train <- mtcars[train, ] mt.test <- mtcars[-train, ]

```
2. Iterate over `spar` from 0.1 -- 0.8 (in increments of 0.1),
predicting `mpg` from `hp` using a *generalized additive model*
(`gam`). For each value of`spar` calculate the $R^2$ value in the
test and training set.
```{r,error=T}
library(gam)
### YOU FINISH CODE HERE
spars <-
###
## vectors to store results
train.r2 <- c()
test.r2 <- c()
## iterate using a for-loop
for(sp in spars) {
### YOUR SOLUTION HERE
mod <-
train.r2 <- c(train.r2,#HERE)
test.r2 <- c(test.r2, #HERE)
###
}
## arrange the data
gam.r2 <- data.frame(spar = c(spars, spars),
set = rep(c("train", "test"), each = length(spars)),
r2 = c(train.r2, test.r2))
```

- Plot the $R^2$ values in the test and training set as a function of
`spar`

. What is the best`spar`

setting?

```
### YOUR SOLUTION HERE
```

- (Advanced) Repeat steps 1-3 several times. Do you get the same answer with different train/test splits?

```
### YOUR SOLUTION HERE
```

- (Advanced) Use 5-fold cross-validation to select the optimal
`spar`

value.

```
### YOUR SOLUTION HERE
```