Key Word(s): decision trees, trees, regression trees, bagging, entropy, information gain

# CS-109A Introduction to Data Science

## Lab 9: Decision Trees (Part 1 of 2): Classification, Regression, Bagging, Random Forests¶

**Harvard University**

**Fall 2019**

**Instructors:** Pavlos Protopapas, Kevin Rader, and Chris Tanner

**Lab Instructors:** Chris Tanner and Eleni Kaxiras

**Authors:** Kevin Rader, Rahul Dave, Chris Tanner

```
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
```

## Learning Goals¶

The goal of this lab is for students to:

- Understand where Decision Trees fit into the larger picture of this class and other models
- Understand what Decision Trees are and why we would care to use them
- How decision trees work
- Feel comfortable running sklearn's implementation of a decision tree
- Understand the concepts of bagging and random forests

```
# imports
%matplotlib inline
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.utils import resample
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn.apionly as sns
```

## Background¶

Let's do a high-level recap of what we've learned in this course so far:

Say we have input data $X = (X_1, X_2, ..., X_n)$ and corresponding class labels $Y = (Y_1, Y_2, ..., Y_n)$ where $n$ represents the number of observations/instances (i.e., unique samples). Much of statistical learning concerns trying to model this relationship between our data's $X$ and $Y$. In particular, we assert that the $Y$'s were produced/generated by some underlying function $f(X)$, and that there is inevitably some noise and systematic, implicit bias and error $\epsilon$ that cannot be captured by any $f(X)$. Thus, we have:

$Y = f(X) + \epsilon$

Statistical learning concerns either **prediction** or **inference**:

**Prediction:** concerns trying to learn a function $\hat{f}(X)$ that is as close as possible to the true function $f(X)$. This allows us to estimate $Y$ values for any new input data $X$.

**Inference:** concerns trying to understand/model the *relationship* between $X$ and $Y$, effectively learning how the data was generated.

Independent of this, if you have access to gold truth labels $Y$, and you make use of them for your modelling, then you are working on a **supervised** learning task. If you do not have or make use of $Y$ values, and you are only concerned with the input data $X$, you are working on an **unsupervised** learning task.

**Q1:**Using the above terms, what types of problems are linear regression, logistic regression, and PCA?

```
# %load solutions/q1.txt
```

**Q2:**What is a decision tree? Why do we care to make a decision tree?

```
# discussed in lab.
```

## Understanding Decision Trees¶

My goal is for none of the topics we learn in this class to seem like nebulus concepts or black-boxes of magic. In this course, it's important to understand the models that you can use to help you with your data, and this includes not only knowing how to invoke these as tools within Python libraries (e.g., `sklearn`

, `statsmodels`

), but to have an understanding of what each model is actually doing 'under the hood' -- how it actually works -- as this provides insights into why you should use one model vs another, and how you could adjust models and invent new ones!

### Entropy (aka Uncertainty)¶

Remember, in the last lab, we mentioned that in data science and machine learning, our models are often just finding patterns in the data. For example, for classification, it is best when our data is separable by their $Y$ class lables (e.g., cancerous or benign). That is, hopefully the $X$ values for one class label (e.g., cancerous) is disjoint and separated from the $X$ values that correspond to another class label (e.g., benign). If so, our model would be able to easily discern if a given, new piece of data corresponds to the cancerous label or benign label, based on its $X$ values. If the data is not easily separable (i.e., the $X$ values corresponding to cancer looks very similar to $X$ values corresponding to benign), then our task is difficult and perhaps impossible. Along these lines, we can measure this element in terms of how messy/confusable/*uncertain* a collection of data is.

In the 1870s, physicists introduced a term `Gibbs Entropy`

, which was useful in statistical thermodynamics, as it effectively measured uncertainty. By the late 1920s, the foundational work in Information Theory had begun; pioneers John von Neumann and Claude Shannon conducted phenomenal work which paved the way for computation at large -- they heavily influenced the creation of computer science, and their work is still seen in modern day computers. Information theory concerns entropy.) So let's look at an example to concretely address what entropy is (the information theoretic version of it).

Say that we have a fair coin $X$, and each coin fliip is an observation. The coin is equally likely to yield heads or tails. The uncertainty is very high. In fact, it's the highest possible, as it's truly a 50/50 chance of either. Let $H(X)$ represent the entropy of $X$. Per the graphic below, we see that entropy is in fact highest when the probabilities of a 2-class variable are a 50/50 chance.

If we had a cheating coin, whereby it was guaranteed to always be a head (or a tail), then our entropy would be 0, as there is no **uncertainty** about its outcome. Again, this term, entropy, predates decision trees and has vast applications. Alright, so we can see what entropy is measuring (the uncertainty), but how was it actually calculated?

#### Definition:¶

Entropy factors in *all* possible values/classes of a random variable (log base 2):

#### Fair-Coin Example¶

In our fair coin example, we only have 2 classes, both of which have a probability of 1/2. So, to calculate the overall entropy of the fair coin, we have Entropy(1+, 1-) =

### Worked Example¶

Let's say that we have a small, 14-observation dataset that concerns if we will play tennis on a given day or not (Play Tennis will be our output $Y$), based on 4 features of the current weather:

Completely independent of the features, we can calculate the overall entropy of playing tennis, Entropy for (9+, 5-) examples =

**Q3:**Using our entire 14-observation dataset, calculate the Information Gain for the other 3 remaining features (Outlook, Temperature, Humidity). What are their values and which ones gives us the most information gain?

```
# %load solutions/q3.txt
```

**Q4:**Now that we know which feature provides the most information gain, how should we use it to construct a decision tree? Let's start the construction of our tree and repeat the process of Q3 one more time.

```
# %load solutions/q4.txt
```

**Q5:**When should we stop this process?

```
# %load solutions/q5.txt
```

**Q6:**Should we standardize or normalize our features? Both? Neither?

```
# %load solutions/q6.txt
```

**Q7:**What if we have outliers? How sensitive is our Decision Tree to outliers? Why?

```
# %load solutions/q7.txt
```

#### Connection to Lecture¶

In Lecture 16, Pavlos started by presenting the tricky graph below which depicts a dataset with just 2 features: longitude and latitude.

By drawing a straight line to separate our data, we would be doing the same exact process that we are doing here with our Play Tennis dataset. In our Play Tennis example, we are trying to segment our data into bins according to the possible *categories* that a feature can be. In the lecture example (pictured above), we have continuous data, not discrete categories, so we have an infinite number of thresholds by which to segment our data.

**Q8:**How is it possible to segment continuous-valued data, since there are infinite number of possible splits? Do we try 1,000,000 possible values to split by? 100?

```
# %load solutions/q8.txt
```

### Summary:¶

To build a decision tree:

- Start with an empty tree and some data $X$
- Decide what your splitting criterion will be (e.g., Gini, Entropy, etc)
- Decide what your what your stopping criterion will be, or if you'll develop a large tree and prune (pruning is covered in Lecture 15, slides 41 - 54)
- Build the tree in a greedy manner, and if you have multiple hyperparameters, use cross-validation to determine the best values

## Sklearn's Implementation¶

Our beloved `sklearn`

library has implementations of DecisionTrees, so let's practice using it.

First, let's load our Play Tennis data `(../data/play_tennis.csv")`

:

```
tennis_df = pd.read_csv("../data/play_tennis.csv")
tennis_df = pd.get_dummies(tennis_df, columns=['outlook', 'temp', 'humidity', 'windy'])
tennis_df
```

Normally, in real situations, we'd perform EDA. However, for this tiny dataset, we see there are no missing values, and we do not care if there is collinearity or outliers, as Decision Trees are robust to such.

```
# separate our data into X and Y portions
x_train = tennis_df.iloc[:, tennis_df.columns != 'play'].values
y_train = tennis_df['play'].values
```

We can build a DecisionTree classifier as follows:

```
dt = DecisionTreeClassifier().fit(x_train, y_train)
```

```
tree_vis = tree.plot_tree(dt, filled=True)
```

**Q9:**Is this tree identical to what we constructed above? If not, what differs in sklearn's implementation?

```
# %load solutions/q8.txt
```

In the above example, we did not use the tree to do any classification. Our data was too small to consider such.

Let's turn to a different dataset:

## 2016 Election Data¶

We will be attempting to predict the presidential election results (at the county level) from 2016, measured as 'votergap' = (trump - clinton) in percentage points, based mostly on demographic features of those counties. Let's quick take a peak at the data:

```
elect_df = pd.read_csv("../data/county_level_election.csv")
elect_df.head()
```

```
# split 80/20 train-test
X = elect_df[['population','hispanic','minority','female','unemployed','income','nodegree','bachelor','inactivity','obesity','density','cancer']]
response = elect_df['votergap']
Xtrain, Xtest, ytrain, ytest = train_test_split(X,response,test_size=0.2)
```

```
plt.hist(ytrain)
Xtrain.hist(column=['minority', 'population','hispanic','female']);
```

```
print(elect_df.shape)
print(Xtrain.shape)
print(Xtest.shape)
```

## Regression Trees¶

We will start by using a simple Decision Tree Regressor to predict votergap. We'll run a few of these models without any cross-validation or 'regularization', just to illustrate what is going on.

This is what you ought to keep in mind about decision trees.

from the docs:

```
max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split : int, float, optional (default=2)
```

- The deeper the tree, the more prone you are to overfitting.
- The smaller
`min_samples_split`

, the more the overfitting. One may use`min_samples_leaf`

instead. More samples per leaf, the higher the bias.

```
from sklearn.tree import DecisionTreeRegressor
x = Xtrain['minority'].values
o = np.argsort(x)
x = x[o]
y = ytrain.values[o]
plt.plot(x,y, '.');
```

```
plt.plot(np.log(x),y, '.'); # log scale
```

**Q10:**Which of the two versions of 'minority' would be a better choice to use as a predictor for prediction?

```
# %load solutions/q10.txt
```

```
plt.plot(np.log(x),y,'.')
xx = np.log(x).reshape(-1,1)
for i in [1,2]:
dtree = DecisionTreeRegressor(max_depth=i)
dtree.fit(xx, y)
plt.plot(np.log(x), dtree.predict(xx), label=str(i), alpha=1-i/10, lw=4)
plt.legend();
```

```
plt.plot(np.log(x),y,'.')
xx = np.log(x).reshape(-1,1)
for i in [500,200,100,20]:
dtree = DecisionTreeRegressor(min_samples_split=i)
dtree.fit(xx, y)
plt.plot(np.log(x), dtree.predict(xx), label=str(i), alpha=0.8, lw=4)
plt.legend();
```

```
plt.plot(np.log(x),y,'.')
xx = np.log(x).reshape(-1,1)
for i in [500,200,100,20]:
dtree = DecisionTreeRegressor(max_depth=6, min_samples_split=i)
dtree.fit(xx, y)
plt.plot(np.log(x), dtree.predict(xx), label=str(i), alpha=0.8, lw=4)
plt.legend();
```

```
#let's also include logminority as a predictor going forward
xtemp = np.log(Xtrain['minority'].values)
Xtrain = Xtrain.assign(logminority = xtemp)
Xtest = Xtest.assign(logminority = np.log(Xtest['minority'].values))
Xtrain.head()
```

Ok with this discussion in mind, lets improve this model by Bagging.

## Bootstrap-Aggregating (called Bagging)¶

**Q11:**Class poll: When did the movie Titanic come out?

```
# %load solutions/q11.txt
```

The basic idea:

- A Single Decision tree is likely to overfit.
- So lets introduce replication through Bootstrap sampling.
**Bagging**uses bootstrap resampling to create different training datasets. This way each training will give us a different tree.- Added bonus: the left off points can be used to as a natural "validation" set, so no need to
- Since we have many trees that we will
**average over for prediction**, we can choose a large`max_depth`

and we are ok as we will rely on the law of large numbers to shrink this large variance, low bias approach for each individual tree.

```
from sklearn.utils import resample
ntrees = 500
estimators = []
R2s = []
yhats_test = np.zeros((Xtest.shape[0], ntrees))
plt.plot(np.log(x),y,'.')
for i in range(ntrees):
simpletree = DecisionTreeRegressor(max_depth=3)
boot_xx, boot_y = resample(Xtrain[['logminority']], ytrain)
estimators = np.append(estimators,simpletree.fit(boot_xx, boot_y))
R2s = np.append(R2s,simpletree.score(Xtest[['logminority']], ytest))
yhats_test[:,i] = simpletree.predict(Xtest[['logminority']])
plt.plot(np.log(x), simpletree.predict(np.log(x).reshape(-1,1)), 'red', alpha=0.05)
```

```
yhats_test.shape
```

- Edit the code below (which is just copied from above) to refit many bagged trees on the entire xtrain feature set (without the plot...lots of predictors now so difficult to plot).
- Summarize how each of the separate trees performed (both numerically and visually) using $R^2$ as the metric. How do they perform on average?
- Combine the trees into one prediction and evaluate it using $R^2$.
- Briefly discuss the results. How will the results above change if 'max_depth=4' is increased? What if it is decreased?

```
from sklearn.metrics import r2_score
ntrees = 500
estimators = []
R2s = []
yhats_test = np.zeros((Xtest.shape[0], ntrees))
for i in range(ntrees):
dtree = DecisionTreeRegressor(max_depth=3)
boot_xx, boot_y = resample(Xtrain[['logminority']], ytrain)
estimators = np.append(estimators,dtree.fit(boot_xx, boot_y))
R2s = np.append(R2s,dtree.score(Xtest[['logminority']], ytest))
yhats_test[:,i] = dtree.predict(Xtest[['logminority']])
# your code here
```

#### Your answer here¶

## Random Forests¶

What's the basic idea?

Bagging alone is not enough randomization, because even after bootstrapping, we are mainly training on the same data points using the same variablesn, and will retain much of the overfitting.

So we will build each tree by splitting on "random" subset of predictors at each split (hence, each is a 'random tree'). This can't be done in with just one predcitor, but with more predictors we can choose what predictors to split on randomly and how many to do this on. Then we combine many 'random trees' together by averaging their predictions, and this gets us a forest of random trees: a **random forest**.

Below we create a hyper-param Grid. We are preparing to use the bootstrap points not used in training for validation.

```
max_features : int, float, string or None, optional (default=”auto”)
- The number of features to consider when looking for the best split.
```

`max_features`

: Default splits on all the features and is probably prone to overfitting. You'll want to validate on this.- You can "validate" on the trees
`n_estimators`

as well but many a times you will just look for the plateau in the trees as seen below. - From decision trees you get the
`max_depth`

,`min_samples_split`

, and`min_samples_leaf`

as well but you might as well leave those at defaults to get a maximally expanded tree.

```
from sklearn.ensemble import RandomForestRegressor
```

```
# code from
# Adventures in scikit-learn's Random Forest by Gregory Saunders
from itertools import product
from collections import OrderedDict
param_dict = OrderedDict(
n_estimators = [400, 600, 800],
max_features = [0.2, 0.4, 0.6, 0.8]
)
param_dict.values()
```

### Using the OOB score.¶

We have been putting "validate" in quotes. This is because the bootstrap gives us left-over points! So we'll now engage in our very own version of a grid-search, done over the out-of-bag scores that `sklearn`

gives us for free

```
from itertools import product
```

```
#make sure ytrain is the correct data type...in case you have warnings
#print(yytrain.shape,ytrain.shape,Xtrain.shape)
#ytrain = np.ravel(ytrain)
#Let's Cross-val. on the two 'hyperparameters' we based our grid on earlier
results = {}
estimators= {}
for ntrees, maxf in product(*param_dict.values()):
params = (ntrees, maxf)
est = RandomForestRegressor(oob_score=True,
n_estimators=ntrees, max_features=maxf, max_depth=50, n_jobs=-1)
est.fit(Xtrain, ytrain)
results[params] = est.oob_score_
estimators[params] = est
outparams = max(results, key = results.get)
outparams
```

```
rf1 = estimators[outparams]
```

```
results
```

```
rf1.score(Xtest, ytest)
```

Finally you can find the **feature importance** of each predictor in this random forest model. Whenever a feature is used in a tree in the forest, the algorithm will log the decrease in the splitting criterion (such as gini). This is accumulated over all trees and reported in `est.feature_importances_`

```
pd.Series(rf1.feature_importances_,index=list(Xtrain)).sort_values().plot(kind="barh")
```

Since our response isn't very symmetric, we may want to suppress outliers by using the `mean_absolute_error`

instead.

```
from sklearn.metrics import mean_absolute_error
mean_absolute_error(ytest, rf1.predict(Xtest))
```