# CS-109A Introduction to Data Science

## Lab 9: Decision Trees (Part 1 of 2): Classification, Regression, Bagging, Random Forests¶

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras
Authors: Kevin Rader, Rahul Dave, Chris Tanner

In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:

## Learning Goals¶

The goal of this lab is for students to:

• Understand where Decision Trees fit into the larger picture of this class and other models
• Understand what Decision Trees are and why we would care to use them
• How decision trees work
• Feel comfortable running sklearn's implementation of a decision tree
• Understand the concepts of bagging and random forests
In [3]:
# imports
%matplotlib inline
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.utils import resample
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn.apionly as sns


## Background¶

Let's do a high-level recap of what we've learned in this course so far:

Say we have input data $X = (X_1, X_2, ..., X_n)$ and corresponding class labels $Y = (Y_1, Y_2, ..., Y_n)$ where $n$ represents the number of observations/instances (i.e., unique samples). Much of statistical learning concerns trying to model this relationship between our data's $X$ and $Y$. In particular, we assert that the $Y$'s were produced/generated by some underlying function $f(X)$, and that there is inevitably some noise and systematic, implicit bias and error $\epsilon$ that cannot be captured by any $f(X)$. Thus, we have:

$Y = f(X) + \epsilon$

Statistical learning concerns either prediction or inference:

Prediction: concerns trying to learn a function $\hat{f}(X)$ that is as close as possible to the true function $f(X)$. This allows us to estimate $Y$ values for any new input data $X$.

Inference: concerns trying to understand/model the relationship between $X$ and $Y$, effectively learning how the data was generated.

Independent of this, if you have access to gold truth labels $Y$, and you make use of them for your modelling, then you are working on a supervised learning task. If you do not have or make use of $Y$ values, and you are only concerned with the input data $X$, you are working on an unsupervised learning task.

Q1: Using the above terms, what types of problems are linear regression, logistic regression, and PCA?
In [4]:
# %load solutions/q1.txt


Q2: What is a decision tree? Why do we care to make a decision tree?
In [5]:
# discussed in lab.


## Understanding Decision Trees¶

My goal is for none of the topics we learn in this class to seem like nebulus concepts or black-boxes of magic. In this course, it's important to understand the models that you can use to help you with your data, and this includes not only knowing how to invoke these as tools within Python libraries (e.g., sklearn, statsmodels), but to have an understanding of what each model is actually doing 'under the hood' -- how it actually works -- as this provides insights into why you should use one model vs another, and how you could adjust models and invent new ones!

### Entropy (aka Uncertainty)¶

Remember, in the last lab, we mentioned that in data science and machine learning, our models are often just finding patterns in the data. For example, for classification, it is best when our data is separable by their $Y$ class lables (e.g., cancerous or benign). That is, hopefully the $X$ values for one class label (e.g., cancerous) is disjoint and separated from the $X$ values that correspond to another class label (e.g., benign). If so, our model would be able to easily discern if a given, new piece of data corresponds to the cancerous label or benign label, based on its $X$ values. If the data is not easily separable (i.e., the $X$ values corresponding to cancer looks very similar to $X$ values corresponding to benign), then our task is difficult and perhaps impossible. Along these lines, we can measure this element in terms of how messy/confusable/uncertain a collection of data is.

In the 1870s, physicists introduced a term Gibbs Entropy, which was useful in statistical thermodynamics, as it effectively measured uncertainty. By the late 1920s, the foundational work in Information Theory had begun; pioneers John von Neumann and Claude Shannon conducted phenomenal work which paved the way for computation at large -- they heavily influenced the creation of computer science, and their work is still seen in modern day computers. Information theory concerns entropy.) So let's look at an example to concretely address what entropy is (the information theoretic version of it).

Say that we have a fair coin $X$, and each coin fliip is an observation. The coin is equally likely to yield heads or tails. The uncertainty is very high. In fact, it's the highest possible, as it's truly a 50/50 chance of either. Let $H(X)$ represent the entropy of $X$. Per the graphic below, we see that entropy is in fact highest when the probabilities of a 2-class variable are a 50/50 chance.

If we had a cheating coin, whereby it was guaranteed to always be a head (or a tail), then our entropy would be 0, as there is no uncertainty about its outcome. Again, this term, entropy, predates decision trees and has vast applications. Alright, so we can see what entropy is measuring (the uncertainty), but how was it actually calculated?

#### Definition:¶

Entropy factors in all possible values/classes of a random variable (log base 2):

#### Fair-Coin Example¶

In our fair coin example, we only have 2 classes, both of which have a probability of 1/2. So, to calculate the overall entropy of the fair coin, we have Entropy(1+, 1-) =

$H(X)$ = -1 * (P(coin=heads)*log(P(coin=heads)) + P(coin=tails)*log(P(coin=tails)))

$= -1 * (\frac{1}{2}log(\frac{1}{2}) + \frac{1}{2}log(\frac{1}{2}))$

$= -1 * (\frac{1}{2}*-1) + \frac{1}{2}*-1)$

$= -1 * (-\frac{1}{2} + -\frac{1}{2})$

$= -1*-1 = 1$

### Worked Example¶

Let's say that we have a small, 14-observation dataset that concerns if we will play tennis on a given day or not (Play Tennis will be our output $Y$), based on 4 features of the current weather:

Completely independent of the features, we can calculate the overall entropy of playing tennis, Entropy for (9+, 5-) examples =

$H(X) = -1 * (P$(play_tennis=yes)*log(P(play_tennis=yes)) + P(play_tennis=no)*log(P(play_tennis=no)))

$= -\frac{9}{14}log(\frac{9}{14}) - \frac{5}{14}log(\frac{5}{14}) = 0.94$
Okay, **0.94** is pretty horrible, as it's close to 1, which is the worst possible value. This means that a priori, if we use no features, it's hard to predict if we will play tennis or not. There's a lot of uncertainty (aka entropy). To improve this, could we segment our data in such a way that it's more clear if we will play tennis or not (i.e., by more clear, I mean we will have lower uncertainty... lower entropy). Let's start with looking at the Wind feature. There are 2 possible values for the Wind attribute, **weak** or **strong.** If we were to look at the subset of data that has weak wind, we see that there are 8 data samples (6 are 'Yes' for Play Tennis, 2 have 'No' for Play Tennis). Hmm, so if we know that the Wind is weak, it helps inform us that there's a 6/8 (75%) chance that we will Play Tennis. Let's put this in terms of entropy: When we look at ONLY the Wind is Weak subset of data, we have a Play Tennis entropy for (6+, 2-) examples, which calculates to:

$H(X) = -1 * (P($play_tennis=yes$)*log(P$(play_tennis=yes$)) + P($play_tennis$=$no$)*log(P($play_tennis$=no)))$

$= -\frac{6}{8}log(\frac{6}{8}) - \frac{2}{8}log(\frac{2}{8}) = 0.811$
A value of 0.811 may seem sadly high, still, but our calculation was correct. If you reference the figure above that shows the entropy of a fair coin, we see that having a probability of 75% does in fact yield an entropy of 0.811. We're only looking at a subset of our data though (the subset for Wind is Weak). We now need to look at the rest of our data (the subset for Wind is Strong). When the Wind is Strong, we have 6 data points: 3 have Play Tennis is Yes, and 3 are No). In short-hand notation, we have (3+, 3-), which is a 0.5 probability, and we know already that this yields an Entropy of 1. When looking at this possible division of separating our data according to the value of Wind, the hope was that we'd have very low entropy in each subset of data. Imagine if the Wind attribute perfectly aligned with Playing Tennis or not (the values were identical). In that case, we would have an Entropy of 0 (no uncertainty), and thus, it would be INCREDIBLY useful to predict playing tennis or not based on the Wind attribute (it would tell us the exact answer). We saw that the Wind attribute didn't yield an entropy of 0; its two classes (weak and strong) had an entropy of 0.811 and 1, respectively. Is Wind a useful feature for us then? In order quantitatively measure its usefulness, we can use the entropy to calculate Information Gain, which we saw in Lecture 15 on Slide 40:

$Gain(S) = H(S) - \sum_{i}\frac{|S_{i}|}{|S|}*H(S_{i})$
Let $S$ represent our current data, and each $S_{i}$ is a subset of the data split according to each of the possible values. So, when considering splitting on Wind, our Information Gain is:

$Gain($alldata)$= H($alldata$) - \frac{|S_{windweak}|}{|S|}H(S_{windweak}) - \frac{|S_{windstrong}|}{|S|}H(S_{windstrong})$

$= 0.94 - \frac{8}{14}0.811 - \frac{6}{14}1.00 = 0.048$
Okay, using Wind as a feature to split our data yields an Information Gain of 0.048. That looks like a low value. We want a high value because gain is good (we want to separate our data in a way that the increases our information). Is 0.048 bad? It all depends on the dataset.

Q3: Using our entire 14-observation dataset, calculate the Information Gain for the other 3 remaining features (Outlook, Temperature, Humidity). What are their values and which ones gives us the most information gain?

In [13]:
# %load solutions/q3.txt

Q4: Now that we know which feature provides the most information gain, how should we use it to construct a decision tree? Let's start the construction of our tree and repeat the process of Q3 one more time.
In [14]:
# %load solutions/q4.txt

Q5: When should we stop this process?
In [15]:
# %load solutions/q5.txt

Q6: Should we standardize or normalize our features? Both? Neither?
In [16]:
# %load solutions/q6.txt

Q7: What if we have outliers? How sensitive is our Decision Tree to outliers? Why?
In [17]:
# %load solutions/q7.txt


#### Connection to Lecture¶

In Lecture 16, Pavlos started by presenting the tricky graph below which depicts a dataset with just 2 features: longitude and latitude.

By drawing a straight line to separate our data, we would be doing the same exact process that we are doing here with our Play Tennis dataset. In our Play Tennis example, we are trying to segment our data into bins according to the possible categories that a feature can be. In the lecture example (pictured above), we have continuous data, not discrete categories, so we have an infinite number of thresholds by which to segment our data.

Q8: How is it possible to segment continuous-valued data, since there are infinite number of possible splits? Do we try 1,000,000 possible values to split by? 100?

In [18]:
# %load solutions/q8.txt


### Summary:¶

To build a decision tree:

• Start with an empty tree and some data $X$
• Decide what your splitting criterion will be (e.g., Gini, Entropy, etc)
• Decide what your what your stopping criterion will be, or if you'll develop a large tree and prune (pruning is covered in Lecture 15, slides 41 - 54)
• Build the tree in a greedy manner, and if you have multiple hyperparameters, use cross-validation to determine the best values

## Sklearn's Implementation¶

Our beloved sklearn library has implementations of DecisionTrees, so let's practice using it.

First, let's load our Play Tennis data (../data/play_tennis.csv"):

In [19]:
tennis_df = pd.read_csv("../data/play_tennis.csv")
tennis_df = pd.get_dummies(tennis_df, columns=['outlook', 'temp', 'humidity', 'windy'])
tennis_df

Out[19]:
play outlook_overcast outlook_rainy outlook_sunny temp_cool temp_hot temp_mild humidity_high humidity_normal windy_False windy_True
0 no 0 0 1 0 1 0 1 0 1 0
1 no 0 0 1 0 1 0 1 0 0 1
2 yes 1 0 0 0 1 0 1 0 1 0
3 yes 0 1 0 0 0 1 1 0 1 0
4 yes 0 1 0 1 0 0 0 1 1 0
5 no 0 1 0 1 0 0 0 1 0 1
6 yes 1 0 0 1 0 0 0 1 0 1
7 no 0 0 1 0 0 1 1 0 1 0
8 yes 0 0 1 1 0 0 0 1 1 0
9 yes 0 1 0 0 0 1 0 1 1 0
10 yes 0 0 1 0 0 1 0 1 0 1
11 yes 1 0 0 0 0 1 1 0 0 1
12 yes 1 0 0 0 1 0 0 1 1 0
13 no 0 1 0 0 0 1 1 0 0 1

Normally, in real situations, we'd perform EDA. However, for this tiny dataset, we see there are no missing values, and we do not care if there is collinearity or outliers, as Decision Trees are robust to such.

In [20]:
# separate our data into X and Y portions
x_train = tennis_df.iloc[:, tennis_df.columns != 'play'].values
y_train = tennis_df['play'].values


We can build a DecisionTree classifier as follows:

In [21]:
dt = DecisionTreeClassifier().fit(x_train, y_train)

In [22]:
tree_vis = tree.plot_tree(dt, filled=True)


Q9: Is this tree identical to what we constructed above? If not, what differs in sklearn's implementation?

In [23]:
# %load solutions/q8.txt


In the above example, we did not use the tree to do any classification. Our data was too small to consider such.

Let's turn to a different dataset:

## 2016 Election Data¶

We will be attempting to predict the presidential election results (at the county level) from 2016, measured as 'votergap' = (trump - clinton) in percentage points, based mostly on demographic features of those counties. Let's quick take a peak at the data:

In [27]:
elect_df = pd.read_csv("../data/county_level_election.csv")

Out[27]:
state fipscode county population hispanic minority female unemployed income nodegree bachelor inactivity obesity density cancer votergap trump clinton
0 Colorado 8117 Summit County 27239 15.173 4.918 45.996 2.5 68352 5.4 48.1 8.1 13.1 46.0 46.2 -27.632 31.530 59.162
1 Colorado 8037 Eagle County 53653 30.040 5.169 47.231 3.1 76661 10.1 47.3 9.4 11.8 31.0 47.1 -19.897 36.058 55.955
2 Idaho 16067 Minidoka County 19226 34.070 5.611 49.318 3.7 46332 24.1 11.8 18.3 34.2 80.0 61.8 54.148 71.135 16.987
3 Colorado 8113 San Miguel County 7558 10.154 4.747 46.808 3.7 59603 4.7 54.4 12.4 16.7 5.7 62.6 -44.769 23.892 68.662
4 Utah 49051 Wasatch County 21600 13.244 4.125 48.812 3.4 65207 9.5 34.4 13.9 23.0 257.8 68.3 25.357 50.471 25.114
In [28]:
# split 80/20 train-test
X = elect_df[['population','hispanic','minority','female','unemployed','income','nodegree','bachelor','inactivity','obesity','density','cancer']]
response = elect_df['votergap']
Xtrain, Xtest, ytrain, ytest = train_test_split(X,response,test_size=0.2)

In [29]:
plt.hist(ytrain)
Xtrain.hist(column=['minority', 'population','hispanic','female']);

In [30]:
print(elect_df.shape)
print(Xtrain.shape)
print(Xtest.shape)

(3066, 18)
(2452, 12)
(614, 12)


## Regression Trees¶

We will start by using a simple Decision Tree Regressor to predict votergap. We'll run a few of these models without any cross-validation or 'regularization', just to illustrate what is going on.

This is what you ought to keep in mind about decision trees.

from the docs:

max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split : int, float, optional (default=2)
• The deeper the tree, the more prone you are to overfitting.
• The smaller min_samples_split, the more the overfitting. One may use min_samples_leaf instead. More samples per leaf, the higher the bias.
In [33]:
from sklearn.tree import DecisionTreeRegressor
x = Xtrain['minority'].values
o = np.argsort(x)
x = x[o]
y = ytrain.values[o]
plt.plot(x,y, '.');

In [34]:
plt.plot(np.log(x),y, '.'); # log scale


Q10: Which of the two versions of 'minority' would be a better choice to use as a predictor for prediction?

In [36]:
# %load solutions/q10.txt

In [37]:
plt.plot(np.log(x),y,'.')
xx = np.log(x).reshape(-1,1)
for i in [1,2]:
dtree = DecisionTreeRegressor(max_depth=i)
dtree.fit(xx, y)
plt.plot(np.log(x), dtree.predict(xx), label=str(i), alpha=1-i/10, lw=4)
plt.legend();

In [38]:
plt.plot(np.log(x),y,'.')
xx = np.log(x).reshape(-1,1)
for i in [500,200,100,20]:
dtree = DecisionTreeRegressor(min_samples_split=i)
dtree.fit(xx, y)
plt.plot(np.log(x), dtree.predict(xx), label=str(i), alpha=0.8, lw=4)
plt.legend();

In [39]:
plt.plot(np.log(x),y,'.')
xx = np.log(x).reshape(-1,1)
for i in [500,200,100,20]:
dtree = DecisionTreeRegressor(max_depth=6, min_samples_split=i)
dtree.fit(xx, y)
plt.plot(np.log(x), dtree.predict(xx), label=str(i), alpha=0.8, lw=4)
plt.legend();

In [40]:
#let's also include logminority as a predictor going forward
xtemp = np.log(Xtrain['minority'].values)
Xtrain = Xtrain.assign(logminority = xtemp)
Xtest = Xtest.assign(logminority = np.log(Xtest['minority'].values))

Out[40]:
population hispanic minority female unemployed income nodegree bachelor inactivity obesity density cancer logminority
1066 129518 34.669 12.183 49.101 7.4 45025 13.7 23.1 20.9 26.0 21.3 209.3 2.500042
1267 6319 10.229 13.193 49.522 5.5 32645 17.4 11.4 32.7 36.0 32.9 218.0 2.579686
2318 24070 1.905 4.029 50.341 6.9 44851 24.6 12.4 26.3 28.3 77.3 265.7 1.393518
1247 23530 3.467 2.352 50.192 5.9 43740 17.5 11.2 30.6 36.1 52.8 217.1 0.855266
1018 3460 2.587 1.892 50.125 2.5 48535 11.7 14.4 22.8 29.8 3.5 206.9 0.637634

Ok with this discussion in mind, lets improve this model by Bagging.

## Bootstrap-Aggregating (called Bagging)¶

Q11: Class poll: When did the movie Titanic come out?

In [41]:
# %load solutions/q11.txt


The basic idea:

• A Single Decision tree is likely to overfit.
• So lets introduce replication through Bootstrap sampling.
• Bagging uses bootstrap resampling to create different training datasets. This way each training will give us a different tree.
• Added bonus: the left off points can be used to as a natural "validation" set, so no need to
• Since we have many trees that we will average over for prediction, we can choose a large max_depth and we are ok as we will rely on the law of large numbers to shrink this large variance, low bias approach for each individual tree.
In [42]:
from sklearn.utils import resample

ntrees = 500
estimators = []
R2s = []
yhats_test = np.zeros((Xtest.shape[0], ntrees))

plt.plot(np.log(x),y,'.')
for i in range(ntrees):
simpletree = DecisionTreeRegressor(max_depth=3)
boot_xx, boot_y = resample(Xtrain[['logminority']], ytrain)
estimators = np.append(estimators,simpletree.fit(boot_xx, boot_y))
R2s = np.append(R2s,simpletree.score(Xtest[['logminority']], ytest))
yhats_test[:,i] = simpletree.predict(Xtest[['logminority']])
plt.plot(np.log(x), simpletree.predict(np.log(x).reshape(-1,1)), 'red', alpha=0.05)

In [37]:
yhats_test.shape

Out[37]:
(614, 500)

**Exercise 2**

1. Edit the code below (which is just copied from above) to refit many bagged trees on the entire xtrain feature set (without the plot...lots of predictors now so difficult to plot).
2. Summarize how each of the separate trees performed (both numerically and visually) using $R^2$ as the metric. How do they perform on average?
3. Combine the trees into one prediction and evaluate it using $R^2$.
4. Briefly discuss the results. How will the results above change if 'max_depth=4' is increased? What if it is decreased?
In [38]:
from sklearn.metrics import r2_score

ntrees = 500
estimators = []
R2s = []
yhats_test = np.zeros((Xtest.shape[0], ntrees))

for i in range(ntrees):
dtree = DecisionTreeRegressor(max_depth=3)
boot_xx, boot_y = resample(Xtrain[['logminority']], ytrain)
estimators = np.append(estimators,dtree.fit(boot_xx, boot_y))
R2s = np.append(R2s,dtree.score(Xtest[['logminority']], ytest))
yhats_test[:,i] = dtree.predict(Xtest[['logminority']])



## Random Forests¶

What's the basic idea?

Bagging alone is not enough randomization, because even after bootstrapping, we are mainly training on the same data points using the same variablesn, and will retain much of the overfitting.

So we will build each tree by splitting on "random" subset of predictors at each split (hence, each is a 'random tree'). This can't be done in with just one predcitor, but with more predictors we can choose what predictors to split on randomly and how many to do this on. Then we combine many 'random trees' together by averaging their predictions, and this gets us a forest of random trees: a random forest.

Below we create a hyper-param Grid. We are preparing to use the bootstrap points not used in training for validation.

max_features : int, float, string or None, optional (default=”auto”)
- The number of features to consider when looking for the best split.
• max_features: Default splits on all the features and is probably prone to overfitting. You'll want to validate on this.
• You can "validate" on the trees n_estimators as well but many a times you will just look for the plateau in the trees as seen below.
• From decision trees you get the max_depth, min_samples_split, and min_samples_leaf as well but you might as well leave those at defaults to get a maximally expanded tree.
In [39]:
from sklearn.ensemble import RandomForestRegressor

In [40]:
# code from
# Adventures in scikit-learn's Random Forest by Gregory Saunders
from itertools import product
from collections import OrderedDict
param_dict = OrderedDict(
n_estimators = [400, 600, 800],
max_features = [0.2, 0.4, 0.6, 0.8]
)

param_dict.values()

Out[40]:
odict_values([[400, 600, 800], [0.2, 0.4, 0.6, 0.8]])

### Using the OOB score.¶

We have been putting "validate" in quotes. This is because the bootstrap gives us left-over points! So we'll now engage in our very own version of a grid-search, done over the out-of-bag scores that sklearn gives us for free

In [41]:
from itertools import product

In [42]:
#make sure ytrain is the correct data type...in case you have warnings
#print(yytrain.shape,ytrain.shape,Xtrain.shape)
#ytrain = np.ravel(ytrain)

#Let's Cross-val. on the two 'hyperparameters' we based our grid on earlier
results = {}
estimators= {}
for ntrees, maxf in product(*param_dict.values()):
params = (ntrees, maxf)
est = RandomForestRegressor(oob_score=True,
n_estimators=ntrees, max_features=maxf, max_depth=50, n_jobs=-1)
est.fit(Xtrain, ytrain)
results[params] = est.oob_score_
estimators[params] = est
outparams = max(results, key = results.get)
outparams

Out[42]:
(600, 0.6)
In [43]:
rf1 = estimators[outparams]

In [44]:
results

Out[44]:
{(400, 0.2): 0.7539639690584528,
(400, 0.4): 0.7677805365253206,
(400, 0.6): 0.7699751622917039,
(400, 0.8): 0.7675383680731207,
(600, 0.2): 0.7547137231855866,
(600, 0.4): 0.7683114062482946,
(600, 0.6): 0.7707024796128402,
(600, 0.8): 0.7695227029080896,
(800, 0.2): 0.7543389426401441,
(800, 0.4): 0.7675397515655356,
(800, 0.6): 0.7704787398435397,
(800, 0.8): 0.7691255683564636}
In [45]:
rf1.score(Xtest, ytest)

Out[45]:
0.8073817254966813

Finally you can find the feature importance of each predictor in this random forest model. Whenever a feature is used in a tree in the forest, the algorithm will log the decrease in the splitting criterion (such as gini). This is accumulated over all trees and reported in est.feature_importances_

In [46]:
pd.Series(rf1.feature_importances_,index=list(Xtrain)).sort_values().plot(kind="barh")

Out[46]:

Since our response isn't very symmetric, we may want to suppress outliers by using the mean_absolute_error instead.

In [47]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(ytest, rf1.predict(Xtest))

Out[47]:
10.679048688925084