CS-109A Introduction to Data Science

Lab 10: Decision Trees (Part 2 of 2): Boosting!¶

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras
Authors: Kevin Rader, Rahul Dave, Chris Tanner

In [54]:

## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

Out[54]:

Learning Goals¶

The goal of this lab is for students to:

Understand the core concepts of Bagging and RandomForest (and the difference between the two)
Understand why Boosting can be important
Understand how Boosting works (conceptually)
Feel comfortable writing code in `sklearn` to run Adaboost

In [55]:

# imports
%matplotlib inline
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.utils import resample
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostRegressor
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
# import seaborn.apionly as sns

Background¶

Before we move on to boosting, let's first recap what we learned in last lab and lecture about decision trees, along with a refresher on bias vs variance.

Q1: For a given dataset, say we trained a DecisionTree of depth 3 (a root node with children, then terminal leaves). What can you speculate about its variance and bias? What if our tree had a depth of 12? Let's discuss.

In [66]:

# SOLUTION: discussed in lab
# %load solutions/q1.txt

We know that Decision Trees are useful models because they offer:

the ability to model non-linear relationships of data
interpretable results (if we don't aggregate trees)
insensitivity to outliers and unscaled data

Q2: What is Bagging and why is it useful?

In [21]:

# SOLUTION: discussed in lab

Q3: What is RandomForest and why is it useful?

In [22]:

# SOLUTION: discussed in lab

Q4: What are potential weaknesses with Bagging and RandomForest?

In [23]:

# SOLUTION: discussed in lab

Understanding Boosting¶

My goal is for none of the topics we learn in this class to seem like nebulus concepts or black-boxes of magic. In this course, it's important to understand the models that you can use to help you with your data, and this includes not only knowing how to invoke these as tools within Python libraries (e.g., sklearn, statsmodels), but to have an understanding of what each model is actually doing 'under the hood' -- how it actually works -- as this provides insights into why you should use one model vs another, and how you could adjust models and invent new ones!

Gradient Boosting¶

In Lecture 17, Pavlos introduced Boosting by first talking about Gradient Boosting.

Q5: What is the motivation for Boosting in general?

In [24]:

# SOLUTION: discussed in lab. in short, we should learn from our mistakes and weight the predictions of our past models (based on their accuracy)

Q6: What is Gradient Boosting?

In [25]:

# SOLUTION: discussed in lab

Q7: What is AdaBoost?

In [26]:

# SOLUTION: discussed in lab

Let's walk through how AdaBoost works, via slides (AdaBoost.pptx and AdaBoost.pdf).

Summary:¶

In summary, none of the models we've previously discussed have the ability to learn from their final predictions, which is what boosting allows us to do. Sure, we train a model iteratively over data (possibly multiple epochs), but its final predictions/errors are never used to train new models. Boosting allows us to do this, and it tends to use a bunch of 'weak' models. The reason for using weak models is that if we used very expressive models, our combined models would likely overfit pretty quickly. In short, we should learn from our models' mistakes. Gradient Boosting is typically used for regression tasks, as it models the residuals. AdaBoost is typically used for classification tasks. AdaBoost explicitly weights the dataset for each model based on the previous mdoels' misclassifications. That is, our dataset is constantly being changed after each model, based on the misclassification from our accumulative model. Moreover, the model's accuracy is used for determining how much weight (aka, how much stock or faith) we should place on that given model.

My slides, AdaBoost.pptx/pdf, go into more detail.

If you are looking for further details, I highly recommend CMU's slides from their Machine Learning Algorithms course.

Sklearn's Implementation¶

Our beloved sklearn library has an implementation of AdaBoost, so let's practice using it.

Let's load in the same dataset that we used in the last lab -- county_level_election.csv

In [27]:

elect_df = pd.read_csv("../data/county_level_election.csv")
elect_df.head()

Out[27]:

	state	fipscode	county	population	hispanic	minority	female	unemployed	income	nodegree	bachelor	inactivity	obesity	density	cancer	votergap	trump	clinton
0	Colorado	8117	Summit County	27239	15.173	4.918	45.996	2.5	68352	5.4	48.1	8.1	13.1	46.0	46.2	-27.632	31.530	59.162
1	Colorado	8037	Eagle County	53653	30.040	5.169	47.231	3.1	76661	10.1	47.3	9.4	11.8	31.0	47.1	-19.897	36.058	55.955
2	Idaho	16067	Minidoka County	19226	34.070	5.611	49.318	3.7	46332	24.1	11.8	18.3	34.2	80.0	61.8	54.148	71.135	16.987
3	Colorado	8113	San Miguel County	7558	10.154	4.747	46.808	3.7	59603	4.7	54.4	12.4	16.7	5.7	62.6	-44.769	23.892	68.662
4	Utah	49051	Wasatch County	21600	13.244	4.125	48.812	3.4	65207	9.5	34.4	13.9	23.0	257.8	68.3	25.357	50.471	25.114

In [29]:

# NOTICE that a negative votergap means more people voted for clinton than trump; a positive votergap means more people voted for trump

# split 80/20 train-test
X = elect_df[['population','hispanic','minority','female','unemployed','income','nodegree','bachelor','inactivity','obesity','density','cancer']]
response = elect_df['votergap']
Xtrain, Xtest, ytrain, ytest = train_test_split(X,response,test_size=0.2)

In [30]:

# let's plot a few of our features
plt.hist(ytrain)
Xtrain.hist(column=['minority', 'population','hispanic','female']);

We can build a DecisionTree classifier as follows:

In [58]:

# let's plot our `minority` values
from sklearn.tree import DecisionTreeRegressor
x = Xtrain['minority'].values
o = np.argsort(x)
x = x[o]
y = ytrain.values
y = y[o]
plt.plot(x,y, '.');

In [59]:

plt.plot(np.log(x),y, '.'); # log scale

In [60]:

# as we saw in lab 9, this is what happens when we fit 2 different standard Decision Trees to our data
plt.plot(np.log(x),y,'.')
xx = np.log(x).reshape(-1,1)
for i in [1,2]:
    dtree = DecisionTreeRegressor(max_depth=i)
    dtree.fit(xx, y)
    plt.plot(np.log(x), dtree.predict(xx), label=str(i), alpha=1-i/10, lw=4)
plt.legend();

In [64]:

estab = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=1), n_estimators=200, learning_rate=1.0)
estab.fit(xx, y)
staged_predict_generator = estab.staged_predict(xx)

In [65]:

import time
from IPython import display
plt.plot(xx, y, '.');
counter = 0
for stagepred in staged_predict_generator:
    counter = counter + 1
    if counter in [1, 2, 4, 8, 10, 50, 100, 200]:
        plt.plot(xx, stagepred, alpha=0.7, label=str(counter), lw=4)
        plt.legend();
        display.display(plt.gcf())
        display.clear_output(wait=True)
        time.sleep(1)

Ok, so this demonstration helps us understand some things about boosting.

n_estimators is the number of trees, and thus the stage in the fitting. It also controls the complexity for us. The more trees we have the more we fit to the tiny details.
staged_predict gives us the prediction at each step
once again max_depth from the underlying decision tree tells us the depth of the tree. But here it tells us the amount of features interactions we have, not just the scale of our fit. But clearly it increases the variance again.

Ideas from decision trees remain. For example, increase min_samples_leaf or decrease max_depth to reduce variance and increase the bias.

Group Exercise 1:

1: What do you expect to happen if you increase max_depth to 5? Edit the code above to explore the result.

2: What do you expect to happen if you put max_depth back to 1 and decrease the learning_rate to 0.1? Edit the code above to explore the result.

3: Do a little work to find some sort of 'best' values of max_depth and learning_rate. Does this result make sense?

Your answer here¶

AdaBoost Classification¶

Let's start with the same voter data from above:

In [52]:

X = elect_df[['population','hispanic','minority','female','unemployed','income','nodegree','bachelor','inactivity','obesity','density','cancer']]
response = elect_df['votergap']
Xtrain, Xtest, ytrain, ytest = train_test_split(X,response,test_size=0.2)

Group Exercise 2:

Convert the y-values to being binary, whereby any negative values become a 0 (more people voted for Clinton) and any positive values become a 1 (more people voted for Trump).

Group Exercise 3:

Create an AdaBoostClassifier to predict our binary classification task. Below we provide a code snippet. Are there any issues along the way? Are our features okay to use as is? Be sure to output your prediction accuracy.

In [ ]:

from sklearn.ensemble import AdaBoostClassifier
ada_model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2), n_estimators=10, learning_rate=.01).fit(Xtrain, ytrain)

Group Exercise 4:

Experiment with the different parameters, which are listed in the documentation. Be sure to pay attention to the algorithm parameter. What is the difference between SAMME and SAMME.R?

What's the best performance that you can achieve? And how does this compare with a more simple baseline (use either Bagging, RandomForest, or a single Decision Tree).