Key Word(s): Decision Tree, Random Forest, Bias–Variance Tradeoff, Adaboost, Boosting, Bootstrap-Aggregating (Bagging)

CS109A Introduction to Data Science

Standard Section 8: Bagging and Random Forest¶

Harvard University
Fall 2017
Section Leaders: Mehul Smriti Raje, Ken Arnold, Karan R. Motwani, Cecilia Garraffo
Instructors: Pavlos Protopapas, Kevin Rader

In [1]:

#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:

This section will work with a spam email dataset. Our ultimate goal is to be able to build models so that we can predict whether an email is spam or not spam based on word characteristics within each email. We will cover the Adaboost and Random Forest methods and allow you to apply it to the homework.

Specifically, we will:

1. Load in the spam dataset and split the data into train and test.
2. Find the optimal depth for the Decision Tree model and evaluate performance.
3. Fit the Bagging model using multiple bootstrapped datasets and ensemble. 
4. Fit the Random Forest Model and compare with Bagging.
5. Use the Adaboost method to visualize Bias-Variance tradeoff.
6. Example to better understand Bias vs Variance tradeoff.

In [1]:

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
%matplotlib inline
from tqdm import tqdm

pd.set_option('display.width', 1500)
pd.set_option('display.max_columns', 100)

from sklearn.model_selection import learning_curve

! pip install tqdm to install the tqdm package.

Part 0 : Introduction to the Spam Dataset¶

We will be working with a spam email dataset. The dataset has 57 predictors with a response variable called Spam that indicates whether an email is spam or not spam. The goal is to be able to create a classifier or method that acts as a spam filter.

In [2]:

#Import Dataframe and Set Column Names
spam_df = pd.read_csv('data/spam.csv', header=None)
columns = ["Column_"+str(i+1) for i in range(spam_df.shape[1]-1)] + ['Spam']
spam_df.columns = columns
display(spam_df.head())

	Column_1	Column_2	Column_3	Column_5	Column_6	Column_7	Column_8	Column_9	Column_10	Column_11	Column_12	Column_13	Column_14	Column_15	Column_16	Column_17	Column_18	Column_19	Column_20	Column_21	Column_23	Column_24	Column_37	Column_40	Column_43	Column_45	Column_46	Column_49	Column_50	Column_52	Column_53	Column_54	Column_55	Column_56	Column_57	Spam
0	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	0.00	0.64	0.00	0.00	0.00	0.32	0.00	1.29	1.93	0.00	0.96	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000	0.778	0.000	0.000	3.756	61	278	1
1	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	0.21	0.79	0.65	0.21	0.14	0.14	0.07	0.28	3.47	0.00	1.59	0.43	0.43	0.07	0.00	0.00	0.00	0.00	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	1
2	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	0.38	0.45	0.12	0.00	1.75	0.06	0.06	1.03	1.36	0.32	0.51	1.16	0.06	0.00	0.06	0.12	0.06	0.06	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	1
3	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	0.31	0.31	0.31	0.00	0.00	0.31	0.00	0.00	3.18	0.00	0.31	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.137	0.137	0.000	0.000	3.537	40	191	1
4	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	0.31	0.31	0.31	0.00	0.00	0.31	0.00	0.00	3.18	0.00	0.31	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.135	0.135	0.000	0.000	3.537	40	191	1

The predictor variabes are all continuous. They represent certain features like the frequency of the word "discount". The exact specification and description of each predictor can be found online. We are not so much interested in the exact inference of each predictor so we will omit the exact names of each of the predictors. We are more interested in the prediction of the algorithm so we will treat each as predictor without going into too much exact detail in each.

Link to description : https://archive.ics.uci.edu/ml/datasets/spambase

Let us split the dataset into a 70-30 split by using the following:

Note : While you will use train_test_split in your homeworks, the code below should help you visualize splitting/masking of a dataframe which will be helpful in general.

In [4]:

#Split data into train and test
np.random.seed(42)
msk = np.random.rand(len(spam_df)) < 0.7
data_train = spam_df[msk]
data_test = spam_df[~msk]

#Split predictor and response columns
x_train, y_train = data_train.drop(['Spam'], axis=1), data_train['Spam']
x_test, y_test = data_test.drop(['Spam'], axis=1), data_test['Spam']

print("Shape of Training Set :",data_train.shape)
print("Shape of Testing Set :", data_test.shape)

Shape of Training Set : (3262, 58)
Shape of Testing Set : (1339, 58)

In [13]:

spam_df.iloc[np.arange(10)]

Out[13]:

	Column_1	Column_2	Column_3	Column_5	Column_6	Column_7	Column_8	Column_9	Column_10	Column_11	Column_12	Column_13	Column_14	Column_15	Column_16	Column_17	Column_18	Column_19	Column_20	Column_21	Column_23	Column_24	Column_33	Column_37	Column_40	Column_43	Column_44	Column_45	Column_46	Column_49	Column_50	Column_52	Column_53	Column_54	Column_55	Column_56	Column_57	Spam
0	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	0.00	0.64	0.00	0.00	0.00	0.32	0.00	1.29	1.93	0.00	0.96	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.000	0.778	0.000	0.000	3.756	61	278	1
1	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	0.21	0.79	0.65	0.21	0.14	0.14	0.07	0.28	3.47	0.00	1.59	0.43	0.43	0.00	0.07	0.00	0.00	0.00	0.00	0.00	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	1
2	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	0.38	0.45	0.12	0.00	1.75	0.06	0.06	1.03	1.36	0.32	0.51	1.16	0.06	0.00	0.00	0.06	0.12	0.00	0.06	0.06	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	1
3	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	0.31	0.31	0.31	0.00	0.00	0.31	0.00	0.00	3.18	0.00	0.31	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.137	0.137	0.000	0.000	3.537	40	191	1
4	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	0.31	0.31	0.31	0.00	0.00	0.31	0.00	0.00	3.18	0.00	0.31	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.135	0.135	0.000	0.000	3.537	40	191	1
5	0.00	0.00	0.00	1.85	0.00	0.00	1.85	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.223	0.000	0.000	0.000	3.000	15	54	1
6	0.00	0.00	0.00	1.92	0.00	0.00	0.00	0.00	0.64	0.96	1.28	0.00	0.00	0.00	0.96	0.00	0.32	3.85	0.00	0.64	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.054	0.164	0.054	0.000	1.671	4	112	1
7	0.00	0.00	0.00	1.88	0.00	0.00	1.88	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.206	0.000	0.000	0.000	2.450	11	49	1
8	0.15	0.00	0.46	0.61	0.00	0.30	0.00	0.92	0.76	0.76	0.92	0.00	0.00	0.00	0.00	0.00	0.15	1.23	3.53	2.00	0.00	0.15	0.15	0.00	0.00	0.30	0.00	0.00	0.00	0.00	0.271	0.181	0.203	0.022	9.744	445	1257	1
9	0.06	0.12	0.77	0.19	0.32	0.38	0.00	0.06	0.00	0.00	0.64	0.25	0.00	0.12	0.00	0.00	0.12	1.67	0.06	0.71	0.19	0.00	0.00	0.00	0.00	0.00	0.06	0.00	0.00	0.04	0.030	0.244	0.081	0.000	1.729	43	749	1

We can check that the number of spam cases is roughly evenly represented in both the training and test set.

In [11]:

#Check Percentage of Spam in Train and Test Set
print("Percentage of Spam in Training Set :", str(100*y_train.sum()/len(y_train))+'%')
print("Percentage of Spam in Testing Set :",str(100*y_test.sum()/len(y_test))+'%')

Percentage of Spam in Training Set : 39.17841814837523%
Percentage of Spam in Testing Set : 39.955190440627334%

Part 1 : Fitting an Optimal Single Decision Tree (by Depth) :¶

We fit here a single tree to our spam dataset and perform 5-fold cross validation on the training set. For EACH depth of the tree, we fit a tree and then compute the 5-fold CV scores. These scores are then averaged and compared across different depths.

In [6]:

#Find optimal depth of trees
depth, tree_start, tree_end = {}, 3, 20
for i in range(tree_start, tree_end):
    model = DecisionTreeClassifier(max_depth=i)
    scores = cross_val_score(estimator=model, X=x_train, y=y_train, cv=5, n_jobs=-1)
    depth[i] = scores.mean()
    
#Plot
lists = sorted(depth.items())
x, y = zip(*lists) 
plt.ylabel("Cross Validation Accuracy")
plt.xlabel("Maximum Depth")
plt.title('Variation of Accuracy with Depth - Simple Decision Tree')
plt.plot(x, y, 'b-', marker='o')
plt.show()

As we can see, the optimal depth is found to be a depth of 7. Although, does it makes sense to choose 6?

Let us set best_depth as a new parameter. Can you see why we coded the best depth parameter as we did below? (Hint: Think about reproducibility.)

Also, of we wanted to get the Confidence Bands of these results, how would we? It's as simple as a combination of getting variance using scores.std() and plt.fill_between().

In [7]:

#Make best depth a variable
best_depth = sorted(depth, key=depth.get, reverse=True)[0]
print("The best depth was found to be:", best_depth)

The best depth was found to be: 7

In [8]:

#Evalaute the performance at the best depth
model = DecisionTreeClassifier(max_depth=best_depth)
model.fit(x_train, y_train)

#Check Accuracy of Spam Detection in Train and Test Set
print("Accuracy, Training Set: {:.2%}".format(accuracy_score(y_train, model.predict(x_train))))
print("Accuracy, Testing Set: {:.2%}".format(accuracy_score(y_test, model.predict(x_test))))

Accuracy, Training Set: 94.39%
Accuracy, Testing Set: 90.81%

In [9]:

#Get Performance by Class (Lookup Confusion Matrix)
pd.crosstab(y_test, model.predict(x_test), margins=True, rownames=['Actual'], colnames=['Predicted'])

Out[9]:

Predicted	0	1	All
Actual
0	758	46	804
1	77	458	535
All	835	504	1339

Part 2: Bagging and Voting¶

Let's bootstrap our training dataset to create multiple, fit Decision Tree models to each.

(Resampling: we showed live that different samples give different results for things like sums, varying more when the things we sum over have high variance themselves.)

In [80]:

# Stat on all data
data_train.sum(axis=0).to_frame('sum').T

Out[80]:

	Column_1	Column_2	Column_3	Column_4	Column_5	Column_6	Column_7	Column_8	Column_9	Column_10	Column_11	Column_12	Column_13	Column_14	Column_15	Column_16	Column_17	Column_18	Column_19	Column_20	Column_21	Column_22	Column_23	Column_24	Column_25	Column_26	Column_27	Column_28	Column_29	Column_30	Column_31	Column_32	Column_33	Column_34	Column_35	Column_36	Column_37	Column_38	Column_39	Column_40	Column_41	Column_42	Column_43	Column_44	Column_45	Column_46	Column_47	Column_48	Column_49	Column_50	Column_51	Column_52	Column_53	Column_54	Column_55	Column_56	Column_57	Spam
sum	345.72	673.94	929.37	215.52	1009.71	314.66	365.93	360.02	289.45	793.26	204.28	1754.83	313.3	183.96	162.91	867.46	455.96	615.97	5516.65	317.18	2661.76	392.17	319.55	289.32	1723.59	836.14	2532.76	419.81	319.23	361.45	211.09	164.33	334.18	165.17	367.15	334.49	443.42	48.39	266.91	224.98	129.31	402.75	148.0	236.4	1005.45	596.8	18.29	104.09	118.468	462.011	61.189	905.649	232.714	167.588	17290.936	172106.0	907663.0	1278.0

In [83]:

data_train.sample(frac=1., replace=True).sum(axis=0).to_frame('sum').T

Out[83]:

	Column_1	Column_2	Column_3	Column_4	Column_5	Column_6	Column_7	Column_8	Column_9	Column_10	Column_11	Column_12	Column_13	Column_14	Column_15	Column_16	Column_17	Column_18	Column_19	Column_20	Column_21	Column_22	Column_23	Column_24	Column_25	Column_26	Column_27	Column_28	Column_29	Column_30	Column_31	Column_32	Column_33	Column_34	Column_35	Column_36	Column_37	Column_38	Column_39	Column_40	Column_41	Column_42	Column_43	Column_44	Column_45	Column_46	Column_47	Column_48	Column_49	Column_50	Column_51	Column_52	Column_53	Column_54	Column_55	Column_56	Column_57	Spam
sum	346.47	717.69	895.14	420.09	987.02	348.81	358.03	329.81	286.01	749.9	187.9	1761.05	341.26	194.12	163.01	813.03	428.91	635.57	5493.24	312.82	2628.2	418.56	335.29	281.29	1744.44	797.15	2110.56	442.27	341.18	360.57	196.14	147.81	338.25	149.91	370.14	328.65	429.03	45.06	240.01	218.37	144.31	449.42	145.63	196.12	1080.91	665.12	14.31	118.5	124.247	452.753	67.43	903.126	240.119	158.87	15532.539	180343.0	942985.0	1275.0

Now we actually fit the samples

In [85]:

#Creating model
np.random.seed(0)
model = DecisionTreeClassifier(max_depth=5) # we tried a variety of depths here

#Initializing variables
n_trees = 100 # we tried a variety of numbers here
predictions_train = np.zeros((data_train.shape[0], n_trees))
predictions_test = np.zeros((data_test.shape[0], n_trees))

#Conduct bootstraping iterations
for i in range(n_trees):
    temp = data_train.sample(frac=1, replace=True)
    response_variable = temp['Spam']
    temp = temp.drop(['Spam'], axis=1)
    model.fit(temp, response_variable)  
    predictions_train[:,i] = model.predict(x_train)   
    predictions_test[:,i] = model.predict(x_test)
    
#Make Predictions Dataframe
columns = ["Bootstrap-Model_"+str(i+1) for i in range(n_trees)]
predictions_train = pd.DataFrame(predictions_train, columns=columns)
predictions_test = pd.DataFrame(predictions_test, columns=columns)

In [86]:

y_train = data_train['Spam'].values
y_test = data_test['Spam'].values

In [76]:

num_to_avg = 100 # we varied this line, from 1, 2, 3, 100, n_trees
fig, axs = plt.subplots(1, 2, figsize=(16, 7))
for (ax, label, predictions, y) in [
    (axs[0], 'train', predictions_train, y_train), 
    (axs[1], 'test', predictions_test, y_test)
]:
    mean_predictions = predictions.iloc[:,:num_to_avg].mean(axis=1)
    mean_predictions[y == 1].hist(density=True, histtype='step', range=[0,1], label='Spam', lw=2, ax=ax)
    mean_predictions[y == 0].hist(density=True, histtype='step', range=[0,1], label='Not-Spam', lw=2, ax=ax)
    ax.legend(loc='upper center');
    ax.set_xlabel("Mean of ensemble predictions")
    ax.set_title(label)

(Try 100 trees of depth 1. Why do the plots look similar to when we did 2 trees of higher depth?)

And now vote!

In [58]:

#Function to ensemble the prediction of each bagged decision tree model
def get_prediction(df, count=-1):
    count = df.shape[1] if count==-1 else count
    temp = df.iloc[:,0:count]
    return np.mean(temp, axis=1)>0.5

#Check Accuracy of Spam Detection in Train and Test Set
print("Accuracy, Training Set :", str(100*accuracy_score(y_train, get_prediction(predictions_train, count=-1)))+'%')
print("Accuracy, Testing Set :", str(100*accuracy_score(y_test, get_prediction(predictions_test, count=-1)))+'%')

Accuracy, Training Set : 94.54322501532802%
Accuracy, Testing Set : 92.3076923076923%

Count in the above code can be use to define the number of models the voting in the dataframe should be based on.

In [59]:

#Get Performance by Class (Lookup Confusion Matrix)
pd.crosstab(np.array(y_test), model.predict(x_test), margins=True, rownames=['Actual'], colnames=['Predicted'])

Out[59]:

Predicted	0	1	All
Actual
0	760	44	804
1	83	452	535
All	843	496	1339

Is there a variation between the accuracy and number of bagging columns we consider?

Food for Thought : Are these bagging models independent of each other, can they be trained in a parallel fashion?

Part 3 : Random Forest vs Bagging¶

Now, we will fit an ensemble method, the Random Forest technique, which is different from the decision tree. Refer to the lectures slides for a full treatment on how they are different. Let's use n_estimators = predictor_count/2 and max_depth = best_depth.

In [15]:

#Fit a Random Forest Model

#Training
model = RandomForestClassifier(n_estimators=int(x_train.shape[1]/2), max_depth=best_depth)
model.fit(x_train, y_train)

#Predict
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)

#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100

print("Accuracy, Training Set :",str(train_score)+'%')
print("Accuracy, Testing Set :",str(test_score)+'%')

Accuracy, Training Set : 94.78847332924586%
Accuracy, Testing Set : 93.42793129200896%

In [16]:

#Top Features
feature_importance = model.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

#Plot
plt.figure(figsize=(10,12))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, x_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')

Out[16]:

Text(0.5,1,'Variable Importance')

Random Forest gives the above values as feature_importance where it normalizes the impact of a predictor to the number of times it is useful and thus gives overvall significance for free. Explore the attributes of the Random Forest model object for the best nodes.

As we see above, the performance of both Bagging and Random Forest was similar, so what is the difference? Do both overfit the data just as much?¶

Hints :

What is the only extra parameter we declared when defining a Random Forest Model vs Bagging? Does it have an impact on overfitting?
When we ensembled trees using bagging, we used the class labels to get the majority vote. Can you think of an advantage to using prediction probabilities? Does Random Forest use labels or probabilities?

Part 5 : Fitting an Adaboost Model¶

Now let's try Boosting!

In [17]:

#Fit an Adaboost Model

#Training
model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5), n_estimators=100, learning_rate=0.05)
model.fit(x_train, y_train)

#Predict
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)

#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100

print("Accuracy, Training Set :",str(train_score)+'%')
print("Accuracy, Testing Set :",str(test_score)+'%')

Accuracy, Training Set : 99.72409564684243%
Accuracy, Testing Set : 94.39880507841673%

In [18]:

#Plot Iteration based score
train_scores = list(model.staged_score(x_train,y_train))
test_scores = list(model.staged_score(x_test, y_test))

plt.plot(train_scores,label='train')
plt.plot(test_scores,label='test')
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title("Variation of Accuracy with Iterations")
plt.legend();

AdaBoost seems to be performing better than Simple Decision Trees and Random Forest considering Test Set Accuracy, what is the difference in approach it takes?

In [19]:

#Find Optimal Depth of trees for Boosting
score_train, score_test, depth_start, depth_end = {}, {}, 2, 20
for i in tqdm(range(depth_start, depth_end)):
    model = AdaBoostClassifier(
        base_estimator=DecisionTreeClassifier(max_depth=i),
        n_estimators=100, learning_rate=0.05)
    model.fit(x_train, y_train)
    score_train[i] = accuracy_score(y_train, model.predict(x_train))
    score_test[i] = accuracy_score(y_test, model.predict(x_test))

100%|██████████| 18/18 [00:51<00:00,  2.85s/it]

In [20]:

#Plot
lists1 = sorted(score_train.items())
lists2 = sorted(score_test.items())
x1, y1 = zip(*lists1) 
x2, y2 = zip(*lists2) 
plt.ylabel("Accuracy")
plt.xlabel("Depth")
plt.title('Variation of Accuracy with Depth - Adaboost Classifier')
plt.plot(x1, y1, 'b-', label='Train')
plt.plot(x2, y2, 'g-', label='Test')
plt.legend()
plt.show()

Adaboost complexity depends on both the number of estimators and the base estimator. As our model complexity increases, we observe an increase in accuracy but as we go further to the right of the graph, our model will overfit the data. To formally understand what is going on here, we will give a brief treatment of how the bias and variance are related in the next section.

What would happen if we varied depth and estimators simulataneously? Is that the right way to assess the trade-off between performance and overfitting?

Food for Thought : Are boosted models independent of one another? Do they need to wait for the previous model's residuals?

Part 6 : The Bias-Variance tradeoff¶

A central notion underlying what we've been learning in lectures and sections so far is the trade-off between overfitting and underfitting. If you remember back to Homework 3, we had a model that seemed to represent our data accurately. However, we saw that as we made it more and more accurate on the training set, it did not generalize well to unobserved data.

As a different example, in face recognition algorithms, such as that on the iPhone X, a too-accurate model would be unable to identity someone who styled their hair differently that day. The reason is that our model may learn irrelevant features in the training data. On the contrary, an insufficiently trained model would not generalize well either. For example, it was recently reported that a face mask could sufficiently fool the iPhone X.

A widely used solution in statistics to reduce overfitting consists of adding structure to the model, with something like regularization. This method favors simpler models during training.

The bias-variance dilemma is closely related. The bias of a model quantifies how precise a model is across training sets. The variance quantifies how sensitive the model is to small changes in the training set. A robust model is not overly sensitive to small changes. The dilemma involves minimizing both bias and variance; we want a precise and robust model. Simpler models tend to be less accurate but more robust. Complex models tend to be more accurate but less robust.