CS109A Introduction to Data Science
Standard Section 8: Bagging and Random Forest¶
Harvard University
Fall 2017
Section Leaders: Mehul Smriti Raje, Ken Arnold, Karan R. Motwani, Cecilia Garraffo
Instructors: Pavlos Protopapas, Kevin Rader
#RUN THIS CELL
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
This section will work with a spam email dataset. Our ultimate goal is to be able to build models so that we can predict whether an email is spam or not spam based on word characteristics within each email. We will cover the Adaboost and Random Forest methods and allow you to apply it to the homework.
Specifically, we will:
1. Load in the spam dataset and split the data into train and test.
2. Find the optimal depth for the Decision Tree model and evaluate performance.
3. Fit the Bagging model using multiple bootstrapped datasets and ensemble.
4. Fit the Random Forest Model and compare with Bagging.
5. Use the Adaboost method to visualize Bias-Variance tradeoff.
6. Example to better understand Bias vs Variance tradeoff.
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
%matplotlib inline
from tqdm import tqdm
pd.set_option('display.width', 1500)
pd.set_option('display.max_columns', 100)
from sklearn.model_selection import learning_curve
! pip install tqdm
to install the tqdm package.
Part 0 : Introduction to the Spam Dataset¶
We will be working with a spam email dataset. The dataset has 57 predictors with a response variable called Spam
that indicates whether an email is spam or not spam. The goal is to be able to create a classifier or method that acts as a spam filter.
#Import Dataframe and Set Column Names
spam_df = pd.read_csv('data/spam.csv', header=None)
columns = ["Column_"+str(i+1) for i in range(spam_df.shape[1]-1)] + ['Spam']
spam_df.columns = columns
display(spam_df.head())
The predictor variabes are all continuous. They represent certain features like the frequency of the word "discount
". The exact specification and description of each predictor can be found online. We are not so much interested in the exact inference of each predictor so we will omit the exact names of each of the predictors. We are more interested in the prediction of the algorithm so we will treat each as predictor without going into too much exact detail in each.
Link to description : https://archive.ics.uci.edu/ml/datasets/spambase
Let us split the dataset into a 70-30 split by using the following:
Note : While you will use train_test_split
in your homeworks, the code below should help you visualize splitting/masking of a dataframe which will be helpful in general.
#Split data into train and test
np.random.seed(42)
msk = np.random.rand(len(spam_df)) < 0.7
data_train = spam_df[msk]
data_test = spam_df[~msk]
#Split predictor and response columns
x_train, y_train = data_train.drop(['Spam'], axis=1), data_train['Spam']
x_test, y_test = data_test.drop(['Spam'], axis=1), data_test['Spam']
print("Shape of Training Set :",data_train.shape)
print("Shape of Testing Set :", data_test.shape)
spam_df.iloc[np.arange(10)]
We can check that the number of spam cases is roughly evenly represented in both the training and test set.
#Check Percentage of Spam in Train and Test Set
print("Percentage of Spam in Training Set :", str(100*y_train.sum()/len(y_train))+'%')
print("Percentage of Spam in Testing Set :",str(100*y_test.sum()/len(y_test))+'%')
Part 1 : Fitting an Optimal Single Decision Tree (by Depth) :¶
We fit here a single tree to our spam dataset and perform 5-fold cross validation on the training set. For EACH depth of the tree, we fit a tree and then compute the 5-fold CV scores. These scores are then averaged and compared across different depths.
#Find optimal depth of trees
depth, tree_start, tree_end = {}, 3, 20
for i in range(tree_start, tree_end):
model = DecisionTreeClassifier(max_depth=i)
scores = cross_val_score(estimator=model, X=x_train, y=y_train, cv=5, n_jobs=-1)
depth[i] = scores.mean()
#Plot
lists = sorted(depth.items())
x, y = zip(*lists)
plt.ylabel("Cross Validation Accuracy")
plt.xlabel("Maximum Depth")
plt.title('Variation of Accuracy with Depth - Simple Decision Tree')
plt.plot(x, y, 'b-', marker='o')
plt.show()
As we can see, the optimal depth is found to be a depth of 7. Although, does it makes sense to choose 6?
Let us set best_depth
as a new parameter. Can you see why we coded the best depth parameter as we did below? (Hint: Think about reproducibility.)
Also, of we wanted to get the Confidence Bands of these results, how would we? It's as simple as a combination of getting variance using scores.std()
and plt.fill_between()
.
#Make best depth a variable
best_depth = sorted(depth, key=depth.get, reverse=True)[0]
print("The best depth was found to be:", best_depth)
#Evalaute the performance at the best depth
model = DecisionTreeClassifier(max_depth=best_depth)
model.fit(x_train, y_train)
#Check Accuracy of Spam Detection in Train and Test Set
print("Accuracy, Training Set: {:.2%}".format(accuracy_score(y_train, model.predict(x_train))))
print("Accuracy, Testing Set: {:.2%}".format(accuracy_score(y_test, model.predict(x_test))))
#Get Performance by Class (Lookup Confusion Matrix)
pd.crosstab(y_test, model.predict(x_test), margins=True, rownames=['Actual'], colnames=['Predicted'])
Part 2: Bagging and Voting¶
Let's bootstrap our training dataset to create multiple, fit Decision Tree models to each.
(Resampling: we showed live that different samples give different results for things like sums, varying more when the things we sum over have high variance themselves.)
# Stat on all data
data_train.sum(axis=0).to_frame('sum').T
data_train.sample(frac=1., replace=True).sum(axis=0).to_frame('sum').T
Now we actually fit the samples
#Creating model
np.random.seed(0)
model = DecisionTreeClassifier(max_depth=5) # we tried a variety of depths here
#Initializing variables
n_trees = 100 # we tried a variety of numbers here
predictions_train = np.zeros((data_train.shape[0], n_trees))
predictions_test = np.zeros((data_test.shape[0], n_trees))
#Conduct bootstraping iterations
for i in range(n_trees):
temp = data_train.sample(frac=1, replace=True)
response_variable = temp['Spam']
temp = temp.drop(['Spam'], axis=1)
model.fit(temp, response_variable)
predictions_train[:,i] = model.predict(x_train)
predictions_test[:,i] = model.predict(x_test)
#Make Predictions Dataframe
columns = ["Bootstrap-Model_"+str(i+1) for i in range(n_trees)]
predictions_train = pd.DataFrame(predictions_train, columns=columns)
predictions_test = pd.DataFrame(predictions_test, columns=columns)
y_train = data_train['Spam'].values
y_test = data_test['Spam'].values
num_to_avg = 100 # we varied this line, from 1, 2, 3, 100, n_trees
fig, axs = plt.subplots(1, 2, figsize=(16, 7))
for (ax, label, predictions, y) in [
(axs[0], 'train', predictions_train, y_train),
(axs[1], 'test', predictions_test, y_test)
]:
mean_predictions = predictions.iloc[:,:num_to_avg].mean(axis=1)
mean_predictions[y == 1].hist(density=True, histtype='step', range=[0,1], label='Spam', lw=2, ax=ax)
mean_predictions[y == 0].hist(density=True, histtype='step', range=[0,1], label='Not-Spam', lw=2, ax=ax)
ax.legend(loc='upper center');
ax.set_xlabel("Mean of ensemble predictions")
ax.set_title(label)
(Try 100 trees of depth 1. Why do the plots look similar to when we did 2 trees of higher depth?)
And now vote!
#Function to ensemble the prediction of each bagged decision tree model
def get_prediction(df, count=-1):
count = df.shape[1] if count==-1 else count
temp = df.iloc[:,0:count]
return np.mean(temp, axis=1)>0.5
#Check Accuracy of Spam Detection in Train and Test Set
print("Accuracy, Training Set :", str(100*accuracy_score(y_train, get_prediction(predictions_train, count=-1)))+'%')
print("Accuracy, Testing Set :", str(100*accuracy_score(y_test, get_prediction(predictions_test, count=-1)))+'%')
Count in the above code can be use to define the number of models the voting in the dataframe should be based on.
#Get Performance by Class (Lookup Confusion Matrix)
pd.crosstab(np.array(y_test), model.predict(x_test), margins=True, rownames=['Actual'], colnames=['Predicted'])
Is there a variation between the accuracy and number of bagging columns we consider?
Food for Thought : Are these bagging models independent of each other, can they be trained in a parallel fashion?
Part 3 : Random Forest vs Bagging¶
Now, we will fit an ensemble method, the Random Forest technique, which is different from the decision tree. Refer to the lectures slides for a full treatment on how they are different. Let's use n_estimators = predictor_count/2
and max_depth = best_depth
.
#Fit a Random Forest Model
#Training
model = RandomForestClassifier(n_estimators=int(x_train.shape[1]/2), max_depth=best_depth)
model.fit(x_train, y_train)
#Predict
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)
#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100
print("Accuracy, Training Set :",str(train_score)+'%')
print("Accuracy, Testing Set :",str(test_score)+'%')
#Top Features
feature_importance = model.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
#Plot
plt.figure(figsize=(10,12))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, x_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
Random Forest gives the above values as feature_importance
where it normalizes the impact of a predictor to the number of times it is useful and thus gives overvall significance for free. Explore the attributes of the Random Forest model object for the best nodes.
As we see above, the performance of both Bagging and Random Forest was similar, so what is the difference? Do both overfit the data just as much?¶
Hints :
What is the only extra parameter we declared when defining a Random Forest Model vs Bagging? Does it have an impact on overfitting?
When we ensembled trees using bagging, we used the class labels to get the majority vote. Can you think of an advantage to using prediction probabilities? Does Random Forest use labels or probabilities?
Part 5 : Fitting an Adaboost Model¶
Now let's try Boosting!
#Fit an Adaboost Model
#Training
model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5), n_estimators=100, learning_rate=0.05)
model.fit(x_train, y_train)
#Predict
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)
#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100
print("Accuracy, Training Set :",str(train_score)+'%')
print("Accuracy, Testing Set :",str(test_score)+'%')
#Plot Iteration based score
train_scores = list(model.staged_score(x_train,y_train))
test_scores = list(model.staged_score(x_test, y_test))
plt.plot(train_scores,label='train')
plt.plot(test_scores,label='test')
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title("Variation of Accuracy with Iterations")
plt.legend();
AdaBoost seems to be performing better than Simple Decision Trees and Random Forest considering Test Set Accuracy, what is the difference in approach it takes?
#Find Optimal Depth of trees for Boosting
score_train, score_test, depth_start, depth_end = {}, {}, 2, 20
for i in tqdm(range(depth_start, depth_end)):
model = AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=i),
n_estimators=100, learning_rate=0.05)
model.fit(x_train, y_train)
score_train[i] = accuracy_score(y_train, model.predict(x_train))
score_test[i] = accuracy_score(y_test, model.predict(x_test))
#Plot
lists1 = sorted(score_train.items())
lists2 = sorted(score_test.items())
x1, y1 = zip(*lists1)
x2, y2 = zip(*lists2)
plt.ylabel("Accuracy")
plt.xlabel("Depth")
plt.title('Variation of Accuracy with Depth - Adaboost Classifier')
plt.plot(x1, y1, 'b-', label='Train')
plt.plot(x2, y2, 'g-', label='Test')
plt.legend()
plt.show()
Adaboost complexity depends on both the number of estimators and the base estimator. As our model complexity increases, we observe an increase in accuracy but as we go further to the right of the graph, our model will overfit the data. To formally understand what is going on here, we will give a brief treatment of how the bias and variance are related in the next section.
What would happen if we varied depth
and estimators
simulataneously? Is that the right way to assess the trade-off between performance and overfitting?
Food for Thought : Are boosted models independent of one another? Do they need to wait for the previous model's residuals?
Part 6 : The Bias-Variance tradeoff¶
A central notion underlying what we've been learning in lectures and sections so far is the trade-off between overfitting and underfitting. If you remember back to Homework 3, we had a model that seemed to represent our data accurately. However, we saw that as we made it more and more accurate on the training set, it did not generalize well to unobserved data.
As a different example, in face recognition algorithms, such as that on the iPhone X, a too-accurate model would be unable to identity someone who styled their hair differently that day. The reason is that our model may learn irrelevant features in the training data. On the contrary, an insufficiently trained model would not generalize well either. For example, it was recently reported that a face mask could sufficiently fool the iPhone X.
A widely used solution in statistics to reduce overfitting consists of adding structure to the model, with something like regularization. This method favors simpler models during training.
The bias-variance dilemma is closely related. The bias of a model quantifies how precise a model is across training sets. The variance quantifies how sensitive the model is to small changes in the training set. A robust model is not overly sensitive to small changes. The dilemma involves minimizing both bias and variance; we want a precise and robust model. Simpler models tend to be less accurate but more robust. Complex models tend to be more accurate but less robust.