CS109A Introduction to Data Science

Standard Section 8: Bagging and Random Forest

Harvard University
Fall 2017
Section Leaders: Mehul Smriti Raje, Ken Arnold, Karan R. Motwani, Cecilia Garraffo
Instructors: Pavlos Protopapas, Kevin Rader


In [1]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

This section will work with a spam email dataset. Our ultimate goal is to be able to build models so that we can predict whether an email is spam or not spam based on word characteristics within each email. We will cover the Adaboost and Random Forest methods and allow you to apply it to the homework.

Specifically, we will:

1. Load in the spam dataset and split the data into train and test.
2. Find the optimal depth for the Decision Tree model and evaluate performance.
3. Fit the Bagging model using multiple bootstrapped datasets and ensemble. 
4. Fit the Random Forest Model and compare with Bagging.
5. Use the Adaboost method to visualize Bias-Variance tradeoff.
6. Example to better understand Bias vs Variance tradeoff.
In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
%matplotlib inline
from tqdm import tqdm

pd.set_option('display.width', 1500)
pd.set_option('display.max_columns', 100)

from sklearn.model_selection import learning_curve

! pip install tqdm to install the tqdm package.

Part 0 : Introduction to the Spam Dataset

We will be working with a spam email dataset. The dataset has 57 predictors with a response variable called Spam that indicates whether an email is spam or not spam. The goal is to be able to create a classifier or method that acts as a spam filter.

In [2]:
#Import Dataframe and Set Column Names
spam_df = pd.read_csv('data/spam.csv', header=None)
columns = ["Column_"+str(i+1) for i in range(spam_df.shape[1]-1)] + ['Spam']
spam_df.columns = columns
display(spam_df.head())
Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 Column_8 Column_9 Column_10 Column_11 Column_12 Column_13 Column_14 Column_15 Column_16 Column_17 Column_18 Column_19 Column_20 Column_21 Column_22 Column_23 Column_24 Column_25 Column_26 Column_27 Column_28 Column_29 Column_30 Column_31 Column_32 Column_33 Column_34 Column_35 Column_36 Column_37 Column_38 Column_39 Column_40 Column_41 Column_42 Column_43 Column_44 Column_45 Column_46 Column_47 Column_48 Column_49 Column_50 Column_51 Column_52 Column_53 Column_54 Column_55 Column_56 Column_57 Spam
0 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 0.00 0.00 0.32 0.00 1.29 1.93 0.00 0.96 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.00 0.00 0.0 0.0 0.00 0.000 0.0 0.778 0.000 0.000 3.756 61 278 1
1 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65 0.21 0.14 0.14 0.07 0.28 3.47 0.00 1.59 0.0 0.43 0.43 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.07 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.00 0.00 0.0 0.0 0.00 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 1
2 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12 0.00 1.75 0.06 0.06 1.03 1.36 0.32 0.51 0.0 1.16 0.06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.06 0.0 0.0 0.12 0.0 0.06 0.06 0.0 0.0 0.01 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 1
3 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.00 0.00 0.0 0.0 0.00 0.137 0.0 0.137 0.000 0.000 3.537 40 191 1
4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.00 0.00 0.0 0.0 0.00 0.135 0.0 0.135 0.000 0.000 3.537 40 191 1

The predictor variabes are all continuous. They represent certain features like the frequency of the word "discount". The exact specification and description of each predictor can be found online. We are not so much interested in the exact inference of each predictor so we will omit the exact names of each of the predictors. We are more interested in the prediction of the algorithm so we will treat each as predictor without going into too much exact detail in each.

Let us split the dataset into a 70-30 split by using the following:

Note : While you will use train_test_split in your homeworks, the code below should help you visualize splitting/masking of a dataframe which will be helpful in general.

In [4]:
#Split data into train and test
np.random.seed(42)
msk = np.random.rand(len(spam_df)) < 0.7
data_train = spam_df[msk]
data_test = spam_df[~msk]

#Split predictor and response columns
x_train, y_train = data_train.drop(['Spam'], axis=1), data_train['Spam']
x_test, y_test = data_test.drop(['Spam'], axis=1), data_test['Spam']

print("Shape of Training Set :",data_train.shape)
print("Shape of Testing Set :", data_test.shape)
Shape of Training Set : (3262, 58)
Shape of Testing Set : (1339, 58)
In [13]:
spam_df.iloc[np.arange(10)]
Out[13]:
Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 Column_8 Column_9 Column_10 Column_11 Column_12 Column_13 Column_14 Column_15 Column_16 Column_17 Column_18 Column_19 Column_20 Column_21 Column_22 Column_23 Column_24 Column_25 Column_26 Column_27 Column_28 Column_29 Column_30 Column_31 Column_32 Column_33 Column_34 Column_35 Column_36 Column_37 Column_38 Column_39 Column_40 Column_41 Column_42 Column_43 Column_44 Column_45 Column_46 Column_47 Column_48 Column_49 Column_50 Column_51 Column_52 Column_53 Column_54 Column_55 Column_56 Column_57 Spam
0 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 0.00 0.00 0.32 0.00 1.29 1.93 0.00 0.96 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.000 0.0 0.778 0.000 0.000 3.756 61 278 1
1 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65 0.21 0.14 0.14 0.07 0.28 3.47 0.00 1.59 0.0 0.43 0.43 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.07 0.0 0.0 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 1
2 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12 0.00 1.75 0.06 0.06 1.03 1.36 0.32 0.51 0.0 1.16 0.06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 0.0 0.06 0.0 0.0 0.12 0.00 0.06 0.06 0.0 0.0 0.01 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 1
3 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.137 0.0 0.137 0.000 0.000 3.537 40 191 1
4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.135 0.0 0.135 0.000 0.000 3.537 40 191 1
5 0.00 0.00 0.00 0.0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.223 0.0 0.000 0.000 0.000 3.000 15 54 1
6 0.00 0.00 0.00 0.0 1.92 0.00 0.00 0.00 0.00 0.64 0.96 1.28 0.00 0.00 0.00 0.96 0.00 0.32 3.85 0.00 0.64 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.054 0.0 0.164 0.054 0.000 1.671 4 112 1
7 0.00 0.00 0.00 0.0 1.88 0.00 0.00 1.88 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.206 0.0 0.000 0.000 0.000 2.450 11 49 1
8 0.15 0.00 0.46 0.0 0.61 0.00 0.30 0.00 0.92 0.76 0.76 0.92 0.00 0.00 0.00 0.00 0.00 0.15 1.23 3.53 2.00 0.0 0.00 0.15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.15 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.30 0.00 0.00 0.00 0.0 0.0 0.00 0.271 0.0 0.181 0.203 0.022 9.744 445 1257 1
9 0.06 0.12 0.77 0.0 0.19 0.32 0.38 0.00 0.06 0.00 0.00 0.64 0.25 0.00 0.12 0.00 0.00 0.12 1.67 0.06 0.71 0.0 0.19 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.00 0.06 0.00 0.00 0.0 0.0 0.04 0.030 0.0 0.244 0.081 0.000 1.729 43 749 1

We can check that the number of spam cases is roughly evenly represented in both the training and test set.

In [11]:
#Check Percentage of Spam in Train and Test Set
print("Percentage of Spam in Training Set :", str(100*y_train.sum()/len(y_train))+'%')
print("Percentage of Spam in Testing Set :",str(100*y_test.sum()/len(y_test))+'%')
Percentage of Spam in Training Set : 39.17841814837523%
Percentage of Spam in Testing Set : 39.955190440627334%

Part 1 : Fitting an Optimal Single Decision Tree (by Depth) :

We fit here a single tree to our spam dataset and perform 5-fold cross validation on the training set. For EACH depth of the tree, we fit a tree and then compute the 5-fold CV scores. These scores are then averaged and compared across different depths.

In [6]:
#Find optimal depth of trees
depth, tree_start, tree_end = {}, 3, 20
for i in range(tree_start, tree_end):
    model = DecisionTreeClassifier(max_depth=i)
    scores = cross_val_score(estimator=model, X=x_train, y=y_train, cv=5, n_jobs=-1)
    depth[i] = scores.mean()
    
#Plot
lists = sorted(depth.items())
x, y = zip(*lists) 
plt.ylabel("Cross Validation Accuracy")
plt.xlabel("Maximum Depth")
plt.title('Variation of Accuracy with Depth - Simple Decision Tree')
plt.plot(x, y, 'b-', marker='o')
plt.show()

As we can see, the optimal depth is found to be a depth of 7. Although, does it makes sense to choose 6?

Let us set best_depth as a new parameter. Can you see why we coded the best depth parameter as we did below? (Hint: Think about reproducibility.)

Also, of we wanted to get the Confidence Bands of these results, how would we? It's as simple as a combination of getting variance using scores.std() and plt.fill_between().

In [7]:
#Make best depth a variable
best_depth = sorted(depth, key=depth.get, reverse=True)[0]
print("The best depth was found to be:", best_depth)
The best depth was found to be: 7
In [8]:
#Evalaute the performance at the best depth
model = DecisionTreeClassifier(max_depth=best_depth)
model.fit(x_train, y_train)

#Check Accuracy of Spam Detection in Train and Test Set
print("Accuracy, Training Set: {:.2%}".format(accuracy_score(y_train, model.predict(x_train))))
print("Accuracy, Testing Set: {:.2%}".format(accuracy_score(y_test, model.predict(x_test))))
Accuracy, Training Set: 94.39%
Accuracy, Testing Set: 90.81%
In [9]:
#Get Performance by Class (Lookup Confusion Matrix)
pd.crosstab(y_test, model.predict(x_test), margins=True, rownames=['Actual'], colnames=['Predicted'])
Out[9]:
Predicted 0 1 All
Actual
0 758 46 804
1 77 458 535
All 835 504 1339

Part 2: Bagging and Voting

Let's bootstrap our training dataset to create multiple, fit Decision Tree models to each.

(Resampling: we showed live that different samples give different results for things like sums, varying more when the things we sum over have high variance themselves.)

In [80]:
# Stat on all data
data_train.sum(axis=0).to_frame('sum').T
Out[80]:
Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 Column_8 Column_9 Column_10 Column_11 Column_12 Column_13 Column_14 Column_15 Column_16 Column_17 Column_18 Column_19 Column_20 Column_21 Column_22 Column_23 Column_24 Column_25 Column_26 Column_27 Column_28 Column_29 Column_30 Column_31 Column_32 Column_33 Column_34 Column_35 Column_36 Column_37 Column_38 Column_39 Column_40 Column_41 Column_42 Column_43 Column_44 Column_45 Column_46 Column_47 Column_48 Column_49 Column_50 Column_51 Column_52 Column_53 Column_54 Column_55 Column_56 Column_57 Spam
sum 345.72 673.94 929.37 215.52 1009.71 314.66 365.93 360.02 289.45 793.26 204.28 1754.83 313.3 183.96 162.91 867.46 455.96 615.97 5516.65 317.18 2661.76 392.17 319.55 289.32 1723.59 836.14 2532.76 419.81 319.23 361.45 211.09 164.33 334.18 165.17 367.15 334.49 443.42 48.39 266.91 224.98 129.31 402.75 148.0 236.4 1005.45 596.8 18.29 104.09 118.468 462.011 61.189 905.649 232.714 167.588 17290.936 172106.0 907663.0 1278.0
In [83]:
data_train.sample(frac=1., replace=True).sum(axis=0).to_frame('sum').T
Out[83]:
Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 Column_8 Column_9 Column_10 Column_11 Column_12 Column_13 Column_14 Column_15 Column_16 Column_17 Column_18 Column_19 Column_20 Column_21 Column_22 Column_23 Column_24 Column_25 Column_26 Column_27 Column_28 Column_29 Column_30 Column_31 Column_32 Column_33 Column_34 Column_35 Column_36 Column_37 Column_38 Column_39 Column_40 Column_41 Column_42 Column_43 Column_44 Column_45 Column_46 Column_47 Column_48 Column_49 Column_50 Column_51 Column_52 Column_53 Column_54 Column_55 Column_56 Column_57 Spam
sum 346.47 717.69 895.14 420.09 987.02 348.81 358.03 329.81 286.01 749.9 187.9 1761.05 341.26 194.12 163.01 813.03 428.91 635.57 5493.24 312.82 2628.2 418.56 335.29 281.29 1744.44 797.15 2110.56 442.27 341.18 360.57 196.14 147.81 338.25 149.91 370.14 328.65 429.03 45.06 240.01 218.37 144.31 449.42 145.63 196.12 1080.91 665.12 14.31 118.5 124.247 452.753 67.43 903.126 240.119 158.87 15532.539 180343.0 942985.0 1275.0

Now we actually fit the samples

In [85]:
#Creating model
np.random.seed(0)
model = DecisionTreeClassifier(max_depth=5) # we tried a variety of depths here

#Initializing variables
n_trees = 100 # we tried a variety of numbers here
predictions_train = np.zeros((data_train.shape[0], n_trees))
predictions_test = np.zeros((data_test.shape[0], n_trees))

#Conduct bootstraping iterations
for i in range(n_trees):
    temp = data_train.sample(frac=1, replace=True)
    response_variable = temp['Spam']
    temp = temp.drop(['Spam'], axis=1)
    model.fit(temp, response_variable)  
    predictions_train[:,i] = model.predict(x_train)   
    predictions_test[:,i] = model.predict(x_test)
    
#Make Predictions Dataframe
columns = ["Bootstrap-Model_"+str(i+1) for i in range(n_trees)]
predictions_train = pd.DataFrame(predictions_train, columns=columns)
predictions_test = pd.DataFrame(predictions_test, columns=columns)
In [86]:
y_train = data_train['Spam'].values
y_test = data_test['Spam'].values
In [76]:
num_to_avg = 100 # we varied this line, from 1, 2, 3, 100, n_trees
fig, axs = plt.subplots(1, 2, figsize=(16, 7))
for (ax, label, predictions, y) in [
    (axs[0], 'train', predictions_train, y_train), 
    (axs[1], 'test', predictions_test, y_test)
]:
    mean_predictions = predictions.iloc[:,:num_to_avg].mean(axis=1)
    mean_predictions[y == 1].hist(density=True, histtype='step', range=[0,1], label='Spam', lw=2, ax=ax)
    mean_predictions[y == 0].hist(density=True, histtype='step', range=[0,1], label='Not-Spam', lw=2, ax=ax)
    ax.legend(loc='upper center');
    ax.set_xlabel("Mean of ensemble predictions")
    ax.set_title(label)

(Try 100 trees of depth 1. Why do the plots look similar to when we did 2 trees of higher depth?)

And now vote!

In [58]:
#Function to ensemble the prediction of each bagged decision tree model
def get_prediction(df, count=-1):
    count = df.shape[1] if count==-1 else count
    temp = df.iloc[:,0:count]
    return np.mean(temp, axis=1)>0.5

#Check Accuracy of Spam Detection in Train and Test Set
print("Accuracy, Training Set :", str(100*accuracy_score(y_train, get_prediction(predictions_train, count=-1)))+'%')
print("Accuracy, Testing Set :", str(100*accuracy_score(y_test, get_prediction(predictions_test, count=-1)))+'%')
Accuracy, Training Set : 94.54322501532802%
Accuracy, Testing Set : 92.3076923076923%

Count in the above code can be use to define the number of models the voting in the dataframe should be based on.

In [59]:
#Get Performance by Class (Lookup Confusion Matrix)
pd.crosstab(np.array(y_test), model.predict(x_test), margins=True, rownames=['Actual'], colnames=['Predicted'])
Out[59]:
Predicted 0 1 All
Actual
0 760 44 804
1 83 452 535
All 843 496 1339

Is there a variation between the accuracy and number of bagging columns we consider?

Food for Thought : Are these bagging models independent of each other, can they be trained in a parallel fashion?

Part 3 : Random Forest vs Bagging

Now, we will fit an ensemble method, the Random Forest technique, which is different from the decision tree. Refer to the lectures slides for a full treatment on how they are different. Let's use n_estimators = predictor_count/2 and max_depth = best_depth.

In [15]:
#Fit a Random Forest Model

#Training
model = RandomForestClassifier(n_estimators=int(x_train.shape[1]/2), max_depth=best_depth)
model.fit(x_train, y_train)

#Predict
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)

#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100

print("Accuracy, Training Set :",str(train_score)+'%')
print("Accuracy, Testing Set :",str(test_score)+'%')
Accuracy, Training Set : 94.78847332924586%
Accuracy, Testing Set : 93.42793129200896%
In [16]:
#Top Features
feature_importance = model.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

#Plot
plt.figure(figsize=(10,12))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, x_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
Out[16]:
Text(0.5,1,'Variable Importance')

Random Forest gives the above values as feature_importance where it normalizes the impact of a predictor to the number of times it is useful and thus gives overvall significance for free. Explore the attributes of the Random Forest model object for the best nodes.

As we see above, the performance of both Bagging and Random Forest was similar, so what is the difference? Do both overfit the data just as much?

Hints :

  • What is the only extra parameter we declared when defining a Random Forest Model vs Bagging? Does it have an impact on overfitting?

  • When we ensembled trees using bagging, we used the class labels to get the majority vote. Can you think of an advantage to using prediction probabilities? Does Random Forest use labels or probabilities?

Part 5 : Fitting an Adaboost Model

Now let's try Boosting!

In [17]:
#Fit an Adaboost Model

#Training
model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5), n_estimators=100, learning_rate=0.05)
model.fit(x_train, y_train)

#Predict
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)

#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100

print("Accuracy, Training Set :",str(train_score)+'%')
print("Accuracy, Testing Set :",str(test_score)+'%')
Accuracy, Training Set : 99.72409564684243%
Accuracy, Testing Set : 94.39880507841673%
In [18]:
#Plot Iteration based score
train_scores = list(model.staged_score(x_train,y_train))
test_scores = list(model.staged_score(x_test, y_test))

plt.plot(train_scores,label='train')
plt.plot(test_scores,label='test')
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title("Variation of Accuracy with Iterations")
plt.legend();

AdaBoost seems to be performing better than Simple Decision Trees and Random Forest considering Test Set Accuracy, what is the difference in approach it takes?

In [19]:
#Find Optimal Depth of trees for Boosting
score_train, score_test, depth_start, depth_end = {}, {}, 2, 20
for i in tqdm(range(depth_start, depth_end)):
    model = AdaBoostClassifier(
        base_estimator=DecisionTreeClassifier(max_depth=i),
        n_estimators=100, learning_rate=0.05)
    model.fit(x_train, y_train)
    score_train[i] = accuracy_score(y_train, model.predict(x_train))
    score_test[i] = accuracy_score(y_test, model.predict(x_test))
100%|██████████| 18/18 [00:51<00:00,  2.85s/it]
In [20]:
#Plot
lists1 = sorted(score_train.items())
lists2 = sorted(score_test.items())
x1, y1 = zip(*lists1) 
x2, y2 = zip(*lists2) 
plt.ylabel("Accuracy")
plt.xlabel("Depth")
plt.title('Variation of Accuracy with Depth - Adaboost Classifier')
plt.plot(x1, y1, 'b-', label='Train')
plt.plot(x2, y2, 'g-', label='Test')
plt.legend()
plt.show()

Adaboost complexity depends on both the number of estimators and the base estimator. As our model complexity increases, we observe an increase in accuracy but as we go further to the right of the graph, our model will overfit the data. To formally understand what is going on here, we will give a brief treatment of how the bias and variance are related in the next section.

What would happen if we varied depth and estimators simulataneously? Is that the right way to assess the trade-off between performance and overfitting?

Food for Thought : Are boosted models independent of one another? Do they need to wait for the previous model's residuals?

Part 6 : The Bias-Variance tradeoff

A central notion underlying what we've been learning in lectures and sections so far is the trade-off between overfitting and underfitting. If you remember back to Homework 3, we had a model that seemed to represent our data accurately. However, we saw that as we made it more and more accurate on the training set, it did not generalize well to unobserved data.

As a different example, in face recognition algorithms, such as that on the iPhone X, a too-accurate model would be unable to identity someone who styled their hair differently that day. The reason is that our model may learn irrelevant features in the training data. On the contrary, an insufficiently trained model would not generalize well either. For example, it was recently reported that a face mask could sufficiently fool the iPhone X.

A widely used solution in statistics to reduce overfitting consists of adding structure to the model, with something like regularization. This method favors simpler models during training.

The bias-variance dilemma is closely related. The bias of a model quantifies how precise a model is across training sets. The variance quantifies how sensitive the model is to small changes in the training set. A robust model is not overly sensitive to small changes. The dilemma involves minimizing both bias and variance; we want a precise and robust model. Simpler models tend to be less accurate but more robust. Complex models tend to be more accurate but less robust.