Key Word(s): bagging

Title¶

Exercise: Bagging Classification with Decision Boundary

Description¶

The goal of this exercise is to use Bagging (Bootstrap Aggregated) to solve a classification problem and visualize the influence on Bagging on trees with varying depths.

Your final plot should resemble the one below.

Instructions:¶

Read the dataset agriland.csv.
Assign the predictor and response variables as X and y.
Split the data into train and test sets with test_split=0.2 and random_state=44.
Fit a single DecisionTreeClassifier() and find the accuracy of your prediction.
Complete the helper function prediction_by_bagging() to find the average predictions for a given number of bootstraps.
Now perform Bagging using the helper function, and compute the new accuracy.
Proceed to plot of accuracy with increasing number of bootstraps.
Finally, use the helper code to plot the decision boundaries for varying max_depth along with num_bootstraps. Investigate the effect of increasing bootstraps on the variance.

Hints:¶

sklearn.tree.DecisionTreeClassifier() : A decision tree classifier.

np.random.choice : Generates a random sample from a given 1-D array

plt.subplots() : Create a figure and a set of subplots.

ax.plot() : Plot y versus x as lines and/or markers

Note: This exercise is auto-graded and you can try multiple attempts.

Bagging Classification¶

In [2]:

# Import required libraries

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import scipy.optimize as opt
from sklearn.metrics import accuracy_score

# to be used for plotting later

from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFF4E5','#D2E3EF'])
cmap_bold = ListedColormap(['#F7345E','#80C3BD'])

In [7]:

# Read the file 'agriland.csv' and take a quick look at your data
df = pd.read_csv('agriland.csv')

# Note that the latitude & longitude values are normalized
df.head()

Out[7]:

	latitude	longitude	land_type
0	-0.071860	-1.297410	1.0
1	-0.179482	-0.874892	1.0
2	-1.217428	-1.352105	0.0
3	1.143306	-0.894172	1.0
4	-3.033199	0.818646	0.0

In [ ]:

# Set your predictor variables(latitude &longitude;) as 'X' and response variable as y and make sure to use .values

X = ___

y = ___

In [ ]:

#split data in train an test, with test size = 0.2 and randomstate=44

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=44)

In [ ]:

# Define the max_depth of your decision tree and set the random_state variable as 44

max_depth = ___

# Lets create and train our model
clf = DecisionTreeClassifier(max_depth=max_depth, random_state=44)

clf.fit(X_train, y_train)

In [ ]:

# Predict on the test set and calculate the accuracy of a single decision tree

prediction = ___
single_acc = ___
print(f'Single tree Accuracy is {single_acc*100}%')

In [ ]:

# Complete the function below to get the prediction by bagging

# Inputs: X_train, y_train to train your data
# X_to_evaluate: Samples that you are goin to predict (evaluate)
# num_bootstraps: how many trees you want to train
# Output: An array of predicted classes for X_to_evaluate

def prediction_by_bagging(X_train, y_train, X_to_evaluate, num_bootstraps):
    
    # list to store every array of predictions
    
    predictions = []
    
    #generate num_bootstraps number of trees
    
    for i in range(num_bootstraps):
        
        # sample data to perform first bootstrap, here, we actually bootstrap indices, because we want the same subset for X_train and y_train
        
        resample_indexes = np.random.choice(np.arange(y_train.shape[0]), size=y_train.shape[0])
        
        # get bootstrapped set for 'X' and 'y' using the above indices
        
        X_boot = X_train[___]
        y_boot = y_train[___]
        
        # train decision tree on bootstrap set, use the same max_depth and random_state as above
        
        clf = ___
        
        # fit the model on bootstrapped training set
        
        clf.fit(___,___)
        
        #  make predictions on X_to_evaluate samples
        
        pred = clf.predict(___)
        
        predictions.append(pred)
    # Now we have a list of predictions like [prediction_array_0, prediction_array_1, ..., prediction_array_n]
    
    # To get the majority vote for each sample, we can find the average prediction and threshold them by 0.5
    
    average_prediction = ___
    
    return average_prediction

In [ ]:

### edTest(test_bag_acc) ###         
# now we print the accuracy of the bagging with decision trees

#Define the number of bootstraps to be used

num_bootstraps = 200

y_pred = prediction_by_bagging(X_train,y_train,X_test,num_bootstraps=num_bootstraps)

# Compare the average predictions to the true test set values and compute the accuracy 

bagging_accuracy = ___

print(f'Accuracy with Bootstrapped Aggregation is  {bagging_accuracy*100}%')

In [ ]:

# To visualize, lets plot accuracy as a function of the number of trees in the Bagging

# Run the helper code below, and if your function is well defined above, you should see a plot of accuracy vs number of bagged trees

n = np.linspace(1,250,250).astype(int)
acc = []
for n_i in n:
    acc.append(np.mean(prediction_by_bagging(X_train, y_train, X_test, n_i)==y_test))
plt.figure(figsize=(10,8))
plt.plot(n,acc,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')
plt.title('Accuracy vs. Number of trees in Bagging ',fontsize=24)
plt.xlabel('Number of trees',fontsize=16)
plt.ylabel('Accuracy',fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='best',fontsize=12)
plt.show()

Bagging Visualization¶

Bagging does well to reduce overfitting, but only upto a certain extent.

Vary the max_depth and numboot variables to see how Bagging helps reduce overfitting with the help of the visualization below

In [ ]:

# We will make plots for three different values of `max_depth`

fig,axes = plt.subplots(1,3,figsize=(20,6))

# Make a list of three max_depths to investigate
max_depth = [2,5,100]

# Fix the number of bootstraps

numboot = 100

for index,ax in enumerate(axes):

    for i in range(numboot):
        df_new = df.sample(frac=1,replace=True)
        y = df_new.land_type.values
        X = df_new[['latitude', 'longitude']].values
        dtree = DecisionTreeClassifier(max_depth=max_depth[index])
        dtree.fit(X, y)
        ax.scatter(X[:, 0], X[:, 1], c=y-1, s=50,alpha=0.5,edgecolor="k",cmap=cmap_bold) 
        plot_step_x1= 0.1
        plot_step_x2= 0.1
        x1min, x1max= X[:,0].min(), X[:,0].max()
        x2min, x2max= X[:,1].min(), X[:,1].max()
        x1, x2 = np.meshgrid(np.arange(x1min, x1max, plot_step_x1), np.arange(x2min, x2max, plot_step_x2) )
        # Re-cast every coordinate in the meshgrid as a 2D point
        Xplot= np.c_[x1.ravel(), x2.ravel()]

        # Predict the class
        y = dtree.predict( Xplot )
        y= y.reshape( x1.shape )
        cs = ax.contourf(x1, x2, y, alpha=0.02)
        
        
    ax.set_xlabel('Latitude',fontsize=14)
    ax.set_ylabel('Longitude',fontsize=14)
    ax.set_title(f'Max depth = {max_depth[index]}',fontsize=20)

Mindchow 🍲¶

Play around with the following parameters:

max_depth
numboot

Based on your observations, answer the questions below:

How does the plot change with varying max_depth
How does the plot change with varying numboot
How are the three plots essentially different?
Does more bootstraps reduce overfitting for
- High depth
- Low depth