Key Word(s): Decision Trees, Regression Trees, Stopping Conditions, Pruning, Bagging, Overfitting
Title :¶
Bagging Classification with Decision Boundary
Description :¶
The goal of this exercise is to use Bagging (Bootstrap Aggregated) to solve a classification problem and visualize the influence on Bagging on trees with varying depths.
Your final plot will resemble the one below.
Instructions:¶
- Read the dataset
agriland.csv
. - Assign the predictor and response variables as
X
andy
. - Split the data into train and test sets with
test_split=0.2
andrandom_state=44
. - Fit a single
DecisionTreeClassifier()
and find the accuracy of your prediction. - Complete the helper function
prediction_by_bagging()
to find the average predictions for a given number of bootstraps. - Perform
Bagging
using the helper function, and compute the new accuracy. - Plot the accuracy as a function of the number of bootstraps.
- Use the helper code to plot the decision boundaries for varying max_depth along with
num_bootstraps
. Investigate the effect of increasing bootstraps on the variance.
Hints:¶
sklearn.tree.DecisionTreeClassifier() A decision tree classifier.
DecisionTreeClassifier.fit() Build a decision tree classifier from the training set (X, y).
DecisionTreeClassifier.predict() Predict class or regression value for X.
train_test_split() Split arrays or matrices into random train and test subsets.
np.random.choice Generates a random sample from a given 1-D array.
plt.subplots() Create a figure and a set of subplots.
ax.plot() Plot y versus x as lines and/or markers
Note: This exercise is auto-graded and you can try multiple attempts.
# Import necessary libraries
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn import metrics
import scipy.optimize as opt
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Used for plotting later
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#F7345E','#80C3BD'])
cmap_light = ListedColormap(['#FFF4E5','#D2E3EF'])
# Read the file 'agriland.csv' as a Pandas dataframe
df = pd.read_csv('agriland.csv')
# Take a quick look at the data
# Note that the latitude & longitude values are normalized
df.head()
# Set the values of latitude & longitude predictor variables
X = ___.values
# Use the column "land_type" as the response variable
y = ___.values
# Split data in train an test, with test size = 0.2
# and set random state as 44
X_train, X_test, y_train, y_test = ___
# Define the max_depth of the decision tree
max_depth = ___
# Define a decision tree classifier with a max depth as defined above
# and set the random_state as 44
clf = ___
# Fit the model on the training data
___
# Use the trained model to predict on the test set
prediction = ___
# Calculate the accuracy of the test predictions of a single tree
single_acc = ___
# Print the accuracy of the tree
print(f'Single tree Accuracy is {single_acc*100}%')
# Complete the function below to get the prediction by bagging
# Inputs: X_train, y_train to train your data
# X_to_evaluate: Samples that you are goin to predict (evaluate)
# num_bootstraps: how many trees you want to train
# Output: An array of predicted classes for X_to_evaluate
def prediction_by_bagging(X_train, y_train, X_to_evaluate, num_bootstraps):
# List to store every array of predictions
predictions = []
# Generate num_bootstraps number of trees
for i in range(num_bootstraps):
# Sample data to perform first bootstrap, here, we actually bootstrap indices,
# because we want the same subset for X_train and y_train
resample_indexes = np.random.choice(np.arange(y_train.shape[0]), size=y_train.shape[0])
# Get a bootstrapped version of the data using the above indices
X_boot = X_train[___]
y_boot = y_train[___]
# Initialize a Decision Tree on bootstrapped data
# Use the same max_depth and random_state as above
clf = ___
# Fit the model on bootstrapped training set
clf.fit(___,___)
# Use the trained model to predict on X_to_evaluate samples
pred = clf.predict(___)
# Append the predictions to the predictions list
predictions.append(pred)
# The list "predictions" has [prediction_array_0, prediction_array_1, ..., prediction_array_n]
# To get the majority vote for each sample, we can find the average
# prediction and threshold them by 0.5
average_prediction = ___
# Return the average prediction
return average_prediction
### edTest(test_bag_acc) ###
# Define the number of bootstraps
num_bootstraps = 200
# Calling the prediction_by_bagging function with appropriate parameters
y_pred = prediction_by_bagging(X_train,y_train,X_test,num_bootstraps=num_bootstraps)
# Compare the average predictions to the true test set values
# and compute the accuracy
bagging_accuracy = ___
# Print the bagging accuracy
print(f'Accuracy with Bootstrapped Aggregation is {bagging_accuracy*100}%')
# Helper code to plot accuracy vs number of bagged trees
n = np.linspace(1,250,250).astype(int)
acc = []
for n_i in n:
acc.append(np.mean(prediction_by_bagging(X_train, y_train, X_test, n_i)==y_test))
plt.figure(figsize=(10,8))
plt.plot(n,acc,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')
plt.title('Accuracy vs. Number of trees in Bagging ',fontsize=24)
plt.xlabel('Number of trees',fontsize=16)
plt.ylabel('Accuracy',fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='best',fontsize=12)
plt.show();
Bagging Visualization¶
Bagging does well to reduce overfitting, but only upto a certain extent.
Vary the max_depth
and numboot
variables to see how Bagging helps reduce overfitting with the help of the visualization below
# Making plots for three different values of `max_depth`
fig,axes = plt.subplots(1,3,figsize=(20,6))
# Make a list of three max_depths to investigate
max_depth = [2,5,100]
# Fix the number of bootstraps
numboot = 100
for index,ax in enumerate(axes):
for i in range(numboot):
df_new = df.sample(frac=1,replace=True)
y = df_new.land_type.values
X = df_new[['latitude', 'longitude']].values
dtree = DecisionTreeClassifier(max_depth=max_depth[index])
dtree.fit(X, y)
ax.scatter(X[:, 0], X[:, 1], c=y-1, s=50,alpha=0.5,edgecolor="k",cmap=cmap_bold)
plot_step_x1= 0.1
plot_step_x2= 0.1
x1min, x1max= X[:,0].min(), X[:,0].max()
x2min, x2max= X[:,1].min(), X[:,1].max()
x1, x2 = np.meshgrid(np.arange(x1min, x1max, plot_step_x1), np.arange(x2min, x2max, plot_step_x2) )
# Re-cast every coordinate in the meshgrid as a 2D point
Xplot= np.c_[x1.ravel(), x2.ravel()]
# Predict the class
y = dtree.predict( Xplot )
y= y.reshape( x1.shape )
cs = ax.contourf(x1, x2, y, alpha=0.02)
ax.set_xlabel('Latitude',fontsize=14)
ax.set_ylabel('Longitude',fontsize=14)
ax.set_title(f'Max depth = {max_depth[index]}',fontsize=20)
Mindchow 🍲¶
Play around with the following parameters:
- max_depth
- numboot
Based on your observations, answer the questions below:
How does the plot change with varying
max_depth
How does the plot change with varying
numboot
How are the three plots essentially different?
Does more bootstraps reduce overfitting for
- High depth
- Low depth