{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Title :\n",
"Bagging Classification with Decision Boundary\n",
"\n",
"## Description :\n",
"The goal of this exercise is to use **Bagging** (Bootstrap Aggregated) to solve a classification problem and visualize the influence on Bagging on trees with varying depths.\n",
"\n",
"Your final plot will resemble the one below.\n",
"\n",
"
\n",
"\n",
"## Instructions:\n",
"\n",
"- Read the dataset `agriland.csv`.\n",
"- Assign the predictor and response variables as `X` and `y`.\n",
"- Split the data into train and test sets with `test_split=0.2` and `random_state=44`.\n",
"- Fit a single `DecisionTreeClassifier()` and find the accuracy of your prediction.\n",
"- Complete the helper function `prediction_by_bagging()` to find the average predictions for a given number of bootstraps.\n",
"- Perform `Bagging` using the helper function, and compute the new accuracy.\n",
"- Plot the accuracy as a function of the number of bootstraps.\n",
"- Use the helper code to plot the decision boundaries for varying max_depth along with `num_bootstraps`. Investigate the effect of increasing bootstraps on the variance.\n",
"\n",
"## Hints: \n",
"\n",
"sklearn.tree.DecisionTreeClassifier()\n",
"A decision tree classifier.\n",
"\n",
"DecisionTreeClassifier.fit()\n",
"Build a decision tree classifier from the training set (X, y).\n",
"\n",
"DecisionTreeClassifier.predict()\n",
"Predict class or regression value for X.\n",
"\n",
"train_test_split()\n",
"Split arrays or matrices into random train and test subsets.\n",
"\n",
"np.random.choice\n",
"Generates a random sample from a given 1-D array.\n",
"\n",
"plt.subplots()\n",
"Create a figure and a set of subplots.\n",
"\n",
"ax.plot()\n",
"Plot y versus x as lines and/or markers\n",
"\n",
"**Note: This exercise is auto-graded and you can try multiple attempts.**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"%matplotlib inline\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn import metrics\n",
"import scipy.optimize as opt\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Used for plotting later\n",
"from matplotlib.colors import ListedColormap\n",
"cmap_bold = ListedColormap(['#F7345E','#80C3BD'])\n",
"cmap_light = ListedColormap(['#FFF4E5','#D2E3EF'])\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Read the file 'agriland.csv' as a Pandas dataframe\n",
"df = pd.read_csv('agriland.csv')\n",
"\n",
"# Take a quick look at the data\n",
"# Note that the latitude & longitude values are normalized\n",
"df.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Set the values of latitude & longitude predictor variables\n",
"X = ___.values\n",
"\n",
"# Use the column \"land_type\" as the response variable\n",
"y = ___.values\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Split data in train an test, with test size = 0.2 \n",
"# and set random state as 44\n",
"X_train, X_test, y_train, y_test = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Define the max_depth of the decision tree\n",
"max_depth = ___\n",
"\n",
"# Define a decision tree classifier with a max depth as defined above\n",
"# and set the random_state as 44\n",
"clf = ___\n",
"\n",
"# Fit the model on the training data\n",
"___\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Use the trained model to predict on the test set\n",
"prediction = ___\n",
"\n",
"# Calculate the accuracy of the test predictions of a single tree\n",
"single_acc = ___\n",
"\n",
"# Print the accuracy of the tree\n",
"print(f'Single tree Accuracy is {single_acc*100}%')\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Complete the function below to get the prediction by bagging\n",
"\n",
"# Inputs: X_train, y_train to train your data\n",
"# X_to_evaluate: Samples that you are goin to predict (evaluate)\n",
"# num_bootstraps: how many trees you want to train\n",
"# Output: An array of predicted classes for X_to_evaluate\n",
"\n",
"def prediction_by_bagging(X_train, y_train, X_to_evaluate, num_bootstraps):\n",
" \n",
" # List to store every array of predictions\n",
" predictions = []\n",
" \n",
" # Generate num_bootstraps number of trees\n",
" for i in range(num_bootstraps):\n",
" \n",
" # Sample data to perform first bootstrap, here, we actually bootstrap indices, \n",
" # because we want the same subset for X_train and y_train\n",
" resample_indexes = np.random.choice(np.arange(y_train.shape[0]), size=y_train.shape[0])\n",
" \n",
" # Get a bootstrapped version of the data using the above indices\n",
" X_boot = X_train[___]\n",
" y_boot = y_train[___]\n",
" \n",
" # Initialize a Decision Tree on bootstrapped data \n",
" # Use the same max_depth and random_state as above\n",
" clf = ___\n",
" \n",
" # Fit the model on bootstrapped training set\n",
" clf.fit(___,___)\n",
" \n",
" # Use the trained model to predict on X_to_evaluate samples\n",
" pred = clf.predict(___)\n",
" \n",
" # Append the predictions to the predictions list\n",
" predictions.append(pred)\n",
"\n",
" # The list \"predictions\" has [prediction_array_0, prediction_array_1, ..., prediction_array_n]\n",
" # To get the majority vote for each sample, we can find the average \n",
" # prediction and threshold them by 0.5\n",
" average_prediction = ___\n",
" \n",
" # Return the average prediction\n",
" return average_prediction\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_bag_acc) ### \n",
"\n",
"\n",
"# Define the number of bootstraps\n",
"num_bootstraps = 200\n",
"\n",
"# Calling the prediction_by_bagging function with appropriate parameters\n",
"y_pred = prediction_by_bagging(X_train,y_train,X_test,num_bootstraps=num_bootstraps)\n",
"\n",
"# Compare the average predictions to the true test set values \n",
"# and compute the accuracy \n",
"bagging_accuracy = ___\n",
"\n",
"# Print the bagging accuracy\n",
"print(f'Accuracy with Bootstrapped Aggregation is {bagging_accuracy*100}%')\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code to plot accuracy vs number of bagged trees\n",
"\n",
"n = np.linspace(1,250,250).astype(int)\n",
"acc = []\n",
"for n_i in n:\n",
" acc.append(np.mean(prediction_by_bagging(X_train, y_train, X_test, n_i)==y_test))\n",
"plt.figure(figsize=(10,8))\n",
"plt.plot(n,acc,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')\n",
"plt.title('Accuracy vs. Number of trees in Bagging ',fontsize=24)\n",
"plt.xlabel('Number of trees',fontsize=16)\n",
"plt.ylabel('Accuracy',fontsize=16)\n",
"plt.xticks(fontsize=12)\n",
"plt.yticks(fontsize=12)\n",
"plt.legend(loc='best',fontsize=12)\n",
"plt.show();\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Bagging Visualization\n",
"\n",
"Bagging does well to reduce overfitting, but only upto a certain extent.\n",
"\n",
"Vary the `max_depth` and `numboot` variables to see how Bagging helps reduce overfitting with the help of the visualization below"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Making plots for three different values of `max_depth`\n",
"fig,axes = plt.subplots(1,3,figsize=(20,6))\n",
"\n",
"# Make a list of three max_depths to investigate\n",
"max_depth = [2,5,100]\n",
"\n",
"# Fix the number of bootstraps\n",
"numboot = 100\n",
"\n",
"for index,ax in enumerate(axes):\n",
"\n",
" for i in range(numboot):\n",
" df_new = df.sample(frac=1,replace=True)\n",
" y = df_new.land_type.values\n",
" X = df_new[['latitude', 'longitude']].values\n",
" dtree = DecisionTreeClassifier(max_depth=max_depth[index])\n",
" dtree.fit(X, y)\n",
" ax.scatter(X[:, 0], X[:, 1], c=y-1, s=50,alpha=0.5,edgecolor=\"k\",cmap=cmap_bold) \n",
" plot_step_x1= 0.1\n",
" plot_step_x2= 0.1\n",
" x1min, x1max= X[:,0].min(), X[:,0].max()\n",
" x2min, x2max= X[:,1].min(), X[:,1].max()\n",
" x1, x2 = np.meshgrid(np.arange(x1min, x1max, plot_step_x1), np.arange(x2min, x2max, plot_step_x2) )\n",
" # Re-cast every coordinate in the meshgrid as a 2D point\n",
" Xplot= np.c_[x1.ravel(), x2.ravel()]\n",
"\n",
" # Predict the class\n",
" y = dtree.predict( Xplot )\n",
" y= y.reshape( x1.shape )\n",
" cs = ax.contourf(x1, x2, y, alpha=0.02)\n",
" \n",
" ax.set_xlabel('Latitude',fontsize=14)\n",
" ax.set_ylabel('Longitude',fontsize=14)\n",
" ax.set_title(f'Max depth = {max_depth[index]}',fontsize=20)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mindchow 🍲\n",
"Play around with the following parameters:\n",
"\n",
"- max_depth\n",
"- numboot\n",
"\n",
"Based on your observations, answer the questions below:\n",
"\n",
"- How does the plot change with varying `max_depth`\n",
"\n",
"- How does the plot change with varying `numboot`\n",
"\n",
"- How are the three plots essentially different?\n",
"\n",
"- Does more bootstraps reduce overfitting for\n",
" - High depth\n",
" - Low depth"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}