{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Bagging Classification with Decision Boundary\n", "\n", "## Description :\n", "The goal of this exercise is to use **Bagging** (Bootstrap Aggregated) to solve a classification problem and visualize the influence on Bagging on trees with varying depths.\n", "\n", "Your final plot will resemble the one below.\n", "\n", "\n", "\n", "## Instructions:\n", "\n", "- Read the dataset `agriland.csv`.\n", "- Assign the predictor and response variables as `X` and `y`.\n", "- Split the data into train and test sets with `test_split=0.2` and `random_state=44`.\n", "- Fit a single `DecisionTreeClassifier()` and find the accuracy of your prediction.\n", "- Complete the helper function `prediction_by_bagging()` to find the average predictions for a given number of bootstraps.\n", "- Perform `Bagging` using the helper function, and compute the new accuracy.\n", "- Plot the accuracy as a function of the number of bootstraps.\n", "- Use the helper code to plot the decision boundaries for varying max_depth along with `num_bootstraps`. Investigate the effect of increasing bootstraps on the variance.\n", "\n", "## Hints: \n", "\n", "sklearn.tree.DecisionTreeClassifier()\n", "A decision tree classifier.\n", "\n", "DecisionTreeClassifier.fit()\n", "Build a decision tree classifier from the training set (X, y).\n", "\n", "DecisionTreeClassifier.predict()\n", "Predict class or regression value for X.\n", "\n", "train_test_split()\n", "Split arrays or matrices into random train and test subsets.\n", "\n", "np.random.choice\n", "Generates a random sample from a given 1-D array.\n", "\n", "plt.subplots()\n", "Create a figure and a set of subplots.\n", "\n", "ax.plot()\n", "Plot y versus x as lines and/or markers\n", "\n", "**Note: This exercise is auto-graded and you can try multiple attempts.**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn import metrics\n", "import scipy.optimize as opt\n", "import matplotlib.pyplot as plt\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.model_selection import train_test_split\n", "\n", "# Used for plotting later\n", "from matplotlib.colors import ListedColormap\n", "cmap_bold = ListedColormap(['#F7345E','#80C3BD'])\n", "cmap_light = ListedColormap(['#FFF4E5','#D2E3EF'])\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Read the file 'agriland.csv' as a Pandas dataframe\n", "df = pd.read_csv('agriland.csv')\n", "\n", "# Take a quick look at the data\n", "# Note that the latitude & longitude values are normalized\n", "df.head()\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Set the values of latitude & longitude predictor variables\n", "X = ___.values\n", "\n", "# Use the column \"land_type\" as the response variable\n", "y = ___.values\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Split data in train an test, with test size = 0.2 \n", "# and set random state as 44\n", "X_train, X_test, y_train, y_test = ___\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Define the max_depth of the decision tree\n", "max_depth = ___\n", "\n", "# Define a decision tree classifier with a max depth as defined above\n", "# and set the random_state as 44\n", "clf = ___\n", "\n", "# Fit the model on the training data\n", "___\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Use the trained model to predict on the test set\n", "prediction = ___\n", "\n", "# Calculate the accuracy of the test predictions of a single tree\n", "single_acc = ___\n", "\n", "# Print the accuracy of the tree\n", "print(f'Single tree Accuracy is {single_acc*100}%')\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Complete the function below to get the prediction by bagging\n", "\n", "# Inputs: X_train, y_train to train your data\n", "# X_to_evaluate: Samples that you are goin to predict (evaluate)\n", "# num_bootstraps: how many trees you want to train\n", "# Output: An array of predicted classes for X_to_evaluate\n", "\n", "def prediction_by_bagging(X_train, y_train, X_to_evaluate, num_bootstraps):\n", " \n", " # List to store every array of predictions\n", " predictions = []\n", " \n", " # Generate num_bootstraps number of trees\n", " for i in range(num_bootstraps):\n", " \n", " # Sample data to perform first bootstrap, here, we actually bootstrap indices, \n", " # because we want the same subset for X_train and y_train\n", " resample_indexes = np.random.choice(np.arange(y_train.shape[0]), size=y_train.shape[0])\n", " \n", " # Get a bootstrapped version of the data using the above indices\n", " X_boot = X_train[___]\n", " y_boot = y_train[___]\n", " \n", " # Initialize a Decision Tree on bootstrapped data \n", " # Use the same max_depth and random_state as above\n", " clf = ___\n", " \n", " # Fit the model on bootstrapped training set\n", " clf.fit(___,___)\n", " \n", " # Use the trained model to predict on X_to_evaluate samples\n", " pred = clf.predict(___)\n", " \n", " # Append the predictions to the predictions list\n", " predictions.append(pred)\n", "\n", " # The list \"predictions\" has [prediction_array_0, prediction_array_1, ..., prediction_array_n]\n", " # To get the majority vote for each sample, we can find the average \n", " # prediction and threshold them by 0.5\n", " average_prediction = ___\n", " \n", " # Return the average prediction\n", " return average_prediction\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_bag_acc) ### \n", "\n", "\n", "# Define the number of bootstraps\n", "num_bootstraps = 200\n", "\n", "# Calling the prediction_by_bagging function with appropriate parameters\n", "y_pred = prediction_by_bagging(X_train,y_train,X_test,num_bootstraps=num_bootstraps)\n", "\n", "# Compare the average predictions to the true test set values \n", "# and compute the accuracy \n", "bagging_accuracy = ___\n", "\n", "# Print the bagging accuracy\n", "print(f'Accuracy with Bootstrapped Aggregation is {bagging_accuracy*100}%')\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to plot accuracy vs number of bagged trees\n", "\n", "n = np.linspace(1,250,250).astype(int)\n", "acc = []\n", "for n_i in n:\n", " acc.append(np.mean(prediction_by_bagging(X_train, y_train, X_test, n_i)==y_test))\n", "plt.figure(figsize=(10,8))\n", "plt.plot(n,acc,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')\n", "plt.title('Accuracy vs. Number of trees in Bagging ',fontsize=24)\n", "plt.xlabel('Number of trees',fontsize=16)\n", "plt.ylabel('Accuracy',fontsize=16)\n", "plt.xticks(fontsize=12)\n", "plt.yticks(fontsize=12)\n", "plt.legend(loc='best',fontsize=12)\n", "plt.show();\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bagging Visualization\n", "\n", "Bagging does well to reduce overfitting, but only upto a certain extent.\n", "\n", "Vary the `max_depth` and `numboot` variables to see how Bagging helps reduce overfitting with the help of the visualization below" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Making plots for three different values of `max_depth`\n", "fig,axes = plt.subplots(1,3,figsize=(20,6))\n", "\n", "# Make a list of three max_depths to investigate\n", "max_depth = [2,5,100]\n", "\n", "# Fix the number of bootstraps\n", "numboot = 100\n", "\n", "for index,ax in enumerate(axes):\n", "\n", " for i in range(numboot):\n", " df_new = df.sample(frac=1,replace=True)\n", " y = df_new.land_type.values\n", " X = df_new[['latitude', 'longitude']].values\n", " dtree = DecisionTreeClassifier(max_depth=max_depth[index])\n", " dtree.fit(X, y)\n", " ax.scatter(X[:, 0], X[:, 1], c=y-1, s=50,alpha=0.5,edgecolor=\"k\",cmap=cmap_bold) \n", " plot_step_x1= 0.1\n", " plot_step_x2= 0.1\n", " x1min, x1max= X[:,0].min(), X[:,0].max()\n", " x2min, x2max= X[:,1].min(), X[:,1].max()\n", " x1, x2 = np.meshgrid(np.arange(x1min, x1max, plot_step_x1), np.arange(x2min, x2max, plot_step_x2) )\n", " # Re-cast every coordinate in the meshgrid as a 2D point\n", " Xplot= np.c_[x1.ravel(), x2.ravel()]\n", "\n", " # Predict the class\n", " y = dtree.predict( Xplot )\n", " y= y.reshape( x1.shape )\n", " cs = ax.contourf(x1, x2, y, alpha=0.02)\n", " \n", " ax.set_xlabel('Latitude',fontsize=14)\n", " ax.set_ylabel('Longitude',fontsize=14)\n", " ax.set_title(f'Max depth = {max_depth[index]}',fontsize=20)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mindchow 🍲\n", "Play around with the following parameters:\n", "\n", "- max_depth\n", "- numboot\n", "\n", "Based on your observations, answer the questions below:\n", "\n", "- How does the plot change with varying `max_depth`\n", "\n", "- How does the plot change with varying `numboot`\n", "\n", "- How are the three plots essentially different?\n", "\n", "- Does more bootstraps reduce overfitting for\n", " - High depth\n", " - Low depth" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }