{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Title\n",
    "\n",
    "**Exercise: Bagging Classification with Decision Boundary**\n",
    "\n",
    "# Description\n",
    "\n",
    "The goal of this exercise is to use **Bagging** (Bootstrap Aggregated) to solve a classification problem and visualize the influence on Bagging on trees with varying depths.\n",
    "\n",
    "Your final plot should resemble the one below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../img/image2.png\" style=\"width: 500px;\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instructions:\n",
    "- Read the dataset `agriland.csv`.\n",
    "- Assign the predictor and response variables as `X` and `y`.\n",
    "- Split the data into train and test sets with `test_split=0.2` and `random_state=44`.\n",
    "- Fit a single `DecisionTreeClassifier()` and find the accuracy of your prediction.\n",
    "- Complete the helper function `prediction_by_bagging()` to find the average predictions for a given number of bootstraps.\n",
    "- Now perform **Bagging** using the helper function, and compute the new accuracy.\n",
    "- Proceed to plot of accuracy with increasing number of bootstraps.\n",
    "- Finally, use the helper code to plot the decision boundaries for varying `max_depth` along with `num_bootstraps`. Investigate the effect of increasing bootstraps on the variance.\n",
    "\n",
    "# Hints:\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn-tree-decisiontreeclassifier\" target=\"_blank\">sklearn.tree.DecisionTreeClassifier()</a> : A decision tree classifier.\n",
    "\n",
    "<a href=\"https://docs.scipy.org/doc//numpy-1.10.4/reference/generated/numpy.random.choice.html\" target=\"_blank\">np.random.choice</a> : Generates a random sample from a given 1-D array\n",
    "\n",
    "<a href=\"https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html\" target=\"_blank\">plt.subplots()</a> : Create a figure and a set of subplots.\n",
    "\n",
    "<a href=\"https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.axes.Axes.plot.html\" target=\"_blank\">ax.plot()</a> : Plot y versus x as lines and/or markers\n",
    "\n",
    "**Note: This exercise is auto-graded and you can try multiple attempts.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bagging Classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "\n",
    "%matplotlib inline\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn import metrics\n",
    "import scipy.optimize as opt\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "# to be used for plotting later\n",
    "\n",
    "from matplotlib.colors import ListedColormap\n",
    "cmap_light = ListedColormap(['#FFF4E5','#D2E3EF'])\n",
    "cmap_bold = ListedColormap(['#F7345E','#80C3BD'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>land_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.071860</td>\n",
       "      <td>-1.297410</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.179482</td>\n",
       "      <td>-0.874892</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1.217428</td>\n",
       "      <td>-1.352105</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.143306</td>\n",
       "      <td>-0.894172</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-3.033199</td>\n",
       "      <td>0.818646</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   latitude  longitude  land_type\n",
       "0 -0.071860  -1.297410        1.0\n",
       "1 -0.179482  -0.874892        1.0\n",
       "2 -1.217428  -1.352105        0.0\n",
       "3  1.143306  -0.894172        1.0\n",
       "4 -3.033199   0.818646        0.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Read the file 'agriland.csv' and take a quick look at your data\n",
    "df = pd.read_csv('agriland.csv')\n",
    "\n",
    "# Note that the latitude & longitude values are normalized\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set your predictor variables(latitude &longitude) as 'X' and response variable as y and make sure to use .values\n",
    "\n",
    "X = ___\n",
    "\n",
    "y = ___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#split data in train an test, with test size = 0.2 and randomstate=44\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=44)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the max_depth of your decision tree and set the random_state variable as 44\n",
    "\n",
    "max_depth = ___\n",
    "\n",
    "# Lets create and train our model\n",
    "clf = DecisionTreeClassifier(max_depth=max_depth, random_state=44)\n",
    "\n",
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Predict on the test set and calculate the accuracy of a single decision tree\n",
    "\n",
    "prediction = ___\n",
    "single_acc = ___\n",
    "print(f'Single tree Accuracy is {single_acc*100}%')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Complete the function below to get the prediction by bagging\n",
    "\n",
    "# Inputs: X_train, y_train to train your data\n",
    "# X_to_evaluate: Samples that you are goin to predict (evaluate)\n",
    "# num_bootstraps: how many trees you want to train\n",
    "# Output: An array of predicted classes for X_to_evaluate\n",
    "\n",
    "def prediction_by_bagging(X_train, y_train, X_to_evaluate, num_bootstraps):\n",
    "    \n",
    "    # list to store every array of predictions\n",
    "    \n",
    "    predictions = []\n",
    "    \n",
    "    #generate num_bootstraps number of trees\n",
    "    \n",
    "    for i in range(num_bootstraps):\n",
    "        \n",
    "        # sample data to perform first bootstrap, here, we actually bootstrap indices, because we want the same subset for X_train and y_train\n",
    "        \n",
    "        resample_indexes = np.random.choice(np.arange(y_train.shape[0]), size=y_train.shape[0])\n",
    "        \n",
    "        # get bootstrapped set for 'X' and 'y' using the above indices\n",
    "        \n",
    "        X_boot = X_train[___]\n",
    "        y_boot = y_train[___]\n",
    "        \n",
    "        # train decision tree on bootstrap set, use the same max_depth and random_state as above\n",
    "        \n",
    "        clf = ___\n",
    "        \n",
    "        # fit the model on bootstrapped training set\n",
    "        \n",
    "        clf.fit(___,___)\n",
    "        \n",
    "        #  make predictions on X_to_evaluate samples\n",
    "        \n",
    "        pred = clf.predict(___)\n",
    "        \n",
    "        predictions.append(pred)\n",
    "    # Now we have a list of predictions like [prediction_array_0, prediction_array_1, ..., prediction_array_n]\n",
    "    \n",
    "    # To get the majority vote for each sample, we can find the average prediction and threshold them by 0.5\n",
    "    \n",
    "    average_prediction = ___\n",
    "    \n",
    "    return average_prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_bag_acc) ###         \n",
    "# now we print the accuracy of the bagging with decision trees\n",
    "\n",
    "#Define the number of bootstraps to be used\n",
    "\n",
    "num_bootstraps = 200\n",
    "\n",
    "y_pred = prediction_by_bagging(X_train,y_train,X_test,num_bootstraps=num_bootstraps)\n",
    "\n",
    "# Compare the average predictions to the true test set values and compute the accuracy \n",
    "\n",
    "bagging_accuracy = ___\n",
    "\n",
    "print(f'Accuracy with Bootstrapped Aggregation is  {bagging_accuracy*100}%')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To visualize, lets plot accuracy as a function of the number of trees in the Bagging\n",
    "\n",
    "# Run the helper code below, and if your function is well defined above, you should see a plot of accuracy vs number of bagged trees\n",
    "\n",
    "n = np.linspace(1,250,250).astype(int)\n",
    "acc = []\n",
    "for n_i in n:\n",
    "    acc.append(np.mean(prediction_by_bagging(X_train, y_train, X_test, n_i)==y_test))\n",
    "plt.figure(figsize=(10,8))\n",
    "plt.plot(n,acc,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')\n",
    "plt.title('Accuracy vs. Number of trees in Bagging ',fontsize=24)\n",
    "plt.xlabel('Number of trees',fontsize=16)\n",
    "plt.ylabel('Accuracy',fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='best',fontsize=12)\n",
    "plt.show()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bagging Visualization\n",
    "\n",
    "Bagging does well to reduce overfitting, but only upto a certain extent.\n",
    "\n",
    "Vary the `max_depth` and `numboot` variables to see how Bagging helps reduce overfitting with the help of the visualization below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We will make plots for three different values of `max_depth`\n",
    "\n",
    "fig,axes = plt.subplots(1,3,figsize=(20,6))\n",
    "\n",
    "# Make a list of three max_depths to investigate\n",
    "max_depth = [2,5,100]\n",
    "\n",
    "# Fix the number of bootstraps\n",
    "\n",
    "numboot = 100\n",
    "\n",
    "for index,ax in enumerate(axes):\n",
    "\n",
    "    for i in range(numboot):\n",
    "        df_new = df.sample(frac=1,replace=True)\n",
    "        y = df_new.land_type.values\n",
    "        X = df_new[['latitude', 'longitude']].values\n",
    "        dtree = DecisionTreeClassifier(max_depth=max_depth[index])\n",
    "        dtree.fit(X, y)\n",
    "        ax.scatter(X[:, 0], X[:, 1], c=y-1, s=50,alpha=0.5,edgecolor=\"k\",cmap=cmap_bold) \n",
    "        plot_step_x1= 0.1\n",
    "        plot_step_x2= 0.1\n",
    "        x1min, x1max= X[:,0].min(), X[:,0].max()\n",
    "        x2min, x2max= X[:,1].min(), X[:,1].max()\n",
    "        x1, x2 = np.meshgrid(np.arange(x1min, x1max, plot_step_x1), np.arange(x2min, x2max, plot_step_x2) )\n",
    "        # Re-cast every coordinate in the meshgrid as a 2D point\n",
    "        Xplot= np.c_[x1.ravel(), x2.ravel()]\n",
    "\n",
    "        # Predict the class\n",
    "        y = dtree.predict( Xplot )\n",
    "        y= y.reshape( x1.shape )\n",
    "        cs = ax.contourf(x1, x2, y, alpha=0.02)\n",
    "        \n",
    "        \n",
    "    ax.set_xlabel('Latitude',fontsize=14)\n",
    "    ax.set_ylabel('Longitude',fontsize=14)\n",
    "    ax.set_title(f'Max depth = {max_depth[index]}',fontsize=20)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mindchow 🍲\n",
    "Play around with the following parameters:\n",
    "\n",
    "- max_depth\n",
    "- numboot\n",
    "\n",
    "Based on your observations, answer the questions below:\n",
    "\n",
    "- How does the plot change with varying `max_depth`\n",
    "\n",
    "- How does the plot change with varying `numboot`\n",
    "\n",
    "- How are the three plots essentially different?\n",
    "\n",
    "- Does more bootstraps reduce overfitting for\n",
    "    - High depth\n",
    "    - Low depth"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}