{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Title :\n",
    "Bagging Classification with Decision Boundary\n",
    "\n",
    "## Description :\n",
    "The goal of this exercise is to use **Bagging** (Bootstrap Aggregated) to solve a classification problem and visualize the influence on Bagging on trees with varying depths.\n",
    "\n",
    "Your final plot will resemble the one below.\n",
    "\n",
    "<img src=\"../fig/fig2.png\" style=\"width: 500px;\">\n",
    "\n",
    "## Instructions:\n",
    "\n",
    "- Read the dataset `agriland.csv`.\n",
    "- Assign the predictor and response variables as `X` and `y`.\n",
    "- Split the data into train and test sets with `test_split=0.2` and `random_state=44`.\n",
    "- Fit a single `DecisionTreeClassifier()` and find the accuracy of your prediction.\n",
    "- Complete the helper function `prediction_by_bagging()` to find the average predictions for a given number of bootstraps.\n",
    "- Perform `Bagging` using the helper function, and compute the new accuracy.\n",
    "- Plot the accuracy as a function of the number of bootstraps.\n",
    "- Use the helper code to plot the decision boundaries for varying max_depth along with `num_bootstraps`. Investigate the effect of increasing bootstraps on the variance.\n",
    "\n",
    "## Hints: \n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn-tree-decisiontreeclassifier\" target=\"_blank\">sklearn.tree.DecisionTreeClassifier()</a>\n",
    "A decision tree classifier.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit\" target=\"_blank\">DecisionTreeClassifier.fit()</a>\n",
    "Build a decision tree classifier from the training set (X, y).\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict\" target=\"_blank\">DecisionTreeClassifier.predict()</a>\n",
    "Predict class or regression value for X.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\" target=\"_blank\">train_test_split()</a>\n",
    "Split arrays or matrices into random train and test subsets.\n",
    "\n",
    "<a href=\"https://docs.scipy.org/doc//numpy-1.10.4/reference/generated/numpy.random.choice.html\" target=\"_blank\">np.random.choice</a>\n",
    "Generates a random sample from a given 1-D array.\n",
    "\n",
    "<a href=\"https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html\" target=\"_blank\">plt.subplots()</a>\n",
    "Create a figure and a set of subplots.\n",
    "\n",
    "<a href=\"https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.axes.Axes.plot.html\" target=\"_blank\">ax.plot()</a>\n",
    "Plot y versus x as lines and/or markers\n",
    "\n",
    "**Note: This exercise is auto-graded and you can try multiple attempts.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn import metrics\n",
    "import scipy.optimize as opt\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Used for plotting later\n",
    "from matplotlib.colors import ListedColormap\n",
    "cmap_bold = ListedColormap(['#F7345E','#80C3BD'])\n",
    "cmap_light = ListedColormap(['#FFF4E5','#D2E3EF'])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the file 'agriland.csv' as a Pandas dataframe\n",
    "df = pd.read_csv('agriland.csv')\n",
    "\n",
    "# Take a quick look at the data\n",
    "# Note that the latitude & longitude values are normalized\n",
    "df.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the values of latitude & longitude predictor variables\n",
    "X = ___.values\n",
    "\n",
    "# Use the column \"land_type\" as the response variable\n",
    "y = ___.values\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split data in train an test, with test size = 0.2 \n",
    "# and set random state as 44\n",
    "X_train, X_test, y_train, y_test = ___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the max_depth of the decision tree\n",
    "max_depth = ___\n",
    "\n",
    "# Define a decision tree classifier with a max depth as defined above\n",
    "# and set the random_state as 44\n",
    "clf = ___\n",
    "\n",
    "# Fit the model on the training data\n",
    "___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the trained model to predict on the test set\n",
    "prediction = ___\n",
    "\n",
    "# Calculate the accuracy of the test predictions of a single tree\n",
    "single_acc = ___\n",
    "\n",
    "# Print the accuracy of the tree\n",
    "print(f'Single tree Accuracy is {single_acc*100}%')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Complete the function below to get the prediction by bagging\n",
    "\n",
    "# Inputs: X_train, y_train to train your data\n",
    "# X_to_evaluate: Samples that you are goin to predict (evaluate)\n",
    "# num_bootstraps: how many trees you want to train\n",
    "# Output: An array of predicted classes for X_to_evaluate\n",
    "\n",
    "def prediction_by_bagging(X_train, y_train, X_to_evaluate, num_bootstraps):\n",
    "    \n",
    "    # List to store every array of predictions\n",
    "    predictions = []\n",
    "    \n",
    "    # Generate num_bootstraps number of trees\n",
    "    for i in range(num_bootstraps):\n",
    "        \n",
    "        # Sample data to perform first bootstrap, here, we actually bootstrap indices, \n",
    "        # because we want the same subset for X_train and y_train\n",
    "        resample_indexes = np.random.choice(np.arange(y_train.shape[0]), size=y_train.shape[0])\n",
    "        \n",
    "        # Get a bootstrapped version of the data using the above indices\n",
    "        X_boot = X_train[___]\n",
    "        y_boot = y_train[___]\n",
    "        \n",
    "        # Initialize a Decision Tree on bootstrapped data \n",
    "        # Use the same max_depth and random_state as above\n",
    "        clf = ___\n",
    "        \n",
    "        # Fit the model on bootstrapped training set\n",
    "        clf.fit(___,___)\n",
    "        \n",
    "        # Use the trained model to predict on X_to_evaluate samples\n",
    "        pred = clf.predict(___)\n",
    "        \n",
    "        # Append the predictions to the predictions list\n",
    "        predictions.append(pred)\n",
    "\n",
    "    # The list \"predictions\" has [prediction_array_0, prediction_array_1, ..., prediction_array_n]\n",
    "    # To get the majority vote for each sample, we can find the average \n",
    "    # prediction and threshold them by 0.5\n",
    "    average_prediction = ___\n",
    "    \n",
    "    # Return the average prediction\n",
    "    return average_prediction\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_bag_acc) ###         \n",
    "\n",
    "\n",
    "# Define the number of bootstraps\n",
    "num_bootstraps = 200\n",
    "\n",
    "# Calling the prediction_by_bagging function with appropriate parameters\n",
    "y_pred = prediction_by_bagging(X_train,y_train,X_test,num_bootstraps=num_bootstraps)\n",
    "\n",
    "# Compare the average predictions to the true test set values \n",
    "# and compute the accuracy \n",
    "bagging_accuracy = ___\n",
    "\n",
    "# Print the bagging accuracy\n",
    "print(f'Accuracy with Bootstrapped Aggregation is  {bagging_accuracy*100}%')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot accuracy vs number of bagged trees\n",
    "\n",
    "n = np.linspace(1,250,250).astype(int)\n",
    "acc = []\n",
    "for n_i in n:\n",
    "    acc.append(np.mean(prediction_by_bagging(X_train, y_train, X_test, n_i)==y_test))\n",
    "plt.figure(figsize=(10,8))\n",
    "plt.plot(n,acc,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')\n",
    "plt.title('Accuracy vs. Number of trees in Bagging ',fontsize=24)\n",
    "plt.xlabel('Number of trees',fontsize=16)\n",
    "plt.ylabel('Accuracy',fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='best',fontsize=12)\n",
    "plt.show();\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bagging Visualization\n",
    "\n",
    "Bagging does well to reduce overfitting, but only upto a certain extent.\n",
    "\n",
    "Vary the `max_depth` and `numboot` variables to see how Bagging helps reduce overfitting with the help of the visualization below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Making plots for three different values of `max_depth`\n",
    "fig,axes = plt.subplots(1,3,figsize=(20,6))\n",
    "\n",
    "# Make a list of three max_depths to investigate\n",
    "max_depth = [2,5,100]\n",
    "\n",
    "# Fix the number of bootstraps\n",
    "numboot = 100\n",
    "\n",
    "for index,ax in enumerate(axes):\n",
    "\n",
    "    for i in range(numboot):\n",
    "        df_new = df.sample(frac=1,replace=True)\n",
    "        y = df_new.land_type.values\n",
    "        X = df_new[['latitude', 'longitude']].values\n",
    "        dtree = DecisionTreeClassifier(max_depth=max_depth[index])\n",
    "        dtree.fit(X, y)\n",
    "        ax.scatter(X[:, 0], X[:, 1], c=y-1, s=50,alpha=0.5,edgecolor=\"k\",cmap=cmap_bold) \n",
    "        plot_step_x1= 0.1\n",
    "        plot_step_x2= 0.1\n",
    "        x1min, x1max= X[:,0].min(), X[:,0].max()\n",
    "        x2min, x2max= X[:,1].min(), X[:,1].max()\n",
    "        x1, x2 = np.meshgrid(np.arange(x1min, x1max, plot_step_x1), np.arange(x2min, x2max, plot_step_x2) )\n",
    "        # Re-cast every coordinate in the meshgrid as a 2D point\n",
    "        Xplot= np.c_[x1.ravel(), x2.ravel()]\n",
    "\n",
    "        # Predict the class\n",
    "        y = dtree.predict( Xplot )\n",
    "        y= y.reshape( x1.shape )\n",
    "        cs = ax.contourf(x1, x2, y, alpha=0.02)\n",
    "        \n",
    "    ax.set_xlabel('Latitude',fontsize=14)\n",
    "    ax.set_ylabel('Longitude',fontsize=14)\n",
    "    ax.set_title(f'Max depth = {max_depth[index]}',fontsize=20)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mindchow 🍲\n",
    "Play around with the following parameters:\n",
    "\n",
    "- max_depth\n",
    "- numboot\n",
    "\n",
    "Based on your observations, answer the questions below:\n",
    "\n",
    "- How does the plot change with varying `max_depth`\n",
    "\n",
    "- How does the plot change with varying `numboot`\n",
    "\n",
    "- How are the three plots essentially different?\n",
    "\n",
    "- Does more bootstraps reduce overfitting for\n",
    "    - High depth\n",
    "    - Low depth"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}