{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Title\n",
    "\n",
    "**Exercise: Regression with Bagging**\n",
    "\n",
    "# Description\n",
    "\n",
    "The aim of this exercise is to understand bagging regression. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../img/image.png\" style=\"width: 500px;\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instructions:\n",
    "- Read the dataset airquality.csv as a pandas dataframe.\n",
    "- Take a quick look at the dataset.\n",
    "- Split the data into train and test sets.\n",
    "- Specify the number of bootstraps as 30 and a maximum depth of 3.\n",
    "- Define a Bagging Regression model that uses Decision Tree as its base estimator.\n",
    "- Fit the model on the train data.\n",
    "- Use the helper code to predict using the mean model and individual estimators. The plot will look similar to the one given above.\n",
    "- Predict on the test data using the first estimator and the mean model.\n",
    "- Compute and display the test MSEs.\n",
    "\n",
    "# Hints:\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\" target=\"_blank\">sklearn.train_test_split()</a> : Split arrays or matrices into random train and test subsets.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html\" target=\"_blank\">BaggingRegressor()</a> : Returns a Bagging regressor instance.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html\" target=\"_blank\">DecisionTreeRegressor()</a> : A decision tree regressor.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html\" target=\"_blank\">DecisionTreeRegressor().estimators_</a> : A list of estimators. Use this to access any of the estimators. \n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html\" target=\"_blank\">sklearn.mean_squared_error()</a> : Mean squared error regression loss."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "\n",
    "import numpy as np\n",
    "from numpy import mean\n",
    "from numpy import std\n",
    "from sklearn.datasets import make_regression\n",
    "from sklearn.ensemble import BaggingRegressor\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd \n",
    "import itertools\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the dataset\n",
    "df = pd.read_csv(\"airquality.csv\",index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take a quick look at the data\n",
    "df.head(10)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We will only use Ozone for this exerice. Drop any notnas\n",
    "df = df[df.Ozone.notna()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign \"x\" column as the predictor variable, only use Ozone, and \"y\" as the\n",
    "x = df[['Ozone']].values\n",
    "y = df['Temp']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split the data into train and test sets with train size as 0.8 and random_state as 102\n",
    "x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=102)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bagging Regressor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Ozone</th>\n",
       "      <th>Solar.R</th>\n",
       "      <th>Wind</th>\n",
       "      <th>Temp</th>\n",
       "      <th>Month</th>\n",
       "      <th>Day</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>41.0</td>\n",
       "      <td>190.0</td>\n",
       "      <td>7.4</td>\n",
       "      <td>67</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>36.0</td>\n",
       "      <td>118.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>72</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12.0</td>\n",
       "      <td>149.0</td>\n",
       "      <td>12.6</td>\n",
       "      <td>74</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>18.0</td>\n",
       "      <td>313.0</td>\n",
       "      <td>11.5</td>\n",
       "      <td>62</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>28.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.9</td>\n",
       "      <td>66</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>23.0</td>\n",
       "      <td>299.0</td>\n",
       "      <td>8.6</td>\n",
       "      <td>65</td>\n",
       "      <td>5</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>19.0</td>\n",
       "      <td>99.0</td>\n",
       "      <td>13.8</td>\n",
       "      <td>59</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>8.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>20.1</td>\n",
       "      <td>61</td>\n",
       "      <td>5</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>7.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.9</td>\n",
       "      <td>74</td>\n",
       "      <td>5</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>16.0</td>\n",
       "      <td>256.0</td>\n",
       "      <td>9.7</td>\n",
       "      <td>69</td>\n",
       "      <td>5</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>11.0</td>\n",
       "      <td>290.0</td>\n",
       "      <td>9.2</td>\n",
       "      <td>66</td>\n",
       "      <td>5</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14.0</td>\n",
       "      <td>274.0</td>\n",
       "      <td>10.9</td>\n",
       "      <td>68</td>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>18.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>13.2</td>\n",
       "      <td>58</td>\n",
       "      <td>5</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>14.0</td>\n",
       "      <td>334.0</td>\n",
       "      <td>11.5</td>\n",
       "      <td>64</td>\n",
       "      <td>5</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>34.0</td>\n",
       "      <td>307.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>66</td>\n",
       "      <td>5</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>6.0</td>\n",
       "      <td>78.0</td>\n",
       "      <td>18.4</td>\n",
       "      <td>57</td>\n",
       "      <td>5</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>30.0</td>\n",
       "      <td>322.0</td>\n",
       "      <td>11.5</td>\n",
       "      <td>68</td>\n",
       "      <td>5</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>11.0</td>\n",
       "      <td>44.0</td>\n",
       "      <td>9.7</td>\n",
       "      <td>62</td>\n",
       "      <td>5</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>1.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>9.7</td>\n",
       "      <td>59</td>\n",
       "      <td>5</td>\n",
       "      <td>21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>11.0</td>\n",
       "      <td>320.0</td>\n",
       "      <td>16.6</td>\n",
       "      <td>73</td>\n",
       "      <td>5</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>4.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>9.7</td>\n",
       "      <td>61</td>\n",
       "      <td>5</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>32.0</td>\n",
       "      <td>92.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>61</td>\n",
       "      <td>5</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>23.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>67</td>\n",
       "      <td>5</td>\n",
       "      <td>28</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Ozone  Solar.R  Wind  Temp  Month  Day\n",
       "1    41.0    190.0   7.4    67      5    1\n",
       "2    36.0    118.0   8.0    72      5    2\n",
       "3    12.0    149.0  12.6    74      5    3\n",
       "4    18.0    313.0  11.5    62      5    4\n",
       "6    28.0      NaN  14.9    66      5    6\n",
       "7    23.0    299.0   8.6    65      5    7\n",
       "8    19.0     99.0  13.8    59      5    8\n",
       "9     8.0     19.0  20.1    61      5    9\n",
       "11    7.0      NaN   6.9    74      5   11\n",
       "12   16.0    256.0   9.7    69      5   12\n",
       "13   11.0    290.0   9.2    66      5   13\n",
       "14   14.0    274.0  10.9    68      5   14\n",
       "15   18.0     65.0  13.2    58      5   15\n",
       "16   14.0    334.0  11.5    64      5   16\n",
       "17   34.0    307.0  12.0    66      5   17\n",
       "18    6.0     78.0  18.4    57      5   18\n",
       "19   30.0    322.0  11.5    68      5   19\n",
       "20   11.0     44.0   9.7    62      5   20\n",
       "21    1.0      8.0   9.7    59      5   21\n",
       "22   11.0    320.0  16.6    73      5   22\n",
       "23    4.0     25.0   9.7    61      5   23\n",
       "24   32.0     92.0  12.0    61      5   24\n",
       "28   23.0     13.0  12.0    67      5   28"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "# Specify the number of bootstraps as 30\n",
    "num_bootstraps = 30\n",
    "\n",
    "# Specify the maximum depth of the decision tree as 3\n",
    "max_depth = 3\n",
    "\n",
    "# Define the Bagging Regressor Model\n",
    "# Use Decision Tree as your base estimator with depth as mentioned in max_depth\n",
    "# Initialise number of estimators using the num_bootstraps value\n",
    "model = ___\n",
    "                        \n",
    "\n",
    "# Fit the model on the train data\n",
    "___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot the predictions of individual estimators and \n",
    "plt.figure(figsize=(10,8))\n",
    "\n",
    "xrange = np.linspace(x.min(),x.max(),80).reshape(-1,1)\n",
    "plt.plot(x_train,y_train,'o',color='#EFAEA4', markersize=6, label=\"Train Data\")\n",
    "plt.plot(x_test,y_test,'o',color='#F6345E', markersize=6, label=\"Test Data\")\n",
    "\n",
    "plt.xlim()\n",
    "for i in model.estimators_:\n",
    "    y_pred1 = i.predict(xrange)\n",
    "    plt.plot(xrange,y_pred1,alpha=0.5,linewidth=0.5,color = '#ABCCE3')\n",
    "plt.plot(xrange,y_pred1,alpha=0.6,linewidth=1,color = '#ABCCE3',label=\"Prediction of Individual Estimators\")\n",
    "\n",
    "\n",
    "y_pred = model.predict(xrange)\n",
    "plt.plot(xrange,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='best',fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute the test MSE of the prediction of individual estimator\n",
    "y_pred1 = ___\n",
    "print(\"The test MSE of one estimator in the model is\", round(mean_squared_error(y_test,y_pred1),2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_mse) ###\n",
    "# Compute the test MSE of the model prediction\n",
    "y_pred = ___\n",
    "print(\"The test MSE of the model is\",round(mean_squared_error(y_test,y_pred),2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mindchow 🍲\n",
    "\n",
    "After marking, go back and change the number of bootstraps and the maximum depth of the tree. \n",
    "\n",
    "\n",
    "- Do you see any relation between them? \n",
    "\n",
    "- How does the variance change with change in maximum depth?\n",
    "\n",
    "- How does the variance change with change in number of bootstraps?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Your answer here*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}