{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Title :\n",
    "\n",
    "Exercise: Regression with Boosting\n",
    "    \n",
    "## Description :\n",
    "\n",
    "The goal of this exercise is to understand Gradient Boosting Regression.\n",
    "\n",
    "<img src=\"../fig/fig1.png\" style=\"width: 500px;\">\n",
    "\n",
    "## Instructions:\n",
    "\n",
    "- Part A: \n",
    "    - Read the dataset airquality.csv as a pandas dataframe.\n",
    "    - Take a quick look at the dataset.\n",
    "    - Assign the predictor and response variables appropriately as mentioned in the scaffold.\n",
    "    - Fit a single decision tree stump and predict on the entire data.\n",
    "    - Calculate the residuals and fit another tree on the residuals.\n",
    "    - Take a combination of the trees and fit on the model.\n",
    "    - For each of these model use the helper code provided to plot the model prediction and data.\n",
    "\n",
    "- Part B: Compare to bagging \n",
    "    - Split the data into train and test splits.\n",
    "    - Specify the number of bootstraps for bagging to be 30 and a maximum depth of 3.\n",
    "    - Define a Gradient Boosting Regression model that uses with 1000 estimators and depth of 1.\n",
    "    - Define a Bagging Regression model that uses the Decision Tree as its base estimator.\n",
    "    - Fit both the models on the train data.\n",
    "    - Use the helper code to predict using the mean model and individual estimators. The plot will look similar to the one given above.\n",
    "    - Compute the MSE of the 2 models on the test data.\n",
    "\n",
    "## Hints: \n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html\" target=\"_blank\">sklearn.DecisionTreeRegressor()</a>\n",
    "A decision tree regressor.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.fit\" target=\"_blank\">regressor.fit()</a>\n",
    "Build a decision tree regressor from the training set (X, y).\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html\" target=\"_blank\">sklearn.DecisionTreeClassifier()</a>\n",
    "Generates a Logistic Regression classifier.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit\" target=\"_blank\">classifier.fit()</a>\n",
    "Build a decision tree classifier from the training set (X, y).\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\" target=\"_blank\">sklearn.train_test_split()</a>\n",
    "Split arrays or matrices into om train and test subsets.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html\" target=\"_blank\">BaggingRegressor()</a>\n",
    "Returns a Bagging regressor instance.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html\" target=\"_blank\">sklearn.mean_squared_error()</a>\n",
    "Mean squared error regression loss.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html\" target=\"_blank\">GradientBoostingRegressor()</a>\n",
    "Gradient Boosting for regression.\n",
    "\n",
    "**Note:** This exercise is **auto-graded and you can try multiple attempts.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "import itertools\n",
    "import numpy as np\n",
    "import pandas as pd \n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.ensemble import BaggingRegressor\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.ensemble import GradientBoostingRegressor\n",
    "%matplotlib inline\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the dataset airquality.csv\n",
    "df = pd.read_csv(\"airquality.csv\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take a quick look at the data\n",
    "# Remove rows with missing values\n",
    "df = df[df.Ozone.notna()]\n",
    "df.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign \"x\" column as the predictor variable and \"y\" as the\n",
    "# We only use Ozone as a predictor for this exercise and Temp as the response\n",
    "x, y = df['Ozone'].values, df['Temp'].values\n",
    "\n",
    "# Sorting the data based on X values\n",
    "x, y = list(zip(*sorted(zip(x,y))))\n",
    "x, y = np.array(x).reshape(-1,1),np.array(y)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part A: Gradient Boosting by hand"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialise a single decision tree stump\n",
    "basemodel = ___\n",
    "\n",
    "# Fit the stump on the entire data\n",
    "___\n",
    "\n",
    "# Predict on the entire data\n",
    "y_pred = ___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot the data\n",
    "plt.figure(figsize=(10,6))\n",
    "xrange = np.linspace(x.min(),x.max(),100)\n",
    "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
    "plt.xlim()\n",
    "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='best',fontsize=12)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_first_residuals) ###\n",
    "\n",
    "# Calculate the error residuals\n",
    "residuals = ___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot the data with the residuals\n",
    "plt.figure(figsize=(10,6))\n",
    "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
    "plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n",
    "plt.plot([x.min(),x.max()],[0,0],'--')\n",
    "plt.xlim()\n",
    "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='center right',fontsize=12)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_fitted_residuals) ###\n",
    "\n",
    "# Initialise a tree stump\n",
    "dtr = ___\n",
    "\n",
    "# Fit the tree stump on the residuals\n",
    "___\n",
    "\n",
    "# Predict on the entire data\n",
    "y_pred_residuals = ___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to add the fit of the residuals to the original plot \n",
    "plt.figure(figsize=(10,6))\n",
    "\n",
    "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
    "plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n",
    "plt.plot([x.min(),x.max()],[0,0],'--')\n",
    "plt.xlim()\n",
    "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
    "plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='center right',fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_new_pred) ###\n",
    "\n",
    "# Set a lambda value and compute the predictions based on \n",
    "# the residuals\n",
    "lambda_ = ___\n",
    "y_pred_new = ___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot the boosted tree\n",
    "plt.figure(figsize=(10,8))\n",
    "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
    "plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n",
    "plt.plot([x.min(),x.max()],[0,0],'--')\n",
    "plt.xlim()\n",
    "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
    "plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')\n",
    "plt.plot(x,y_pred_new,alpha=0.7,linewidth=3,color='k', label='Boosted Tree')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='center right',fontsize=12)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2: Comparison with Bagging\n",
    "\n",
    "To compare the two methods, we will be using sklearn's methods and not our own implementation from above. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split the data into train and test sets with train size as 0.8 \n",
    "# and random_state as 102\n",
    "# The default value for shuffle is True for train_test_split, so the ordering we \n",
    "# did above is not a problem. \n",
    "x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=102)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_boosting) ###\n",
    "\n",
    "# Set a learning rate\n",
    "l_rate = ___\n",
    "\n",
    "# Initialise a Boosting model using sklearn's boosting model \n",
    "# Use 1000 estimators, depth of 1 and learning rate as defined above\n",
    "boosted_model  = ___\n",
    "\n",
    "# Fit on the train data\n",
    "___\n",
    "\n",
    "# Predict on the test data\n",
    "y_pred = ___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specify the number of bootstraps\n",
    "num_bootstraps = 30\n",
    "\n",
    "# Specify the maximum depth of the decision tree\n",
    "max_depth = 100\n",
    "\n",
    "# Define the Bagging Regressor Model\n",
    "# Use Decision Tree as your base estimator with depth as mentioned in max_depth\n",
    "# Initialise number of estimators using the num_bootstraps value\n",
    "# Set max_samples as 1 and random_state as 3\n",
    "model = ___\n",
    "                        \n",
    "\n",
    "# Fit the model on the train data\n",
    "___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot the bagging and boosting model predictions\n",
    "plt.figure(figsize=(10,8))\n",
    "xrange = np.linspace(x.min(),x.max(),100).reshape(-1,1)\n",
    "y_pred_boost = boosted_model.predict(xrange)\n",
    "y_pred_bag = model.predict(xrange)\n",
    "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
    "plt.xlim()\n",
    "plt.plot(xrange,y_pred_boost,alpha=0.7,linewidth=3,color='#77c2fc', label='Bagging')\n",
    "plt.plot(xrange,y_pred_bag,alpha=0.7,linewidth=3,color='#50AEA4', label='Boosting')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='best',fontsize=12)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_mse) ###\n",
    "\n",
    "# Compute the MSE of the Boosting model prediction on the test data\n",
    "boost_mse = ___\n",
    "print(\"The MSE of the Boosting model is\", boost_mse)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute the MSE of the Bagging model prediction on the test data\n",
    "bag_mse = ___\n",
    "print(\"The MSE of the Bagging model is\", bag_mse)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}