{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Title\n",
    "\n",
    "**Exercise: Regression with Boosting**\n",
    "\n",
    "# Description\n",
    "\n",
    "The goal of this exercise is to understand Gradient Boosting Regression by doing!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../img/image.png\" style=\"width: 500px;\">\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instructions:\n",
    "\n",
    "## Part A: \n",
    "- Read the dataset airquality.csv as a pandas dataframe.\n",
    "- Take a quick look at the dataset.\n",
    "- Assign the predictor and response variables appropriately as mentioned in the scaffold.\n",
    "- Fit a single decision tree stump and predict on the entire data.\n",
    "- Calculate the residuals and fit another tree on the residuals.\n",
    "- Take a combination of the trees and fit on the model.\n",
    "- For each of these model use the helper code provided to plot the model prediction and data.\n",
    "\n",
    "## Part B: Compare to bagging \n",
    "- Split the data into train and test splits.\n",
    "- Specify the number of bootstraps for bagging to be 30 and a maximum depth of 3.\n",
    "- Define a Gradient Boosting Regression model that uses with 5000 estimators and depth of 1.\n",
    "- Define a Bagging Regression model that uses the Decision Tree as its base estimator.\n",
    "- Fit both the models on the train data.\n",
    "- Use the helper code to predict using the mean model and individual estimators. The plot will look similar to the one given above.\n",
    "- Compute the MSE of the 2 models on the test data.\n",
    "\n",
    "# Hints:\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\" target=\"_blank\">sklearn.train_test_split()</a> : Split arrays or matrices into om train and test subsets.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html\" target=\"_blank\">BaggingRegressor()</a> : Returns a Bagging regressor instance.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html\" target=\"_blank\">DecisionTreeRegressor()</a> : A decision tree regressor.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html\" target=\"_blank\">sklearn.mean_squared_error()</a> : Mean squared error regression loss.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html\" target=\"_blank\">GradientBoostingRegressor()</a> : Gradient Boosting for regression.\n",
    "\n",
    "Note: This exercise is **auto-graded and you can try multiple attempts.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "\n",
    "# Evaluate bagging ensemble for regression\n",
    "import numpy as np\n",
    "from sklearn.ensemble import BaggingRegressor\n",
    "from sklearn.ensemble import GradientBoostingRegressor\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd \n",
    "import itertools\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the dataset airquality.csv\n",
    "df = pd.read_csv(\"airquality.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take a quick look at the data\n",
    "# Remove rows with missing values\n",
    "df = df[df.Ozone.notna()]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign \"x\" column as the predictor variable and \"y\" as the\n",
    "# We only use Ozone as a predictor for this exercise and Temp as the response\n",
    "\n",
    "x,y = df['Ozone'].values,df['Temp'].values\n",
    "\n",
    "# Fancy way of sorting on X \n",
    "# We do this now because we will not split the data \n",
    "# into train/val/test in this part of the exercise\n",
    "\n",
    "x,y = list(zip(*sorted(zip(x,y))))\n",
    "x,y = np.array(x).reshape(-1,1),np.array(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part A: Gradient Boosting by hand"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit a single decision tree stump on the entire data\n",
    "\n",
    "basemodel = ___\n",
    "___\n",
    "\n",
    "# Predict on the entire data\n",
    "y_pred = ___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot the data\n",
    "\n",
    "plt.figure(figsize=(10,6))\n",
    "xrange = np.linspace(x.min(),x.max(),100)\n",
    "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
    "plt.xlim()\n",
    "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='best',fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_first_residuals) ###\n",
    "\n",
    "# Calculate the Error Residuals\n",
    "residuals = ___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot the data with the residuals\n",
    "plt.figure(figsize=(10,6))\n",
    "\n",
    "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
    "plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n",
    "plt.plot([x.min(),x.max()],[0,0],'--')\n",
    "plt.xlim()\n",
    "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='center right',fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_fitted_residuals) ###\n",
    "\n",
    "# Fit another tree stump on the residuals\n",
    "dtr = ___\n",
    "___\n",
    "\n",
    "# Predict on the entire data\n",
    "y_pred_residuals = ___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to add the fit of the residuals to the original plot \n",
    "plt.figure(figsize=(10,6))\n",
    "\n",
    "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
    "plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n",
    "plt.plot([x.min(),x.max()],[0,0],'--')\n",
    "plt.xlim()\n",
    "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
    "plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='center right',fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_new_pred) ###\n",
    "\n",
    "# Set a lambda value and compute the predictions based on the residuals\n",
    "lambda_ = ___\n",
    "y_pred_new = ___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot the boosted tree\n",
    "plt.figure(figsize=(10,8))\n",
    "\n",
    "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
    "plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n",
    "plt.plot([x.min(),x.max()],[0,0],'--')\n",
    "plt.xlim()\n",
    "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
    "plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')\n",
    "plt.plot(x,y_pred_new,alpha=0.7,linewidth=3,color='k', label='Boosted Tree')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='center right',fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Mindchow 🍲 \n",
    "You can continue doing this! Try at least one more interation. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2: Comparison with Bagging\n",
    "\n",
    "To compare the two methods, we will be using sklearn's methods and not our own implementation from above. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split the data into train and test sets with train size as 0.8 \n",
    "# and random_state as 102\n",
    "# The default value for shuffle is True for train_test_split, so the ordering we \n",
    "# did above is not a problem. \n",
    "\n",
    "x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=102)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_boosting) ###\n",
    "\n",
    "# Set a learning rate\n",
    "l_rate = ___\n",
    "\n",
    "# Initialise a Boosting model using sklearn's boosting model \n",
    "# Use 5000 estimators, depth of 1 and learning rate as defined above\n",
    "boosted_model  = ___\n",
    "\n",
    "# Fit on the train data\n",
    "___\n",
    "\n",
    "# Predict on the test data\n",
    "y_pred = ___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train a bagging model\n",
    "# Specify the number of bootstraps\n",
    "num_bootstraps = 30\n",
    "\n",
    "# Specify the maximum depth of the decision tree\n",
    "max_depth = 100\n",
    "\n",
    "# Define the Bagging Regressor Model\n",
    "# Use Decision Tree as your base estimator with depth as mentioned in max_depth\n",
    "# Initialise number of estimators using the num_bootstraps value\n",
    "# Set max_samples as 0.8 and random_state as 3\n",
    "bagging_model = ___\n",
    "                        \n",
    "\n",
    "# Fit the model on the train data\n",
    "___\n",
    "\n",
    "# Predict on the test data\n",
    "y_pred_bagging = ___ \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot the bagging and boosting model predictions\n",
    "plt.figure(figsize=(10,8))\n",
    "\n",
    "xrange = np.linspace(x.min(),x.max(),100).reshape(-1,1)\n",
    "y_pred_boost = boosted_model.predict(xrange)\n",
    "y_pred_bag = bagging_model.predict(xrange)\n",
    "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
    "plt.xlim()\n",
    "plt.plot(xrange,y_pred_boost,alpha=0.7,linewidth=3,color='#77c2fc', label='Bagging')\n",
    "plt.plot(xrange,y_pred_bag,alpha=0.7,linewidth=3,color='#50AEA4', label='Boosting')\n",
    "plt.xlabel(\"Ozone\", fontsize=16)\n",
    "plt.ylabel(\"Temperature\", fontsize=16)\n",
    "plt.xticks(fontsize=12)\n",
    "plt.yticks(fontsize=12)\n",
    "plt.legend(loc='best',fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_mse) ###\n",
    "\n",
    "# Compute the MSE of the Boosting model prediction on the test data\n",
    "\n",
    "boost_mse = ___\n",
    "print(\"The MSE of the Boosting model is\", boost_mse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute the MSE of the Bagging model prediction on the test data\n",
    "\n",
    "bag_mse = ___\n",
    "print(\"The MSE of the Bagging model is\", bag_mse)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mindchow 🍲\n",
    "\n",
    "To be fair, we should fine tune the hyper-parameters for both models. \n",
    "\n",
    "Go back and play with the `learning rate`,`n_estimators` and `max_depth` for Boosting and `n_estimators` and `max_depth` for Bagging. \n",
    "\n",
    "How does RF compare? \n",
    " \n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Your answer here*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}