{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "\n", "Exercise: Regression with Boosting\n", " \n", "## Description :\n", "\n", "The goal of this exercise is to understand Gradient Boosting Regression.\n", "\n", "\n", "\n", "## Instructions:\n", "\n", "- Part A: \n", " - Read the dataset airquality.csv as a pandas dataframe.\n", " - Take a quick look at the dataset.\n", " - Assign the predictor and response variables appropriately as mentioned in the scaffold.\n", " - Fit a single decision tree stump and predict on the entire data.\n", " - Calculate the residuals and fit another tree on the residuals.\n", " - Take a combination of the trees and fit on the model.\n", " - For each of these model use the helper code provided to plot the model prediction and data.\n", "\n", "- Part B: Compare to bagging \n", " - Split the data into train and test splits.\n", " - Specify the number of bootstraps for bagging to be 30 and a maximum depth of 3.\n", " - Define a Gradient Boosting Regression model that uses with 1000 estimators and depth of 1.\n", " - Define a Bagging Regression model that uses the Decision Tree as its base estimator.\n", " - Fit both the models on the train data.\n", " - Use the helper code to predict using the mean model and individual estimators. The plot will look similar to the one given above.\n", " - Compute the MSE of the 2 models on the test data.\n", "\n", "## Hints: \n", "\n", "sklearn.DecisionTreeRegressor()\n", "A decision tree regressor.\n", "\n", "regressor.fit()\n", "Build a decision tree regressor from the training set (X, y).\n", "\n", "sklearn.DecisionTreeClassifier()\n", "Generates a Logistic Regression classifier.\n", "\n", "classifier.fit()\n", "Build a decision tree classifier from the training set (X, y).\n", "\n", "sklearn.train_test_split()\n", "Split arrays or matrices into om train and test subsets.\n", "\n", "BaggingRegressor()\n", "Returns a Bagging regressor instance.\n", "\n", "sklearn.mean_squared_error()\n", "Mean squared error regression loss.\n", "\n", "GradientBoostingRegressor()\n", "Gradient Boosting for regression.\n", "\n", "**Note:** This exercise is **auto-graded and you can try multiple attempts.**" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "import itertools\n", "import numpy as np\n", "import pandas as pd \n", "import matplotlib.pyplot as plt\n", "from sklearn.ensemble import BaggingRegressor\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.ensemble import GradientBoostingRegressor\n", "%matplotlib inline\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Read the dataset airquality.csv\n", "df = pd.read_csv(\"airquality.csv\")\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Take a quick look at the data\n", "# Remove rows with missing values\n", "df = df[df.Ozone.notna()]\n", "df.head()\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Assign \"x\" column as the predictor variable and \"y\" as the\n", "# We only use Ozone as a predictor for this exercise and Temp as the response\n", "x, y = df['Ozone'].values, df['Temp'].values\n", "\n", "# Sorting the data based on X values\n", "x, y = list(zip(*sorted(zip(x,y))))\n", "x, y = np.array(x).reshape(-1,1),np.array(y)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part A: Gradient Boosting by hand" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Initialise a single decision tree stump\n", "basemodel = ___\n", "\n", "# Fit the stump on the entire data\n", "___\n", "\n", "# Predict on the entire data\n", "y_pred = ___\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to plot the data\n", "plt.figure(figsize=(10,6))\n", "xrange = np.linspace(x.min(),x.max(),100)\n", "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n", "plt.xlim()\n", "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n", "plt.xlabel(\"Ozone\", fontsize=16)\n", "plt.ylabel(\"Temperature\", fontsize=16)\n", "plt.xticks(fontsize=12)\n", "plt.yticks(fontsize=12)\n", "plt.legend(loc='best',fontsize=12)\n", "plt.show()\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_first_residuals) ###\n", "\n", "# Calculate the error residuals\n", "residuals = ___\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to plot the data with the residuals\n", "plt.figure(figsize=(10,6))\n", "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n", "plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n", "plt.plot([x.min(),x.max()],[0,0],'--')\n", "plt.xlim()\n", "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n", "plt.xlabel(\"Ozone\", fontsize=16)\n", "plt.ylabel(\"Temperature\", fontsize=16)\n", "plt.xticks(fontsize=12)\n", "plt.yticks(fontsize=12)\n", "plt.legend(loc='center right',fontsize=12)\n", "plt.show()\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_fitted_residuals) ###\n", "\n", "# Initialise a tree stump\n", "dtr = ___\n", "\n", "# Fit the tree stump on the residuals\n", "___\n", "\n", "# Predict on the entire data\n", "y_pred_residuals = ___\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to add the fit of the residuals to the original plot \n", "plt.figure(figsize=(10,6))\n", "\n", "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n", "plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n", "plt.plot([x.min(),x.max()],[0,0],'--')\n", "plt.xlim()\n", "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n", "plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')\n", "plt.xlabel(\"Ozone\", fontsize=16)\n", "plt.ylabel(\"Temperature\", fontsize=16)\n", "plt.xticks(fontsize=12)\n", "plt.yticks(fontsize=12)\n", "plt.legend(loc='center right',fontsize=12)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_new_pred) ###\n", "\n", "# Set a lambda value and compute the predictions based on \n", "# the residuals\n", "lambda_ = ___\n", "y_pred_new = ___\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to plot the boosted tree\n", "plt.figure(figsize=(10,8))\n", "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n", "plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n", "plt.plot([x.min(),x.max()],[0,0],'--')\n", "plt.xlim()\n", "plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n", "plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')\n", "plt.plot(x,y_pred_new,alpha=0.7,linewidth=3,color='k', label='Boosted Tree')\n", "plt.xlabel(\"Ozone\", fontsize=16)\n", "plt.ylabel(\"Temperature\", fontsize=16)\n", "plt.xticks(fontsize=12)\n", "plt.yticks(fontsize=12)\n", "plt.legend(loc='center right',fontsize=12)\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2: Comparison with Bagging\n", "\n", "To compare the two methods, we will be using sklearn's methods and not our own implementation from above. " ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Split the data into train and test sets with train size as 0.8 \n", "# and random_state as 102\n", "# The default value for shuffle is True for train_test_split, so the ordering we \n", "# did above is not a problem. \n", "x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=102)\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_boosting) ###\n", "\n", "# Set a learning rate\n", "l_rate = ___\n", "\n", "# Initialise a Boosting model using sklearn's boosting model \n", "# Use 1000 estimators, depth of 1 and learning rate as defined above\n", "boosted_model = ___\n", "\n", "# Fit on the train data\n", "___\n", "\n", "# Predict on the test data\n", "y_pred = ___\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Specify the number of bootstraps\n", "num_bootstraps = 30\n", "\n", "# Specify the maximum depth of the decision tree\n", "max_depth = 100\n", "\n", "# Define the Bagging Regressor Model\n", "# Use Decision Tree as your base estimator with depth as mentioned in max_depth\n", "# Initialise number of estimators using the num_bootstraps value\n", "# Set max_samples as 1 and random_state as 3\n", "model = ___\n", " \n", "\n", "# Fit the model on the train data\n", "___\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to plot the bagging and boosting model predictions\n", "plt.figure(figsize=(10,8))\n", "xrange = np.linspace(x.min(),x.max(),100).reshape(-1,1)\n", "y_pred_boost = boosted_model.predict(xrange)\n", "y_pred_bag = model.predict(xrange)\n", "plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n", "plt.xlim()\n", "plt.plot(xrange,y_pred_boost,alpha=0.7,linewidth=3,color='#77c2fc', label='Bagging')\n", "plt.plot(xrange,y_pred_bag,alpha=0.7,linewidth=3,color='#50AEA4', label='Boosting')\n", "plt.xlabel(\"Ozone\", fontsize=16)\n", "plt.ylabel(\"Temperature\", fontsize=16)\n", "plt.xticks(fontsize=12)\n", "plt.yticks(fontsize=12)\n", "plt.legend(loc='best',fontsize=12)\n", "plt.show()\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_mse) ###\n", "\n", "# Compute the MSE of the Boosting model prediction on the test data\n", "boost_mse = ___\n", "print(\"The MSE of the Boosting model is\", boost_mse)\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Compute the MSE of the Bagging model prediction on the test data\n", "bag_mse = ___\n", "print(\"The MSE of the Bagging model is\", bag_mse)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }