{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Title :\n",
"\n",
"Exercise: Regression with Boosting\n",
" \n",
"## Description :\n",
"\n",
"The goal of this exercise is to understand Gradient Boosting Regression.\n",
"\n",
"
\n",
"\n",
"## Instructions:\n",
"\n",
"- Part A: \n",
" - Read the dataset airquality.csv as a pandas dataframe.\n",
" - Take a quick look at the dataset.\n",
" - Assign the predictor and response variables appropriately as mentioned in the scaffold.\n",
" - Fit a single decision tree stump and predict on the entire data.\n",
" - Calculate the residuals and fit another tree on the residuals.\n",
" - Take a combination of the trees and fit on the model.\n",
" - For each of these model use the helper code provided to plot the model prediction and data.\n",
"\n",
"- Part B: Compare to bagging \n",
" - Split the data into train and test splits.\n",
" - Specify the number of bootstraps for bagging to be 30 and a maximum depth of 3.\n",
" - Define a Gradient Boosting Regression model that uses with 1000 estimators and depth of 1.\n",
" - Define a Bagging Regression model that uses the Decision Tree as its base estimator.\n",
" - Fit both the models on the train data.\n",
" - Use the helper code to predict using the mean model and individual estimators. The plot will look similar to the one given above.\n",
" - Compute the MSE of the 2 models on the test data.\n",
"\n",
"## Hints: \n",
"\n",
"sklearn.DecisionTreeRegressor()\n",
"A decision tree regressor.\n",
"\n",
"regressor.fit()\n",
"Build a decision tree regressor from the training set (X, y).\n",
"\n",
"sklearn.DecisionTreeClassifier()\n",
"Generates a Logistic Regression classifier.\n",
"\n",
"classifier.fit()\n",
"Build a decision tree classifier from the training set (X, y).\n",
"\n",
"sklearn.train_test_split()\n",
"Split arrays or matrices into om train and test subsets.\n",
"\n",
"BaggingRegressor()\n",
"Returns a Bagging regressor instance.\n",
"\n",
"sklearn.mean_squared_error()\n",
"Mean squared error regression loss.\n",
"\n",
"GradientBoostingRegressor()\n",
"Gradient Boosting for regression.\n",
"\n",
"**Note:** This exercise is **auto-graded and you can try multiple attempts.**"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"import itertools\n",
"import numpy as np\n",
"import pandas as pd \n",
"import matplotlib.pyplot as plt\n",
"from sklearn.ensemble import BaggingRegressor\n",
"from sklearn.tree import DecisionTreeRegressor\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import GradientBoostingRegressor\n",
"%matplotlib inline\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Read the dataset airquality.csv\n",
"df = pd.read_csv(\"airquality.csv\")\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Take a quick look at the data\n",
"# Remove rows with missing values\n",
"df = df[df.Ozone.notna()]\n",
"df.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Assign \"x\" column as the predictor variable and \"y\" as the\n",
"# We only use Ozone as a predictor for this exercise and Temp as the response\n",
"x, y = df['Ozone'].values, df['Temp'].values\n",
"\n",
"# Sorting the data based on X values\n",
"x, y = list(zip(*sorted(zip(x,y))))\n",
"x, y = np.array(x).reshape(-1,1),np.array(y)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part A: Gradient Boosting by hand"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Initialise a single decision tree stump\n",
"basemodel = ___\n",
"\n",
"# Fit the stump on the entire data\n",
"___\n",
"\n",
"# Predict on the entire data\n",
"y_pred = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code to plot the data\n",
"plt.figure(figsize=(10,6))\n",
"xrange = np.linspace(x.min(),x.max(),100)\n",
"plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
"plt.xlim()\n",
"plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
"plt.xlabel(\"Ozone\", fontsize=16)\n",
"plt.ylabel(\"Temperature\", fontsize=16)\n",
"plt.xticks(fontsize=12)\n",
"plt.yticks(fontsize=12)\n",
"plt.legend(loc='best',fontsize=12)\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_first_residuals) ###\n",
"\n",
"# Calculate the error residuals\n",
"residuals = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code to plot the data with the residuals\n",
"plt.figure(figsize=(10,6))\n",
"plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
"plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n",
"plt.plot([x.min(),x.max()],[0,0],'--')\n",
"plt.xlim()\n",
"plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
"plt.xlabel(\"Ozone\", fontsize=16)\n",
"plt.ylabel(\"Temperature\", fontsize=16)\n",
"plt.xticks(fontsize=12)\n",
"plt.yticks(fontsize=12)\n",
"plt.legend(loc='center right',fontsize=12)\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_fitted_residuals) ###\n",
"\n",
"# Initialise a tree stump\n",
"dtr = ___\n",
"\n",
"# Fit the tree stump on the residuals\n",
"___\n",
"\n",
"# Predict on the entire data\n",
"y_pred_residuals = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code to add the fit of the residuals to the original plot \n",
"plt.figure(figsize=(10,6))\n",
"\n",
"plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
"plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n",
"plt.plot([x.min(),x.max()],[0,0],'--')\n",
"plt.xlim()\n",
"plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
"plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')\n",
"plt.xlabel(\"Ozone\", fontsize=16)\n",
"plt.ylabel(\"Temperature\", fontsize=16)\n",
"plt.xticks(fontsize=12)\n",
"plt.yticks(fontsize=12)\n",
"plt.legend(loc='center right',fontsize=12)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_new_pred) ###\n",
"\n",
"# Set a lambda value and compute the predictions based on \n",
"# the residuals\n",
"lambda_ = ___\n",
"y_pred_new = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code to plot the boosted tree\n",
"plt.figure(figsize=(10,8))\n",
"plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
"plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label=\"Residuals\")\n",
"plt.plot([x.min(),x.max()],[0,0],'--')\n",
"plt.xlim()\n",
"plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')\n",
"plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')\n",
"plt.plot(x,y_pred_new,alpha=0.7,linewidth=3,color='k', label='Boosted Tree')\n",
"plt.xlabel(\"Ozone\", fontsize=16)\n",
"plt.ylabel(\"Temperature\", fontsize=16)\n",
"plt.xticks(fontsize=12)\n",
"plt.yticks(fontsize=12)\n",
"plt.legend(loc='center right',fontsize=12)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 2: Comparison with Bagging\n",
"\n",
"To compare the two methods, we will be using sklearn's methods and not our own implementation from above. "
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Split the data into train and test sets with train size as 0.8 \n",
"# and random_state as 102\n",
"# The default value for shuffle is True for train_test_split, so the ordering we \n",
"# did above is not a problem. \n",
"x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=102)\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_boosting) ###\n",
"\n",
"# Set a learning rate\n",
"l_rate = ___\n",
"\n",
"# Initialise a Boosting model using sklearn's boosting model \n",
"# Use 1000 estimators, depth of 1 and learning rate as defined above\n",
"boosted_model = ___\n",
"\n",
"# Fit on the train data\n",
"___\n",
"\n",
"# Predict on the test data\n",
"y_pred = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Specify the number of bootstraps\n",
"num_bootstraps = 30\n",
"\n",
"# Specify the maximum depth of the decision tree\n",
"max_depth = 100\n",
"\n",
"# Define the Bagging Regressor Model\n",
"# Use Decision Tree as your base estimator with depth as mentioned in max_depth\n",
"# Initialise number of estimators using the num_bootstraps value\n",
"# Set max_samples as 1 and random_state as 3\n",
"model = ___\n",
" \n",
"\n",
"# Fit the model on the train data\n",
"___\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code to plot the bagging and boosting model predictions\n",
"plt.figure(figsize=(10,8))\n",
"xrange = np.linspace(x.min(),x.max(),100).reshape(-1,1)\n",
"y_pred_boost = boosted_model.predict(xrange)\n",
"y_pred_bag = model.predict(xrange)\n",
"plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label=\"True Data\")\n",
"plt.xlim()\n",
"plt.plot(xrange,y_pred_boost,alpha=0.7,linewidth=3,color='#77c2fc', label='Bagging')\n",
"plt.plot(xrange,y_pred_bag,alpha=0.7,linewidth=3,color='#50AEA4', label='Boosting')\n",
"plt.xlabel(\"Ozone\", fontsize=16)\n",
"plt.ylabel(\"Temperature\", fontsize=16)\n",
"plt.xticks(fontsize=12)\n",
"plt.yticks(fontsize=12)\n",
"plt.legend(loc='best',fontsize=12)\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_mse) ###\n",
"\n",
"# Compute the MSE of the Boosting model prediction on the test data\n",
"boost_mse = ___\n",
"print(\"The MSE of the Boosting model is\", boost_mse)\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Compute the MSE of the Bagging model prediction on the test data\n",
"bag_mse = ___\n",
"print(\"The MSE of the Bagging model is\", bag_mse)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}