Key Word(s): Boosting, Classification


Title

Exercise: Regression with Boosting

Description

The goal of this exercise is to understand Gradient Boosting Regression by doing!

Instructions:

Part A:

  • Read the dataset airquality.csv as a pandas dataframe.
  • Take a quick look at the dataset.
  • Assign the predictor and response variables appropriately as mentioned in the scaffold.
  • Fit a single decision tree stump and predict on the entire data.
  • Calculate the residuals and fit another tree on the residuals.
  • Take a combination of the trees and fit on the model.
  • For each of these model use the helper code provided to plot the model prediction and data.

Part B: Compare to bagging

  • Split the data into train and test splits.
  • Specify the number of bootstraps for bagging to be 30 and a maximum depth of 3.
  • Define a Gradient Boosting Regression model that uses with 5000 estimators and depth of 1.
  • Define a Bagging Regression model that uses the Decision Tree as its base estimator.
  • Fit both the models on the train data.
  • Use the helper code to predict using the mean model and individual estimators. The plot will look similar to the one given above.
  • Compute the MSE of the 2 models on the test data.

Hints:

sklearn.train_test_split() : Split arrays or matrices into om train and test subsets.

BaggingRegressor() : Returns a Bagging regressor instance.

DecisionTreeRegressor() : A decision tree regressor.

sklearn.mean_squared_error() : Mean squared error regression loss.

GradientBoostingRegressor() : Gradient Boosting for regression.

Note: This exercise is auto-graded and you can try multiple attempts.

In [ ]:
# Import necessary libraries

# Evaluate bagging ensemble for regression
import numpy as np
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import GradientBoostingRegressor
import matplotlib.pyplot as plt
import pandas as pd 
import itertools
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
%matplotlib inline
In [ ]:
# Read the dataset airquality.csv
df = pd.read_csv("airquality.csv")
In [ ]:
# Take a quick look at the data
# Remove rows with missing values
df = df[df.Ozone.notna()]
df.head()
In [ ]:
# Assign "x" column as the predictor variable and "y" as the
# We only use Ozone as a predictor for this exercise and Temp as the response

x,y = df['Ozone'].values,df['Temp'].values

# Fancy way of sorting on X 
# We do this now because we will not split the data 
# into train/val/test in this part of the exercise

x,y = list(zip(*sorted(zip(x,y))))
x,y = np.array(x).reshape(-1,1),np.array(y)

Part A: Gradient Boosting by hand

In [ ]:
# Fit a single decision tree stump on the entire data

basemodel = ___
___

# Predict on the entire data
y_pred = ___
In [ ]:
# Helper code to plot the data

plt.figure(figsize=(10,6))
xrange = np.linspace(x.min(),x.max(),100)
plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label="True Data")
plt.xlim()
plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='best',fontsize=12)
plt.show()
In [ ]:
### edTest(test_first_residuals) ###

# Calculate the Error Residuals
residuals = ___
In [ ]:
# Helper code to plot the data with the residuals
plt.figure(figsize=(10,6))

plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label="True Data")
plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label="Residuals")
plt.plot([x.min(),x.max()],[0,0],'--')
plt.xlim()
plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='center right',fontsize=12)
plt.show()
In [ ]:
### edTest(test_fitted_residuals) ###

# Fit another tree stump on the residuals
dtr = ___
___

# Predict on the entire data
y_pred_residuals = ___
In [ ]:
# Helper code to add the fit of the residuals to the original plot 
plt.figure(figsize=(10,6))

plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label="True Data")
plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label="Residuals")
plt.plot([x.min(),x.max()],[0,0],'--')
plt.xlim()
plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')
plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='center right',fontsize=12)
plt.show()
In [ ]:
### edTest(test_new_pred) ###

# Set a lambda value and compute the predictions based on the residuals
lambda_ = ___
y_pred_new = ___
In [ ]:
# Helper code to plot the boosted tree
plt.figure(figsize=(10,8))

plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label="True Data")
plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label="Residuals")
plt.plot([x.min(),x.max()],[0,0],'--')
plt.xlim()
plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')
plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')
plt.plot(x,y_pred_new,alpha=0.7,linewidth=3,color='k', label='Boosted Tree')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='center right',fontsize=12)
plt.show()

Mindchow 🍲

You can continue doing this! Try at least one more interation.

Part 2: Comparison with Bagging

To compare the two methods, we will be using sklearn's methods and not our own implementation from above.

In [ ]:
# Split the data into train and test sets with train size as 0.8 
# and random_state as 102
# The default value for shuffle is True for train_test_split, so the ordering we 
# did above is not a problem. 

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=102)
In [ ]:
### edTest(test_boosting) ###

# Set a learning rate
l_rate = ___

# Initialise a Boosting model using sklearn's boosting model 
# Use 5000 estimators, depth of 1 and learning rate as defined above
boosted_model  = ___

# Fit on the train data
___

# Predict on the test data
y_pred = ___
In [ ]:
# Train a bagging model
# Specify the number of bootstraps
num_bootstraps = 30

# Specify the maximum depth of the decision tree
max_depth = 100

# Define the Bagging Regressor Model
# Use Decision Tree as your base estimator with depth as mentioned in max_depth
# Initialise number of estimators using the num_bootstraps value
# Set max_samples as 0.8 and random_state as 3
bagging_model = ___
                        

# Fit the model on the train data
___

# Predict on the test data
y_pred_bagging = ___ 
In [ ]:
# Helper code to plot the bagging and boosting model predictions
plt.figure(figsize=(10,8))

xrange = np.linspace(x.min(),x.max(),100).reshape(-1,1)
y_pred_boost = boosted_model.predict(xrange)
y_pred_bag = bagging_model.predict(xrange)
plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label="True Data")
plt.xlim()
plt.plot(xrange,y_pred_boost,alpha=0.7,linewidth=3,color='#77c2fc', label='Bagging')
plt.plot(xrange,y_pred_bag,alpha=0.7,linewidth=3,color='#50AEA4', label='Boosting')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='best',fontsize=12)
plt.show()
In [ ]:
### edTest(test_mse) ###

# Compute the MSE of the Boosting model prediction on the test data

boost_mse = ___
print("The MSE of the Boosting model is", boost_mse)
In [ ]:
# Compute the MSE of the Bagging model prediction on the test data

bag_mse = ___
print("The MSE of the Bagging model is", bag_mse)

Mindchow 🍲

To be fair, we should fine tune the hyper-parameters for both models.

Go back and play with the learning rate,n_estimators and max_depth for Boosting and n_estimators and max_depth for Bagging.

How does RF compare?

Your answer here