Key Word(s): Gradient Boosting, Comparison of Models, Residuals, Gradient Descent, Learning Rate
Title :¶
Exercise: Regression with Boosting
Description :¶
The goal of this exercise is to understand Gradient Boosting Regression.
Instructions:¶
Part A:
- Read the dataset airquality.csv as a pandas dataframe.
- Take a quick look at the dataset.
- Assign the predictor and response variables appropriately as mentioned in the scaffold.
- Fit a single decision tree stump and predict on the entire data.
- Calculate the residuals and fit another tree on the residuals.
- Take a combination of the trees and fit on the model.
- For each of these model use the helper code provided to plot the model prediction and data.
Part B: Compare to bagging
- Split the data into train and test splits.
- Specify the number of bootstraps for bagging to be 30 and a maximum depth of 3.
- Define a Gradient Boosting Regression model that uses with 1000 estimators and depth of 1.
- Define a Bagging Regression model that uses the Decision Tree as its base estimator.
- Fit both the models on the train data.
- Use the helper code to predict using the mean model and individual estimators. The plot will look similar to the one given above.
- Compute the MSE of the 2 models on the test data.
Hints:¶
sklearn.DecisionTreeRegressor() A decision tree regressor.
regressor.fit() Build a decision tree regressor from the training set (X, y).
sklearn.DecisionTreeClassifier() Generates a Logistic Regression classifier.
classifier.fit() Build a decision tree classifier from the training set (X, y).
sklearn.train_test_split() Split arrays or matrices into om train and test subsets.
BaggingRegressor() Returns a Bagging regressor instance.
sklearn.mean_squared_error() Mean squared error regression loss.
GradientBoostingRegressor() Gradient Boosting for regression.
Note: This exercise is auto-graded and you can try multiple attempts.
# Import necessary libraries
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
%matplotlib inline
# Read the dataset airquality.csv
df = pd.read_csv("airquality.csv")
# Take a quick look at the data
# Remove rows with missing values
df = df[df.Ozone.notna()]
df.head()
# Assign "x" column as the predictor variable and "y" as the
# We only use Ozone as a predictor for this exercise and Temp as the response
x, y = df['Ozone'].values, df['Temp'].values
# Sorting the data based on X values
x, y = list(zip(*sorted(zip(x,y))))
x, y = np.array(x).reshape(-1,1),np.array(y)
Part A: Gradient Boosting by hand¶
# Initialise a single decision tree stump
basemodel = ___
# Fit the stump on the entire data
___
# Predict on the entire data
y_pred = ___
# Helper code to plot the data
plt.figure(figsize=(10,6))
xrange = np.linspace(x.min(),x.max(),100)
plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label="True Data")
plt.xlim()
plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='best',fontsize=12)
plt.show()
### edTest(test_first_residuals) ###
# Calculate the error residuals
residuals = ___
# Helper code to plot the data with the residuals
plt.figure(figsize=(10,6))
plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label="True Data")
plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label="Residuals")
plt.plot([x.min(),x.max()],[0,0],'--')
plt.xlim()
plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='center right',fontsize=12)
plt.show()
### edTest(test_fitted_residuals) ###
# Initialise a tree stump
dtr = ___
# Fit the tree stump on the residuals
___
# Predict on the entire data
y_pred_residuals = ___
# Helper code to add the fit of the residuals to the original plot
plt.figure(figsize=(10,6))
plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label="True Data")
plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label="Residuals")
plt.plot([x.min(),x.max()],[0,0],'--')
plt.xlim()
plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')
plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='center right',fontsize=12)
plt.show()
### edTest(test_new_pred) ###
# Set a lambda value and compute the predictions based on
# the residuals
lambda_ = ___
y_pred_new = ___
# Helper code to plot the boosted tree
plt.figure(figsize=(10,8))
plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label="True Data")
plt.plot(x,residuals,'.-',color='#faa0a6', markersize=6, label="Residuals")
plt.plot([x.min(),x.max()],[0,0],'--')
plt.xlim()
plt.plot(x,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='First Tree')
plt.plot(x,y_pred_residuals,alpha=0.7,linewidth=3,color='red', label='Residual Tree')
plt.plot(x,y_pred_new,alpha=0.7,linewidth=3,color='k', label='Boosted Tree')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='center right',fontsize=12)
plt.show()
Part 2: Comparison with Bagging¶
To compare the two methods, we will be using sklearn's methods and not our own implementation from above.
# Split the data into train and test sets with train size as 0.8
# and random_state as 102
# The default value for shuffle is True for train_test_split, so the ordering we
# did above is not a problem.
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=102)
### edTest(test_boosting) ###
# Set a learning rate
l_rate = ___
# Initialise a Boosting model using sklearn's boosting model
# Use 1000 estimators, depth of 1 and learning rate as defined above
boosted_model = ___
# Fit on the train data
___
# Predict on the test data
y_pred = ___
# Specify the number of bootstraps
num_bootstraps = 30
# Specify the maximum depth of the decision tree
max_depth = 100
# Define the Bagging Regressor Model
# Use Decision Tree as your base estimator with depth as mentioned in max_depth
# Initialise number of estimators using the num_bootstraps value
# Set max_samples as 1 and random_state as 3
model = ___
# Fit the model on the train data
___
# Helper code to plot the bagging and boosting model predictions
plt.figure(figsize=(10,8))
xrange = np.linspace(x.min(),x.max(),100).reshape(-1,1)
y_pred_boost = boosted_model.predict(xrange)
y_pred_bag = model.predict(xrange)
plt.plot(x,y,'o',color='#EFAEA4', markersize=6, label="True Data")
plt.xlim()
plt.plot(xrange,y_pred_boost,alpha=0.7,linewidth=3,color='#77c2fc', label='Bagging')
plt.plot(xrange,y_pred_bag,alpha=0.7,linewidth=3,color='#50AEA4', label='Boosting')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='best',fontsize=12)
plt.show()
### edTest(test_mse) ###
# Compute the MSE of the Boosting model prediction on the test data
boost_mse = ___
print("The MSE of the Boosting model is", boost_mse)
# Compute the MSE of the Bagging model prediction on the test data
bag_mse = ___
print("The MSE of the Bagging model is", bag_mse)