Key Word(s): bagging


Title

Exercise: Regression with Bagging

Description

The aim of this exercise is to understand bagging regression.

Instructions:

  • Read the dataset airquality.csv as a pandas dataframe.
  • Take a quick look at the dataset.
  • Split the data into train and test sets.
  • Specify the number of bootstraps as 30 and a maximum depth of 3.
  • Define a Bagging Regression model that uses Decision Tree as its base estimator.
  • Fit the model on the train data.
  • Use the helper code to predict using the mean model and individual estimators. The plot will look similar to the one given above.
  • Predict on the test data using the first estimator and the mean model.
  • Compute and display the test MSEs.

Hints:

sklearn.train_test_split() : Split arrays or matrices into random train and test subsets.

BaggingRegressor() : Returns a Bagging regressor instance.

DecisionTreeRegressor() : A decision tree regressor.

DecisionTreeRegressor().estimators_ : A list of estimators. Use this to access any of the estimators.

sklearn.mean_squared_error() : Mean squared error regression loss.

In [2]:
# Import necessary libraries

import numpy as np
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.ensemble import BaggingRegressor
import matplotlib.pyplot as plt
import pandas as pd 
import itertools
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
%matplotlib inline
In [3]:
# Read the dataset
df = pd.read_csv("airquality.csv",index_col=0)
In [10]:
# Take a quick look at the data
df.head(10)
In [12]:
# We will only use Ozone for this exerice. Drop any notnas
df = df[df.Ozone.notna()]
In [17]:
# Assign "x" column as the predictor variable, only use Ozone, and "y" as the
x = df[['Ozone']].values
y = df['Temp']
In [20]:
# Split the data into train and test sets with train size as 0.8 and random_state as 102
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=102)

Bagging Regressor

In [21]:
# Specify the number of bootstraps as 30
num_bootstraps = 30

# Specify the maximum depth of the decision tree as 3
max_depth = 3

# Define the Bagging Regressor Model
# Use Decision Tree as your base estimator with depth as mentioned in max_depth
# Initialise number of estimators using the num_bootstraps value
model = ___
                        

# Fit the model on the train data
___
Out[21]:
Ozone Solar.R Wind Temp Month Day
1 41.0 190.0 7.4 67 5 1
2 36.0 118.0 8.0 72 5 2
3 12.0 149.0 12.6 74 5 3
4 18.0 313.0 11.5 62 5 4
6 28.0 NaN 14.9 66 5 6
7 23.0 299.0 8.6 65 5 7
8 19.0 99.0 13.8 59 5 8
9 8.0 19.0 20.1 61 5 9
11 7.0 NaN 6.9 74 5 11
12 16.0 256.0 9.7 69 5 12
13 11.0 290.0 9.2 66 5 13
14 14.0 274.0 10.9 68 5 14
15 18.0 65.0 13.2 58 5 15
16 14.0 334.0 11.5 64 5 16
17 34.0 307.0 12.0 66 5 17
18 6.0 78.0 18.4 57 5 18
19 30.0 322.0 11.5 68 5 19
20 11.0 44.0 9.7 62 5 20
21 1.0 8.0 9.7 59 5 21
22 11.0 320.0 16.6 73 5 22
23 4.0 25.0 9.7 61 5 23
24 32.0 92.0 12.0 61 5 24
28 23.0 13.0 12.0 67 5 28
In [24]:
# Helper code to plot the predictions of individual estimators and 
plt.figure(figsize=(10,8))

xrange = np.linspace(x.min(),x.max(),80).reshape(-1,1)
plt.plot(x_train,y_train,'o',color='#EFAEA4', markersize=6, label="Train Data")
plt.plot(x_test,y_test,'o',color='#F6345E', markersize=6, label="Test Data")

plt.xlim()
for i in model.estimators_:
    y_pred1 = i.predict(xrange)
    plt.plot(xrange,y_pred1,alpha=0.5,linewidth=0.5,color = '#ABCCE3')
plt.plot(xrange,y_pred1,alpha=0.6,linewidth=1,color = '#ABCCE3',label="Prediction of Individual Estimators")


y_pred = model.predict(xrange)
plt.plot(xrange,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')
plt.xlabel("Ozone", fontsize=16)
plt.ylabel("Temperature", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='best',fontsize=12)
plt.show()
In [25]:
# Compute the test MSE of the prediction of individual estimator
y_pred1 = ___
print("The test MSE of one estimator in the model is", round(mean_squared_error(y_test,y_pred1),2))
In [ ]:
### edTest(test_mse) ###
# Compute the test MSE of the model prediction
y_pred = ___
print("The test MSE of the model is",round(mean_squared_error(y_test,y_pred),2))

Mindchow 🍲

After marking, go back and change the number of bootstraps and the maximum depth of the tree.

  • Do you see any relation between them?

  • How does the variance change with change in maximum depth?

  • How does the variance change with change in number of bootstraps?

Your answer here

In [ ]: