Title

Exercise 1: Bagging vs Random Forest (Tree correlation)

Description

How does Random Forest improve on Bagging?

The goal of this exercise is to investigate the correlation between randomly selected trees from Bagging and Random Forest.

Instructions:

  • Read the dataset diabetes.csv as a pandas dataframe, and take a quick look at the data.
  • Split the data into train and validation sets.
  • Define a BaggingClassifier model that uses DecisionTreClassifier as its base estimator.
  • Specify the number of bootstraps as 1000 and a maximum depth of 3.
  • Fit the BaggingClassifier model on the train data.
  • Use the helper code to predict using the mean model and individual estimators. The plot will look similar to the one given below.
  • Predict on the test data using the first estimator and the mean model.
  • Compute and display the validation accuracy
  • Repeat the modeling and classification process above, this time using a RandomForestClassifier.

Your final output should look something like this:

Hints:

sklearn.train_test_split() : Split arrays or matrices into random train and test subsets.

sklearn.ensemble.BaggingClassifier() : Returns a Bagging classifier instance.

sklearn.tree.DecisionTreeClassifier() : A Tree classifier can be used as the base model for the Bagging classifier.

sklearn.ensemble.andomForestClassifier() : Defines a Random forest classifier.

sklearn.metrics.accuracy_score(y_true, y_pred) : Accuracy classification score.

In [1]:
#!pip install -qq dtreeviz
import os, sys
sys.path.append(f"{os.getcwd()}/../")
In [2]:
# Import the main packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from dtreeviz.trees import dtreeviz

%matplotlib inline

colors = [None,  # 0 classes
          None,  # 1 class
          ['#FFF4E5','#D2E3EF'],# 2 classes
           ]

from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
In [3]:
# Read the dataset and take a quick look

df = pd.read_csv("diabetes.csv")
df.head()
In [4]:
### edTest(test_assign) ###
# Assign the predictor and response variables. 
# "Outcome" is the response and all the other columns are the predictors

X = __ 
y = __  
In [5]:
# Fix a random_state and split the data 
# into train and validation sets

random_state = 144

X_train, X_val, y_train,y_val = train_test_split(__,__,
                                train_size = 0.8,
                                random_state =random_state)

Bagging Implementation

In [6]:
# Define a Bagging classifier with randon_state as above
# and with a DecisionClassifier as a basemodel
# We fix the max_depth variable to 20 for all trees
max_depth = 20

# Set the maximum depth to be max_depth and use 100 estimators
n_estimators = 1000
basemodel = __(max_depth=__, 
                random_state=__)

bagging = BaggingClassifier(base_estimator=basemodel, 
                            n_estimators=n_estimators)
# Fit the model on the training set

bagging.fit(__, __)
In [7]:
### edTest(test_bagging) ###
# We make predictions on the validation set 

predictions = bagging.predict(X_val)

# compute the accuracy on the validation set

acc_bag = round(accuracy_score(predictions, y_val),2)

print(f'For Bagging, the accuracy on the validation set is {acc_bag}')

Random Forest implementation

In [8]:
# Define a Random Forest classifier with randon_state as above

# Set the maximum depth to be max_depth and use 100 estimators

random_forest = __(max_depth=max_depth, 
                    random_state=random_state, 
                    n_estimators=n_estimators)

# Fit the model on the training set
random_forest.fit(__, __)
In [9]:
### edTest(test_RF) ###
# We make predictions on the validation set 

predictions = random_forest.predict(X_val)

# compute the accuracy on the validation set

acc_rf = round(accuracy_score(predictions, y_val),2)

print(f'For Random Forest, the accuracy on the validation set is {acc_rf}')

Visualizing the trees - Bagging

In [10]:
# Reducing the max_depth for visualization 

max_depth = 3

basemodel = DecisionTreeClassifier(max_depth=max_depth, 
                                random_state=random_state)

bagging = BaggingClassifier(base_estimator=basemodel, 
                                n_estimators=1000)

# Fit the model on the training set

bagging.fit(X_train, y_train)

# Selecting two trees at random

bagvati1 = bagging.estimators_[0]
bagvati2 = bagging.estimators_[100]
In [11]:
vizA = dtreeviz(bagvati1, df.iloc[:,:8],df.Outcome,
               feature_names = df.columns[:8],
               target_name = 'Diabetes', class_names= ['No','Yes']
              ,orientation = 'TD',
               colors={'classes':colors},
               label_fontsize=14,
               ticks_fontsize=10,
               )
printmd('

Bagging Tree 1

'
) vizA
In [12]:
vizB = dtreeviz(bagvati2, df.iloc[:,:8],df.Outcome,
               feature_names = df.columns[:8],
               target_name = 'Diabetes', class_names= ['No','Yes']
              ,orientation = 'TD',
               colors={'classes':colors},
               label_fontsize=14,
               ticks_fontsize=10,
                scale=1.1
               )
printmd('

Bagging Tree 2

'
) vizB

Visualizing the trees - Random Forest

In [13]:
# Reducing the max_depth for visualization 

max_depth = 3

random_forest = RandomForestClassifier(max_depth=max_depth, random_state=random_state, n_estimators=1000,max_features = "sqrt")

# Fit the model on the training set

random_forest.fit(X_train, y_train)

# Selecting two trees at random

forestvati1 = random_forest.estimators_[0]
forestvati2 = random_forest.estimators_[__]
In [14]:
vizC = dtreeviz(forestvati1, df.iloc[:,:8],df.Outcome,
               feature_names = df.columns[:8],
               target_name = 'Diabetes', class_names= ['No','Yes']
              ,orientation = 'TD',
               colors={'classes':colors},
               label_fontsize=14,
               ticks_fontsize=10,
                scale=1.1
               )
printmd('

Random Forest Tree 1

'
) vizC
In [15]:
vizD = dtreeviz(forestvati2, df.iloc[:,:8],df.Outcome,
               feature_names = df.columns[:8],
               target_name = 'Diabetes', class_names= ['No','Yes']
              ,orientation = 'TD',
               colors={'classes':colors},
               label_fontsize=14,
               ticks_fontsize=10,
                scale=1.1
               )
printmd('

Random Forest Tree 2

'
) vizD

Mindchow 🍲

  • Change the max_depth of Bagging and Random Forest to see different trees. Which one gives different trees?
  • Change the max_features in RandomForestClassifier to 8. How is it affecting the correlation between the trees?