Title¶

Class imbalance: Random Forest vs SMOTE Classification

Description¶

The goal of this exercise is to investigate the performance of Random Forest with and without class balancing techniques on a dataset with class imbalance.

The comparison will look a little something like this:

Instructions:¶

Read the dataset diabetes.csv as a pandas dataframe.
Take a quick look at the dataset.
Quantify the class imbalance of your response variable.
Assign the response variable as Outcome and everything else as a predictor.
Split the data into train and validation sets.
Fit a RandomForestClassifier() on the training data, without any consideration for class imbalance.
Predict on the validation set and compute the f1_score and the auc_score and save them to appropriate variables.
Fit a RandomForestClassifier() on the training data, but this time make a consideration for class imbalance by setting lass_weight='balanced_subsample'.
Predict on the validation set and compute the f1_score and the auc_score for this model and save them to appropriate variables.
Fit a RandomForestClassifier() on the training data generated using SMOTE using class_weight='balanced_subsample'.
Predict on the validation set and compute the f1_score and the auc_score for this model and save them to appropriate variables.
Finally, use the helper code to tabulate your results and compare the performance of each model.

Hints:¶

RandomForestClassifier() : A random forest classifier.

f1_score() : Compute the F1 score, also known as balanced F-score or F-measure

roc_auc_score() : Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

In [ ]:

# Import the main packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from imblearn.over_sampling import SMOTE
from prettytable import PrettyTable
%matplotlib inline

In [ ]:

# Read the dataset and take a quick look

df = pd.read_csv('diabetes.csv')
df.head()

In [ ]:

# On checking the response variable ('Outcome') value counts, you will notice that the number of diabetics are less than the number of non-diabetics 

df['Outcome'].value_counts()

In [ ]:

### edTest(test_imbalance) ###

# To estimate the amount of data imbalance, find the ratio of class 1(Diabetics) to the size of the dataset.

imbalance_ratio = ___

print(f'The percentage of diabetics in the dataset is only {(imbalance_ratio)*100:.2f}%')

In [ ]:

# Assign the predictor and response variables.

# Outcome is the response and all the other columns are the predictors

X = ___
y = ___

In [ ]:

# Fix a random_state and split the data into train and validation sets

random_state = 22

X_train, X_val, y_train,y_val = train_test_split(X,y,train_size = 0.8,random_state =random_state)

In [ ]:

# We fix the max_depth variable to 20 for all trees, you can come back and change this to investigate performance of RandomForest

max_depth = 20

Strategy 1 - Vanilla Random Forest¶

No correction for imbalance

In [ ]:

# Define a Random Forest classifier with random_state as above
# Set the maximum depth to be max_depth and use 10 estimators
random_forest = ___

# Fit the model on the training set
random_forest.___

In [ ]:

# We make predictions on the validation set 
predictions = ___

# We also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`
# Compute the f1-score and assign it to variable score1
score1 = ___

# Compute the `auc` and assign it to variable auc1
auc1 = ___

Strategy 2 - Random Forest with class weighting¶

Balancing the class imbalance in each bootstrap

In [ ]:

# Define a Random Forest classifier with random_state as above

# Set the maximum depth to be max_depth and use 10 estimators

# Specify `class_weight='balanced_subsample'

random_forest = ___

# Fit the model on the training data

random_forest.___

In [ ]:

# We make predictions on the validation set 

predictions = ___

# Again we also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`

# Compute the f1-score and assign it to variable score2

score2 = ___

# Compute the `auc` and assign it to variable auc2

auc2 = ___

Strategy 3 - RandomForest with SMOTE¶

We can use SMOTE along with the previous method to further improve our metrics.
Read more about imblearn's SMOTE here.

In [ ]:

# Run this cell below to use SMOTE to balance our dataset
sm = SMOTE(random_state=3)

X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())

#If you now see the shape, you will see that X_train_res has more number of points than X_train

print(f'Number of points in balanced dataset is {X_train_res.shape[0]}')

In [ ]:

# Again Define a Random Forest classifier with random_state as above

# Set the maximum depth to be max_depth and use 10 estimators

# Again specify `class_weight='balanced_subsample'

random_forest = ___

# Fit the model on the new training data created above with SMOTE 

random_forest.___

In [ ]:

### edTest(test_smote) ###

# We make predictions on the validation set 

predictions = ___

# Again we also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`

# Compute the f1-score and assign it to variable score3

score3 = ___

# Compute the `auc` and assign it to variable auc3

auc3 = ___

In [ ]:

# Finally, we compare the results from the three implementations above

# Just run the cells below to see your results

pt = PrettyTable()
pt.field_names = ["Strategy","F1 Score","AUC score"]
pt.add_row(["Random Forest - no correction",score1,auc1])
pt.add_row(["Random Forest - class weighting",score2,auc2])
pt.add_row(["Random Forest - SMOTE upsampling",score3,auc3])
print(pt)

Mindchow 🍲¶

Go back and change the learning_rate parameter and n_estimators for Adaboost. Do you see an improvement in results?

Your answer here