Key Word(s): Random Forest, hyper-parameters, ensemble methods, MDI, CART, Imbalanced classes, F-score, Gradient Boosting
Title :¶
Exercise: Random Forest with Class Imbalance
Description :¶
The goal of this exercise is to investigate the performance of Random Forest on a dataset with class imbalance, and then use corrections strategies to improve performance.
Your final comparison may look like the table below (but not with these exact values):
Instructions:¶
- Read the dataset
diabetes.csv
as a pandas dataframe. - Take a quick look at the dataset.
- Split the data into train and test sets.
- Perform classification with a Vanilla Random Forest which does not take into account class imbalance.
- Perform classification with a Balanced Random Forest which does take into account class imbalance.
- Upsample the data and perform classification with a Balanced Random Forest.
- Downsample the data and perform classification with a Balanced Random Forest.
- Compare the F1-Score and AUC Score of all 4 models.
Hints:¶
np.ravel() Return a contiguous flattened array.
f1_score() Compute the F1 score, also known as balanced F-score or F-measure.
roc_auc_score() Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
sklearn.train_test_split() Split arrays or matrices into random train and test subsets.
RandomForestClassifier() Defines the RandomForestClassifier and includes more details on the definition and range of values for its tunable parameters.
RandomForestClassifier.fit() Build a forest of trees from the training set (X, y).
RandomForestClassifier.predict() Predict class for X.
BalancedRandomForestClassifier() A balanced random forest classifier.
BalancedRandomForestClassifier.fit() Build a forest of trees from the training set (X, y).
BalancedRandomForestClassifier.predict() Predict class for X.
SMOTE() Class to perform over-sampling using SMOTE.
SMOTE.fit_resample() Resample the dataset.
RandomUnderSampler() Class to perform random under-sampling.
RandomUnderSampler.fit_resample() Resample the dataset.
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from prettytable import PrettyTable
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
from imblearn.under_sampling import RandomUnderSampler
from imblearn.ensemble import BalancedRandomForestClassifier
%matplotlib inline
# Code to read the dataset and take a quick look
df = pd.read_csv("diabetes.csv")
df.head()
# Investigate the response variable for data imbalance
count0, count1 = df['Outcome'].value_counts()
print(f'The percentage of diabetics in the dataset is only {100*count1/(count0+count1):.2f}%')
The percentage of diabetics in the dataset is only 34.90%
# Assign the predictor and response variables
# "Outcome" is the response and all the other columns are the predictors
# Use the values of these features and response
X = ___
y = ___
# Fix a random_state
random_state = 22
# Split the data into train and validation sets
# Set random state as defined above and use a train size of 0.8
X_train, X_val, y_train, y_val = ___
# Set the max_depth variable to 20 for all trees
max_depth = ___
Strategy 1 - Vanilla Random Forest¶
- No correction for imbalance
# Define a Random Forest classifier with randon_state as above
# Set the maximum depth to be max_depth and use 10 estimators
random_forest = ___
# Fit the model on the training set
___
### edTest(test_vanilla) ###
# Use the trained model to predict on the validation set
predictions = ___
# Compute two metrics that better represent misclassification of minority classes
# i.e `F1 score` and `AUC`
# Compute the F1-score and assign it to variable score1
f_score = ___
score1 = round(f_score, 2)
# Compute the AUC and assign it to variable auc1
auc_score = ___
auc1 = round(auc_score, 2)
Strategy 2 - Random Forest with class weighting¶
- Balancing the class imbalance in each bootstrap
# Define a Random Forest classifier with randon_state as above
# Set the maximum depth to be max_depth and use 10 estimators
# Use class_weight as balanced_subsample to weigh the class accordingly
random_forest = ___
# Fit the model on the training set
___
### edTest(test_balanced) ###
# Use the trained model to predict on the validation set
predictions = ___
# Compute two metrics that better represent misclassification of minority classes
# i.e `F1 score` and `AUC`
# Compute the F1-score and assign it to variable score2
f_score = ___
score2 = round(f_score, 2)
# Compute the AUC and assign it to variable auc2
auc_score = ___
auc2 = round(auc_score, 2)
# Perform upsampling using SMOTE
# Define a SMOTE with random_state=2
sm = ___
# Use the SMOTE object to upsample the train data
# You may have to use ravel()
X_train_res, y_train_res = ___
# Define a Random Forest classifier with randon_state as above
# Set the maximum depth to be max_depth and use 10 estimators
# Use class_weight as balanced_subsample to weigh the class accordingly
random_forest = ___
# Fit the Random Forest on upsampled data
___
### edTest(test_upsample) ###
# Use the trained model to predict on the validation set
predictions = ___
# Compute the F1-score and assign it to variable score3
f_score = ___
score3 = round(f_score, 2)
# Compute the AUC and assign it to variable auc3
auc_score = ___
auc3 = round(auc_score, 2)
Strategy 4 - Downsample the data¶
Using the imblearn RandomUnderSampler().
# Define an RandomUnderSampler instance with random state as 2
rs = ___
# Downsample the train data
# You may have to use ravel()
X_train_res, y_train_res = ___
# Define a Random Forest classifier with randon_state as above
# Set the maximum depth to be max_depth and use 10 estimators
# Use class_weight as balanced_subsample to weigh the class accordingly
random_forest = ___
# Fit the Random Forest on downsampled data
___
### edTest(test_downsample) ###
# Use the trained model to predict on the validation set
predictions = ___
# Compute two metrics that better represent misclassification of minority classes
# i.e `F1 score` and `AUC`
# Compute the F1-score and assign it to variable score4
f_score = ___
score4 = round(f_score, 2)
# Compute the AUC and assign it to variable auc4
auc_score = ___
auc4 = round(auc_score, 2)
# Compile the results from the implementations above
pt = PrettyTable()
pt.field_names = ["Strategy","F1 Score","AUC score"]
pt.add_row(["Random Forest - No imbalance correction",score1,auc1])
pt.add_row(["Random Forest - balanced_subsamples",score2,auc2])
pt.add_row(["Random Forest - Upsampling",score3,auc3])
pt.add_row(["Random Forest - Downsampling",score4,auc4])
print(pt)
### edTest(test_chow1) ###
# Submit an answer choice as a string below (eg. if you choose option A, put 'A')
answer1 = '___'