Instructions:¶
- Read the dataset
diabetes.csv
as a pandas dataframe. - Take a quick look at the dataset.
- Quantify the class imbalance of your response variable.
- Assign the response variable as
Outcome
and everything else as a predictor. - Split the data into train and validation sets.
- Fit a
RandomForestClassifier()
on the training data, without any consideration for class imbalance. - Predict on the validation set and compute the
f1_score
and theauc_score
and save them to appropriate variables. - Fit a
RandomForestClassifier()
on the training data, but this time make a consideration for class imbalance by settinglass_weight='balanced_subsample'
. - Predict on the validation set and compute the f1_score and the auc_score for this model and save them to appropriate variables.
- Fit a
RandomForestClassifier()
on the training data generated using SMOTE usingclass_weight='balanced_subsample'
. - Predict on the validation set and compute the
f1_score
and theauc_score
for this model and save them to appropriate variables. - Finally, use the helper code to tabulate your results and compare the performance of each model.
Hints:¶
RandomForestClassifier() : A random forest classifier.
f1_score() : Compute the F1 score, also known as balanced F-score or F-measure
roc_auc_score() : Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
In [ ]:
# Import the main packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from imblearn.over_sampling import SMOTE
from prettytable import PrettyTable
%matplotlib inline
In [ ]:
# Read the dataset and take a quick look
df = pd.read_csv('diabetes.csv')
df.head()
In [ ]:
# On checking the response variable ('Outcome') value counts, you will notice that the number of diabetics are less than the number of non-diabetics
df['Outcome'].value_counts()
In [ ]:
### edTest(test_imbalance) ###
# To estimate the amount of data imbalance, find the ratio of class 1(Diabetics) to the size of the dataset.
imbalance_ratio = ___
print(f'The percentage of diabetics in the dataset is only {(imbalance_ratio)*100:.2f}%')
In [ ]:
# Assign the predictor and response variables.
# Outcome is the response and all the other columns are the predictors
X = ___
y = ___
In [ ]:
# Fix a random_state and split the data into train and validation sets
random_state = 22
X_train, X_val, y_train,y_val = train_test_split(X,y,train_size = 0.8,random_state =random_state)
In [ ]:
# We fix the max_depth variable to 20 for all trees, you can come back and change this to investigate performance of RandomForest
max_depth = 20
Strategy 1 - Vanilla Random Forest¶
- No correction for imbalance
In [ ]:
# Define a Random Forest classifier with random_state as above
# Set the maximum depth to be max_depth and use 10 estimators
random_forest = ___
# Fit the model on the training set
random_forest.___
In [ ]:
# We make predictions on the validation set
predictions = ___
# We also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`
# Compute the f1-score and assign it to variable score1
score1 = ___
# Compute the `auc` and assign it to variable auc1
auc1 = ___
Strategy 2 - Random Forest with class weighting¶
- Balancing the class imbalance in each bootstrap
In [ ]:
# Define a Random Forest classifier with random_state as above
# Set the maximum depth to be max_depth and use 10 estimators
# Specify `class_weight='balanced_subsample'
random_forest = ___
# Fit the model on the training data
random_forest.___
In [ ]:
# We make predictions on the validation set
predictions = ___
# Again we also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`
# Compute the f1-score and assign it to variable score2
score2 = ___
# Compute the `auc` and assign it to variable auc2
auc2 = ___
In [ ]:
# Run this cell below to use SMOTE to balance our dataset
sm = SMOTE(random_state=3)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
#If you now see the shape, you will see that X_train_res has more number of points than X_train
print(f'Number of points in balanced dataset is {X_train_res.shape[0]}')
In [ ]:
# Again Define a Random Forest classifier with random_state as above
# Set the maximum depth to be max_depth and use 10 estimators
# Again specify `class_weight='balanced_subsample'
random_forest = ___
# Fit the model on the new training data created above with SMOTE
random_forest.___
In [ ]:
### edTest(test_smote) ###
# We make predictions on the validation set
predictions = ___
# Again we also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`
# Compute the f1-score and assign it to variable score3
score3 = ___
# Compute the `auc` and assign it to variable auc3
auc3 = ___
In [ ]:
# Finally, we compare the results from the three implementations above
# Just run the cells below to see your results
pt = PrettyTable()
pt.field_names = ["Strategy","F1 Score","AUC score"]
pt.add_row(["Random Forest - no correction",score1,auc1])
pt.add_row(["Random Forest - class weighting",score2,auc2])
pt.add_row(["Random Forest - SMOTE upsampling",score3,auc3])
print(pt)
Mindchow 🍲¶
- Go back and change the
learning_rate
parameter andn_estimators
for Adaboost. Do you see an improvement in results?
Your answer here