{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Title :\n",
"\n",
"Exercise: Random Forest with Class Imbalance\n",
"\n",
"## Description :\n",
"\n",
"The goal of this exercise is to investigate the performance of Random Forest on a dataset with class imbalance, and then use corrections strategies to improve performance.\n",
"\n",
"Your final comparison may look like the table below (but not with these exact values):\n",
"\n",
"
\n",
"\n",
"## Instructions:\n",
"\n",
"- Read the dataset `diabetes.csv` as a pandas dataframe.\n",
"- Take a quick look at the dataset.\n",
"- Split the data into train and test sets.\n",
"- Perform classification with a Vanilla Random Forest which does not take into account class imbalance.\n",
"- Perform classification with a Balanced Random Forest which does take into account class imbalance.\n",
"- Upsample the data and perform classification with a Balanced Random Forest.\n",
"- Downsample the data and perform classification with a Balanced Random Forest.\n",
"- Compare the F1-Score and AUC Score of all 4 models.\n",
"\n",
"## Hints: \n",
"\n",
"np.ravel()\n",
"Return a contiguous flattened array.\n",
"\n",
"f1_score()\n",
"Compute the F1 score, also known as balanced F-score or F-measure.\n",
"\n",
"roc_auc_score()\n",
"Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.\n",
"\n",
"sklearn.train_test_split()\n",
"Split arrays or matrices into random train and test subsets.\n",
"\n",
"RandomForestClassifier()\n",
"Defines the RandomForestClassifier and includes more details on the definition and range of values for its tunable parameters.\n",
"\n",
"RandomForestClassifier.fit()\n",
"Build a forest of trees from the training set (X, y).\n",
"\n",
"RandomForestClassifier.predict()\n",
"Predict class for X.\n",
"\n",
"BalancedRandomForestClassifier()\n",
"A balanced random forest classifier.\n",
"\n",
"BalancedRandomForestClassifier.fit()\n",
"Build a forest of trees from the training set (X, y).\n",
"\n",
"BalancedRandomForestClassifier.predict()\n",
"Predict class for X.\n",
"\n",
"SMOTE()\n",
"Class to perform over-sampling using SMOTE.\n",
"\n",
"SMOTE.fit_resample()\n",
"Resample the dataset.\n",
"\n",
"RandomUnderSampler()\n",
"Class to perform random under-sampling.\n",
"\n",
"RandomUnderSampler.fit_resample()\n",
"Resample the dataset."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from prettytable import PrettyTable\n",
"from imblearn.over_sampling import SMOTE\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.metrics import f1_score, roc_auc_score\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.inspection import permutation_importance\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"from imblearn.ensemble import BalancedRandomForestClassifier\n",
"%matplotlib inline\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Code to read the dataset and take a quick look\n",
"df = pd.read_csv(\"diabetes.csv\")\n",
"df.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The percentage of diabetics in the dataset is only 34.90%\n"
]
}
],
"source": [
"# Investigate the response variable for data imbalance\n",
"count0, count1 = df['Outcome'].value_counts()\n",
"\n",
"print(f'The percentage of diabetics in the dataset is only {100*count1/(count0+count1):.2f}%')\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Assign the predictor and response variables\n",
"# \"Outcome\" is the response and all the other columns are the predictors\n",
"# Use the values of these features and response\n",
"X = ___\n",
"y = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Fix a random_state\n",
"random_state = 22\n",
"\n",
"# Split the data into train and validation sets\n",
"# Set random state as defined above and use a train size of 0.8\n",
"X_train, X_val, y_train, y_val = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Set the max_depth variable to 20 for all trees\n",
"max_depth = ___\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Strategy 1 - Vanilla Random Forest\n",
"\n",
"- No correction for imbalance"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Define a Random Forest classifier with randon_state as above\n",
"# Set the maximum depth to be max_depth and use 10 estimators\n",
"random_forest = ___\n",
"\n",
"# Fit the model on the training set\n",
"___\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_vanilla) ### \n",
"# Use the trained model to predict on the validation set \n",
"predictions = ___\n",
"\n",
"# Compute two metrics that better represent misclassification of minority classes \n",
"# i.e `F1 score` and `AUC`\n",
"\n",
"# Compute the F1-score and assign it to variable score1\n",
"f_score = ___\n",
"score1 = round(f_score, 2)\n",
"\n",
"# Compute the AUC and assign it to variable auc1\n",
"auc_score = ___\n",
"auc1 = round(auc_score, 2)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Strategy 2 - Random Forest with class weighting\n",
"- Balancing the class imbalance in each bootstrap"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Define a Random Forest classifier with randon_state as above\n",
"# Set the maximum depth to be max_depth and use 10 estimators\n",
"# Use class_weight as balanced_subsample to weigh the class accordingly\n",
"random_forest = ___\n",
"\n",
"# Fit the model on the training set\n",
"___\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_balanced) ###\n",
"\n",
"# Use the trained model to predict on the validation set \n",
"predictions = ___\n",
"\n",
"# Compute two metrics that better represent misclassification of minority classes \n",
"# i.e `F1 score` and `AUC`\n",
"\n",
"# Compute the F1-score and assign it to variable score2\n",
"f_score = ___\n",
"score2 = round(f_score, 2)\n",
"\n",
"# Compute the AUC and assign it to variable auc2\n",
"auc_score = ___\n",
"auc2 = round(auc_score, 2)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Strategy 3 - Balanced Random Forest with SMOTE \n",
"\n",
"- Using the **imblearn** `BalancedRandomForestClassifier()` \n",
"- Read more about this implementation [here](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.ensemble.BalancedRandomForestClassifier.html)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Perform upsampling using SMOTE\n",
"\n",
"# Define a SMOTE with random_state=2\n",
"sm = ___\n",
"\n",
"# Use the SMOTE object to upsample the train data\n",
"# You may have to use ravel() \n",
"X_train_res, y_train_res = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Define a Random Forest classifier with randon_state as above\n",
"# Set the maximum depth to be max_depth and use 10 estimators\n",
"# Use class_weight as balanced_subsample to weigh the class accordingly\n",
"random_forest = ___\n",
"\n",
"# Fit the Random Forest on upsampled data \n",
"___\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_upsample) ###\n",
"\n",
"# Use the trained model to predict on the validation set \n",
"predictions = ___\n",
"\n",
"# Compute the F1-score and assign it to variable score3\n",
"f_score = ___\n",
"score3 = round(f_score, 2)\n",
"\n",
"# Compute the AUC and assign it to variable auc3\n",
"auc_score = ___\n",
"auc3 = round(auc_score, 2)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Strategy 4 - Downsample the data\n",
"\n",
"Using the imblearn RandomUnderSampler()."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# Define an RandomUnderSampler instance with random state as 2\n",
"rs = ___\n",
"\n",
"# Downsample the train data\n",
"# You may have to use ravel()\n",
"X_train_res, y_train_res = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Define a Random Forest classifier with randon_state as above\n",
"# Set the maximum depth to be max_depth and use 10 estimators\n",
"# Use class_weight as balanced_subsample to weigh the class accordingly\n",
"random_forest = ___\n",
"\n",
"# Fit the Random Forest on downsampled data \n",
"___\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_downsample) ###\n",
"\n",
"# Use the trained model to predict on the validation set \n",
"predictions = ___\n",
"\n",
"# Compute two metrics that better represent misclassification of minority classes \n",
"# i.e `F1 score` and `AUC`\n",
"\n",
"# Compute the F1-score and assign it to variable score4\n",
"f_score = ___\n",
"score4 = round(f_score, 2)\n",
"\n",
"# Compute the AUC and assign it to variable auc4\n",
"auc_score = ___\n",
"auc4 = round(auc_score, 2)\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Compile the results from the implementations above\n",
"\n",
"pt = PrettyTable()\n",
"pt.field_names = [\"Strategy\",\"F1 Score\",\"AUC score\"]\n",
"pt.add_row([\"Random Forest - No imbalance correction\",score1,auc1])\n",
"pt.add_row([\"Random Forest - balanced_subsamples\",score2,auc2])\n",
"pt.add_row([\"Random Forest - Upsampling\",score3,auc3])\n",
"pt.add_row([\"Random Forest - Downsampling\",score4,auc4])\n",
"print(pt)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ⏸ Which of the metrics given below is not recommended when there is a imbalance in the dataset?\n",
"\n",
"#### A. Precision\n",
"#### B. Recall\n",
"#### C. F1-Score\n",
"#### D. Accuracy\n",
"#### E. AUC-Score"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_chow1) ###\n",
"# Submit an answer choice as a string below (eg. if you choose option A, put 'A')\n",
"answer1 = '___'\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}