{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Title\n",
"\n",
"**Class imbalance: Random Forest vs SMOTE Classification**\n",
"\n",
"# Description\n",
"\n",
"The goal of this exercise is to investigate the performance of Random Forest with and without class balancing techniques on a dataset with class imbalance.\n",
"\n",
"The comparison will look a little something like this:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Instructions:\n",
"- Read the dataset `diabetes.csv` as a pandas dataframe.\n",
"- Take a quick look at the dataset.\n",
"- Quantify the class imbalance of your response variable. \n",
"- Assign the response variable as `Outcome` and everything else as a predictor.\n",
"- Split the data into train and validation sets.\n",
"- Fit a `RandomForestClassifier()` on the training data, without any consideration for class imbalance.\n",
"- Predict on the validation set and compute the `f1_score` and the `auc_score` and save them to appropriate variables.\n",
"- Fit a `RandomForestClassifier()` on the training data, but this time make a consideration for class imbalance by setting `lass_weight='balanced_subsample'`.\n",
"- Predict on the validation set and compute the f1_score and the auc_score for this model and save them to appropriate variables.\n",
"- Fit a `RandomForestClassifier()` on the training data generated using SMOTE using `class_weight='balanced_subsample'`.\n",
"- Predict on the validation set and compute the `f1_score` and the `auc_score` for this model and save them to appropriate variables.\n",
"- Finally, use the helper code to tabulate your results and compare the performance of each model.\n",
"\n",
"\n",
"# Hints:\n",
"\n",
"RandomForestClassifier() : A random forest classifier.\n",
"\n",
"f1_score() : Compute the F1 score, also known as balanced F-score or F-measure\n",
"\n",
"roc_auc_score() : Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import the main packages\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.inspection import permutation_importance\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import f1_score\n",
"from sklearn.metrics import roc_auc_score\n",
"from imblearn.over_sampling import SMOTE\n",
"from prettytable import PrettyTable\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Read the dataset and take a quick look\n",
"\n",
"df = pd.read_csv('diabetes.csv')\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# On checking the response variable ('Outcome') value counts, you will notice that the number of diabetics are less than the number of non-diabetics \n",
"\n",
"df['Outcome'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_imbalance) ###\n",
"\n",
"# To estimate the amount of data imbalance, find the ratio of class 1(Diabetics) to the size of the dataset.\n",
"\n",
"imbalance_ratio = ___\n",
"\n",
"print(f'The percentage of diabetics in the dataset is only {(imbalance_ratio)*100:.2f}%')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Assign the predictor and response variables.\n",
"\n",
"# Outcome is the response and all the other columns are the predictors\n",
"\n",
"X = ___\n",
"y = ___"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fix a random_state and split the data into train and validation sets\n",
"\n",
"random_state = 22\n",
"\n",
"X_train, X_val, y_train,y_val = train_test_split(X,y,train_size = 0.8,random_state =random_state)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# We fix the max_depth variable to 20 for all trees, you can come back and change this to investigate performance of RandomForest\n",
"\n",
"max_depth = 20"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Strategy 1 - Vanilla Random Forest\n",
"\n",
"- No correction for imbalance"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define a Random Forest classifier with random_state as above\n",
"# Set the maximum depth to be max_depth and use 10 estimators\n",
"random_forest = ___\n",
"\n",
"# Fit the model on the training set\n",
"random_forest.___"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# We make predictions on the validation set \n",
"predictions = ___\n",
"\n",
"# We also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`\n",
"# Compute the f1-score and assign it to variable score1\n",
"score1 = ___\n",
"\n",
"# Compute the `auc` and assign it to variable auc1\n",
"auc1 = ___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Strategy 2 - Random Forest with class weighting\n",
"- Balancing the class imbalance in each bootstrap"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define a Random Forest classifier with random_state as above\n",
"\n",
"# Set the maximum depth to be max_depth and use 10 estimators\n",
"\n",
"# Specify `class_weight='balanced_subsample'\n",
"\n",
"random_forest = ___\n",
"\n",
"# Fit the model on the training data\n",
"\n",
"random_forest.___"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# We make predictions on the validation set \n",
"\n",
"predictions = ___\n",
"\n",
"# Again we also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`\n",
"\n",
"# Compute the f1-score and assign it to variable score2\n",
"\n",
"score2 = ___\n",
"\n",
"# Compute the `auc` and assign it to variable auc2\n",
"\n",
"auc2 = ___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Strategy 3 - RandomForest with SMOTE\n",
"- We can use SMOTE along with the previous method to further improve our metrics.\n",
"- Read more about imblearn's SMOTE [here](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Run this cell below to use SMOTE to balance our dataset\n",
"sm = SMOTE(random_state=3)\n",
"\n",
"X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())\n",
"\n",
"#If you now see the shape, you will see that X_train_res has more number of points than X_train\n",
"\n",
"print(f'Number of points in balanced dataset is {X_train_res.shape[0]}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Again Define a Random Forest classifier with random_state as above\n",
"\n",
"# Set the maximum depth to be max_depth and use 10 estimators\n",
"\n",
"# Again specify `class_weight='balanced_subsample'\n",
"\n",
"random_forest = ___\n",
"\n",
"# Fit the model on the new training data created above with SMOTE \n",
"\n",
"random_forest.___"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_smote) ###\n",
"\n",
"# We make predictions on the validation set \n",
"\n",
"predictions = ___\n",
"\n",
"# Again we also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`\n",
"\n",
"# Compute the f1-score and assign it to variable score3\n",
"\n",
"score3 = ___\n",
"\n",
"# Compute the `auc` and assign it to variable auc3\n",
"\n",
"auc3 = ___"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Finally, we compare the results from the three implementations above\n",
"\n",
"# Just run the cells below to see your results\n",
"\n",
"pt = PrettyTable()\n",
"pt.field_names = [\"Strategy\",\"F1 Score\",\"AUC score\"]\n",
"pt.add_row([\"Random Forest - no correction\",score1,auc1])\n",
"pt.add_row([\"Random Forest - class weighting\",score2,auc2])\n",
"pt.add_row([\"Random Forest - SMOTE upsampling\",score3,auc3])\n",
"print(pt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mindchow 🍲\n",
"\n",
"- Go back and change the `learning_rate` parameter and `n_estimators` for Adaboost. Do you see an improvement in results?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Your answer here*"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}