{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Title\n", "\n", "**Class imbalance: Random Forest vs SMOTE Classification**\n", "\n", "# Description\n", "\n", "The goal of this exercise is to investigate the performance of Random Forest with and without class balancing techniques on a dataset with class imbalance.\n", "\n", "The comparison will look a little something like this:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Instructions:\n", "- Read the dataset `diabetes.csv` as a pandas dataframe.\n", "- Take a quick look at the dataset.\n", "- Quantify the class imbalance of your response variable. \n", "- Assign the response variable as `Outcome` and everything else as a predictor.\n", "- Split the data into train and validation sets.\n", "- Fit a `RandomForestClassifier()` on the training data, without any consideration for class imbalance.\n", "- Predict on the validation set and compute the `f1_score` and the `auc_score` and save them to appropriate variables.\n", "- Fit a `RandomForestClassifier()` on the training data, but this time make a consideration for class imbalance by setting `lass_weight='balanced_subsample'`.\n", "- Predict on the validation set and compute the f1_score and the auc_score for this model and save them to appropriate variables.\n", "- Fit a `RandomForestClassifier()` on the training data generated using SMOTE using `class_weight='balanced_subsample'`.\n", "- Predict on the validation set and compute the `f1_score` and the `auc_score` for this model and save them to appropriate variables.\n", "- Finally, use the helper code to tabulate your results and compare the performance of each model.\n", "\n", "\n", "# Hints:\n", "\n", "RandomForestClassifier() : A random forest classifier.\n", "\n", "f1_score() : Compute the F1 score, also known as balanced F-score or F-measure\n", "\n", "roc_auc_score() : Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the main packages\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.inspection import permutation_importance\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import f1_score\n", "from sklearn.metrics import roc_auc_score\n", "from imblearn.over_sampling import SMOTE\n", "from prettytable import PrettyTable\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read the dataset and take a quick look\n", "\n", "df = pd.read_csv('diabetes.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# On checking the response variable ('Outcome') value counts, you will notice that the number of diabetics are less than the number of non-diabetics \n", "\n", "df['Outcome'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### edTest(test_imbalance) ###\n", "\n", "# To estimate the amount of data imbalance, find the ratio of class 1(Diabetics) to the size of the dataset.\n", "\n", "imbalance_ratio = ___\n", "\n", "print(f'The percentage of diabetics in the dataset is only {(imbalance_ratio)*100:.2f}%')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assign the predictor and response variables.\n", "\n", "# Outcome is the response and all the other columns are the predictors\n", "\n", "X = ___\n", "y = ___" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fix a random_state and split the data into train and validation sets\n", "\n", "random_state = 22\n", "\n", "X_train, X_val, y_train,y_val = train_test_split(X,y,train_size = 0.8,random_state =random_state)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We fix the max_depth variable to 20 for all trees, you can come back and change this to investigate performance of RandomForest\n", "\n", "max_depth = 20" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Strategy 1 - Vanilla Random Forest\n", "\n", "- No correction for imbalance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define a Random Forest classifier with random_state as above\n", "# Set the maximum depth to be max_depth and use 10 estimators\n", "random_forest = ___\n", "\n", "# Fit the model on the training set\n", "random_forest.___" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We make predictions on the validation set \n", "predictions = ___\n", "\n", "# We also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`\n", "# Compute the f1-score and assign it to variable score1\n", "score1 = ___\n", "\n", "# Compute the `auc` and assign it to variable auc1\n", "auc1 = ___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Strategy 2 - Random Forest with class weighting\n", "- Balancing the class imbalance in each bootstrap" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define a Random Forest classifier with random_state as above\n", "\n", "# Set the maximum depth to be max_depth and use 10 estimators\n", "\n", "# Specify `class_weight='balanced_subsample'\n", "\n", "random_forest = ___\n", "\n", "# Fit the model on the training data\n", "\n", "random_forest.___" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We make predictions on the validation set \n", "\n", "predictions = ___\n", "\n", "# Again we also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`\n", "\n", "# Compute the f1-score and assign it to variable score2\n", "\n", "score2 = ___\n", "\n", "# Compute the `auc` and assign it to variable auc2\n", "\n", "auc2 = ___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Strategy 3 - RandomForest with SMOTE\n", "- We can use SMOTE along with the previous method to further improve our metrics.\n", "- Read more about imblearn's SMOTE [here](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run this cell below to use SMOTE to balance our dataset\n", "sm = SMOTE(random_state=3)\n", "\n", "X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())\n", "\n", "#If you now see the shape, you will see that X_train_res has more number of points than X_train\n", "\n", "print(f'Number of points in balanced dataset is {X_train_res.shape[0]}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Again Define a Random Forest classifier with random_state as above\n", "\n", "# Set the maximum depth to be max_depth and use 10 estimators\n", "\n", "# Again specify `class_weight='balanced_subsample'\n", "\n", "random_forest = ___\n", "\n", "# Fit the model on the new training data created above with SMOTE \n", "\n", "random_forest.___" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### edTest(test_smote) ###\n", "\n", "# We make predictions on the validation set \n", "\n", "predictions = ___\n", "\n", "# Again we also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`\n", "\n", "# Compute the f1-score and assign it to variable score3\n", "\n", "score3 = ___\n", "\n", "# Compute the `auc` and assign it to variable auc3\n", "\n", "auc3 = ___" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Finally, we compare the results from the three implementations above\n", "\n", "# Just run the cells below to see your results\n", "\n", "pt = PrettyTable()\n", "pt.field_names = [\"Strategy\",\"F1 Score\",\"AUC score\"]\n", "pt.add_row([\"Random Forest - no correction\",score1,auc1])\n", "pt.add_row([\"Random Forest - class weighting\",score2,auc2])\n", "pt.add_row([\"Random Forest - SMOTE upsampling\",score3,auc3])\n", "print(pt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mindchow 🍲\n", "\n", "- Go back and change the `learning_rate` parameter and `n_estimators` for Adaboost. Do you see an improvement in results?\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Your answer here*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }