{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Title :\n",
    "\n",
    "Exercise: Random Forest with Class Imbalance\n",
    "\n",
    "## Description :\n",
    "\n",
    "The goal of this exercise is to investigate the performance of Random Forest on a dataset with class imbalance, and then use corrections strategies to improve performance.\n",
    "\n",
    "Your final comparison may look like the table below (but not with these exact values):\n",
    "\n",
    "<img src=\"../fig/fig1.png\" style=\"width: 500px;\">\n",
    "\n",
    "## Instructions:\n",
    "\n",
    "- Read the dataset `diabetes.csv` as a pandas dataframe.\n",
    "- Take a quick look at the dataset.\n",
    "- Split the data into train and test sets.\n",
    "- Perform classification with a Vanilla Random Forest which does not take into account class imbalance.\n",
    "- Perform classification with a Balanced Random Forest which does take into account class imbalance.\n",
    "- Upsample the data and perform classification with a Balanced Random Forest.\n",
    "- Downsample the data and perform classification with a Balanced Random Forest.\n",
    "- Compare the F1-Score and AUC Score of all 4 models.\n",
    "\n",
    "## Hints: \n",
    "\n",
    "<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.ravel.html\" target=\"_blank\">np.ravel()</a>\n",
    "Return a contiguous flattened array.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html\" target=\"_blank\">f1_score()</a>\n",
    "Compute the F1 score, also known as balanced F-score or F-measure.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html\" target=\"_blank\">roc_auc_score()</a>\n",
    "Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\" target=\"_blank\">sklearn.train_test_split()</a>\n",
    "Split arrays or matrices into random train and test subsets.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\" target=\"_blank\">RandomForestClassifier()</a>\n",
    "Defines the RandomForestClassifier and includes more details on the definition and range of values for its tunable parameters.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit\" target=\"_blank\">RandomForestClassifier.fit()</a>\n",
    "Build a forest of trees from the training set (X, y).\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict\" target=\"_blank\">RandomForestClassifier.predict()</a>\n",
    "Predict class for X.\n",
    "\n",
    "<a href=\"https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedRandomForestClassifier.html\" target=\"_blank\">BalancedRandomForestClassifier()</a>\n",
    "A balanced random forest classifier.\n",
    "\n",
    "<a href=\"https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedRandomForestClassifier.html#imblearn.ensemble.BalancedRandomForestClassifier.fit\" target=\"_blank\">BalancedRandomForestClassifier.fit()</a>\n",
    "Build a forest of trees from the training set (X, y).\n",
    "\n",
    "<a href=\"https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedRandomForestClassifier.html#imblearn.ensemble.BalancedRandomForestClassifier.predict\" target=\"_blank\">BalancedRandomForestClassifier.predict()</a>\n",
    "Predict class for X.\n",
    "\n",
    "<a href=\"https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html\" target=\"_blank\">SMOTE()</a>\n",
    "Class to perform over-sampling using SMOTE.\n",
    "\n",
    "<a href=\"https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html#imblearn.over_sampling.SMOTE.fit_resample\" target=\"_blank\">SMOTE.fit_resample()</a>\n",
    "Resample the dataset.\n",
    "\n",
    "<a href=\"https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html\" target=\"_blank\">RandomUnderSampler()</a>\n",
    "Class to perform random under-sampling.\n",
    "\n",
    "<a href=\"https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html#imblearn.under_sampling.RandomUnderSampler.fit_resample\" target=\"_blank\">RandomUnderSampler.fit_resample()</a>\n",
    "Resample the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from prettytable import PrettyTable\n",
    "from imblearn.over_sampling import SMOTE\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import f1_score, roc_auc_score\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.inspection import permutation_importance\n",
    "from imblearn.under_sampling import RandomUnderSampler\n",
    "from imblearn.ensemble import BalancedRandomForestClassifier\n",
    "%matplotlib inline\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Code to read the dataset and take a quick look\n",
    "df = pd.read_csv(\"diabetes.csv\")\n",
    "df.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The percentage of diabetics in the dataset is only 34.90%\n"
     ]
    }
   ],
   "source": [
    "# Investigate the response variable for data imbalance\n",
    "count0, count1 = df['Outcome'].value_counts()\n",
    "\n",
    "print(f'The percentage of diabetics in the dataset is only {100*count1/(count0+count1):.2f}%')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign the predictor and response variables\n",
    "# \"Outcome\" is the response and all the other columns are the predictors\n",
    "# Use the values of these features and response\n",
    "X = ___\n",
    "y = ___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fix a random_state\n",
    "random_state = 22\n",
    "\n",
    "# Split the data into train and validation sets\n",
    "# Set random state as defined above and use a train size of 0.8\n",
    "X_train, X_val, y_train, y_val = ___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the max_depth variable to 20 for all trees\n",
    "max_depth = ___\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strategy 1 - Vanilla Random Forest\n",
    "\n",
    "- No correction for imbalance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a Random Forest classifier with randon_state as above\n",
    "# Set the maximum depth to be max_depth and use 10 estimators\n",
    "random_forest = ___\n",
    "\n",
    "# Fit the model on the training set\n",
    "___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_vanilla) ### \n",
    "# Use the trained model to predict on the validation set \n",
    "predictions = ___\n",
    "\n",
    "# Compute two metrics that better represent misclassification of minority classes \n",
    "# i.e `F1 score` and `AUC`\n",
    "\n",
    "# Compute the F1-score and assign it to variable score1\n",
    "f_score = ___\n",
    "score1 = round(f_score, 2)\n",
    "\n",
    "# Compute the AUC and assign it to variable auc1\n",
    "auc_score = ___\n",
    "auc1 = round(auc_score, 2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strategy 2 - Random Forest with class weighting\n",
    "- Balancing the class imbalance in each bootstrap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a Random Forest classifier with randon_state as above\n",
    "# Set the maximum depth to be max_depth and use 10 estimators\n",
    "# Use class_weight as balanced_subsample to weigh the class accordingly\n",
    "random_forest = ___\n",
    "\n",
    "# Fit the model on the training set\n",
    "___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_balanced) ###\n",
    "\n",
    "# Use the trained model to predict on the validation set \n",
    "predictions = ___\n",
    "\n",
    "# Compute two metrics that better represent misclassification of minority classes \n",
    "# i.e `F1 score` and `AUC`\n",
    "\n",
    "# Compute the F1-score and assign it to variable score2\n",
    "f_score = ___\n",
    "score2 = round(f_score, 2)\n",
    "\n",
    "# Compute the AUC and assign it to variable auc2\n",
    "auc_score = ___\n",
    "auc2 = round(auc_score, 2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strategy 3 - Balanced Random Forest with SMOTE \n",
    "\n",
    "- Using the **imblearn** `BalancedRandomForestClassifier()` \n",
    "- Read more about this implementation [here](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.ensemble.BalancedRandomForestClassifier.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform upsampling using SMOTE\n",
    "\n",
    "# Define a SMOTE with random_state=2\n",
    "sm = ___\n",
    "\n",
    "# Use the SMOTE object to upsample the train data\n",
    "# You may have to use ravel() \n",
    "X_train_res, y_train_res = ___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a Random Forest classifier with randon_state as above\n",
    "# Set the maximum depth to be max_depth and use 10 estimators\n",
    "# Use class_weight as balanced_subsample to weigh the class accordingly\n",
    "random_forest = ___\n",
    "\n",
    "# Fit the Random Forest on upsampled data \n",
    "___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_upsample) ###\n",
    "\n",
    "# Use the trained model to predict on the validation set \n",
    "predictions = ___\n",
    "\n",
    "# Compute the F1-score and assign it to variable score3\n",
    "f_score = ___\n",
    "score3 = round(f_score, 2)\n",
    "\n",
    "# Compute the AUC and assign it to variable auc3\n",
    "auc_score = ___\n",
    "auc3 = round(auc_score, 2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strategy 4 - Downsample the data\n",
    "\n",
    "Using the imblearn RandomUnderSampler()."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define an RandomUnderSampler instance with random state as 2\n",
    "rs = ___\n",
    "\n",
    "# Downsample the train data\n",
    "# You may have to use ravel()\n",
    "X_train_res, y_train_res = ___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a Random Forest classifier with randon_state as above\n",
    "# Set the maximum depth to be max_depth and use 10 estimators\n",
    "# Use class_weight as balanced_subsample to weigh the class accordingly\n",
    "random_forest = ___\n",
    "\n",
    "# Fit the Random Forest on downsampled data \n",
    "___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_downsample) ###\n",
    "\n",
    "# Use the trained model to predict on the validation set \n",
    "predictions = ___\n",
    "\n",
    "# Compute two metrics that better represent misclassification of minority classes \n",
    "# i.e `F1 score` and `AUC`\n",
    "\n",
    "# Compute the F1-score and assign it to variable score4\n",
    "f_score = ___\n",
    "score4 = round(f_score, 2)\n",
    "\n",
    "# Compute the AUC and assign it to variable auc4\n",
    "auc_score = ___\n",
    "auc4 = round(auc_score, 2)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compile the results from the implementations above\n",
    "\n",
    "pt = PrettyTable()\n",
    "pt.field_names = [\"Strategy\",\"F1 Score\",\"AUC score\"]\n",
    "pt.add_row([\"Random Forest - No imbalance correction\",score1,auc1])\n",
    "pt.add_row([\"Random Forest - balanced_subsamples\",score2,auc2])\n",
    "pt.add_row([\"Random Forest - Upsampling\",score3,auc3])\n",
    "pt.add_row([\"Random Forest - Downsampling\",score4,auc4])\n",
    "print(pt)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ⏸ Which of the metrics given below is not recommended when there is a imbalance in the dataset?\n",
    "\n",
    "#### A. Precision\n",
    "#### B. Recall\n",
    "#### C. F1-Score\n",
    "#### D. Accuracy\n",
    "#### E. AUC-Score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_chow1) ###\n",
    "# Submit an answer choice as a string below (eg. if you choose option A, put 'A')\n",
    "answer1 = '___'\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}