{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Title\n",
    "\n",
    "**Class imbalance: Random Forest vs SMOTE Classification**\n",
    "\n",
    "# Description\n",
    "\n",
    "The goal of this exercise is to investigate the performance of Random Forest with and without class balancing techniques on a dataset with class imbalance.\n",
    "\n",
    "The comparison will look a little something like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../img/image2.png\" style=\"width: 500px;\">\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instructions:\n",
    "- Read the dataset `diabetes.csv` as a pandas dataframe.\n",
    "- Take a quick look at the dataset.\n",
    "- Quantify the class imbalance of your response variable. \n",
    "- Assign the response variable as `Outcome` and everything else as a predictor.\n",
    "- Split the data into train and validation sets.\n",
    "- Fit a `RandomForestClassifier()` on the training data, without any consideration for class imbalance.\n",
    "- Predict on the validation set and compute the `f1_score` and the `auc_score` and save them to appropriate variables.\n",
    "- Fit a `RandomForestClassifier()` on the training data, but this time make a consideration for class imbalance by setting `lass_weight='balanced_subsample'`.\n",
    "- Predict on the validation set and compute the f1_score and the auc_score for this model and save them to appropriate variables.\n",
    "- Fit a `RandomForestClassifier()` on the training data generated using <a href=\"https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html\" target=\"_blank\">SMOTE</a> using `class_weight='balanced_subsample'`.\n",
    "- Predict on the validation set and compute the `f1_score` and the `auc_score` for this model and save them to appropriate variables.\n",
    "- Finally, use the helper code to tabulate your results and compare the performance of each model.\n",
    "\n",
    "\n",
    "# Hints:\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\" target=\"_blank\">RandomForestClassifier()</a> : A random forest classifier.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html\" target=\"_blank\">f1_score()</a> : Compute the F1 score, also known as balanced F-score or F-measure\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html\" target=\"_blank\">roc_auc_score()</a> : Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the main packages\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.inspection import permutation_importance\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import f1_score\n",
    "from sklearn.metrics import roc_auc_score\n",
    "from imblearn.over_sampling import SMOTE\n",
    "from prettytable import PrettyTable\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the dataset and take a quick look\n",
    "\n",
    "df = pd.read_csv('diabetes.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# On checking the response variable ('Outcome') value counts, you will notice that the number of diabetics are less than the number of non-diabetics \n",
    "\n",
    "df['Outcome'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_imbalance) ###\n",
    "\n",
    "# To estimate the amount of data imbalance, find the ratio of class 1(Diabetics) to the size of the dataset.\n",
    "\n",
    "imbalance_ratio = ___\n",
    "\n",
    "print(f'The percentage of diabetics in the dataset is only {(imbalance_ratio)*100:.2f}%')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign the predictor and response variables.\n",
    "\n",
    "# Outcome is the response and all the other columns are the predictors\n",
    "\n",
    "X = ___\n",
    "y = ___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fix a random_state and split the data into train and validation sets\n",
    "\n",
    "random_state = 22\n",
    "\n",
    "X_train, X_val, y_train,y_val = train_test_split(X,y,train_size = 0.8,random_state =random_state)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We fix the max_depth variable to 20 for all trees, you can come back and change this to investigate performance of RandomForest\n",
    "\n",
    "max_depth = 20"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strategy 1 - Vanilla Random Forest\n",
    "\n",
    "- No correction for imbalance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a Random Forest classifier with random_state as above\n",
    "# Set the maximum depth to be max_depth and use 10 estimators\n",
    "random_forest = ___\n",
    "\n",
    "# Fit the model on the training set\n",
    "random_forest.___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We make predictions on the validation set \n",
    "predictions = ___\n",
    "\n",
    "# We also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`\n",
    "# Compute the f1-score and assign it to variable score1\n",
    "score1 = ___\n",
    "\n",
    "# Compute the `auc` and assign it to variable auc1\n",
    "auc1 = ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strategy 2 - Random Forest with class weighting\n",
    "- Balancing the class imbalance in each bootstrap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a Random Forest classifier with random_state as above\n",
    "\n",
    "# Set the maximum depth to be max_depth and use 10 estimators\n",
    "\n",
    "# Specify `class_weight='balanced_subsample'\n",
    "\n",
    "random_forest = ___\n",
    "\n",
    "# Fit the model on the training data\n",
    "\n",
    "random_forest.___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We make predictions on the validation set \n",
    "\n",
    "predictions = ___\n",
    "\n",
    "# Again we also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`\n",
    "\n",
    "# Compute the f1-score and assign it to variable score2\n",
    "\n",
    "score2 = ___\n",
    "\n",
    "# Compute the `auc` and assign it to variable auc2\n",
    "\n",
    "auc2 = ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strategy 3 - RandomForest with SMOTE\n",
    "- We can use SMOTE along with the previous method to further improve our metrics.\n",
    "- Read more about imblearn's SMOTE [here](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run this cell below to use SMOTE to balance our dataset\n",
    "sm = SMOTE(random_state=3)\n",
    "\n",
    "X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())\n",
    "\n",
    "#If you now see the shape, you will see that X_train_res has more number of points than X_train\n",
    "\n",
    "print(f'Number of points in balanced dataset is {X_train_res.shape[0]}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Again Define a Random Forest classifier with random_state as above\n",
    "\n",
    "# Set the maximum depth to be max_depth and use 10 estimators\n",
    "\n",
    "# Again specify `class_weight='balanced_subsample'\n",
    "\n",
    "random_forest = ___\n",
    "\n",
    "# Fit the model on the new training data created above with SMOTE \n",
    "\n",
    "random_forest.___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_smote) ###\n",
    "\n",
    "# We make predictions on the validation set \n",
    "\n",
    "predictions = ___\n",
    "\n",
    "# Again we also compute two metrics that better represent misclassification of minority classes i.e `f1 score` and `AUC`\n",
    "\n",
    "# Compute the f1-score and assign it to variable score3\n",
    "\n",
    "score3 = ___\n",
    "\n",
    "# Compute the `auc` and assign it to variable auc3\n",
    "\n",
    "auc3 = ___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Finally, we compare the results from the three implementations above\n",
    "\n",
    "# Just run the cells below to see your results\n",
    "\n",
    "pt = PrettyTable()\n",
    "pt.field_names = [\"Strategy\",\"F1 Score\",\"AUC score\"]\n",
    "pt.add_row([\"Random Forest - no correction\",score1,auc1])\n",
    "pt.add_row([\"Random Forest - class weighting\",score2,auc2])\n",
    "pt.add_row([\"Random Forest - SMOTE upsampling\",score3,auc3])\n",
    "print(pt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mindchow 🍲\n",
    "\n",
    "- Go back and change the `learning_rate` parameter and `n_estimators` for Adaboost. Do you see an improvement in results?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Your answer here*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}