{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "\n", "Case Study: Covid-19\n", "\n", "## Hints: \n", "\n", "sklearn.model_selection.GridsearchCV\n", "Exhaustive search over specified parameter values for an estimator.\n", "\n", "sklearn.metrics.confusion_matrix\n", "Compute confusion matrix to evaluate the accuracy of a classification.\n", "\n", "seaborn.heatmap\n", "Plot rectangular data as a color-encoded matrix.\n", "\n", "sklearn.metrics.roc_auc_score\n", "Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.\n", "\n", "sklearn.metrics.roc_curve\n", "Compute Receiver operating characteristic (ROC)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# COVID-19 Machine Learning Dataset®\n", "\n", ">Adopted from the dataset provided by Dr. Karandeep Singh [@kdpsinghlab](https://twitter.com/kdpsinghlab/status/1239416911668092928)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this case study (intended for education) is to **predict the urgency** with which a COVID-19 patient will need to be admitted to the hospital from the time of onset of symptoms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The original dataset is located on this [github repo](https://github.com/ml4lhs/covid19_ml_education/raw/master/covid_ml.csv). Please note that this dataset has been simplified for this case study\n", "\n", "The raw data comes from the following [source](http://virological.org/t/epidemiological-data-from-the-ncov-2019-outbreak-early-descriptions-from-publicly-available-data/337)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Intended For Educational Use Only\n", "## Should this data be used for research?\n", "\n", "No. Students working with this dataset should understand that both the source data and the ML data have several limitations:\n", "- The source data is crowdsourced and may contain inaccuracies.\n", "- There may be duplicate patients in this dataset\n", "- There is a substantial amount of missingness in the symptoms data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **And most importantly:**\n", "- The entire premise is flawed. The fact that a patient was admitted the same day as experiencing symptoms may have more to do with the availability of hospital beds as opposed to the patient's acuity of illness.\n", "- Also, the fact that less sick patients or asymptomatic patients may not have been captured in the source dataset mean that the probabilities estimated by any model fit on this data are unlikely to reflect reality." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Primary predictors:**\n", "- age (if an age range was provided in the source data, only the first number is used)\n", "- sex\n", "- cough, fever, chills, sore_throat, headache, fatigue (all derived from the symptoms column using regular expressions)\n", "\n", ">The goal of the exercise is to make a classification model to predict the **urgency_of_admission** based on the following criteria\n", "1. 0-1 days from onset of symptoms to admission => High\n", "2. 2+ days from onset of symptoms to admission *or* no admission => Low\n", " " ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "# Feel free to import other modules and libraries as you deem fit\n", "import sklearn\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from prettytable import PrettyTable\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import train_test_split\n", "from IPython.core.interactiveshell import InteractiveShell\n", "from sklearn.metrics import classification_report,roc_auc_score, roc_curve, accuracy_score\n", "\n", "%matplotlib inline\n", "InteractiveShell.ast_node_interactivity = \"all\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calling the dataset\n", ">We are using a modified dataset found here [covid_19 dataset source](https://web.us.edusercontent.com/hulb749ekf8a0pncvj99orb78k)\n", "\n", "The following changes were made:\n", "1. Categorical values changed to 1 and 0\n", "2. SMOTE used to upsample in order to balance the dataset. Refer to the class imbalance exercise in the Random Forest session." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Read the train and test data\n", "# Take a look at the data to understand the features and reponse\n", "\n", "# Your code here\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# define X_train, y_train, X_test, and y_test \n", "# Urgency is the response variable, all other variables are the predictors \n", "\n", "# Your code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GridsearchCV for Logistic Regression\n", "> Perform a hyper-parameter search to get the best C value for Logistic Regression using [GridsearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).\n", ">> For simplicity, use **accuracy** as the metric to choose best hyper-parameter\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Perform GridSearchCV to get the best C value for a Logistic Regression model\n", "# Feel free to use the cv and set of C values of your choice\n", "# Remember to keep track of your best C value\n", "\n", "# Your code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fitting the data and making predictions\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Using the C value from above, initialize a Logistic Regression model\n", "# Fit the model on the train data\n", "# Predict on the test data\n", "\n", "# Your code here\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Compute the accuracy of the model\n", "logistic_acc = ___\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GridsearchCV for KNN classification\n", "> Perform a hyper-parameter search to get the best k value for KNN Classification using [GridsearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).\n", ">> For simplicity, use **accuracy** as the metric to choose best hyper-parameter\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Perform GridSearchCV to get the best k value for a kNN Classification model\n", "# Feel free to use the cv and set of k values of your choice\n", "# Remember to keep track of your best k value\n", "\n", "# Your code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fitting the data and making predictions\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Using the k value from above, initialize a kNN Classification model\n", "# Fit the model on the train data\n", "# Predict on the test data\n", "\n", "# Your code here\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Compute the accuracy of the model\n", "knn_acc = ___\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Store the Confusion Matrix of the trained Logistic Regression Model on the test data in a variable\n", "\n", "\n", "# Your code here\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Store the Confusion Matrix of the trained kNN Classification Model on the test data in a variable\n", "\n", "# Your code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is a Confusion Matrix?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " A classifier will get some samples right, and some wrong. Generally we see which ones it gets right and which ones it gets wrong on the test set\n", " \n", " ![hwimages](images/confusionmatrix.png)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### True Positive\n", "- Samples that are +ive and the classifier predicts as +ive are called True Positives (TP)\n", "\n", "### False Positive\n", "- Samples that are -ive and the classifier predicts (wrongly) as +ive are called False Positives (FP)\n", "\n", "### True Negative\n", "- Samples that are -ive and the classifier predicts as -ive are called True Negatives (TN)\n", "\n", "### False Negative\n", "- Samples that are +ive and the classifier predicts as -ive are called False Negatives (FN)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The boy who cried wolf: Data Science edition\n", "\n", "#### Predicted wolf, but no wolf \n", "\n", " ![hwimages](images/fp.jpeg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predicted no wolf, but actually wolf\n", "\n", " ![hwimages](images/fn.jpeg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot the Confusion Matrix\n", "\n", "Plot the Confusion Matrix for each of the above models as a heatmap. Your plot should look similar to the following:\n", "\n", " ![hwimages](images/cm_plot.png)\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Plot of the Confusion Matrix for the Logisitic Regression and kNN Classification model\n", "\n", "# Your code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Sensitivity**\n", "\n", "The **Sensitivity**, also known as **Recall** or **True Positive Rate(TPR)**\n", "\n", "\n", "$$TPR = Recall = \\frac{TP}{OP} = \\frac{TP}{TP+FN},$$\n", "\n", "also called the **Hit Rate**: the fraction of observed positives (1s) the classifier gets right, or how many true positives were recalled. Maximizing the recall towards 1 means keeping down the false negative rate" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Compute the Sensitivity for the Logistic Regression model\n", "logistic_recall = ___\n", "\n", "# Compute the Sensitivity for the kNN Classification model\n", "knn_recall = ___\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Specificity**\n", "The **Specificity** or **True Negative Rate** is defined as\n", "\n", "$$TNR = \\frac{TN}{ON} = \\frac{TN}{FP+TN}$$" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Compute the Specificity for the Logistic Regression model\n", "logistic_fpr = ___\n", "\n", "# Compute the Specificity for the kNN Classification model\n", "knn_fpr = ___\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Precision** (Positive Predicted Value)\n", "\n", "**Precision**,tells you how many of the predicted positive(1) hits were truly positive\n", "\n", "$$Precision = \\frac{TP}{PP} = \\frac{TP}{TP+FP}.$$" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Compute the Precision for the Logistic Regression model\n", "logistic_precision = ___\n", "\n", "# Compute the Precision for the kNN Classification model\n", "knn_precision = ___\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **F1 score**\n", "**F1** score gives us the Harmonic Mean of Precision and Recall.\n", "It tries to minimize both **false positives** and **false negatives** simultaneously\n", "\n", "$$F1 = \\frac{2*Recall*Precision}{Recall + Precision}$$" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Compute the F1-Score for the Logistic Regression model\n", "logistic_fscore = ___\n", "\n", "# Compute the F1-Score for the kNN Classification model\n", "knn_fscore = ___\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to bring everything together\n", "pt = PrettyTable()\n", "\n", "pt.field_names = [\"Metric\", \"Logistic Regression\", \"kNN Classification\"]\n", "pt.add_row([\"Accuracy\", round(logistic_acc, 3), round(knn_acc, 3)])\n", "pt.add_row([\"Sensitivity(Recall)\", round(logistic_recall, 3), round(knn_recall, 3)])\n", "pt.add_row([\"Specificity\", round(logistic_fpr, 3), round(knn_fpr, 3)])\n", "pt.add_row([\"Precision\", round(logistic_precision, 3), round(knn_precision, 3)])\n", "pt.add_row([\"F1 Score\", round(logistic_fscore, 3), round(knn_fscore, 3)])\n", "\n", "print(pt)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# BACK TO THE LECTURE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bayes Theorem & Diagnostic testing\n", "\n", "Refer to Dr. Rahul Dave's [Covid19 Serological testing blog](https://covid19.posts.ai/2020/04/04/bayes-rule-and-serological-testing.html) for an excellent introduction to all the above concepts." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Compute the area under the ROC curve for the Logistic Regression model\n", "logreg_auc = ___\n", "\n", "# Compute the area under the ROC curve for the kNN Classification model\n", "knnreg_auc = ___\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ROC Curve\n", "\n", "To make a ROC curve you plot the True Positive Rate, against the False Positive Rate,\n", "\n", "The curve is actually a 3 dimensional plot, which each point representing a different value of threshold.\n", "\n", "![ROC](images/roc.png)\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Plot the ROC curve for the Logistic Regression model and kNN Classification model\n", "# You can refer to the end of homework 5 for example code\n", "\n", "# Your code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Which classifier to choose?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Choice of classifier Scenario 1 - BRAZIL\n", "\n", "\"drawing\"\n", "\n", ">**The new variant of the Covid-19 virus is contagious and is infecting many Brazilians**\n", "\n", ">>*Brazilian officials however dictate that hospitals do not classify a large number of people at 'high' risk to avoid bad press and subsequent political global backlash**\n", "\n", "**In numbers we need the best classifier with the following restriction**\n", "\n", "$$TPR + FPR \\le 0.5$$ \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Choice of classifier Scenario 2 - GERMANY\n", "\n", "\"drawing\"\n", "\n", ">**It's the month of February, and Germany, is now aware that the pandemic of Covid-19 has severely hit Europe. Italy is already decimated and there is suspected spread to other European nations as well**\n", "\n", ">> German officials have a clear target. The want the fatality ratio to be as less as possible. Thus, it is imperative to find cases in need of urgent attention and give them the best chance of survival.\n", "\n", "**In numbers we need the best classifier with the following restriction**\n", "\n", "$$ 0.8 \\ge TPR \\le 0.85 $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Choice of classifier Scenario 3 - INDIA\n", "\n", "\"drawing\"\n", "\n", ">**It's the month of May, and India, now severly impacted by Covid-19, has now run a major shortage of hospital beds for suspected cases**\n", "\n", ">> Owing to exponentially rising cases Indian officials cannot afford many **False Positives** to be given hospital beds. Hence, it is dictated that hospitals do not classify a large number of people at 'high' risk to avoid bed shortage for **At risk** patients\n", "\n", ">> India has only 1 million beds left, and there are already 2 million people suspected of having the disease. The officials need to work out a strategy to find the people at most need of urgent care.\n", "\n", "**In numbers we need the best classifier with the following restriction**\n", "\n", "$$TP + FP = 1000000$$\n", "\n", "$$TP = TPR*OP$$ $$FP = TPR*ON$$\n", "\n", "$$TPR*OP + FPR*ON = 1000000$$\n", "\n", "$$Assuming\\ OP=ON = 500000$$\n", "\n", "$$TPR + FPR \\le 1 $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ROC curve with boundary conditions\n", "\n", "Plot the ROC curve of the Logistic Regression model and kNN Classification model, along with the boundary conditions for each of the scenarios.\n", "\n", "Each of the scenarios can be represented as a region governed by straight line(s) based on the given conditions. The resulting plot will look similar to the following:\n", "\n", "\"drawing\"\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Area under curve - Logistic regression & kNN\n", "# along with the boundary conditions\n", "\n", "# Your code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# BACK TO THE LECTURE" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }