Key Word(s): Case Study
Title :¶
Case Study: Covid-19
Hints:¶
sklearn.model_selection.GridsearchCV Exhaustive search over specified parameter values for an estimator.
sklearn.metrics.confusion_matrix Compute confusion matrix to evaluate the accuracy of a classification.
seaborn.heatmap Plot rectangular data as a color-encoded matrix.
sklearn.metrics.roc_auc_score Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
sklearn.metrics.roc_curve Compute Receiver operating characteristic (ROC).
COVID-19 Machine Learning Dataset®¶
Adopted from the dataset provided by Dr. Karandeep Singh @kdpsinghlab
The goal of this case study (intended for education) is to predict the urgency with which a COVID-19 patient will need to be admitted to the hospital from the time of onset of symptoms.
The original dataset is located on this github repo. Please note that this dataset has been simplified for this case study
The raw data comes from the following source.
Intended For Educational Use Only¶
Should this data be used for research?¶
No. Students working with this dataset should understand that both the source data and the ML data have several limitations:
- The source data is crowdsourced and may contain inaccuracies.
- There may be duplicate patients in this dataset
- There is a substantial amount of missingness in the symptoms data.
And most importantly:¶
- The entire premise is flawed. The fact that a patient was admitted the same day as experiencing symptoms may have more to do with the availability of hospital beds as opposed to the patient's acuity of illness.
- Also, the fact that less sick patients or asymptomatic patients may not have been captured in the source dataset mean that the probabilities estimated by any model fit on this data are unlikely to reflect reality.
Primary predictors:
- age (if an age range was provided in the source data, only the first number is used)
- sex
- cough, fever, chills, sore_throat, headache, fatigue (all derived from the symptoms column using regular expressions)
The goal of the exercise is to make a classification model to predict the urgency_of_admission based on the following criteria
- 0-1 days from onset of symptoms to admission => High
- 2+ days from onset of symptoms to admission or no admission => Low
# Import necessary libraries
# Feel free to import other modules and libraries as you deem fit
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from prettytable import PrettyTable
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from IPython.core.interactiveshell import InteractiveShell
from sklearn.metrics import classification_report,roc_auc_score, roc_curve, accuracy_score
%matplotlib inline
InteractiveShell.ast_node_interactivity = "all"
Calling the dataset¶
We are using a modified dataset found here covid_19 dataset source
The following changes were made:
- Categorical values changed to 1 and 0
- SMOTE used to upsample in order to balance the dataset. Refer to the class imbalance exercise in the Random Forest session.
# Read the train and test data
# Take a look at the data to understand the features and reponse
# Your code here
# define X_train, y_train, X_test, and y_test
# Urgency is the response variable, all other variables are the predictors
# Your code here
GridsearchCV for Logistic Regression¶
Perform a hyper-parameter search to get the best C value for Logistic Regression using GridsearchCV.
For simplicity, use accuracy as the metric to choose best hyper-parameter
# Perform GridSearchCV to get the best C value for a Logistic Regression model
# Feel free to use the cv and set of C values of your choice
# Remember to keep track of your best C value
# Your code here
Fitting the data and making predictions¶
# Using the C value from above, initialize a Logistic Regression model
# Fit the model on the train data
# Predict on the test data
# Your code here
# Compute the accuracy of the model
logistic_acc = ___
GridsearchCV for KNN classification¶
Perform a hyper-parameter search to get the best k value for KNN Classification using GridsearchCV.
For simplicity, use accuracy as the metric to choose best hyper-parameter
# Perform GridSearchCV to get the best k value for a kNN Classification model
# Feel free to use the cv and set of k values of your choice
# Remember to keep track of your best k value
# Your code here
Fitting the data and making predictions¶
# Using the k value from above, initialize a kNN Classification model
# Fit the model on the train data
# Predict on the test data
# Your code here
# Compute the accuracy of the model
knn_acc = ___
# Store the Confusion Matrix of the trained Logistic Regression Model on the test data in a variable
# Your code here
# Store the Confusion Matrix of the trained kNN Classification Model on the test data in a variable
# Your code here
What is a Confusion Matrix?¶
A classifier will get some samples right, and some wrong. Generally we see which ones it gets right and which ones it gets wrong on the test set
True Positive¶
- Samples that are +ive and the classifier predicts as +ive are called True Positives (TP)
False Positive¶
- Samples that are -ive and the classifier predicts (wrongly) as +ive are called False Positives (FP)
True Negative¶
- Samples that are -ive and the classifier predicts as -ive are called True Negatives (TN)
False Negative¶
- Samples that are +ive and the classifier predicts as -ive are called False Negatives (FN)
Predicted no wolf, but actually wolf¶
Plot the Confusion Matrix¶
Plot the Confusion Matrix for each of the above models as a heatmap. Your plot should look similar to the following:
# Plot of the Confusion Matrix for the Logisitic Regression and kNN Classification model
# Your code here
Sensitivity¶
The Sensitivity, also known as Recall or True Positive Rate(TPR)
$$TPR = Recall = \frac{TP}{OP} = \frac{TP}{TP+FN},$$also called the Hit Rate: the fraction of observed positives (1s) the classifier gets right, or how many true positives were recalled. Maximizing the recall towards 1 means keeping down the false negative rate
# Compute the Sensitivity for the Logistic Regression model
logistic_recall = ___
# Compute the Sensitivity for the kNN Classification model
knn_recall = ___
Specificity¶
The Specificity or True Negative Rate is defined as
$$TNR = \frac{TN}{ON} = \frac{TN}{FP+TN}$$# Compute the Specificity for the Logistic Regression model
logistic_fpr = ___
# Compute the Specificity for the kNN Classification model
knn_fpr = ___
Precision (Positive Predicted Value)¶
Precision,tells you how many of the predicted positive(1) hits were truly positive
$$Precision = \frac{TP}{PP} = \frac{TP}{TP+FP}.$$# Compute the Precision for the Logistic Regression model
logistic_precision = ___
# Compute the Precision for the kNN Classification model
knn_precision = ___
F1 score¶
F1 score gives us the Harmonic Mean of Precision and Recall. It tries to minimize both false positives and false negatives simultaneously
$$F1 = \frac{2*Recall*Precision}{Recall + Precision}$$# Compute the F1-Score for the Logistic Regression model
logistic_fscore = ___
# Compute the F1-Score for the kNN Classification model
knn_fscore = ___
# Helper code to bring everything together
pt = PrettyTable()
pt.field_names = ["Metric", "Logistic Regression", "kNN Classification"]
pt.add_row(["Accuracy", round(logistic_acc, 3), round(knn_acc, 3)])
pt.add_row(["Sensitivity(Recall)", round(logistic_recall, 3), round(knn_recall, 3)])
pt.add_row(["Specificity", round(logistic_fpr, 3), round(knn_fpr, 3)])
pt.add_row(["Precision", round(logistic_precision, 3), round(knn_precision, 3)])
pt.add_row(["F1 Score", round(logistic_fscore, 3), round(knn_fscore, 3)])
print(pt)
BACK TO THE LECTURE¶
Bayes Theorem & Diagnostic testing¶
Refer to Dr. Rahul Dave's Covid19 Serological testing blog for an excellent introduction to all the above concepts.
# Compute the area under the ROC curve for the Logistic Regression model
logreg_auc = ___
# Compute the area under the ROC curve for the kNN Classification model
knnreg_auc = ___
ROC Curve¶
To make a ROC curve you plot the True Positive Rate, against the False Positive Rate,
The curve is actually a 3 dimensional plot, which each point representing a different value of threshold.
# Plot the ROC curve for the Logistic Regression model and kNN Classification model
# You can refer to the end of homework 5 for example code
# Your code here
Which classifier to choose?¶
Choice of classifier Scenario 1 - BRAZIL¶
The new variant of the Covid-19 virus is contagious and is infecting many Brazilians
*Brazilian officials however dictate that hospitals do not classify a large number of people at 'high' risk to avoid bad press and subsequent political global backlash**
In numbers we need the best classifier with the following restriction
$$TPR + FPR \le 0.5$$
Choice of classifier Scenario 2 - GERMANY¶
It's the month of February, and Germany, is now aware that the pandemic of Covid-19 has severely hit Europe. Italy is already decimated and there is suspected spread to other European nations as well
German officials have a clear target. The want the fatality ratio to be as less as possible. Thus, it is imperative to find cases in need of urgent attention and give them the best chance of survival.
In numbers we need the best classifier with the following restriction
$$ 0.8 \ge TPR \le 0.85 $$Choice of classifier Scenario 3 - INDIA¶
It's the month of May, and India, now severly impacted by Covid-19, has now run a major shortage of hospital beds for suspected cases
Owing to exponentially rising cases Indian officials cannot afford many False Positives to be given hospital beds. Hence, it is dictated that hospitals do not classify a large number of people at 'high' risk to avoid bed shortage for At risk patients
India has only 1 million beds left, and there are already 2 million people suspected of having the disease. The officials need to work out a strategy to find the people at most need of urgent care.
In numbers we need the best classifier with the following restriction
$$TP + FP = 1000000$$$$TP = TPR*OP$$$$FP = TPR*ON$$
$$TPR*OP + FPR*ON = 1000000$$$$Assuming\ OP=ON = 500000$$$$TPR + FPR \le 1 $$ROC curve with boundary conditions¶
Plot the ROC curve of the Logistic Regression model and kNN Classification model, along with the boundary conditions for each of the scenarios.
Each of the scenarios can be represented as a region governed by straight line(s) based on the given conditions. The resulting plot will look similar to the following:
# Area under curve - Logistic regression & kNN
# along with the boundary conditions
# Your code here