
Exercise 2 [Not Graded!] - Confusion Matrices & ROC Curves


The aim of this exercise is to evaluate classification models through confusion matrics, ROC curves, and the AUC metric. You eventually will create a plot that looks like this:

Dataset Description:

The dataset used here is called the Heart dataset. This dataset has several predictors such as Age, Sex, and MaxHR, etc.


  1. Run the code, play around with thresholds and regularization parameters, and determine what the effect is on the misclassifications, ROC Curve, and AUC.
  2. Hint: why aren't we plotting like in the last exercise?


sklearn.LogisticRegression() : Generates a Logistic Regression classifier

sklearn.fit() : Fits the model to the given data

sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions

sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)

sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated $\beta$ coefficients in a Logistic Regression model

sklearn.confusion_matrix() : Calculate the confusion matrix

sklearn.roc_curve() : Calculate the ROC curve

sklearn.roc_auc_score() : Calculate the Area under the curve

Note: This exercise is NOT auto-graded.

In [1]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import train_test_split
In [2]:
heart = pd.read_csv('Heart.csv')

# Force the response into a binary indicator:
heart['AHD'] = 1*(heart['AHD'] == "Yes")

heart_train, heart_test = train_test_split(heart, test_size=0.3, random_state = 109)
In [3]:
degree = 3
predictors = ['Age','Sex','MaxHR','RestBP','Chol']

X_train = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(heart_train[predictors])
y_train = heart_train['AHD']

X_test = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(heart_test[predictors])
y_test = heart_test['AHD']

logit = LogisticRegression(penalty='none', max_iter = 10000).fit(X_train, y_train)
logit_ridge = LogisticRegression(C=0.001, penalty='l2',solver='lbfgs', max_iter = 10000).fit(X_train, y_train)
In [4]:
yhat_logit = logit.predict_proba(X_test)[:,1]
yhat_logit_ridge = logit_ridge.predict_proba(X_test)[:,1]

threshold = 0.5

print('The confusion matrix in test for logit when cut-off is',threshold, ': \n',
      sk.metrics.confusion_matrix(y_test, yhat_logit>threshold))
print('The confusion matrix in test for logit_ridge when cut-off is',threshold, ': \n',
      sk.metrics.confusion_matrix(y_test, yhat_logit_ridge>threshold))
In [5]:
# your code here
yhat_logit= logit.predict_proba(X_test)[:,1]
yhat_logit_ridge= logit_ridge.predict_proba(X_test)[:,1]

fpr, tpr, thresholds = metrics.roc_curve(y_test, yhat_logit)
fpr_ridge, tpr_ridge, thresholds_ridge = metrics.roc_curve(y_test, yhat_logit_ridge)

plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.title("ROC Curve for Predicting AHD in a Logistic Regression Model")
In [6]:
