Key Word(s): Logistic Regression, Classification, kNN

Title¶

Exercise 2 [Not Graded!] - Confusion Matrices & ROC Curves

Description¶

The aim of this exercise is to evaluate classification models through confusion matrics, ROC curves, and the AUC metric. You eventually will create a plot that looks like this:

Dataset Description:¶

The dataset used here is called the Heart dataset. This dataset has several predictors such as Age, Sex, and MaxHR, etc.

Instructions:¶

Run the code, play around with thresholds and regularization parameters, and determine what the effect is on the misclassifications, ROC Curve, and AUC.
Hint: why aren't we plotting like in the last exercise?

Hints:¶

sklearn.LogisticRegression() : Generates a Logistic Regression classifier

sklearn.fit() : Fits the model to the given data

sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions

sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)

sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated $\beta$ coefficients in a Logistic Regression model

sklearn.confusion_matrix() : Calculate the confusion matrix

sklearn.roc_curve() : Calculate the ROC curve

sklearn.roc_auc_score() : Calculate the Area under the curve

Note: This exercise is NOT auto-graded.

In [1]:

import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import train_test_split

In [2]:

heart = pd.read_csv('Heart.csv')

# Force the response into a binary indicator:
heart['AHD'] = 1*(heart['AHD'] == "Yes")

heart_train, heart_test = train_test_split(heart, test_size=0.3, random_state = 109)

In [3]:

degree = 3
predictors = ['Age','Sex','MaxHR','RestBP','Chol']

X_train = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(heart_train[predictors])
y_train = heart_train['AHD']

X_test = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(heart_test[predictors])
y_test = heart_test['AHD']


logit = LogisticRegression(penalty='none', max_iter = 10000).fit(X_train, y_train)
logit_ridge = LogisticRegression(C=0.001, penalty='l2',solver='lbfgs', max_iter = 10000).fit(X_train, y_train)

In [4]:

yhat_logit = logit.predict_proba(X_test)[:,1]
yhat_logit_ridge = logit_ridge.predict_proba(X_test)[:,1]

threshold = 0.5

print('The confusion matrix in test for logit when cut-off is',threshold, ': \n',
      sk.metrics.confusion_matrix(y_test, yhat_logit>threshold))
print('The confusion matrix in test for logit_ridge when cut-off is',threshold, ': \n',
      sk.metrics.confusion_matrix(y_test, yhat_logit_ridge>threshold))

In [5]:

######
# your code here
######
yhat_logit= logit.predict_proba(X_test)[:,1]
yhat_logit_ridge= logit_ridge.predict_proba(X_test)[:,1]


fpr, tpr, thresholds = metrics.roc_curve(y_test, yhat_logit)
fpr_ridge, tpr_ridge, thresholds_ridge = metrics.roc_curve(y_test, yhat_logit_ridge)

x=np.arange(0,100)/100
plt.plot(x,x,'--',color="gray",alpha=0.3)
plt.plot(fpr,tpr,label="logit")
plt.plot(fpr_ridge,tpr_ridge,label="logit_ridge")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.title("ROC Curve for Predicting AHD in a Logistic Regression Model")
plt.legend()
plt.show()

In [6]:

print(metrics.auc(fpr,tpr))
print(metrics.auc(fpr_ridge,tpr_ridge))

In [ ]: