Title

Exercise 2 - Simple k-NN Classification and Logistic Regression

Description

The aim of this exercise is to fit, interpret, predict, score, and plot simple $k$-NN classification and logistic regression models using the sklearn package.

In the end, you should get a plot that looks like this (way too busy for publication, but OK here for teaching purposes):

Dataset Description:

The dataset used here is called the Heart dataset. This dataset has several predictors such as Age, Sex, and MaxHR, etc. For now, we will just use Age to predict whether or not someone has atherosclerotic heart disease (AHD).

Instructions:

  1. Read the Heart.csv file into a pandas data frame.
  2. Split the dataset into train and validation sets, with 80% of the data for training
  3. Assign the predictor and response variables. Remember the aim is to predict whether a patient has AHD
  4. Fit a $k$-NN model (manually tuned) to the dataset and look at some predictions.
  5. Fit a logistic regression model to the dataset and interpret the coefficients.
  6. Do some work mathematically based on the estimated model.
  7. Compute and print the train and validation accuracies for both
  8. Plot the predictions on the scatterplot.

Hints:

sklearn.KNeighborsClassifier() : Generates a $k$-NN classifier

sklearn.LogisticRegression() : Generates a Logistic Regression classifier

sklearn.fit() : Fits the model to the given data

sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions

sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)

sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated $\beta$ coefficients in a Logistic Regression model

sklearn.score() : Accuracy classification score.

Note: This exercise is auto-graded and you can try multiple attempts.

In [1]:
# import libraries

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier 

Read-in the dataset

In [ ]:
# Read the "Heart.csv" dataset and take a quick look
heart = pd.read_csv('Heart.csv')

# Force the response into a binary indicator:
heart['AHD'] = 1*(heart['AHD'] == "Yes")

print(heart.shape)
In [ ]:
# split into train and validation
heart_train, heart_val = train_test_split(heart, train_size = 0.75, random_state = 109)

print(heart_train.shape, heart_val.shape)

$k$-NN model fitting

Define and fit a $k$-NN classification model with $k=20$ to predict AHD from Age.

In [ ]:
# select variables for model estimation: be careful of format 
# (aka, single or double square brackets)
x_train = heart_train[____]
y_train = heart_train[____]

# define the model
knn20 = KNeighborsClassifier(___)

# fit to the data
knn20.fit(___ , ___)

$k$-NN prediction

Perform some simple predictions: both the pure classifications and the probability estimates.

In [ ]:
### edTest(test_knn) ###

# there are two types of predictions in classification models in sklearn
# model.predict for pure classifications, and model.predict_proba for probabilities

# create the predictions based on the train data
yhat20_class = knn20.predict(___)
yhat20_prob = knn20.predict_proba(___)

# print out the first 10 predictions for the actual data
print(yhat20_class[1:10])
print(yhat20_prob[1:10])

What do you notice about the probability estimates? Which 'column' is which? How could you manually create yhat20_class from yhat20_prob?

Your answer here

Simple logisitc regression model fitting

Define and fit a $k$-NN classification model with $k=20$ to predict AHD from Age.

In [ ]:
### edTest(test_logit) ###
# Create a logistic regression model, with 'none' as the penalty

logit1 = LogisticRegression(penalty=____, max_iter = 1000)

#Fit the model using the training set

logit1.fit(____,____)

# Get the coefficient estimates

print("Logistic Regression Estimated Betas (B0,B1):",logit1.____,logit1.____)

Interpret the Coefficient Estimates

Write down the logistic regression model below (edit the provided latex formula). Use this formula to answer:

What is the estimated odds that a 60 year old will have AHD in the ICU? What is the probability?

$$ \ln\left( \frac{P(Y=1)}{1-P(Y=1)} \right) = \hat{\beta}_0 + \hat{\beta}_1 X$$

your answer here

In [ ]:
# Make the predictions on the training & validation set
# Be careful as to how you define the new observation.  Hint: double brackets is one way to do it

logit1.predict(____)

Accuracy computation

$k$-NN and logistic accuracy

In [ ]:
### edTest(test_accuracy) ###

# Define the equivalent validation variables from `heart_val`

x_val = heart_val[____]
y_val = heart_val[____]

# Compute the training & validation accuracy using the estimator.score() function

knn20_train_accuracy = knn20.score(x_train, y_train)
knn20_val_accuracy = knn20.score(x_val, y_val)

logit_train_accuracy = ____
logit_val_accuracy = ____

# Print the accuracies below

print("k-NN Train & Validation Accuracy:", knn20_train_accuracy, knn20_val_accuracy)
print("Logisitic Train & Validation Accuracy:", logit_train_accuracy, logit_val_accuracy)

Plot the predictions

Use a 'dummy' x variable for plotting for the two different models. Here we plot the train and validation data separately, and the 4 different types of predictions (2 for each model: pure classification and probability estimation)

In [ ]:
# set-up the dummy x for plotting: we extend it a little bit beyond the range of observed values 
x = np.linspace(np.min(heart[['Age']])-10,____+10,200)


# be careful in pulling off only the correct column of the probability calculations: use `[:,1]`
yhat_class_knn20 = knn20.predict(x)
yhat_prob_knn20 = _____

yhat_class_logit = logit1.predict(x)
yhat_prob_logit = _____

# plot the observed data.  Note: we offset the validation points to make them more clearly differentiated from train
plt.plot(x_train, y_train, 'o' ,alpha=0.1, label='Train Data')
plt.plot(x_val, 0.94*y_val+0.03, 'o' ,alpha=0.1, label='Validation Data')

# plot the predictions
plt.plot(x, yhat_class_knn20, label='knn20 Classifications')
plt.plot(x, yhat_prob_knn20, label='knn20 Probabilities')
plt.plot(____, ____, label='logit1 Classifications')
plt.plot(____, ____, ____)

# put the lower-left part of the legend 5% to the right along the x-axis, and 45% up along the y-axis
plt.legend(loc=(0.05,0.45))

# Don't forget your axis labels!
plt.xlabel("Age")
plt.ylabel("Heart disease (AHD)")

plt.show()

Post exercise questions

  1. Describe the estimated associations between AHD and Age based on these two models.

Your answer here

  1. Which model apears to be more overfit to the training data? How do you know? How can this be resolved?

Your answer here

  1. How could you engineer features for the logistic regression model so that the association between AHD and Age resembles the general trend in the knn20 model more similarly?

Your answer here

In [ ]:
# Your answer here
  1. Refit the models above to predict AHD from 3 predictors (please also play around with $k$):
    {python}
    ['MaxHR','Age','Sex']
    Investigate the associations in each of the two models and evaluate the predictive accuracy of each of these models
In [ ]:
# Your answer here