Key Word(s): Logistic Regression, Classification, kNN
Title¶
Exercise 2 - Simple k-NN Classification and Logistic Regression
Description¶
The aim of this exercise is to fit, interpret, predict, score, and plot simple $k$-NN classification and logistic regression models using the sklearn
package.
In the end, you should get a plot that looks like this (way too busy for publication, but OK here for teaching purposes):
Dataset Description:¶
The dataset used here is called the Heart dataset. This dataset has several predictors such as Age
, Sex
, and MaxHR
, etc. For now, we will just use Age
to predict whether or not someone has atherosclerotic heart disease (AHD).
Instructions:¶
- Read the
Heart.csv
file into a pandas data frame. - Split the dataset into train and validation sets, with 80% of the data for training
- Assign the predictor and response variables. Remember the aim is to predict whether a patient has
AHD
- Fit a $k$-NN model (manually tuned) to the dataset and look at some predictions.
- Fit a logistic regression model to the dataset and interpret the coefficients.
- Do some work mathematically based on the estimated model.
- Compute and print the train and validation accuracies for both
- Plot the predictions on the scatterplot.
Hints:¶
sklearn.KNeighborsClassifier() : Generates a $k$-NN classifier
sklearn.LogisticRegression() : Generates a Logistic Regression classifier
sklearn.fit() : Fits the model to the given data
sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions
sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)
sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated $\beta$ coefficients in a Logistic Regression model
sklearn.score() : Accuracy classification score.
Note: This exercise is auto-graded and you can try multiple attempts.
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
Read-in the dataset¶
# Read the "Heart.csv" dataset and take a quick look
heart = pd.read_csv('Heart.csv')
# Force the response into a binary indicator:
heart['AHD'] = 1*(heart['AHD'] == "Yes")
print(heart.shape)
# split into train and validation
heart_train, heart_val = train_test_split(heart, train_size = 0.75, random_state = 109)
print(heart_train.shape, heart_val.shape)
$k$-NN model fitting¶
Define and fit a $k$-NN classification model with $k=20$ to predict AHD
from Age
.
# select variables for model estimation: be careful of format
# (aka, single or double square brackets)
x_train = heart_train[____]
y_train = heart_train[____]
# define the model
knn20 = KNeighborsClassifier(___)
# fit to the data
knn20.fit(___ , ___)
$k$-NN prediction¶
Perform some simple predictions: both the pure classifications and the probability estimates.
### edTest(test_knn) ###
# there are two types of predictions in classification models in sklearn
# model.predict for pure classifications, and model.predict_proba for probabilities
# create the predictions based on the train data
yhat20_class = knn20.predict(___)
yhat20_prob = knn20.predict_proba(___)
# print out the first 10 predictions for the actual data
print(yhat20_class[1:10])
print(yhat20_prob[1:10])
What do you notice about the probability estimates? Which 'column' is which? How could you manually create yhat20_class
from yhat20_prob
?
Your answer here
Simple logisitc regression model fitting¶
Define and fit a $k$-NN classification model with $k=20$ to predict AHD
from Age
.
### edTest(test_logit) ###
# Create a logistic regression model, with 'none' as the penalty
logit1 = LogisticRegression(penalty=____, max_iter = 1000)
#Fit the model using the training set
logit1.fit(____,____)
# Get the coefficient estimates
print("Logistic Regression Estimated Betas (B0,B1):",logit1.____,logit1.____)
Interpret the Coefficient Estimates¶
Write down the logistic regression model below (edit the provided latex formula). Use this formula to answer:
What is the estimated odds that a 60 year old will have AHD in the ICU? What is the probability?
your answer here
# Make the predictions on the training & validation set
# Be careful as to how you define the new observation. Hint: double brackets is one way to do it
logit1.predict(____)
Accuracy computation¶
$k$-NN and logistic accuracy¶
### edTest(test_accuracy) ###
# Define the equivalent validation variables from `heart_val`
x_val = heart_val[____]
y_val = heart_val[____]
# Compute the training & validation accuracy using the estimator.score() function
knn20_train_accuracy = knn20.score(x_train, y_train)
knn20_val_accuracy = knn20.score(x_val, y_val)
logit_train_accuracy = ____
logit_val_accuracy = ____
# Print the accuracies below
print("k-NN Train & Validation Accuracy:", knn20_train_accuracy, knn20_val_accuracy)
print("Logisitic Train & Validation Accuracy:", logit_train_accuracy, logit_val_accuracy)
Plot the predictions¶
Use a 'dummy' x variable for plotting for the two different models. Here we plot the train and validation data separately, and the 4 different types of predictions (2 for each model: pure classification and probability estimation)
# set-up the dummy x for plotting: we extend it a little bit beyond the range of observed values
x = np.linspace(np.min(heart[['Age']])-10,____+10,200)
# be careful in pulling off only the correct column of the probability calculations: use `[:,1]`
yhat_class_knn20 = knn20.predict(x)
yhat_prob_knn20 = _____
yhat_class_logit = logit1.predict(x)
yhat_prob_logit = _____
# plot the observed data. Note: we offset the validation points to make them more clearly differentiated from train
plt.plot(x_train, y_train, 'o' ,alpha=0.1, label='Train Data')
plt.plot(x_val, 0.94*y_val+0.03, 'o' ,alpha=0.1, label='Validation Data')
# plot the predictions
plt.plot(x, yhat_class_knn20, label='knn20 Classifications')
plt.plot(x, yhat_prob_knn20, label='knn20 Probabilities')
plt.plot(____, ____, label='logit1 Classifications')
plt.plot(____, ____, ____)
# put the lower-left part of the legend 5% to the right along the x-axis, and 45% up along the y-axis
plt.legend(loc=(0.05,0.45))
# Don't forget your axis labels!
plt.xlabel("Age")
plt.ylabel("Heart disease (AHD)")
plt.show()
Post exercise questions¶
- Describe the estimated associations between AHD and Age based on these two models.
Your answer here
- Which model apears to be more overfit to the training data? How do you know? How can this be resolved?
Your answer here
- How could you engineer features for the logistic regression model so that the association between AHD and Age resembles the general trend in the knn20 model more similarly?
Your answer here
# Your answer here
- Refit the models above to predict
AHD
from 3 predictors (please also play around with $k$):
Investigate the associations in each of the two models and evaluate the predictive accuracy of each of these models{python} ['MaxHR','Age','Sex']
# Your answer here