{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Title\n",
"\n",
"**Exercise 2 - Simple k-NN Classification and Logistic Regression**\n",
"\n",
"# Description\n",
"The aim of this exercise is to fit, interpret, predict, score, and plot simple $k$-NN classification and logistic regression models using the `sklearn` package.\n",
"\n",
"In the end, you should get a plot that looks like this (way too busy for publication, but OK here for teaching purposes):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dataset Description:\n",
"\n",
"The dataset used here is called the Heart dataset. This dataset has several predictors such as `Age`, `Sex`, and `MaxHR`, etc. For now, we will just use `Age` to predict whether or not someone has atherosclerotic heart disease (AHD).\n",
"\n",
"# Instructions:\n",
"1. Read the `Heart.csv` file into a pandas data frame.\n",
"2. Split the dataset into train and validation sets, with 80% of the data for training\n",
"3. Assign the predictor and response variables. Remember the aim is to predict whether a patient has `AHD`\n",
"4. Fit a $k$-NN model (manually tuned) to the dataset and look at some predictions.\n",
"5. Fit a logistic regression model to the dataset and interpret the coefficients. \n",
"6. Do some work mathematically based on the estimated model.\n",
"7. Compute and print the train and validation accuracies for both\n",
"8. Plot the predictions on the scatterplot.\n",
"\n",
"# Hints:\n",
"\n",
"sklearn.KNeighborsClassifier() : Generates a $k$-NN classifier\n",
"\n",
"sklearn.LogisticRegression() : Generates a Logistic Regression classifier\n",
"\n",
"sklearn.fit() : Fits the model to the given data\n",
"\n",
"sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions\n",
"\n",
"sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)\n",
"\n",
"sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated $\\beta$ coefficients in a Logistic Regression model\n",
"\n",
"sklearn.score() : Accuracy classification score.\n",
"\n",
"**Note: This exercise is auto-graded and you can try multiple attempts.**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# import libraries\n",
"\n",
"import pandas as pd \n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.neighbors import KNeighborsClassifier "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read-in the dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Read the \"Heart.csv\" dataset and take a quick look\n",
"heart = pd.read_csv('Heart.csv')\n",
"\n",
"# Force the response into a binary indicator:\n",
"heart['AHD'] = 1*(heart['AHD'] == \"Yes\")\n",
"\n",
"print(heart.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# split into train and validation\n",
"heart_train, heart_val = train_test_split(heart, train_size = 0.75, random_state = 109)\n",
"\n",
"print(heart_train.shape, heart_val.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### $k$-NN model fitting\n",
"\n",
"Define and fit a $k$-NN classification model with $k=20$ to predict `AHD` from `Age`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# select variables for model estimation: be careful of format \n",
"# (aka, single or double square brackets)\n",
"x_train = heart_train[____]\n",
"y_train = heart_train[____]\n",
"\n",
"# define the model\n",
"knn20 = KNeighborsClassifier(___)\n",
"\n",
"# fit to the data\n",
"knn20.fit(___ , ___)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### $k$-NN prediction\n",
"\n",
"Perform some simple predictions: both the pure classifications and the probability estimates."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_knn) ###\n",
"\n",
"# there are two types of predictions in classification models in sklearn\n",
"# model.predict for pure classifications, and model.predict_proba for probabilities\n",
"\n",
"# create the predictions based on the train data\n",
"yhat20_class = knn20.predict(___)\n",
"yhat20_prob = knn20.predict_proba(___)\n",
"\n",
"# print out the first 10 predictions for the actual data\n",
"print(yhat20_class[1:10])\n",
"print(yhat20_prob[1:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What do you notice about the probability estimates? Which 'column' is which? How could you manually create `yhat20_class` from `yhat20_prob`? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Your answer here*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple logisitc regression model fitting\n",
"\n",
"Define and fit a $k$-NN classification model with $k=20$ to predict `AHD` from `Age`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_logit) ###\n",
"# Create a logistic regression model, with 'none' as the penalty\n",
"\n",
"logit1 = LogisticRegression(penalty=____, max_iter = 1000)\n",
"\n",
"#Fit the model using the training set\n",
"\n",
"logit1.fit(____,____)\n",
"\n",
"# Get the coefficient estimates\n",
"\n",
"print(\"Logistic Regression Estimated Betas (B0,B1):\",logit1.____,logit1.____)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpret the Coefficient Estimates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Write down the logistic regression model below (edit the provided latex formula). Use this formula to answer: \n",
"\n",
"What is the estimated odds that a 60 year old will have AHD in the ICU? What is the probability?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$ \\ln\\left( \\frac{P(Y=1)}{1-P(Y=1)} \\right) = \\hat{\\beta}_0 + \\hat{\\beta}_1 X$$\n",
"\n",
"*your answer here*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Make the predictions on the training & validation set\n",
"# Be careful as to how you define the new observation. Hint: double brackets is one way to do it\n",
"\n",
"logit1.predict(____)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Accuracy computation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### $k$-NN and logistic accuracy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_accuracy) ###\n",
"\n",
"# Define the equivalent validation variables from `heart_val`\n",
"\n",
"x_val = heart_val[____]\n",
"y_val = heart_val[____]\n",
"\n",
"# Compute the training & validation accuracy using the estimator.score() function\n",
"\n",
"knn20_train_accuracy = knn20.score(x_train, y_train)\n",
"knn20_val_accuracy = knn20.score(x_val, y_val)\n",
"\n",
"logit_train_accuracy = ____\n",
"logit_val_accuracy = ____\n",
"\n",
"# Print the accuracies below\n",
"\n",
"print(\"k-NN Train & Validation Accuracy:\", knn20_train_accuracy, knn20_val_accuracy)\n",
"print(\"Logisitic Train & Validation Accuracy:\", logit_train_accuracy, logit_val_accuracy)\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plot the predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use a 'dummy' x variable for plotting for the two different models. Here we plot the train and validation data separately, and the 4 different types of predictions (2 for each model: pure classification and probability estimation)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# set-up the dummy x for plotting: we extend it a little bit beyond the range of observed values \n",
"x = np.linspace(np.min(heart[['Age']])-10,____+10,200)\n",
"\n",
"\n",
"# be careful in pulling off only the correct column of the probability calculations: use `[:,1]`\n",
"yhat_class_knn20 = knn20.predict(x)\n",
"yhat_prob_knn20 = _____\n",
"\n",
"yhat_class_logit = logit1.predict(x)\n",
"yhat_prob_logit = _____\n",
"\n",
"# plot the observed data. Note: we offset the validation points to make them more clearly differentiated from train\n",
"plt.plot(x_train, y_train, 'o' ,alpha=0.1, label='Train Data')\n",
"plt.plot(x_val, 0.94*y_val+0.03, 'o' ,alpha=0.1, label='Validation Data')\n",
"\n",
"# plot the predictions\n",
"plt.plot(x, yhat_class_knn20, label='knn20 Classifications')\n",
"plt.plot(x, yhat_prob_knn20, label='knn20 Probabilities')\n",
"plt.plot(____, ____, label='logit1 Classifications')\n",
"plt.plot(____, ____, ____)\n",
"\n",
"# put the lower-left part of the legend 5% to the right along the x-axis, and 45% up along the y-axis\n",
"plt.legend(loc=(0.05,0.45))\n",
"\n",
"# Don't forget your axis labels!\n",
"plt.xlabel(\"Age\")\n",
"plt.ylabel(\"Heart disease (AHD)\")\n",
"\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Post exercise questions\n",
"\n",
"1. Describe the estimated associations between AHD and Age based on these two models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Your answer here*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Which model apears to be more overfit to the training data? How do you know? How can this be resolved?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Your answer here*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. How could you engineer features for the logistic regression model so that the association between AHD and Age resembles the general trend in the knn20 model more similarly?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Your answer here*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your answer here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4. Refit the models above to predict `AHD` from 3 predictors (please also play around with $k$):\n",
"```{python}\n",
"['MaxHR','Age','Sex']\n",
"```\n",
"Investigate the associations in each of the two models and evaluate the predictive accuracy of each of these models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your answer here"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}