{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Title\n",
"\n",
"**Exercise 1 - Regularization and Decision Boundaries in Logistic Regression**\n",
"\n",
"# Description\n",
"\n",
"The goal of the exercise is to produce a plot similar to the one given below, by performing classification predictions on a logistic regression model ."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Instructions: \n",
"- We are trying to predict who will have AHD based on Age and MaxHAR. To do so we need to:\n",
"- Read the `Heart.csv` as a data frame and split into train and test.\n",
"- Assign the predictor and response variables.\n",
"- Fit logistic regression models and interpret results\n",
"- Compute the accuracy of the model.\n",
"- Plot the classification boundaries against the two predictors\n",
"- Fit an untuned regularized logistic regression model and compare the classification boundary\n",
"\n",
"# Hints:\n",
"sklearn.LogisticRegression() : Generates a Logistic Regression classifier\n",
"\n",
"sklearn.fit() : Fits the model to the given data\n",
"\n",
"sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions\n",
"\n",
"sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)\n",
"\n",
"sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated $\\beta$ coefficients in a Logistic Regression model\n",
"\n",
"sklearn.score() : Accuracy classification score.\n",
"\n",
"sklearn.accuracy_score() : Accuracy classification score\n",
"\n",
"matplotlib.pcolormesh() : Accuracy classification score\n",
"\n",
"**Note: This exercise is auto-graded and you can try multiple attempts.**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import sklearn as sk\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.preprocessing import PolynomialFeatures"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"heart = pd.read_csv('Heart.csv')\n",
"\n",
"# Force the response into a binary indicator:\n",
"heart['AHD'] = 1*(heart['AHD'] == \"Yes\")\n",
"\n",
"print(heart.shape)\n",
"#heart.head()\n",
"heart.describe()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"heart_train, heart_test = train_test_split(heart, test_size=0.3, random_state = 109)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.1** Below we fit an unregularized logistic regression model (`logit1`) to predict `AHD` from `Age` and `MaxHR` in the training set (with `penalty='none'`). Print out the coefficient estimates, and interpret general trends."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"degree = 1\n",
"predictors = ['Age','MaxHR']\n",
"\n",
"X_train1 = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(heart_train[predictors])\n",
"y_train = heart_train['AHD']\n",
"\n",
"\n",
"logit1 = LogisticRegression(penalty='none', max_iter = 5000).fit(X_train1, y_train)\n",
"\n",
"print(\"Logistic Regression Estimated Betas:\",\n",
" logit1.___,logit1.___)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your interpretation here*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.1** Fit an unregularized 4th order polynomial (with interactions) logistic regression model (`logit4`) to predict `AHD` from `Age` and `MaxHR` in the training set (with `penalty='none'`). Print out the coefficient estimates."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_logit4) ###\n",
"\n",
"degree = ___\n",
"predictors = ['Age','MaxHR']\n",
"\n",
"X_train4 = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(___)\n",
"\n",
"logit4 = LogisticRegression(penalty='none', max_iter = 5000).fit(___)\n",
"\n",
"print(\"Logistic Regression Estimated Betas:\",\n",
" logit4.___,logit4.___)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.2** Evaluate the models based on misclassification rate in both the test set. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_misclass) ###\n",
"\n",
"######\n",
"# your code here\n",
"######\n",
"\n",
"predictors = ['Age','MaxHR']\n",
"X_test1 = PolynomialFeatures(degree=1,include_bias=False).fit_transform(heart_test[predictors])\n",
"X_test4 = PolynomialFeatures(degree=4,include_bias=False).fit_transform(heart_test[predictors])\n",
"y_test = heart_test['AHD']\n",
"\n",
"# use logit.score()\n",
"misclass_logit1 = ___\n",
"misclass_logit4 = ___\n",
"\n",
"print(\"Overall misclassification rate in test for logit1:\",misclass_logit1)\n",
"print(\"Overall misclassification rate in test for logit4:\",misclass_logit4)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code below performs the classification predictions for the model at all values in the range of the two predictors for `logit1`. Then the predictions and the train dataset are added to a scatterplot in the second code chunk:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"n = 100\n",
"\n",
"x1=np.linspace(np.min(heart[['Age']]),np.max(heart[['Age']]),n)\n",
"x2=np.linspace(np.min(heart[['MaxHR']]),np.max(heart[['MaxHR']]),n)\n",
"x1v, x2v = np.meshgrid(x1, x2)\n",
"\n",
"# This is how we would typically do the prediction (have a vector of yhats)\n",
"#yhat10 = knn10.predict(np.array([x1v.flatten(),x2v.flatten()]).reshape(-1,2))\n",
"\n",
"# To do the predictions and keep the yhats on 2-D (to match the dummy predictor shapes), use this\n",
"X = np.c_[x1v.ravel(), x2v.ravel()]\n",
"X_dummy = PolynomialFeatures(degree=1,include_bias=False).fit_transform(X)\n",
"\n",
"\n",
"yhat1 = logit1.predict(X_dummy)\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"\n",
"plt.pcolormesh(x1v, x2v, yhat1.reshape(x1v.shape),alpha = 0.05) \n",
"plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n",
"plt.ylabel(\"MaxHR\")\n",
"plt.xlabel(\"Age\")\n",
"plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n",
"plt.colorbar()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"#Perform the same calculation above, but for the 4th order polynomial\n",
"\n",
"X_dummy = PolynomialFeatures(degree=4,include_bias=False).fit_transform(X)\n",
"yhat4 = logit4.predict(___)\n",
"\n",
"plt.pcolormesh(x1v, x2v, yhat4.reshape(x1v.shape),alpha = 0.05) \n",
"plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n",
"plt.ylabel(\"MaxHR\")\n",
"plt.xlabel(\"Age\")\n",
"plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n",
"plt.colorbar()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.3** Compare the two models above on how they create the classification boundary. Which is more likely to be overfit? How would regularization affect these boundaries?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your answer here*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.4** Fit a ridge-like Logistic Regression model with `C=0.0001` on the 4th order polynomial as before. Compare this regularized model with the unregularized one by using the classification boundary."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_ridge) ###\n",
"\n",
"logit_ridge = LogisticRegression(___, max_iter = 5000).fit(___, ___)\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"\n",
"#yhat_ridge = logit_ridge.predict_proba(X_dummy)[:,1]\n",
"yhat_ridge = ___\n",
"\n",
"plt.pcolormesh(x1v, x2v, yhat_ridge.reshape(x1v.shape),alpha = 0.05) \n",
"plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n",
"plt.ylabel(\"MaxHR\")\n",
"plt.xlabel(\"Age\")\n",
"plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n",
"plt.colorbar()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your answer here*"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}