{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Title\n", "\n", "**Exercise 1 - Regularization and Decision Boundaries in Logistic Regression**\n", "\n", "# Description\n", "\n", "The goal of the exercise is to produce a plot similar to the one given below, by performing classification predictions on a logistic regression model ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Instructions: \n", "- We are trying to predict who will have AHD based on Age and MaxHAR. To do so we need to:\n", "- Read the `Heart.csv` as a data frame and split into train and test.\n", "- Assign the predictor and response variables.\n", "- Fit logistic regression models and interpret results\n", "- Compute the accuracy of the model.\n", "- Plot the classification boundaries against the two predictors\n", "- Fit an untuned regularized logistic regression model and compare the classification boundary\n", "\n", "# Hints:\n", "sklearn.LogisticRegression() : Generates a Logistic Regression classifier\n", "\n", "sklearn.fit() : Fits the model to the given data\n", "\n", "sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions\n", "\n", "sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)\n", "\n", "sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated $\\beta$ coefficients in a Logistic Regression model\n", "\n", "sklearn.score() : Accuracy classification score.\n", "\n", "sklearn.accuracy_score() : Accuracy classification score\n", "\n", "matplotlib.pcolormesh() : Accuracy classification score\n", "\n", "**Note: This exercise is auto-graded and you can try multiple attempts.**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import sklearn as sk\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.preprocessing import PolynomialFeatures" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "heart = pd.read_csv('Heart.csv')\n", "\n", "# Force the response into a binary indicator:\n", "heart['AHD'] = 1*(heart['AHD'] == \"Yes\")\n", "\n", "print(heart.shape)\n", "#heart.head()\n", "heart.describe()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "heart_train, heart_test = train_test_split(heart, test_size=0.3, random_state = 109)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1.1** Below we fit an unregularized logistic regression model (`logit1`) to predict `AHD` from `Age` and `MaxHR` in the training set (with `penalty='none'`). Print out the coefficient estimates, and interpret general trends." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "degree = 1\n", "predictors = ['Age','MaxHR']\n", "\n", "X_train1 = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(heart_train[predictors])\n", "y_train = heart_train['AHD']\n", "\n", "\n", "logit1 = LogisticRegression(penalty='none', max_iter = 5000).fit(X_train1, y_train)\n", "\n", "print(\"Logistic Regression Estimated Betas:\",\n", " logit1.___,logit1.___)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*your interpretation here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1.1** Fit an unregularized 4th order polynomial (with interactions) logistic regression model (`logit4`) to predict `AHD` from `Age` and `MaxHR` in the training set (with `penalty='none'`). Print out the coefficient estimates." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "### edTest(test_logit4) ###\n", "\n", "degree = ___\n", "predictors = ['Age','MaxHR']\n", "\n", "X_train4 = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(___)\n", "\n", "logit4 = LogisticRegression(penalty='none', max_iter = 5000).fit(___)\n", "\n", "print(\"Logistic Regression Estimated Betas:\",\n", " logit4.___,logit4.___)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1.2** Evaluate the models based on misclassification rate in both the test set. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "### edTest(test_misclass) ###\n", "\n", "######\n", "# your code here\n", "######\n", "\n", "predictors = ['Age','MaxHR']\n", "X_test1 = PolynomialFeatures(degree=1,include_bias=False).fit_transform(heart_test[predictors])\n", "X_test4 = PolynomialFeatures(degree=4,include_bias=False).fit_transform(heart_test[predictors])\n", "y_test = heart_test['AHD']\n", "\n", "# use logit.score()\n", "misclass_logit1 = ___\n", "misclass_logit4 = ___\n", "\n", "print(\"Overall misclassification rate in test for logit1:\",misclass_logit1)\n", "print(\"Overall misclassification rate in test for logit4:\",misclass_logit4)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below performs the classification predictions for the model at all values in the range of the two predictors for `logit1`. Then the predictions and the train dataset are added to a scatterplot in the second code chunk:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "n = 100\n", "\n", "x1=np.linspace(np.min(heart[['Age']]),np.max(heart[['Age']]),n)\n", "x2=np.linspace(np.min(heart[['MaxHR']]),np.max(heart[['MaxHR']]),n)\n", "x1v, x2v = np.meshgrid(x1, x2)\n", "\n", "# This is how we would typically do the prediction (have a vector of yhats)\n", "#yhat10 = knn10.predict(np.array([x1v.flatten(),x2v.flatten()]).reshape(-1,2))\n", "\n", "# To do the predictions and keep the yhats on 2-D (to match the dummy predictor shapes), use this\n", "X = np.c_[x1v.ravel(), x2v.ravel()]\n", "X_dummy = PolynomialFeatures(degree=1,include_bias=False).fit_transform(X)\n", "\n", "\n", "yhat1 = logit1.predict(X_dummy)\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "\n", "plt.pcolormesh(x1v, x2v, yhat1.reshape(x1v.shape),alpha = 0.05) \n", "plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n", "plt.ylabel(\"MaxHR\")\n", "plt.xlabel(\"Age\")\n", "plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n", "plt.colorbar()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "#Perform the same calculation above, but for the 4th order polynomial\n", "\n", "X_dummy = PolynomialFeatures(degree=4,include_bias=False).fit_transform(X)\n", "yhat4 = logit4.predict(___)\n", "\n", "plt.pcolormesh(x1v, x2v, yhat4.reshape(x1v.shape),alpha = 0.05) \n", "plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n", "plt.ylabel(\"MaxHR\")\n", "plt.xlabel(\"Age\")\n", "plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n", "plt.colorbar()\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1.3** Compare the two models above on how they create the classification boundary. Which is more likely to be overfit? How would regularization affect these boundaries?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1.4** Fit a ridge-like Logistic Regression model with `C=0.0001` on the 4th order polynomial as before. Compare this regularized model with the unregularized one by using the classification boundary." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "### edTest(test_ridge) ###\n", "\n", "logit_ridge = LogisticRegression(___, max_iter = 5000).fit(___, ___)\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "\n", "#yhat_ridge = logit_ridge.predict_proba(X_dummy)[:,1]\n", "yhat_ridge = ___\n", "\n", "plt.pcolormesh(x1v, x2v, yhat_ridge.reshape(x1v.shape),alpha = 0.05) \n", "plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n", "plt.ylabel(\"MaxHR\")\n", "plt.xlabel(\"Age\")\n", "plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n", "plt.colorbar()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*your answer here*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }