{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise: Decision Boundaries\n", "\n", "## Description :\n", "In this exercise we will be comparing the classification boundaries we receive from regularized and unregularized logistic regression models.\n", "\n", "Don't forget the LogisticRegression documentation." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import sklearn as sk\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.preprocessing import PolynomialFeatures\n", "import statsmodels.api as sm" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "heart = pd.read_csv('Heart.csv')\n", "\n", "# Force the response into a binary indicator:\n", "heart['AHD'] = 1*(heart['AHD'] == \"Yes\")\n", "\n", "heart.describe()" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# split train and test data\n", "heart_train, heart_test = train_test_split(heart, test_size=0.3, random_state = 109)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit an unregularized logistic regression model (`logit1`) to predict `AHD` from `Age` and `MaxHR` in the training set (with `penalty='none'` and `max_iter = 5000`). Print out the coefficient estimates, and interpret general trends." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_logit1) ###\n", "\n", "degree = 1\n", "predictors = ['Age','MaxHR']\n", "\n", "X_train1 = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(heart_train[predictors])\n", "y_train = heart_train['AHD']\n", "\n", "logit1 = ___\n", "\n", "print(\"Logistic Regression Estimated Betas:\",\n", " logit1.intercept_,logit1.coef_)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit an unregularized 4th order polynomial (with interactions) logistic regression model (`logit4`) to predict `AHD` from `Age` and `MaxHR` in the training set (with `penalty='none'` and `max_iter = 5000`). Print out the coefficient estimates." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "degree = ___\n", "predictors = ['Age','MaxHR']\n", "\n", "X_train4 = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(heart_train[predictors])\n", "\n", "logit4 = ___\n", "\n", "print(\"Logistic Regression Estimated Betas:\",\n", " logit4.intercept_,logit4.coef_)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate the models based on misclassification rate in both the test set. " ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_misclass) ###\n", "\n", "######\n", "# your code here\n", "######\n", "\n", "predictors = ['Age','MaxHR']\n", "X_test1 = PolynomialFeatures(degree=1,include_bias=False).fit_transform(heart_test[predictors])\n", "X_test4 = PolynomialFeatures(degree=4,include_bias=False).fit_transform(heart_test[predictors])\n", "y_test = heart_test['AHD']\n", "\n", "misclass_logit1 = ___\n", "misclass_logit4 = ___\n", "\n", "print(\"Overall misclassification rate in test for logit1:\",misclass_logit1)\n", "print(\"Overall misclassification rate in test for logit4:\",misclass_logit4)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below performs the classification predictions for the model at all values in the range of the two predictors for `logit1`. Then the predictions and the train dataset are added to a scatterplot in the second code chunk:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "n = 100\n", "\n", "x1=np.linspace(np.min(heart[['Age']]),np.max(heart[['Age']]),n)\n", "x2=np.linspace(np.min(heart[['MaxHR']]),np.max(heart[['MaxHR']]),n)\n", "x1v, x2v = np.meshgrid(x1, x2)\n", "\n", "# To do the predictions and keep the yhats on 2-D (to match the dummy predictor shapes), use this\n", "X = np.c_[x1v.ravel(), x2v.ravel()]\n", "X_dummy = PolynomialFeatures(degree=1,include_bias=False).fit_transform(X)\n", "yhat1 = logit1.predict(X_dummy)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "plt.pcolormesh(x1v, x2v, yhat1.reshape(x1v.shape),alpha = 0.05) \n", "plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n", "plt.ylabel(\"MaxHR\")\n", "plt.xlabel(\"Age\")\n", "plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n", "plt.colorbar()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "X_dummy = PolynomialFeatures(degree=4,include_bias=False).fit_transform(X)\n", "yhat4 = logit4.predict(X_dummy)\n", "\n", "plt.pcolormesh(x1v, x2v, yhat4.reshape(x1v.shape),alpha = 0.05) \n", "plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n", "plt.ylabel(\"MaxHR\")\n", "plt.xlabel(\"Age\")\n", "plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n", "plt.colorbar()\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare the two models above on how they create the classification boundary. Which is more likely to be overfit? How would regularization affect these boundaries?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit a ridge-like Logistic Regression model with `C=0.0001` and `max_iter=5000` on the 4th order polynomial as before. Compare this regularized model with the unregularized one by using the classification boundary." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_ridge) ###\n", "\n", "logit_ridge = LogisticRegression(___).fit(X_train4, y_train)\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "yhat_ridge = logit_ridge.predict(X_dummy)\n", "\n", "plt.pcolormesh(x1v, x2v, yhat_ridge.reshape(x1v.shape),alpha = 0.05) \n", "plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n", "plt.ylabel(\"MaxHR\")\n", "plt.xlabel(\"Age\")\n", "plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n", "plt.colorbar()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Perfect Separation\n", "We modify the data to demonstrate perfect separation." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "predictors = ['Age','MaxHR']\n", "X_train_new = heart_train[predictors].copy()\n", "X_train_new['Age'] = X_train_new['Age'] + 100*y_train.values\n", "\n", "\n", "plt.plot(X_train_new['Age'], y_train ,'o', markersize=7,color=\"#011DAD\",label=\"Data\")\n", "\n", "plt.xlabel(\"Age\")\n", "plt.ylabel(\"AHD\")\n", "plt.yticks((0,1), labels=('No', 'Yes'))\n", "\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Try to train a logistic regression model\n", "\n", "X_train_new = sm.add_constant(X_train_new)\n", "\n", "try:\n", " logreg = sm.Logit(y_train, X_train_new).fit()\n", "except Exception as e: \n", " print(e)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }