{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Title :\n",
    "Exercise: Decision Boundaries\n",
    "\n",
    "## Description :\n",
    "In this exercise we will be comparing the classification boundaries we receive from regularized and unregularized logistic regression models.\n",
    "\n",
    "Don't forget the <a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\" target=\"_blank\">LogisticRegression</a> documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import sklearn as sk\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "heart = pd.read_csv('Heart.csv')\n",
    "\n",
    "# Force the response into a binary indicator:\n",
    "heart['AHD'] = 1*(heart['AHD'] == \"Yes\")\n",
    "\n",
    "heart.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split train and test data\n",
    "heart_train, heart_test = train_test_split(heart, test_size=0.3, random_state = 109)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fit an unregularized logistic regression model (`logit1`) to predict `AHD` from `Age` and `MaxHR` in the training set (with `penalty='none'` and `max_iter = 5000`).  Print out the coefficient estimates, and interpret general trends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_logit1) ###\n",
    "\n",
    "degree = 1\n",
    "predictors = ['Age','MaxHR']\n",
    "\n",
    "X_train1 = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(heart_train[predictors])\n",
    "y_train = heart_train['AHD']\n",
    "\n",
    "logit1 = ___\n",
    "\n",
    "print(\"Logistic Regression Estimated Betas:\",\n",
    "      logit1.intercept_,logit1.coef_)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fit an unregularized 4th order polynomial (with interactions) logistic regression model (`logit4`) to predict `AHD` from `Age` and `MaxHR` in the training set (with `penalty='none'` and `max_iter = 5000`).  Print out the coefficient estimates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "degree = ___\n",
    "predictors = ['Age','MaxHR']\n",
    "\n",
    "X_train4 = PolynomialFeatures(degree=degree,include_bias=False).fit_transform(heart_train[predictors])\n",
    "\n",
    "logit4 = ___\n",
    "\n",
    "print(\"Logistic Regression Estimated Betas:\",\n",
    "      logit4.intercept_,logit4.coef_)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate the models based on misclassification rate in both the test set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_misclass) ###\n",
    "\n",
    "######\n",
    "# your code here\n",
    "######\n",
    "\n",
    "predictors = ['Age','MaxHR']\n",
    "X_test1 = PolynomialFeatures(degree=1,include_bias=False).fit_transform(heart_test[predictors])\n",
    "X_test4 = PolynomialFeatures(degree=4,include_bias=False).fit_transform(heart_test[predictors])\n",
    "y_test = heart_test['AHD']\n",
    "\n",
    "misclass_logit1 = ___\n",
    "misclass_logit4 = ___\n",
    "\n",
    "print(\"Overall misclassification rate in test for logit1:\",misclass_logit1)\n",
    "print(\"Overall misclassification rate in test for logit4:\",misclass_logit4)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code below performs the classification predictions for the model at all values in the range of the two predictors for `logit1`.  Then the predictions and the train dataset are added to a scatterplot in the second code chunk:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 100\n",
    "\n",
    "x1=np.linspace(np.min(heart[['Age']]),np.max(heart[['Age']]),n)\n",
    "x2=np.linspace(np.min(heart[['MaxHR']]),np.max(heart[['MaxHR']]),n)\n",
    "x1v, x2v = np.meshgrid(x1, x2)\n",
    "\n",
    "# To do the predictions and keep the yhats on 2-D (to match the dummy predictor shapes), use this\n",
    "X = np.c_[x1v.ravel(), x2v.ravel()]\n",
    "X_dummy = PolynomialFeatures(degree=1,include_bias=False).fit_transform(X)\n",
    "yhat1 = logit1.predict(X_dummy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.pcolormesh(x1v, x2v, yhat1.reshape(x1v.shape),alpha = 0.05) \n",
    "plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n",
    "plt.ylabel(\"MaxHR\")\n",
    "plt.xlabel(\"Age\")\n",
    "plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n",
    "plt.colorbar()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_dummy = PolynomialFeatures(degree=4,include_bias=False).fit_transform(X)\n",
    "yhat4 = logit4.predict(X_dummy)\n",
    "\n",
    "plt.pcolormesh(x1v, x2v, yhat4.reshape(x1v.shape),alpha = 0.05) \n",
    "plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n",
    "plt.ylabel(\"MaxHR\")\n",
    "plt.xlabel(\"Age\")\n",
    "plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n",
    "plt.colorbar()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare the two models above on how they create the classification boundary.  Which is more likely to be overfit?  How would regularization affect these boundaries?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*your answer here*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fit a ridge-like Logistic Regression model with `C=0.0001` and `max_iter=5000` on the 4th order polynomial as before.  Compare this regularized model with the unregularized one by using the classification boundary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_ridge) ###\n",
    "\n",
    "logit_ridge = LogisticRegression(___).fit(X_train4, y_train)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "yhat_ridge = logit_ridge.predict(X_dummy)\n",
    "\n",
    "plt.pcolormesh(x1v, x2v, yhat_ridge.reshape(x1v.shape),alpha = 0.05) \n",
    "plt.scatter(heart_train['Age'],heart_train['MaxHR'],c=heart_train['AHD'])\n",
    "plt.ylabel(\"MaxHR\")\n",
    "plt.xlabel(\"Age\")\n",
    "plt.title(\"Yellow = Predicted to have AHD, Purple = Predicted to not have AHD\")\n",
    "plt.colorbar()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*your answer here*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Perfect Separation\n",
    "We modify the data to demonstrate perfect separation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictors = ['Age','MaxHR']\n",
    "X_train_new = heart_train[predictors].copy()\n",
    "X_train_new['Age'] = X_train_new['Age'] + 100*y_train.values\n",
    "\n",
    "\n",
    "plt.plot(X_train_new['Age'], y_train ,'o', markersize=7,color=\"#011DAD\",label=\"Data\")\n",
    "\n",
    "plt.xlabel(\"Age\")\n",
    "plt.ylabel(\"AHD\")\n",
    "plt.yticks((0,1), labels=('No', 'Yes'))\n",
    "\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Try to train a logistic regression model\n",
    "\n",
    "X_train_new = sm.add_constant(X_train_new)\n",
    "\n",
    "try:\n",
    "    logreg = sm.Logit(y_train, X_train_new).fit()\n",
    "except Exception as e: \n",
    "    print(e)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}