{ "cells": [ { "cell_type": "markdown", "metadata": { "button": false, "hide": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "# CS-109A Introduction to Data Science\n", "\n", "\n", "## Lab 7: $k$-NN Classification and Imputation\n", "\n", "**Harvard University**
\n", "**Fall 2019**
\n", "**Instructors:** Pavlos Protopapas, Kevin Rader, Chris Tanner
\n", "**Lab Instructors:** Chris Tanner and Eleni Kaxiras.
\n", "**Contributors:** Kevin Rader\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES\n", "import requests\n", "from IPython.core.display import HTML\n", "styles = requests.get(\"https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css\").text\n", "HTML(styles)" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "## Learning Goals \n", "In this lab, we'll explore classification models to predict the health status of survey respondents and be able to build a classification decision boundary to predict the resultsing unbalanced classes.\n", "\n", "By the end of this lab, you should:\n", "- Be familiar with the `sklearn` implementations of\n", " - $k$-NN Regression\n", " - ROC curves and classification metrics\n", "- Be able to optimize some loss function based on misclassification rates\n", "- Be able to impute for missing values\n", "- Be comfortable in the different approaches in handling missingness" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "button": false, "hide": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "//anaconda3/lib/python3.7/_collections_abc.py:841: MatplotlibDeprecationWarning: \n", "The examples.directory rcparam was deprecated in Matplotlib 3.0 and will be removed in 3.2. In the future, examples will be found relative to the 'datapath' directory.\n", " self[key] = other[key]\n", "//anaconda3/lib/python3.7/_collections_abc.py:841: MatplotlibDeprecationWarning: \n", "The savefig.frameon rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.\n", " self[key] = other[key]\n", "//anaconda3/lib/python3.7/_collections_abc.py:841: MatplotlibDeprecationWarning: \n", "The text.latex.unicode rcparam was deprecated in Matplotlib 3.0 and will be removed in 3.2.\n", " self[key] = other[key]\n", "//anaconda3/lib/python3.7/_collections_abc.py:841: MatplotlibDeprecationWarning: \n", "The verbose.fileo rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.\n", " self[key] = other[key]\n", "//anaconda3/lib/python3.7/_collections_abc.py:841: MatplotlibDeprecationWarning: \n", "The verbose.level rcparam was deprecated in Matplotlib 3.1 and will be removed in 3.3.\n", " self[key] = other[key]\n", "//anaconda3/lib/python3.7/site-packages/seaborn/apionly.py:9: UserWarning: As seaborn no longer sets a default style on import, the seaborn.apionly module is deprecated. It will be removed in a future version.\n", " warnings.warn(msg, UserWarning)\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import statsmodels.api as sm\n", "from statsmodels.api import OLS\n", "import sklearn as sk\n", "from sklearn.decomposition import PCA\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.linear_model import LogisticRegressionCV\n", "from sklearn.utils import resample\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.metrics import accuracy_score\n", "# %matplotlib inline\n", "import seaborn.apionly as sns" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "## Part 1: General Social Survey Data + EDA\n", "\n", "The dataset contains a subset of data from the General Social Survey (GSS) that is a bi-annual survey of roughly 2000 Americans. We will be using a small subset of the approx 4000 questions they ask. Specifically we'll use: \n", "\n", "- **id:** respondant's unique ID\n", "- **health:** self-reported health level with 4 categories: poor, fair, good, excellent\n", "- **partyid:** political party affiliation with categories dem, rep, or other\n", "- **age:** age in years\n", "- **sex:** male or female\n", "- **sexornt:** sexual orientation with categories hetero, gay, or bisexual/other\n", "- **educ:** number of years of formal education (capped at 20 years)\n", "- **marital:** marital status with categories married, never married, and no longer married\n", "- **race:** with categories black, white, and other\n", "- **income:** in thousands of dollars\n", "\n", "Our goal is to predict whether or not someone is in **poor health** based on the other measures.\n", "\n", "For this task, we will exercise our normal data science pipeline -- from EDA to modeling and visualization. In particular, we will show the performance of 2 classifiers:\n", "\n", "- Logistic Regression\n", "- $k$-NN Regression\n", "\n", "So without further ado..." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "### EDA" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "Do the following basic EDA (always good ideas):\n", "1. Determine the dimensions of the data set.\n", "2. Get a glimpse of the data set.\n", "3. Calculate basic summary/descriptive statistics of the variables.\n", "\n", "We also ask that you do the following:\n", "4. Create a binary called `poorhealth`. \n", "5. Explore the distribution of the responses, `health` and `poorhealth`, \n", "6. Explore what variables may be related to whether or not some is of poor health. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The dimensions of the data set are: 1569 observations and 10 variables.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idhealthpartyidagesexsexornteducmaritalraceincome
01goodrep43.0malebisexual/other14.0never marriedwhiteNaN
12excellentdem74.0femalehetero10.0no longer marriedwhiteNaN
25excellentrep71.0malehetero18.0no longer marriedblackNaN
36gooddem67.0femalebisexual/other16.0no longer marriedwhiteNaN
47gooddem59.0femalebisexual/other13.0no longer marriedblack18.75
\n", "
" ], "text/plain": [ " id health partyid age sex sexornt educ \\\n", "0 1 good rep 43.0 male bisexual/other 14.0 \n", "1 2 excellent dem 74.0 female hetero 10.0 \n", "2 5 excellent rep 71.0 male hetero 18.0 \n", "3 6 good dem 67.0 female bisexual/other 16.0 \n", "4 7 good dem 59.0 female bisexual/other 13.0 \n", "\n", " marital race income \n", "0 never married white NaN \n", "1 no longer married white NaN \n", "2 no longer married black NaN \n", "3 no longer married white NaN \n", "4 no longer married black 18.75 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gssdata = pd.read_csv(\"data/gsshealth18.csv\")\n", "\n", "#####\n", "# You code here: EDA\n", "# 1. Determine the dimensions of the data set.\n", "# 2. Get a glimpse of the data set.\n", "####\n", "\n", "print(\"The dimensions of the data set are:\", gssdata.shape[0], \"observations and\", gssdata.shape[1], \"variables.\" )\n", "gssdata.head()\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "good 771\n", "excellent 359\n", "fair 355\n", "poor 84\n", "Name: health, dtype: int64\n", "dem 708\n", "rep 514\n", "other 347\n", "Name: partyid, dtype: int64\n", "female 872\n", "male 697\n", "Name: sex, dtype: int64\n", "bisexual/other 907\n", "hetero 640\n", "gay 22\n", "Name: sexornt, dtype: int64\n", "married 655\n", "never married 458\n", "no longer married 454\n", "Name: marital, dtype: int64\n", "white 1137\n", "black 259\n", "other 173\n", "Name: race, dtype: int64\n" ] } ], "source": [ "# 3. Calculate basic summary/descriptive statistics of the variables.\n", "gssdata.describe()\n", "\n", "\n", "print(gssdata['health'].value_counts())\n", "print(gssdata['partyid'].value_counts())\n", "print(gssdata['sex'].value_counts())\n", "print(gssdata['sexornt'].value_counts())\n", "print(gssdata['marital'].value_counts())\n", "print(gssdata['race'].value_counts())" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "0.05353728489483748" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#####\n", "# You code here: EDA\n", "# 4. Create a binary called `poorhealth`. \n", "# 5. Explore the distribution of the responses, `health` and `poorhealth`, \n", "# 6. Explore what variables may be related to whether or not some is of poor health.\n", "####\n", "\n", "gssdata['poorhealth']=1*(gssdata['health']=='poor')\n", "gssdata['poorhealth'].mean()" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "**Question**: What classification accuracy could you achieve if you simply predicted `poorhealth` without a model? What classification accuracy would you get if you were to predict the multi-class `health` variable? Is accuracy the correct metric?" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "**Solution**: Poor health is a quite rare health status: only 5.35\\% of respondents said they were in poor health. If we predicted all persons to be in better than poo health, our naive classifier would have $1-0.0535 = 0.9465 = 94.65\\%$ accuracy. Acuracy is almost certainly not the ideal metric to use here: we'd be better off looking at false positive and false negative rate instead (it is more important to correctly classify those in poor health than those in better than poor health)." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "### Data Cleaning - Basic Handling of Missingness" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "Let's begin by fitting an unregularized logistic regression model to predict poor health based on all the other predictors in the model and three $k$-NN models with $k=5,10,20$.\n", "\n", "First we need to do a small amount of data clean-up. \n", "1. Determine the amount of missingness in each variable. If there is *a lot*, we will drop the variable from the predictor set (not quite yet). If there is a little, we will impute.\n", "2. Drop any variables with lots of missingnes (in a new data set).\n", "3. Do simple imputations for variables with a little bit of missingness.\n", "4. Create dummies for categorical predictors.\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "id 0\n", "health 0\n", "partyid 0\n", "age 2\n", "sex 0\n", "sexornt 0\n", "educ 2\n", "marital 2\n", "race 0\n", "income 661\n", "poorhealth 0\n", "dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#########\n", "# 1. Determine the amount of missingness in each variable. \n", "# Use isna() in combination with .sum()\n", "########\n", "\n", "# Your code here\n", "\n", "gssdata.isna().sum()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " \n" ] }, { "data": { "text/plain": [ "age 2\n", "educ 2\n", "female 0\n", "marital_never married 0\n", "marital_no longer married 0\n", "race_other 0\n", "race_white 0\n", "sexornt_gay 0\n", "sexornt_hetero 0\n", "partyid_other 0\n", "partyid_rep 0\n", "dtype: int64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#######\n", "# And then build your predictor set\n", "# 2. Drop any variables with lots of missingnes (in a new data set).\n", "# 3. Do simple imputations for variables with a little bit of missingness.\n", "# 4. Create dummies for categorical predictors.\n", "#########\n", "\n", "# get the predictors without a ton of missingness \n", "# (income was not included since it had so much missingness)\n", "X = gssdata[['partyid','age','sex','sexornt','educ','marital','race']]\n", "\n", "#create dummies (lots of ways to do it, two ways will be in the solutions\n", "# create dummies 2 different ways\n", "X['female'] = 1*(gssdata['sex']==\"female\")\n", "dummies = pd.get_dummies(X[['marital','race','sexornt','partyid']],drop_first=True)\n", "\n", "# add the dummies in via the join command.\n", "X = X.join(dummies)\n", "\n", "# let's drop the redundat variables no longer needed since we created the dummies\n", "X = X.drop(['partyid','sex','sexornt','marital','race'],axis=1)\n", "\n", "# now check the 'nulls'\n", "X.isna().sum()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# handle missingness in age and education\n", "# missingness in marital was handled with get_dummies\n", "\n", "# impute the median age\n", "X['age']=X['age'].fillna(X['age'].median())\n", "\n", "# impute the most common education: having a HS degree (13 years)\n", "# see histogram for justification\n", "X['educ']=X['educ'].fillna(13)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAQbElEQVR4nO3df6zddX3H8edrFHRTx8/CWNtYnc2i+0MkDevGZpw4xg9j2SILxowGmzRmmGjcMruZOLfsD9gy2VgWlk6MxTiFqYxGcNoAxiwZ6AX5KWoLqdK1o1WgaIib6Ht/nM91l9tze8+9vedc+vH5SE6+3+/n8zn3++73fO+r3/s5v1JVSJL68jPLXYAkaekZ7pLUIcNdkjpkuEtShwx3SerQiuUuAOC0006rtWvXLncZknRMueeee75TVSuH9b0gwn3t2rVMTU0tdxmSdExJ8q25+pyWkaQOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDr0g3qEq6XBrt966LPvdc9XFy7JfLS2v3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDI4V7kj1JHkxyX5Kp1nZKkp1JdrXlya09Sa5NsjvJA0nOHuc/QJJ0uIVcuf9WVZ1VVevb9lbg9qpaB9zetgEuBNa12xbguqUqVpI0mqOZltkIbG/r24FLZrTfUAN3ASclOfMo9iNJWqBRw72ALyS5J8mW1nZGVe0HaMvTW/sq4PEZ993b2p4nyZYkU0mmDh48uLjqJUlDrRhx3LlVtS/J6cDOJF8/wtgMaavDGqq2AdsA1q9ff1i/JGnxRrpyr6p9bXkAuBk4B3hierqlLQ+04XuBNTPuvhrYt1QFS5LmN2+4J3lJkpdNrwPnAw8BO4BNbdgm4Ja2vgO4vL1qZgNwaHr6RpI0GaNMy5wB3Jxkevy/VNW/J/kKcFOSzcC3gUvb+NuAi4DdwLPAFUtetSTpiOYN96p6DHjtkPbvAucNaS/gyiWpTpK0KL5DVZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0aOdyTHJfkq0k+27ZfkeTuJLuS3JjkhNb+ora9u/WvHU/pkqS5LOTK/d3AIzO2rwauqap1wFPA5ta+GXiqql4FXNPGSZImaKRwT7IauBj4cNsO8EbgU23IduCStr6xbdP6z2vjJUkTMuqV+98BfwL8uG2fCjxdVc+17b3Aqra+CngcoPUfauOfJ8mWJFNJpg4ePLjI8iVJw8wb7kneDByoqntmNg8ZWiP0/X9D1baqWl9V61euXDlSsZKk0awYYcy5wFuSXAS8GPh5BlfyJyVZ0a7OVwP72vi9wBpgb5IVwInAk0teuSRpTvNeuVfVn1bV6qpaC1wG3FFVbwfuBN7ahm0CbmnrO9o2rf+Oqjrsyl2SND5H8zr39wHvTbKbwZz69a39euDU1v5eYOvRlShJWqhRpmV+oqq+CHyxrT8GnDNkzA+AS5egNknSIvkOVUnqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUML+mwZSRqntVtvXZb97rnq4mXZ7zh55S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6tC84Z7kxUm+nOT+JA8n+YvW/ookdyfZleTGJCe09he17d2tf+14/wmSpNlGuXL/H+CNVfVa4CzggiQbgKuBa6pqHfAUsLmN3ww8VVWvAq5p4yRJEzRvuNfA99vm8e1WwBuBT7X27cAlbX1j26b1n5ckS1axJGleI825JzkuyX3AAWAn8CjwdFU914bsBVa19VXA4wCt/xBw6pCfuSXJVJKpgwcPHt2/QpL0PCOFe1X9qKrOAlYD5wCvHjasLYddpddhDVXbqmp9Va1fuXLlqPVKkkawoFfLVNXTwBeBDcBJSVa0rtXAvra+F1gD0PpPBJ5cimIlSaNZMd+AJCuBH1bV00l+FngTgydJ7wTeCnwS2ATc0u6yo23/Z+u/o6oOu3KX9MK0duuty12ClsC84Q6cCWxPchyDK/2bquqzSb4GfDLJXwFfBa5v468HPpZkN4Mr9svGULck6QjmDfeqegB43ZD2xxjMv89u/wFw6ZJUJ0laFN+hKkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA7NG+5J1iS5M8kjSR5O8u7WfkqSnUl2teXJrT1Jrk2yO8kDSc4e9z9CkvR8o1y5Pwf8UVW9GtgAXJnkNcBW4PaqWgfc3rYBLgTWtdsW4Lolr1qSdETzhntV7a+qe9v694BHgFXARmB7G7YduKStbwRuqIG7gJOSnLnklUuS5rSgOfcka4HXAXcDZ1TVfhj8BwCc3oatAh6fcbe9rU2SNCEjh3uSlwKfBt5TVc8caeiQthry87YkmUoydfDgwVHLkCSNYKRwT3I8g2D/eFV9pjU/MT3d0pYHWvteYM2Mu68G9s3+mVW1rarWV9X6lStXLrZ+SdIQo7xaJsD1wCNV9aEZXTuATW19E3DLjPbL26tmNgCHpqdvJEmTsWKEMecCfwA8mOS+1vZnwFXATUk2A98GLm19twEXAbuBZ4ErlrRiSdK85g33qvoPhs+jA5w3ZHwBVx5lXZKko+A7VCWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdGuUdqpLUtbVbb122fe+56uKx/Fyv3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QO+R2q0hEs53drSkdj3iv3JB9JciDJQzPaTkmyM8mutjy5tSfJtUl2J3kgydnjLF6SNNwo0zIfBS6Y1bYVuL2q1gG3t22AC4F17bYFuG5pypQkLcS84V5VXwKenNW8Edje1rcDl8xov6EG7gJOSnLmUhUrSRrNYp9QPaOq9gO05emtfRXw+Ixxe1vbYZJsSTKVZOrgwYOLLEOSNMxSv1omQ9pq2MCq2lZV66tq/cqVK5e4DEn66bbYcH9ierqlLQ+09r3AmhnjVgP7Fl+eJGkxFhvuO4BNbX0TcMuM9svbq2Y2AIemp28kSZMz7+vck3wCeANwWpK9wJ8DVwE3JdkMfBu4tA2/DbgI2A08C1wxhpolSfOYN9yr6m1zdJ03ZGwBVx5tUZKko+PHD0hShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUof8mj0dE/y6O2lhvHKXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIL+vQgvilGdKxwSt3SerQWK7ck1wA/D1wHPDhqrpqHPtZbst1FbvnqouXZb+Sjh1LHu5JjgP+EfhtYC/wlSQ7quprS70v+OmcJvhp/DdLWphxTMucA+yuqseq6n+BTwIbx7AfSdIcxjEtswp4fMb2XuBXZw9KsgXY0ja/n+Qbi9zfacB3FnnfcbKuhbGuhXuh1mZdC5Crj6qul8/VMY5wz5C2Oqyhahuw7ah3lkxV1fqj/TlLzboWxroW7oVam3UtzLjqGse0zF5gzYzt1cC+MexHkjSHcYT7V4B1SV6R5ATgMmDHGPYjSZrDkk/LVNVzSd4FfJ7BSyE/UlUPL/V+ZjjqqZ0xsa6Fsa6Fe6HWZl0LM5a6UnXYdLgk6RjnO1QlqUOGuyR16JgJ9yQXJPlGkt1Jtg7pf1GSG1v/3UnWTqCmNUnuTPJIkoeTvHvImDckOZTkvnb7wLjravvdk+TBts+pIf1Jcm07Xg8kOXsCNf3yjONwX5Jnkrxn1piJHa8kH0lyIMlDM9pOSbIzya62PHmO+25qY3Yl2TTmmv4mydfb43RzkpPmuO8RH/Mx1fbBJP814/G6aI77HvH3dwx13Tijpj1J7pvjvmM5ZnNlw0TPr6p6wd8YPDH7KPBK4ATgfuA1s8b8IfBPbf0y4MYJ1HUmcHZbfxnwzSF1vQH47DIcsz3AaUfovwj4HIP3JWwA7l6Gx/S/gZcv1/ECXg+cDTw0o+2vga1tfStw9ZD7nQI81pYnt/WTx1jT+cCKtn71sJpGeczHVNsHgT8e4bE+4u/vUtc1q/9vgQ9M8pjNlQ2TPL+OlSv3UT7SYCOwva1/CjgvybA3VC2ZqtpfVfe29e8BjzB4h+6xYCNwQw3cBZyU5MwJ7v884NGq+tYE9/k8VfUl4MlZzTPPo+3AJUPu+jvAzqp6sqqeAnYCF4yrpqr6QlU91zbvYvDekYmb43iNYqwfSXKkuloG/D7wiaXa34g1zZUNEzu/jpVwH/aRBrND9Cdj2i/CIeDUiVQHtGmg1wF3D+n+tST3J/lckl+ZUEkFfCHJPRl81MNsoxzTcbqMuX/hluN4TTujqvbD4BcUOH3ImOU8du9g8BfXMPM95uPyrjZl9JE5phmW83j9JvBEVe2ao3/sx2xWNkzs/DpWwn2UjzQY6WMPxiHJS4FPA++pqmdmdd/LYOrhtcA/AP82iZqAc6vqbOBC4Mokr5/Vv5zH6wTgLcC/DuleruO1EMty7JK8H3gO+PgcQ+Z7zMfhOuCXgLOA/QymQGZbtnMNeBtHvmof6zGbJxvmvNuQtgUfr2Ml3Ef5SIOfjEmyAjiRxf0JuSBJjmfw4H28qj4zu7+qnqmq77f124Djk5w27rqqal9bHgBuZvCn8UzL+TERFwL3VtUTszuW63jN8MT09FRbHhgyZuLHrj2p9mbg7dUmZmcb4TFfclX1RFX9qKp+DPzzHPtclnOt5cDvATfONWacx2yObJjY+XWshPsoH2mwA5h+VvmtwB1z/RIslTafdz3wSFV9aI4xvzA995/kHAbH/LtjruslSV42vc7gCbmHZg3bAVyegQ3Aoek/Fydgzqup5Thes8w8jzYBtwwZ83ng/CQnt2mI81vbWGTw5TfvA95SVc/OMWaUx3wctc18nuZ359jncn0kyZuAr1fV3mGd4zxmR8iGyZ1fS/0s8bhuDF7d8U0Gz7q/v7X9JYMTHuDFDP7M3w18GXjlBGr6DQZ/Lj0A3NduFwHvBN7ZxrwLeJjBKwTuAn59AnW9su3v/rbv6eM1s64w+FKVR4EHgfUTehx/jkFYnzijbVmOF4P/YPYDP2RwtbSZwfM0twO72vKUNnY9g28Vm77vO9q5thu4Ysw17WYwBzt9jk2/KuwXgduO9JhP4Hh9rJ0/DzAIrjNn19a2D/v9HWddrf2j0+fVjLETOWZHyIaJnV9+/IAkdehYmZaRJC2A4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI69H/HTRiwVWTnOAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.hist(X['educ']);" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "25 50.0\n", "1478 65.0\n", "Name: age, dtype: float64\n", "529 24.0\n", "1546 75.0\n", "Name: age, dtype: float64\n" ] } ], "source": [ "# we checked these to see if there were any patterns in the missingness. \n", "# Nothing really showed up.\n", "print(gssdata['age'][pd.isna(gssdata['marital'])])\n", "print(gssdata['age'][pd.isna(gssdata['educ'])])" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "data": { "text/plain": [ "age 0\n", "educ 0\n", "female 0\n", "marital_never married 0\n", "marital_no longer married 0\n", "race_other 0\n", "race_white 0\n", "sexornt_gay 0\n", "sexornt_hetero 0\n", "partyid_other 0\n", "partyid_rep 0\n", "dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Just to make sure missingness is gone\n", "X.isna().sum()" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "## Part 2: Fit Basic Models\n", "\n", "In this section we ask you to:\n", "\n", "1. Split the data into 70-30 train-test splits (use the code provided...should have been done before EDA :( )\n", "2. Fit an unregularize logistic regression model to predict `poorhealth` from all predictors except income.\n", " \n", " 2b. If you have time: use 'LogisticRegressionCV' to find a well-tuned L2 regularized model.\n", " \n", " \n", "3. Fit $k$-NN classification models with $k=1,15,25$ to predict `poorhealth` from all predictors except income.\n", "4. Report classification accuracy on both train and test set for all models." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [], "source": [ "#######\n", "# Use the following train_test_split code to: \n", "# 1. Split the data into 70-30 train-test splits\n", "#######\n", "from sklearn.model_selection import train_test_split\n", "itrain, itest = train_test_split(range(gssdata.shape[0]), train_size=0.70)\n", "\n", "# Note: the train-test split above is for the INDICES for splitting in case we \n", "# want to use them again in the future...we can have an identical split\n", "X_train = X.loc[itrain]\n", "X_test = X.loc[itest]\n", "\n", "y_train = gssdata['poorhealth'][itrain]\n", "y_test = gssdata['poorhealth'][itest]" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "//anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "######\n", "# 2. Fit an unregularize logistic regression model to predict `poorhealth` \n", "# from all predictors except income.\n", "# 2b. If you have time: use 'LogisticRegressionCV' to find a well-tuned L2 regularized model.\n", "# 3. Fit $k$-NN classification models with k=1,15,25 to predict `poorhealth` \n", "# from all predictors except income.\n", "######\n", "\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "# unregularized Logistic Regression\n", "logit = sk.linear_model.LogisticRegression(C=100000)\n", "logit.fit(X_train,y_train)\n", "\n", "# k-NN for k=1, 15, and 25\n", "knn1 = KNeighborsClassifier(1)\n", "knn1.fit(X_train,y_train)\n", "\n", "knn15 = KNeighborsClassifier(15)\n", "knn15.fit(X_train,y_train)\n", "\n", "knn25 = KNeighborsClassifier(25)\n", "knn25.fit(X_train,y_train)\n", "\n", "logit.predict_proba(X_train)[:,1],\n", "\n", "#visualize the predictions via boxplots\n", "plt.boxplot([logit.predict_proba(X_train)[:,1],knn1.predict_proba(X_train)[:,1],\n", " knn15.predict_proba(X_train)[:,1],knn25.predict_proba(X_train)[:,1]])\n", "plt.legend([\"1=logistic\",\"2=knn1\",\"3=knn15\",\"4=knn25\"]);" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Classification accuracy for logistic were: \n", " Train = 0.9462659380692168 , Test = 0.9469214437367304\n", "Classification accuracy for knn1 were: \n", " Train = 1.0 , Test = 0.9193205944798302\n", "Classification accuracy for knn15 were: \n", " Train = 0.9462659380692168 , Test = 0.9469214437367304\n", "Classification accuracy for knn25 were: \n", " Train = 0.9462659380692168 , Test = 0.9469214437367304\n" ] } ], "source": [ "######\n", "# 4. Report classification accuracy on both train and test set for all models.\n", "######\n", "\n", "print(\"Classification accuracy for logistic were: \\n Train =\",\n", " logit.score(X_train,y_train),\", Test =\", logit.score(X_test,y_test))\n", "print(\"Classification accuracy for knn1 were: \\n Train =\",\n", " knn1.score(X_train,y_train),\", Test =\", knn1.score(X_test,y_test))\n", "print(\"Classification accuracy for knn15 were: \\n Train =\",\n", " knn15.score(X_train,y_train),\", Test =\", knn15.score(X_test,y_test))\n", "print(\"Classification accuracy for knn25 were: \\n Train =\",\n", " knn25.score(X_train,y_train),\", Test =\", knn25.score(X_test,y_test))\n", "\n", "# Note the severe overfitting of knn1, while the others are identical!" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "## Part 3: Evaluate Models via Confusion matrices and ROC Curves\n", "\n", "In this part we ask that you:\n", "1. Plot the histograms of predicted probabilities for your favorite model from above\n", "2. Create the confusion matrices for (a) the default threshold for classification and (b) a well-chosen threshold for classification to balance errors more equally.\n", "3. Make ROC curves to evaluate a model's overall useability.\n", "4. Use the ROC curves to select a threshold to balance the two types of errors." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "As a reminder of Confustion Matrices:\n", "- the samples that are +ive and the classifier predicts as +ive are called True Positives (TP)\n", "- the samples that are -ive and the classifier predicts (wrongly) as +ive are called False Positives (FP)\n", "- the samples that are -ive and the classifier predicts as -ive are called True Negatives (TN)\n", "- the samples that are +ive and the classifier predicts as -ive are called False Negatives (FN)\n", "\n", "A classifier produces a confusion matrix which looks like this:\n", "\n", "![confusionmatrix](confusionmatrix_360.png)\n", "\n", "\n", "IMPORTANT NOTE: In sklearn, to obtain the confusion matrix in the form above, always have the observed `y` first, i.e.: use as `confusion_matrix(y_true, y_pred)`\n", "\n" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZYAAAELCAYAAAD6AKALAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAT0UlEQVR4nO3de7BdVWHH8e/PQFAMhmqH4hjkIWir1YbyEOu0SYud8UGtDFI7UFsc7YwkqB0RiYP2YbUG0Wm1Cp0OpRkqaVV8TC1KFW0iDIhQjSBKB8UAZYRakZQAEiSrf+x9k8Ph3MM5N+s87r3fz8yZdc/ea6299uJyf9lnP05KKUiSVMsTJj0ASdLCYrBIkqoyWCRJVRkskqSqDBZJUlV7TXoAk5bkm8ChwHbgexMejiTNF4cDy4AflFKO7FyRxX65cZJ7geWTHockzVPbSin7dy5Y9EcsNEcqy5cvX87KlSsnPRZJmhe2bNnCtm3boPkb+igGS/Px1zNWrlzJpk2bJj0WSZoXVq9ezebNm6HHKQRP3kuSqjJYJElVGSySpKoMFklSVQaLJKkqg0WSVJXBIkmqyvtY9tAh6y6b2La3rn/FxLYtSbPxiEWSVJXBIkmqymCRJFVlsEiSqjJYJElVGSySpKoMFklSVQaLJKkqg0WSVJXBIkmqymCRJFVlsEiSqjJYJElVGSySpKoMFklSVQMFS5K9kxyf5INJvpbkh0l2JLkzyaVJVj9O+1OSXJlkW5LtSa5PsjZJ3+0neWmSLya5J8kDSb6d5Jwk+wyxj5KkMRr0iGUVcAXwVuBg4D+BzwD3ACcB/5Hk3b0aJvkocAlwNHAl8CXg2cBHgEuTLJml3duBLwC/BXwDuAw4AHgPsCnJvgOOXZI0RoMGy07gU8BvlFKeXko5oZTymlLK84HfBx4B3pXkNzsbJTkJWAPcBbygbXcicATwXeBE4IzujSU5GlgPPAC8uJTyklLKycBhwFeB44D3Dr+7kqRRGyhYSilfKaW8upRyZY91Hwc2tG//oGv1O9ry7FLKLR1t7gZOb9+u6/GR2DogwLmllGs72m0HXkcTdGuS7D/I+CVJ41Pr5P0323LFzIIkK4CjgB3AJ7sblFI2A3cCB9Icgcy0Wwq8rH17SY92twLXAEuBl9cZviSpllrBckRb/rBj2ZFteVMp5cFZ2l3XVRfgOcC+wD2llO8P0U6SNAX2OFiSHAic1r79VMeqQ9vytj7Nb++q2/nz7cyuVztJ0hTYa08aJ9kL+BiwHPhyKeVzHauXteX9fbrY3pb7VWjXOa7T2B12j2flgPUkSQPYo2AB/g44HriDx564T1uWIfuca7tOh9BcIi1JGrM5B0uSDwGvp7mU+PhSyl1dVe5ry2XMbmbdfR3L5tqu01Zgc5/2nVbSHHFJkiqYU7Ak+SDwZuBHNKFyS49qW9vy4D5dHdRVt/PnZw7ZbpdSygZ2XwLdV5JNeHQjSdUMffI+yftp7sD/MfDbpZTvzFJ15hLk5yV50ix1jumqC3Az8CDw1CTPmqXdsT3aSZKmwFDBkmQ9cBbwE5pQ+dZsdUspd9A8imUpcHKPvlbR3PdyF819KTPtdtA8ygXg1B7tDgNeRHN/zGXDjF+SNHoDfxSW5C+Bs4F7aUJlkKOF99HcHHlukqtLKd9r+zoAOL+ts76UsrOr3Xqax72cneTyUsrX23bLgItoAvH8Usq9g45/lC7c+7yxbu8ND5811u1J0jAGCpYkrwTe2b79HvCmJL2q3lxKWT/zppRyaZILaB7fcmOSK4CHaa4kewrwWZqHUT5KKeW6JOuAc4Grk3yFJtBW0TyI8lrgnIH2UJI0VoMesTy14+ej21cvm2mONnYppaxJchWwliYYltCcR7kIuKDH0cpMu/cnuQE4k+ZczBOBW4EPAx8opTw04NglSWM0ULAMc5XVLO03Ahvn0O5y4PK5bleSNH5+g6QkqSqDRZJUlcEiSarKYJEkVWWwSJKqMlgkSVUZLJKkqgwWSVJVBoskqSqDRZJUlcEiSarKYJEkVWWwSJKqMlgkSVUZLJKkqgwWSVJVBoskqSqDRZJUlcEiSarKYJEkVWWwSJKqMlgkSVUZLJKkqgwWSVJVBoskqSqDRZJUlcEiSarKYJEkVWWwSJKqMlgkSVUZLJKkqgwWSVJVBoskqSqDRZJUlcEiSarKYJEkVWWwSJKqMlgkSVXtNekBaP45ZN1lE9nu1vWvmMh2JQ3HIxZJUlUGiySpKoNFklSVwSJJqspgkSRVZbBIkqoyWCRJVRkskqSqDBZJUlUGiySpKoNFklSVwSJJqspgkSRVZbBIkqoyWCRJVRkskqSqDBZJUlUGiySpKoNFklSVwSJJqspgkSRVZbBIkqoyWCRJVRkskqSqDBZJUlUGiySpKoNFklSVwSJJqspgkSRVZbBIkqoyWCRJVRkskqSqDBZJUlV7TXoAGt6Fe5/X/LDx4vFs8JSPj2c7khYEj1gkSVUZLJKkqgYOliTPSfKWJB9LcnOSnUlKklcP0PaUJFcm2ZZke5Lrk6xN0nf7SV6a5ItJ7knyQJJvJzknyT6DjluSNF7DnGM5HXjLsBtI8lFgDfBT4MvAw8DxwEeA45OcXEp5pEe7twPnAo8Am4CfAKuA9wAnJDm+lPLAsOORJI3WMB+FfRs4D3gNcDiw+fEaJDmJJlTuAl5QSjmhlHIicATwXeBE4Iwe7Y4G1gMPAC8upbyklHIycBjwVeA44L1DjF2SNCYDB0sp5cJSyttLKZ8opXx/wGbvaMuzSym3dPR1N80REMC6Hh+JrQMCnFtKubaj3XbgdcBOYE2S/QcdvyRpPEZ28j7JCuAoYAfwye71pZTNwJ3AgTRHIDPtlgIva99e0qPdrcA1wFLg5dUHLknaI6O8KuzItryplPLgLHWu66oL8BxgX+CePkdGvdpJkqbAKIPl0La8rU+d27vqdv58O7Pr1U6SNAVGeef9sra8v0+d7W25X4V2uyQ5DTit//B2WTlgPUnSAEYZLGnLMqZ2nQ6huTRZkjRmowyW+9pyWZ86M+vu61g213adtjLA5dCtlcDyAetKkh7HKINla1se3KfOQV11O39+5pDtdimlbAA29Gm/S5JNeHQjSdWM8uT9N9vyeUmeNEudY7rqAtwMPAg8NcmzZml3bI92kqQpMLJgKaXcAXyD5n6Tk7vXJ1kFrKC5K/+ajnY7gC+0b0/t0e4w4EU098dcVn3gkqQ9MuqnG7+vLc9NcvjMwiQHAOe3b9eXUnZ2tVtPc/L+7CTHdrRbBlxEM+7zSyn3jmzkkqQ5GfgcS5JfZXcYADy3Lf8qydtmFpZSjuv4+dIkF9A8vuXGJFew+yGUTwE+S/MwykcppVyXZB3NQyivTvIV4F6acyEHANcC5ww6dknS+Axz8v4pwAt7LD+iX6NSypokVwFraYJhCc15lIuAC3ocrcy0e3+SG4Azac7FPBG4Ffgw8IFSykNDjF2SNCYDB0spZRO77zEZSillI7BxDu0uBy6fyzYlSZPhN0hKkqoyWCRJVRkskqSqDBZJUlUGiySpKoNFklSVwSJJqspgkSRVZbBIkqoyWCRJVRkskqSqRvkNklooNr7mUW8v3PvukW3qDQ+fNbK+JY2HRyySpKoMFklSVQaLJKkqg0WSVJXBIkmqymCRJFVlsEiSqjJYJElVGSySpKoMFklSVQaLJKkqg0WSVJUPoZQGcMi6yyay3a3rXzGR7Up7wiMWSVJVBoskqSqDRZJUlcEiSarKYJEkVWWwSJKqMlgkSVUZLJKkqgwWSVJVBoskqSqDRZJUlcEiSarKYJEkVWWwSJKqMlgkSVUZLJKkqgwWSVJVBoskqSqDRZJUlcEiSapqr0kPQOp04d7nzb5y48X1N3jKx+v3KS1yHrFIkqoyWCRJVRkskqSqDBZJUlUGiySpKoNFklSVwSJJqspgkSRVZbBIkqoyWCRJVRkskqSqDBZJUlUGiySpKoNFklSVwSJJqspgkSRV5Rd9SerpkHWXTWS7W9e/YiLbVT0esUiSqjJYJElVGSySpKo8x6LFbeNrBqp24d53j3ggs9h48dzbnvLxeuOQhuARiySpKoNFklSVwSJJqspgkSRVZbBIkqoyWCRJVRkskqSqvI9Fklo+H62OqT9iSXJKkiuTbEuyPcn1SdYmmfqxS9JiNNV/nJN8FLgEOBq4EvgS8GzgI8ClSZZMcHiSpB6mNliSnASsAe4CXlBKOaGUciJwBPBd4ETgjAkOUZLUwzSfY3lHW55dSrllZmEp5e4kpwObgHVJ/raUsnMSA5Sm2oDPQZvNsM9He8PDZ+3R9rRwTOURS5IVwFHADuCT3etLKZuBO4EDgePGOzpJUj/TesRyZFveVEp5cJY61wHPaOtePZZRSVqULtz7vNFuoPsp1vP8ydTTGiyHtuVtferc3lV3lySnAacNuK0XAWzZsoXVq1cP2GS3u279MWfmjqHb1fBzT146ke3+5P4dE9nupPYX3OdBLOfUKttd/a+T2+fls+zzmWMex3+te/7YtnXcYU+bU7stW7bM/Hh497ppDZZlbXl/nzrb23K/HusOAVYNs8Ft27axefPmYZrscsOcWknSbLaNbUub9/zfxcu6F0xrsKQtyxzbbwUGTYmjgCXAPcD3htzOSmA5zW/Blsepqz3jXI+X8z0+83WuD6cJlR90r5jWYLmvLR+ThB1m1t3XvaKUsgHYUHdIj5VkE82R0ZZSyupRb28xc67Hy/ken4U411N5VRjNEQfAwX3qHNRVV5I0BaY1WL7Zls9L8qRZ6hzTVVeSNAWmMlhKKXcA3wCWAid3r0+yClhBc1f+NeMdnSSpn6kMltb72vLcJLsuZ0tyAHB++3a9d91L0nSZ1pP3lFIuTXIBcDpwY5IrgIeB44GnAJ+leRilJGmKTG2wAJRS1iS5ClhLc9XEEuBm4CLgAo9WJGn6THWwAJRSNgIbJz0OSdJgpvkciyRpHjJYJElVTf1HYVNuA833wmyd6CgWhw041+O0Aed7XDawwOY6pcz1cVySJD2WH4VJkqoyWCRJVRksHZKckuTKJNuSbE9yfZK1SeY0T0lemuSLSe5J8kCSbyc5J8k+tcc+39Sa6yQHJTk9yT8kuSHJz5KUJG8b1djnoxrzneQJSX4tyXvavv47yY4kdyf5fJJXjXIf5ouKv9unJvmnJDcm+VGSh5P8JMlVSc5Isveo9mGPlVJ8NeeZPkrz/S8PAv8GfAb4v3bZp4ElQ/b39rbtz4ArgE8C/9MuuwbYd9L7vBDmGviTtl33622T3s9pedWab5rv35iZ3x8D/w78C/D1juX/SHvudjG+Kv9uXwU8AtwIfB74Z5rvmdrR8XfkyZPe555jn/QApuEFnNT+h/ohcETH8l8AvtOue8sQ/R0N7KT5BswXdixf1v5iFOCvJ73fC2Sufxf4G+C1wC8BFxsso5lv4FnAl4GXdv+BpHkyxva2v9dNer/n+1y37Y4F9u+xfAXw3ba/v5j0fvcc+6QHMA0v4Pr2P9If9li3quOX5QkD9ndp2+ZPe6w7rP1XyEO9fmkW+qv2XPfoY4PBMr757urvnW1/X570fi+CuX5t29/Vk97vXq9Ff44lyQqaryfeQfNx1aOUUjYDdwIHAscN0N9S4GXt20t69HcrzSHsUuDlcx74PFR7rtXfBOZ75ruRVlToa16ZwFz/rC1/WqGv6hZ9sABHtuVNpZQHZ6lzXVfdfp4D7AvcU0r5foX+FpLac63+xj3fR7TlDyv0Nd+Mba6T/DxwVvv2c3vS16h45z0c2pa39alze1fdQfq7vU+dYfpbSGrPtfob23wn2Rd4c/v2U3vS1zw1srlO8js052+WAE8HXgw8keZj36n86hCDpTmhDs2J9tlsb8v9JtDfQuLcjNc45/t8mj+Y3wH+fg/7mo9GOde/AvxR17IPAX9WSnl4yL7Gwo/CIG1Z69k2tftbSJyb8RrLfCd5F80fvm3A75VSHhrl9qbUyOa6lPKeUkqAfYBn01wk8XrgW0meW3t7NRgscF9bLutTZ2bdfX3qjKq/hcS5Ga+Rz3eStwLvpvnX+MtKKTfNpZ8FYORzXUrZUUq5pZTyXuA04GDg4iTp33L8DJbdTxQ9uE+dg7rqDtLfMyv1t5Bsbctac63+trblSOY7yZuAD9LcDHhCKeWaYftYQLa25bh+tz9Nc+PlUcAhFfqrymDZfYnk85I8aZY6x3TV7edmmv/RnprkWbPUOXaI/haS2nOt/kY230nWAh+mudz1le3ltIvZWH+3S3Mzy4/btwfsaX+1LfpgKaXcAXyD5r6Sk7vXJ1lFc13+XTT3nzxefzuAL7RvT+3R32HAi2iud79szgOfh2rPtfob1XwneSPN1UgPAa8qpVxRZcDz2Lh/t5McSnOkshO4dU/7q27Sd2hOwwt4Nbvvij28Y/kBwE30eBQDcAbN0cnFPfo7ht2PdDm2Y/kymi/0WcyPdKk61z3634B33o9svoE/bn+3fwq8fNL7N02vmnMNPBd4I7Bfj+38Mrvv8r900vvdcy4mPYBpedFcLjnz8LjP0XyGua1d9hke+2ykP2/XbZqlv86HUH4R+ARwd7vsayzuh1BWm2ua6/q/1vH6UVv3tq7lT5/0fs/3+QZWtqFSaJ5VtWGW1wcmvc8LYK5Xt8vvB75K8wDKT9McFc38N7gWeNqk97nXy/tYWqWUNUmuAtbSPNdnCc2/JC4CLiil7Byyv/cnuQE4k+YI5ok0h6wfpvkfbzFekglUn+t9gBf2WP5MHn0BxaL9qoKK870/uy+r/cX21cttwKL82oKKc30TzWXFv04zz0fR3Hf4vzQftX8C+Fgp5ZG6e1CHX00sSapq0Z+8lyTVZbBIkqoyWCRJVRkskqSqDBZJUlUGiySpKoNFklSVwSJJqspgkSRVZbBIkqr6fxqBCqeMfQK1AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#####\n", "# 1. Plot the histograms of predicted probabilities on test for your favorite \n", "# model from above\n", "#####\n", "\n", "# We plot them for the logistic and knn15\n", "plt.hist(knn15.predict_proba(X_test)[:,1])\n", "plt.hist(logit.predict_proba(X_test)[:,1],alpha=0.7);\n", "\n", "# Note this illustrates the fact that neither model predicted proabilities above 0.5 \n", "# and thus all prediced classification were 0 by default.. It also shows that knn15 \n", "# predictions are in increments of 1/15, while the logistic has predicted probabilties\n", "# in a more 'continuous' like range of values." ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (, line 23)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m23\u001b[0m\n\u001b[0;31m print(confusion_matrix(y_test,t_repredict(knn15),0.06,X_test)))\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" ] } ], "source": [ "#####\n", "# 2. Create the confusion matrices for (a) the default threshold for classification and \n", "# (b) a well-chosen threshold for classification to balance errors more equally.\n", "#####\n", "\n", "from sklearn.metrics import confusion_matrix\n", "\n", "# this function may help to manually make confusion table from a different threshold\n", "def t_repredict(est, t, xtest):\n", " probs = est.predict_proba(xtest)\n", " p0 = probs[:,0]\n", " p1 = probs[:,1]\n", " ypred = (p1 > t)*1\n", " return ypred\n", "\n", "# Using the logistic model throughout:\n", "\n", "# Re-calculating the default confusion matrix\n", "print(confusion_matrix(y_test,t_repredict(knn15,0.5,X_test)))\n", "\n", "#And then looking at smaller threshold values: 0.32 and 0.06\n", "print(confusion_matrix(y_test,t_repredict(knn15,0.32,X_test)))\n", "print(confusion_matrix(y_test,t_repredict(knn15),0.06,X_test)))\n" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#####\n", "# 3. Make ROC curves to evaluate a model's overall useability.\n", "#####\n", "\n", "from sklearn.metrics import roc_curve, auc\n", "\n", "# a function to make 'pretty' ROC curves for this model\n", "def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0):\n", " initial=False\n", " if not ax:\n", " ax=plt.gca()\n", " initial=True\n", " if proba:#for stuff like logistic regression\n", " fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])\n", " else:#for stuff like SVM\n", " fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))\n", " roc_auc = auc(fpr, tpr)\n", " if skip:\n", " l=fpr.shape[0]\n", " ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))\n", " else:\n", " ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))\n", " label_kwargs = {}\n", " label_kwargs['bbox'] = dict(\n", " boxstyle='round,pad=0.3', alpha=0.2,\n", " )\n", " if labe!=None:\n", " for k in range(0, fpr.shape[0],labe):\n", " #from https://gist.github.com/podshumok/c1d1c9394335d86255b8\n", " threshold = str(np.round(thresholds[k], 2))\n", " ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)\n", " if initial:\n", " ax.plot([0, 1], [0, 1], 'k--')\n", " ax.set_xlim([0.0, 1.0])\n", " ax.set_ylim([0.0, 1.05])\n", " ax.set_xlabel('False Positive Rate')\n", " ax.set_ylabel('True Positive Rate')\n", " ax.set_title('ROC')\n", " ax.legend(loc=\"lower right\")\n", " return ax\n", "\n", "\n", "sns.set_context(\"poster\")\n", "make_roc(\"Logistic\", logit, y_test, X_test, ax=None, labe=20, proba=True, skip=1);\n", " " ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "**Question**\n", "4. Use the ROC curves to select a threshold to balance the two types of errors.\n", "\n", "**Answer** It looks like based on the logistic model, a threshold of somwhere around 0.04 will give us a very high true positive rate (80\\% or so) before taking on \"too high\" of a false positive rate (just a tad over 50\\%). \n" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "## Part 4: Imputation\n", "\n", "In this part we ask that you explore the effects of imputation:\n", "1. Plot the histogram of `income`.\n", "2. Create a new variable `income_imp` that imputes the median or mean income for all the missing values and plot the histogram for this new variable.\n", "3. Compare the histograms above.\n", "\n", "\n", "4. Update your `poorhealth` prediction model(s) by incorporating `income_imp`. \n", "5. Compare the accuracy of this new model.\n", "\n", "\n", "And if there is time:\n", " \n", "6. Create a new variable `income_imp2` that imputes the value via a model.\n", "7. Update your `poorhealth` prediction model(s) by incorporating `income_imp2`. \n", "8. Compare the accuracy of this newest model." ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#####\n", "# 1. Plot the histogram of `income`.\n", "# 2. Create a new variable `income_imp` that imputes the median or \n", "# mean income for all the missing values and plot the histogram for this new variable.\n", "#####\n", "\n", "\n", "\n", "# First create 'income_imp', and then add it into the predictor set\n", "income_imp = gssdata['income'].fillna(gssdata['income'].median())\n", "X['income_imp'] = income_imp\n", "\n", "# plot the original and the version with imputations\n", "plt.hist(gssdata['income'])\n", "plt.hist(X,alpha=0.5);\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "**Question:**\n", "3. Compare the histograms above.\n", "\n", "**Solution:** There is now a spike at the median compared to what was there before. They distributions are not all that similar in shape or spread (but center is very similar)." ] }, { "cell_type": "code", "execution_count": 173, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[316 130]\n", " [ 14 11]]\n", "[[316 130]\n", " [ 14 11]]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "//anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageeducfemalemarital_never marriedmarital_no longer marriedrace_otherrace_whitesexornt_gaysexornt_heteropartyid_otherpartyid_repincome_imp
6942.014.010001011037.50
108232.015.011001000021.25
114322.014.001001000037.50
87124.012.000001010167.50
38040.012.01010000003.50
\n", "
" ], "text/plain": [ " age educ female marital_never married marital_no longer married \\\n", "69 42.0 14.0 1 0 0 \n", "1082 32.0 15.0 1 1 0 \n", "1143 22.0 14.0 0 1 0 \n", "871 24.0 12.0 0 0 0 \n", "380 40.0 12.0 1 0 1 \n", "\n", " race_other race_white sexornt_gay sexornt_hetero partyid_other \\\n", "69 0 1 0 1 1 \n", "1082 0 1 0 0 0 \n", "1143 0 1 0 0 0 \n", "871 0 1 0 1 0 \n", "380 0 0 0 0 0 \n", "\n", " partyid_rep income_imp \n", "69 0 37.50 \n", "1082 0 21.25 \n", "1143 0 37.50 \n", "871 1 67.50 \n", "380 0 3.50 " ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#####\n", "# 4. Update your `poorhealth` prediction model(s) by incorporating `income_imp`. \n", "# 5. Calculate and compare the accuracy of this new model.\n", "#####\n", "\n", "# re-use indices for splitting since now we added the imputed 'income' variable\n", "# Note: the response is unaffected so does not need to be redefined.\n", "X_train = X.loc[itrain]\n", "X_test = X.loc[itest]\n", "\n", "logit_imp1 = sk.linear_model.LogisticRegression(C=100000)\n", "logit_imp1.fit(X_train,y_train)\n", "\n", "knn15_imp1 = KNeighborsClassifier(15)\n", "knn15_imp1.fit(X_train,y_train)\n", "\n", "print(confusion_matrix(y_test,t_repredict(logit,0.07,X_test)))\n", "print(confusion_matrix(y_test,t_repredict(logit_imp1,0.07,X_test)))\n", "\n", "X_train.head()" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "**Question:**\n", "5. Compare the accuracies.\n", "\n", "**Answer:** Nothing has improved. The accuracies are identical for both the logistic and knn models (looking at varios different thresholds." ] }, { "cell_type": "code", "execution_count": 170, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[330 116]\n", " [ 15 10]]\n", "[[322 124]\n", " [ 15 10]]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "//anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] } ], "source": [ "#####\n", "# And if there is time:\n", "# 6. Create a new variable `income_imp2` that imputes the value via a model.\n", "# 7. Update your `poorhealth` prediction model(s) by incorporating `income_imp2`. \n", "# 8. Calculate and compare the accuracy of this newest model.\n", "#####\n", "\n", "# the y_imp is observed incomes in train for building an imputation model to \n", "# impute values on training and test, and the X_imp is everything else in train \n", "# (we could use the true response health or poorhealth, but we decided not to here).\n", "\n", "income_train = gssdata['income'][itrain]\n", "\n", "# all the missing observations' predictors\n", "miss_index = gssdata['income'][gssdata['income'].isna()].index\n", "X_miss = X.loc[gssdata['income'].isna(),:]\n", "X_miss = X_miss.drop('income_imp',axis=1)\n", "\n", "# all the available observed incomes (and predictors) within train to be used to \n", "# build a model to predict income\n", "y_imp = income_train.dropna()\n", "X_imp = X_train.drop('income_imp',axis=1).loc[itrain]\n", "X_imp = X_imp.loc[income_train.isna()==False,:]\n", "\n", "# fit the model\n", "lm = sk.linear_model.LinearRegression()\n", "lm.fit(X_imp,y_imp)\n", "\n", "# do the predictions without noise, and turn it into a series for imputation\n", "y_miss = lm.predict(X_miss)\n", "y_miss_series = pd.Series(data = y_miss, index = miss_index)\n", "\n", "# create the imputed income variable for all observations\n", "income_imp2 = gssdata['income'].fillna(y_miss_series)\n", "\n", "# add income_imp into the train and test sets properly \n", "X_train['income_imp'] = income_imp2[itrain]\n", "X_test['income_imp'] = income_imp2[itest]\n", "\n", "# go back to the primary classifciatoin modeling...\n", "logit_imp2 = sk.linear_model.LogisticRegression(C=100000)\n", "logit_imp2.fit(X_train,y_train)\n", "\n", "knn15_imp2 = KNeighborsClassifier(15)\n", "knn15_imp2.fit(X_train,y_train)\n", "\n", "# quick peak at predictions to see if anything has changed \n", "print(confusion_matrix(y_test,t_repredict(logit,0.07,X_test)))\n", "print(confusion_matrix(y_test,t_repredict(logit_imp2,0.07,X_test)))\n", "\n" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageeducfemalemarital_never marriedmarital_no longer marriedrace_otherrace_whitesexornt_gaysexornt_heteropartyid_otherpartyid_rep
108232.015.0110010000
87124.012.0000010101
38040.012.0101000000
64843.016.0100010000
126461.010.0101010110
\n", "
" ], "text/plain": [ " age educ female marital_never married marital_no longer married \\\n", "1082 32.0 15.0 1 1 0 \n", "871 24.0 12.0 0 0 0 \n", "380 40.0 12.0 1 0 1 \n", "648 43.0 16.0 1 0 0 \n", "1264 61.0 10.0 1 0 1 \n", "\n", " race_other race_white sexornt_gay sexornt_hetero partyid_other \\\n", "1082 0 1 0 0 0 \n", "871 0 1 0 1 0 \n", "380 0 0 0 0 0 \n", "648 0 1 0 0 0 \n", "1264 0 1 0 1 1 \n", "\n", " partyid_rep \n", "1082 0 \n", "871 1 \n", "380 0 \n", "648 0 \n", "1264 0 " ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_imp.head()" ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[319 127]\n", " [ 15 10]]\n", "[[324 122]\n", " [ 14 11]]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "//anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] } ], "source": [ "######\n", "# Now to do imputation with uncertainty:\n", "# we can use the same imputation model from the previous part \n", "######\n", "\n", "# first add the standard y-hats just like before\n", "y_miss = lm.predict(X_miss)\n", "\n", "# we need to estimate the residual variance (MSE), sigma2_hat, from the observed incomes \n", "# that were used to train the model\n", "y_hat = lm.predict(X_imp)\n", "sigma2_hat = sk.metrics.mean_squared_error(y_imp,y_hat)\n", "\n", "# sample a residual from the assumed normal distribution\n", "e_miss = np.random.normal(loc=0,scale=np.sqrt(sigma2_hat),size=y_miss.shape[0])\n", "\n", "# create the income measurement with uncertainty to be imputed\n", "y_miss_series = pd.Series(data = y_miss+e_miss, index = miss_index)\n", "\n", "# imputed them properly into where they belong.\n", "income_imp3 = gssdata['income'].fillna(y_miss_series)\n", "\n", "# add income_imp into the train and test sets properly \n", "X_train['income_imp'] = income_imp3[itrain]\n", "X_test['income_imp'] = income_imp3[itest]\n", "\n", "# go back to the primary classifciatoin modeling...\n", "logit_imp3 = sk.linear_model.LogisticRegression(C=100000)\n", "logit_imp3.fit(X_train,y_train)\n", "\n", "knn15_imp3 = KNeighborsClassifier(15)\n", "knn15_imp3.fit(X_train,y_train)\n", "\n", "print(confusion_matrix(y_test,t_repredict(logit,0.07,X_test)))\n", "print(confusion_matrix(y_test,t_repredict(logit_imp3,0.07,X_test)))\n" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUC for logistic model when income was dropped: \n", " 0.6481614349775785\n", "AUC for logistic model when imputations were done with median imputation: \n", " 0.6481614349775785\n", "AUC for logistic model when imputation were done via linear regression: \n", " 0.6476233183856502\n", "AUC for logistic model when imputation were done via linear regression with uncertainty: \n", " 0.6590134529147982\n" ] } ], "source": [ "# Let's use AUC for evaluations \n", "\n", "fpr, tpr, thresholds=roc_curve(y_test, logit.predict_proba(X_test)[:,1])\n", "print(\"AUC for logistic model when income was dropped: \\n\",\n", " auc(fpr, tpr))\n", "\n", "fpr, tpr, thresholds=roc_curve(y_test, logit_imp1.predict_proba(X_test)[:,1])\n", "print(\"AUC for logistic model when imputations were done with median imputation: \\n\",\n", " auc(fpr, tpr))\n", "\n", "fpr, tpr, thresholds=roc_curve(y_test, logit_imp2.predict_proba(X_test)[:,1])\n", "print(\"AUC for logistic model when imputation were done via linear regression: \\n\",\n", " auc(fpr, tpr))\n", "fpr, tpr, thresholds=roc_curve(y_test, logit_imp3.predict_proba(X_test)[:,1])\n", "print(\"AUC for logistic model when imputation were done via linear regression with uncertainty: \\n\",\n", " auc(fpr, tpr))" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "**Question:**\n", "8. Compare the accuracies.\n", "\n", "**Answer:** Things have improved! Using the uncertainty in the imputations has slightly improved the AUC for the classification model. But this is ONLY slightly!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 1 }