{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "hide": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "# <img style=\"float: left; padding-right: 10px; width: 45px\" src=\"https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png\"> CS-109A Introduction to Data Science\n",
    "\n",
    "\n",
    "## Lab 7: $k$-NN Classification and Imputation\n",
    "\n",
    "**Harvard University**<br>\n",
    "**Fall 2019**<br>\n",
    "**Instructors:** Pavlos Protopapas, Kevin Rader, Chris Tanner<br>\n",
    "**Lab Instructors:** Chris Tanner and Eleni Kaxiras.  <br>\n",
    "**Contributors:** Kevin Rader\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "blockquote { background: #AEDE94; }\n",
       "h1 { \n",
       "    padding-top: 25px;\n",
       "    padding-bottom: 25px;\n",
       "    text-align: left; \n",
       "    padding-left: 10px;\n",
       "    background-color: #DDDDDD; \n",
       "    color: black;\n",
       "}\n",
       "h2 { \n",
       "    padding-top: 10px;\n",
       "    padding-bottom: 10px;\n",
       "    text-align: left; \n",
       "    padding-left: 5px;\n",
       "    background-color: #EEEEEE; \n",
       "    color: black;\n",
       "}\n",
       "\n",
       "div.exercise {\n",
       "\tbackground-color: #ffcccc;\n",
       "\tborder-color: #E9967A; \t\n",
       "\tborder-left: 5px solid #800080; \n",
       "\tpadding: 0.5em;\n",
       "}\n",
       "\n",
       "span.sub-q {\n",
       "\tfont-weight: bold;\n",
       "}\n",
       "div.theme {\n",
       "\tbackground-color: #DDDDDD;\n",
       "\tborder-color: #E9967A; \t\n",
       "\tborder-left: 5px solid #800080; \n",
       "\tpadding: 0.5em;\n",
       "\tfont-size: 18pt;\n",
       "}\n",
       "div.gc { \n",
       "\tbackground-color: #AEDE94;\n",
       "\tborder-color: #E9967A; \t \n",
       "\tborder-left: 5px solid #800080; \n",
       "\tpadding: 0.5em;\n",
       "\tfont-size: 12pt;\n",
       "}\n",
       "p.q1 { \n",
       "    padding-top: 5px;\n",
       "    padding-bottom: 5px;\n",
       "    text-align: left; \n",
       "    padding-left: 5px;\n",
       "    background-color: #EEEEEE; \n",
       "    color: black;\n",
       "}\n",
       "header {\n",
       "   padding-top: 35px;\n",
       "    padding-bottom: 35px;\n",
       "    text-align: left; \n",
       "    padding-left: 10px;\n",
       "    background-color: #DDDDDD; \n",
       "    color: black;\n",
       "}\n",
       "</style>\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES\n",
    "import requests\n",
    "from IPython.core.display import HTML\n",
    "styles = requests.get(\"https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css\").text\n",
    "HTML(styles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "## Learning Goals \n",
    "In this lab, we'll explore classification models to predict the health status of survey respondents and be able to build a classification decision boundary to predict the resultsing unbalanced classes.\n",
    "\n",
    "By the end of this lab, you should:\n",
    "- Be familiar with the `sklearn` implementations of\n",
    " - $k$-NN Regression\n",
    " - ROC curves and classification metrics\n",
    "- Be able to optimize some loss function based on misclassification rates\n",
    "- Be able to impute for missing values\n",
    "- Be comfortable in the different approaches in handling missingness"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "button": false,
    "hide": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "import statsmodels.api as sm\n",
    "from statsmodels.api import OLS\n",
    "import sklearn as sk\n",
    "from sklearn.decomposition import PCA\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.linear_model import LogisticRegressionCV\n",
    "from sklearn.utils import resample\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.metrics import accuracy_score\n",
    "# %matplotlib inline\n",
    "import seaborn.apionly as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "## Part 1:  General Social Survey Data + EDA\n",
    "\n",
    "The dataset contains a subset of data from the General Social Survey (GSS) that is a bi-annual survey of roughly 2000 Americans.  We will be using a small subset of the approx 4000 questions they ask.  Specifically we'll use:  \n",
    "\n",
    "- **id:** respondant's unique ID\n",
    "- **health:** self-reported health level with 4 categories: poor, fair, good, excellent\n",
    "- **partyid:** political party affiliation with categories dem, rep, or other\n",
    "- **age:** age in years\n",
    "- **sex:** male or female\n",
    "- **sexornt:** sexual orientation with categories hetero, gay, or bisexual/other\n",
    "- **educ:** number of years of formal education (capped at 20 years)\n",
    "- **marital:** marital status with categories married, never married, and no longer married\n",
    "- **race:** with categories black, white, and other\n",
    "- **income:** in thousands of dollars\n",
    "\n",
    "Our goal is to predict whether or not someone is in **poor health** based on the other measures.\n",
    "\n",
    "For this task, we will exercise our normal data science pipeline -- from EDA to modeling and visualization. In particular, we will show the performance of 2 classifiers:\n",
    "\n",
    "- Logistic Regression\n",
    "- $k$-NN Regression\n",
    "\n",
    "So without further ado..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "### EDA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "Do the following basic EDA (always good ideas):\n",
    "1. Determine the dimensions of the data set.\n",
    "2. Get a glimpse of the data set.\n",
    "3. Calculate basic summary/descriptive statistics of the variables.\n",
    "\n",
    "We also ask that you do the following:\n",
    "4. Create a binary called `poorhealth`.  \n",
    "5. Explore the distribution of the responses, `health` and `poorhealth`, \n",
    "6. Explore what variables may be related to whether or not some is of poor health.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>health</th>\n",
       "      <th>partyid</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>sexornt</th>\n",
       "      <th>educ</th>\n",
       "      <th>marital</th>\n",
       "      <th>race</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>good</td>\n",
       "      <td>rep</td>\n",
       "      <td>43.0</td>\n",
       "      <td>male</td>\n",
       "      <td>bisexual/other</td>\n",
       "      <td>14.0</td>\n",
       "      <td>never married</td>\n",
       "      <td>white</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>excellent</td>\n",
       "      <td>dem</td>\n",
       "      <td>74.0</td>\n",
       "      <td>female</td>\n",
       "      <td>hetero</td>\n",
       "      <td>10.0</td>\n",
       "      <td>no longer married</td>\n",
       "      <td>white</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>excellent</td>\n",
       "      <td>rep</td>\n",
       "      <td>71.0</td>\n",
       "      <td>male</td>\n",
       "      <td>hetero</td>\n",
       "      <td>18.0</td>\n",
       "      <td>no longer married</td>\n",
       "      <td>black</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>good</td>\n",
       "      <td>dem</td>\n",
       "      <td>67.0</td>\n",
       "      <td>female</td>\n",
       "      <td>bisexual/other</td>\n",
       "      <td>16.0</td>\n",
       "      <td>no longer married</td>\n",
       "      <td>white</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "      <td>good</td>\n",
       "      <td>dem</td>\n",
       "      <td>59.0</td>\n",
       "      <td>female</td>\n",
       "      <td>bisexual/other</td>\n",
       "      <td>13.0</td>\n",
       "      <td>no longer married</td>\n",
       "      <td>black</td>\n",
       "      <td>18.75</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id     health partyid   age     sex         sexornt  educ  \\\n",
       "0   1       good     rep  43.0    male  bisexual/other  14.0   \n",
       "1   2  excellent     dem  74.0  female          hetero  10.0   \n",
       "2   5  excellent     rep  71.0    male          hetero  18.0   \n",
       "3   6       good     dem  67.0  female  bisexual/other  16.0   \n",
       "4   7       good     dem  59.0  female  bisexual/other  13.0   \n",
       "\n",
       "             marital   race  income  \n",
       "0      never married  white     NaN  \n",
       "1  no longer married  white     NaN  \n",
       "2  no longer married  black     NaN  \n",
       "3  no longer married  white     NaN  \n",
       "4  no longer married  black   18.75  "
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gssdata = pd.read_csv(\"../data/gsshealth18.csv\")\n",
    "\n",
    "#####\n",
    "# You code here: EDA\n",
    "# 1. Determine the dimensions of the data set.\n",
    "# 2. Get a glimpse of the data set.\n",
    "# 3. Calculate basic summary/descriptive statistics of the variables.\n",
    "####\n",
    "gssdata.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id           0\n",
       "health       0\n",
       "partyid      0\n",
       "age          2\n",
       "sex          0\n",
       "sexornt      0\n",
       "educ         2\n",
       "marital      2\n",
       "race         0\n",
       "income     661\n",
       "dtype: int64"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gssdata.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "I_test = gssdata[gssdata['marital'].isna()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_test = I_test['marital']\n",
    "I_test = I_test.drop(columns=['marital'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>health</th>\n",
       "      <th>partyid</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>sexornt</th>\n",
       "      <th>educ</th>\n",
       "      <th>race</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>25</td>\n",
       "      <td>40</td>\n",
       "      <td>good</td>\n",
       "      <td>other</td>\n",
       "      <td>50.0</td>\n",
       "      <td>female</td>\n",
       "      <td>bisexual/other</td>\n",
       "      <td>12.0</td>\n",
       "      <td>black</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1478</td>\n",
       "      <td>2222</td>\n",
       "      <td>fair</td>\n",
       "      <td>other</td>\n",
       "      <td>65.0</td>\n",
       "      <td>male</td>\n",
       "      <td>bisexual/other</td>\n",
       "      <td>19.0</td>\n",
       "      <td>white</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id health partyid   age     sex         sexornt  educ   race  income\n",
       "25      40   good   other  50.0  female  bisexual/other  12.0  black     NaN\n",
       "1478  2222   fair   other  65.0    male  bisexual/other  19.0  white     NaN"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "I_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "25      NaN\n",
       "1478    NaN\n",
       "Name: marital, dtype: object"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>health</th>\n",
       "      <th>partyid</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>sexornt</th>\n",
       "      <th>educ</th>\n",
       "      <th>marital</th>\n",
       "      <th>race</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>good</td>\n",
       "      <td>rep</td>\n",
       "      <td>43.0</td>\n",
       "      <td>male</td>\n",
       "      <td>bisexual/other</td>\n",
       "      <td>14.0</td>\n",
       "      <td>never married</td>\n",
       "      <td>white</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>excellent</td>\n",
       "      <td>dem</td>\n",
       "      <td>74.0</td>\n",
       "      <td>female</td>\n",
       "      <td>hetero</td>\n",
       "      <td>10.0</td>\n",
       "      <td>no longer married</td>\n",
       "      <td>white</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>excellent</td>\n",
       "      <td>rep</td>\n",
       "      <td>71.0</td>\n",
       "      <td>male</td>\n",
       "      <td>hetero</td>\n",
       "      <td>18.0</td>\n",
       "      <td>no longer married</td>\n",
       "      <td>black</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>good</td>\n",
       "      <td>dem</td>\n",
       "      <td>67.0</td>\n",
       "      <td>female</td>\n",
       "      <td>bisexual/other</td>\n",
       "      <td>16.0</td>\n",
       "      <td>no longer married</td>\n",
       "      <td>white</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "      <td>good</td>\n",
       "      <td>dem</td>\n",
       "      <td>59.0</td>\n",
       "      <td>female</td>\n",
       "      <td>bisexual/other</td>\n",
       "      <td>13.0</td>\n",
       "      <td>no longer married</td>\n",
       "      <td>black</td>\n",
       "      <td>18.75</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id     health partyid   age     sex         sexornt  educ  \\\n",
       "0   1       good     rep  43.0    male  bisexual/other  14.0   \n",
       "1   2  excellent     dem  74.0  female          hetero  10.0   \n",
       "2   5  excellent     rep  71.0    male          hetero  18.0   \n",
       "3   6       good     dem  67.0  female  bisexual/other  16.0   \n",
       "4   7       good     dem  59.0  female  bisexual/other  13.0   \n",
       "\n",
       "             marital   race  income  \n",
       "0      never married  white     NaN  \n",
       "1  no longer married  white     NaN  \n",
       "2  no longer married  black     NaN  \n",
       "3  no longer married  white     NaN  \n",
       "4  no longer married  black   18.75  "
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "I_train = gssdata.dropna(subset=['marital'])\n",
    "I_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        never married\n",
       "1    no longer married\n",
       "2    no longer married\n",
       "3    no longer married\n",
       "4    no longer married\n",
       "5        never married\n",
       "6    no longer married\n",
       "7    no longer married\n",
       "8              married\n",
       "9    no longer married\n",
       "Name: marital, dtype: object"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_train = I_train['marital']\n",
    "y_train.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id           0\n",
       "health       0\n",
       "partyid      0\n",
       "age          2\n",
       "sex          0\n",
       "sexornt      0\n",
       "educ         2\n",
       "marital      0\n",
       "race         0\n",
       "income     659\n",
       "dtype: int64"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "I_train.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/eleni/anaconda2/envs/bunny/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n",
      "  FutureWarning)\n"
     ]
    },
    {
     "ename": "ValueError",
     "evalue": "could not convert string to float: 'white'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-43-a6425462e53d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mlogit\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mLogisticRegression\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mC\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1000000\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mlogit\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mI_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlogit\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscore\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mI_test\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0my_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0mlogit\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcoef_\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda2/envs/bunny/lib/python3.6/site-packages/sklearn/linear_model/logistic.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m   1530\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1531\u001b[0m         X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, order=\"C\",\n\u001b[0;32m-> 1532\u001b[0;31m                          accept_large_sparse=solver != 'liblinear')\n\u001b[0m\u001b[1;32m   1533\u001b[0m         \u001b[0mcheck_classification_targets\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1534\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclasses_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda2/envs/bunny/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_X_y\u001b[0;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m    717\u001b[0m                     \u001b[0mensure_min_features\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mensure_min_features\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    718\u001b[0m                     \u001b[0mwarn_on_dtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mwarn_on_dtype\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 719\u001b[0;31m                     estimator=estimator)\n\u001b[0m\u001b[1;32m    720\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mmulti_output\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    721\u001b[0m         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,\n",
      "\u001b[0;32m~/anaconda2/envs/bunny/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m    494\u001b[0m             \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    495\u001b[0m                 \u001b[0mwarnings\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msimplefilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'error'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mComplexWarning\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 496\u001b[0;31m                 \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    497\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    498\u001b[0m                 raise ValueError(\"Complex data not supported\\n\"\n",
      "\u001b[0;32m~/anaconda2/envs/bunny/lib/python3.6/site-packages/numpy/core/numeric.py\u001b[0m in \u001b[0;36masarray\u001b[0;34m(a, dtype, order)\u001b[0m\n\u001b[1;32m    536\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    537\u001b[0m     \"\"\"\n\u001b[0;32m--> 538\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    539\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    540\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'white'"
     ]
    }
   ],
   "source": [
    "logit = LogisticRegression(C=1000000)\n",
    "logit.fit(I_train, y_train) \n",
    "print(logit.score(I_test,y_test))\n",
    "logit.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>sex</th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>health</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>excellent</td>\n",
       "      <td>198</td>\n",
       "      <td>161</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>fair</td>\n",
       "      <td>188</td>\n",
       "      <td>167</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>good</td>\n",
       "      <td>442</td>\n",
       "      <td>329</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>poor</td>\n",
       "      <td>44</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "sex        female  male\n",
       "health                 \n",
       "excellent     198   161\n",
       "fair          188   167\n",
       "good          442   329\n",
       "poor           44    40"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(gssdata['health'], gssdata['sex'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "\n",
    "#####\n",
    "# You code here: EDA\n",
    "# 4. Create a binary called `poorhealth`.  \n",
    "# 5. Explore the distribution of the responses, `health` and `poorhealth`, \n",
    "# 6. Explore what variables may be related to whether or not some is of poor health.\n",
    "####\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "**Question**: What classification accuracy could you achieve if you simply predicted `poorhealth` without a model?  What classification accuracy would you get if you were to predict the multi-class `health` variable? Is accuracy the correct metric?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "*your answer here*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "### Data Cleaning - Basic Handling of Missingness"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "Let's begin by fitting an unregularized logistic regression model to predict poor health based on all the other predictors in the model and three $k$-NN models with $k=5,10,20$.\n",
    "\n",
    "First we need to do a small amount of data clean-up.  \n",
    "1. Determine the amount of missingness in each variable.  If there is *a lot*, we will drop the variable from the predictor set (not quite yet).  If there is a little, we will impute.\n",
    "2. Drop any variables with lots of missingnes (in a new data set).\n",
    "3. Do simple imputations for variables with a little bit of missingness.\n",
    "4. Create dummies for categorical predictors.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "#########\n",
    "# 1. Determine the amount of missingness in each variable. \n",
    "# Use is.na() in combination with .sum()\n",
    "########\n",
    "\n",
    "# Your code here\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "#######\n",
    "# And then build your predictor set\n",
    "# 2. Drop any variables with lots of missingnes (in a new data set).\n",
    "# 3. Do simple imputations for variables with a little bit of missingness.\n",
    "# 4. Create dummies for categorical predictors.\n",
    "#########\n",
    "X = gssdata[['partyid','age','sex','sexornt','educ','marital','race','income']]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.core.display import HTML\n",
    "display(HTML(gssdata.to_html())) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "#create dummies (lots of ways to do it, two ways will be in the solutions\n",
    "\n",
    "\n",
    "#print(X.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "## Part 2:  Fit Basic Models\n",
    "\n",
    "In this section we ask you to:\n",
    "\n",
    "1. Split the data into 70-30 train-test splits (use the code provided...should have been done before EDA :( )\n",
    "2. Fit an unregularize logistic regression model to predict `poorhealth` from all predictors except income.\n",
    "    \n",
    "    2b. If you have time: use 'LogisticRegressionCV' to find a well-tuned L2 regularized model.\n",
    "    \n",
    "    \n",
    "3. Fit $k$-NN classification models with $k=1,15,25$ to predict `poorhealth` from all predictors except income.\n",
    "4. Report classification accuracy on both train and test set for all models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "#######\n",
    "# Use the following train_test_split code to: \n",
    "# 1. Split the data into 70-30 train-test splits\n",
    "#######\n",
    "from sklearn.model_selection import train_test_split\n",
    "itrain, itest = train_test_split(range(gssdata.shape[0]), train_size=0.70)\n",
    "\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "KNeighborsClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "######\n",
    "# 2. Fit an unregularize logistic regression model to predict `poorhealth` \n",
    "#    from all predictors except income.\n",
    "# 2b. If you have time: use 'LogisticRegressionCV' to find a well-tuned L2 regularized model.\n",
    "# 3. Fit $k$-NN classification models with k=1,15,25 to predict `poorhealth` \n",
    "#    from all predictors except income.\n",
    "######"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "######\n",
    "# 4. Report classification accuracy on both train and test set for all models.\n",
    "######\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "## Part 3: Evaluate Models via Confusion matrices and ROC Curves\n",
    "\n",
    "In this part we ask that you:\n",
    "1. Plot the histograms of predicted probabilities for your favorite model from above\n",
    "2. Create the confusion matrices for (a) the default threshold for classification and (b) a well-chosen threshold for classification to balance errors more equally.\n",
    "3. Make ROC curves to evaluate a model's overall useability.\n",
    "4. Use the ROC curves to select a threshold to balance the two types of errors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "As a reminder of Confustion Matrices:\n",
    "- the samples that are +ive and the classifier predicts as +ive are called True Positives (TP)\n",
    "- the samples that are -ive and the classifier predicts (wrongly) as +ive are called False Positives (FP)\n",
    "- the samples that are -ive and the classifier predicts as -ive are called True Negatives (TN)\n",
    "- the samples that are +ive and the classifier predicts as -ive are called False Negatives (FN)\n",
    "\n",
    "A classifier produces a confusion matrix which looks like this:\n",
    "\n",
    "![confusionmatrix](confusionmatrix_360.png)\n",
    "\n",
    "\n",
    "IMPORTANT NOTE: In sklearn, to obtain the confusion matrix in the form above, always have the observed `y` first, i.e.: use as `confusion_matrix(y_true, y_pred)`\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "#####\n",
    "# 1. Plot the histograms of predicted probabilities on train for your favorite \n",
    "#    model from above\n",
    "#####"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "#####\n",
    "#  2. Create the confusion matrices for (a) the default threshold for classification and \n",
    "#     (b) a well-chosen threshold for classification to balance errors more equally.\n",
    "#####\n",
    "\n",
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "# this function may help to manually make confusion table from a different threshold\n",
    "def t_repredict(est, t, xtest):\n",
    "    probs = est.predict_proba(xtest)\n",
    "    p0 = probs[:,0]\n",
    "    p1 = probs[:,1]\n",
    "    ypred = (p1 > t)*1\n",
    "    return ypred\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "#####\n",
    "# 3. Make ROC curves to evaluate a model's overall useability.\n",
    "#####\n",
    "\n",
    "from sklearn.metrics import roc_curve, auc\n",
    "\n",
    "# a function to make 'pretty' ROC curves for this model\n",
    "def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0):\n",
    "    initial=False\n",
    "    if not ax:\n",
    "        ax=plt.gca()\n",
    "        initial=True\n",
    "    if proba:#for stuff like logistic regression\n",
    "        fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])\n",
    "    else:#for stuff like SVM\n",
    "        fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))\n",
    "    roc_auc = auc(fpr, tpr)\n",
    "    if skip:\n",
    "        l=fpr.shape[0]\n",
    "        ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))\n",
    "    else:\n",
    "        ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))\n",
    "    label_kwargs = {}\n",
    "    label_kwargs['bbox'] = dict(\n",
    "        boxstyle='round,pad=0.3', alpha=0.2,\n",
    "    )\n",
    "    if labe!=None:\n",
    "        for k in range(0, fpr.shape[0],labe):\n",
    "            #from https://gist.github.com/podshumok/c1d1c9394335d86255b8\n",
    "            threshold = str(np.round(thresholds[k], 2))\n",
    "            ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)\n",
    "    if initial:\n",
    "        ax.plot([0, 1], [0, 1], 'k--')\n",
    "        ax.set_xlim([0.0, 1.0])\n",
    "        ax.set_ylim([0.0, 1.05])\n",
    "        ax.set_xlabel('False Positive Rate')\n",
    "        ax.set_ylabel('True Positive Rate')\n",
    "        ax.set_title('ROC')\n",
    "    ax.legend(loc=\"lower right\")\n",
    "    return ax\n",
    "\n",
    "sns.set_context(\"poster\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "4. Use the ROC curves to select a threshold to balance the two types of errors.\n",
    "\n",
    "*your answer here*\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "## Part 4: Imputation\n",
    "\n",
    "In this part we ask that you explore the effects of imputation:\n",
    "1. Plot the histogram of `income`.\n",
    "2. Create a new variable `income_imp` that imputes the median or mean income for all the missing values and plot the histogram for this new variable.\n",
    "3. Compare the histograms above.\n",
    "\n",
    "\n",
    "4. Update your `poorhealth` prediction model(s) by incorporating `income_imp`. \n",
    "5. Compare the accuracy of this new model.\n",
    "\n",
    "\n",
    "And if there is time:\n",
    "       \n",
    "6. Create a new variable `income_imp2` that imputes the value via a model.\n",
    "7. Update your `poorhealth` prediction model(s) by incorporating `income_imp2`. \n",
    "8. Compare the accuracy of this newest model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "#####\n",
    "# 1. Plot the histogram of `income`.\n",
    "# 2. Create a new variable `income_imp` that imputes the median or \n",
    "#    mean income for all the missing values and plot the histogram for this new variable.\n",
    "#####"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "3. Compare the histograms above.\n",
    "\n",
    "*your answer here*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "#####\n",
    "# 4. Update your `poorhealth` prediction model(s) by incorporating `income_imp`. \n",
    "# 5. Calculate and compare the accuracy of this new model.\n",
    "# And if there is time:\n",
    "# 6. Create a new variable `income_imp2` that imputes the value via a model.\n",
    "# 7. Update your `poorhealth` prediction model(s) by incorporating `income_imp2`. \n",
    "# 8. Calculate and compare the accuracy of this newest model.\n",
    "#####\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "5 and 8. Compare the accuracies.\n",
    "\n",
    "*your answer here*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "button": false,
    "collapsed": true,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}