CS-109A Introduction to Data Science

Lab 7: $k$-NN Classification and Imputation¶

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras.
Contributors: Kevin Rader

In [1]:

## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:

Learning Goals¶

In this lab, we'll explore classification models to predict the health status of survey respondents and be able to build a classification decision boundary to predict the resultsing unbalanced classes.

By the end of this lab, you should:

Be familiar with the sklearn implementations of
- $k$-NN Regression
- ROC curves and classification metrics
Be able to optimize some loss function based on misclassification rates
Be able to impute for missing values
Be comfortable in the different approaches in handling missingness

In [10]:

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.api import OLS
import sklearn as sk
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.utils import resample
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
# %matplotlib inline
import seaborn.apionly as sns

The dataset contains a subset of data from the General Social Survey (GSS) that is a bi-annual survey of roughly 2000 Americans. We will be using a small subset of the approx 4000 questions they ask. Specifically we'll use:

id: respondant's unique ID
health: self-reported health level with 4 categories: poor, fair, good, excellent
partyid: political party affiliation with categories dem, rep, or other
age: age in years
sex: male or female
sexornt: sexual orientation with categories hetero, gay, or bisexual/other
educ: number of years of formal education (capped at 20 years)
marital: marital status with categories married, never married, and no longer married
race: with categories black, white, and other
income: in thousands of dollars

Our goal is to predict whether or not someone is in poor health based on the other measures.

For this task, we will exercise our normal data science pipeline -- from EDA to modeling and visualization. In particular, we will show the performance of 2 classifiers:

Logistic Regression
$k$-NN Regression

So without further ado...

EDA¶

Do the following basic EDA (always good ideas):

Determine the dimensions of the data set.
Get a glimpse of the data set.
Calculate basic summary/descriptive statistics of the variables.

We also ask that you do the following:

Create a binary called poorhealth.
Explore the distribution of the responses, health and poorhealth,
Explore what variables may be related to whether or not some is of poor health.

In [32]:

gssdata = pd.read_csv("../data/gsshealth18.csv")

#####
# You code here: EDA
# 1. Determine the dimensions of the data set.
# 2. Get a glimpse of the data set.
# 3. Calculate basic summary/descriptive statistics of the variables.
####
gssdata.head()

Out[32]:

	id	health	partyid	age	sex	sexornt	educ	marital	race	income
0	1	good	rep	43.0	male	bisexual/other	14.0	never married	white	NaN
1	2	excellent	dem	74.0	female	hetero	10.0	no longer married	white	NaN
2	5	excellent	rep	71.0	male	hetero	18.0	no longer married	black	NaN
3	6	good	dem	67.0	female	bisexual/other	16.0	no longer married	white	NaN
4	7	good	dem	59.0	female	bisexual/other	13.0	no longer married	black	18.75

In [33]:

gssdata.isna().sum()

Out[33]:

id           0
health       0
partyid      0
age          2
sex          0
sexornt      0
educ         2
marital      2
race         0
income     661
dtype: int64

In [34]:

I_test = gssdata[gssdata['marital'].isna()]

In [35]:

y_test = I_test['marital']
I_test = I_test.drop(columns=['marital'])

In [36]:

I_test

Out[36]:

	id	health	partyid	age	sex	sexornt	educ	race	income
25	40	good	other	50.0	female	bisexual/other	12.0	black	NaN
1478	2222	fair	other	65.0	male	bisexual/other	19.0	white	NaN

In [37]:

y_test

Out[37]:

25      NaN
1478    NaN
Name: marital, dtype: object

In [39]:

I_train = gssdata.dropna(subset=['marital'])
I_train.head()

Out[39]:

	id	health	partyid	age	sex	sexornt	educ	marital	race	income
0	1	good	rep	43.0	male	bisexual/other	14.0	never married	white	NaN
1	2	excellent	dem	74.0	female	hetero	10.0	no longer married	white	NaN
2	5	excellent	rep	71.0	male	hetero	18.0	no longer married	black	NaN
3	6	good	dem	67.0	female	bisexual/other	16.0	no longer married	white	NaN
4	7	good	dem	59.0	female	bisexual/other	13.0	no longer married	black	18.75

In [42]:

y_train = I_train['marital']
y_train.head(10)

Out[42]:

0        never married
1    no longer married
2    no longer married
3    no longer married
4    no longer married
5        never married
6    no longer married
7    no longer married
8              married
9    no longer married
Name: marital, dtype: object

In [40]:

I_train.isna().sum()

Out[40]:

id           0
health       0
partyid      0
age          2
sex          0
sexornt      0
educ         2
marital      0
race         0
income     659
dtype: int64

In [43]:

logit = LogisticRegression(C=1000000)
logit.fit(I_train, y_train) 
print(logit.score(I_test,y_test))
logit.coef_

/Users/eleni/anaconda2/envs/bunny/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
      1 logit = LogisticRegression(C=1000000)
----> 2 logit.fit(I_train, y_train)
      3 print(logit.score(I_test,y_test))
      4 logit.coef_

~/anaconda2/envs/bunny/lib/python3.6/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
   1530 
   1531         X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, order="C",
-> 1532                          accept_large_sparse=solver != 'liblinear')
   1533         check_classification_targets(y)
   1534         self.classes_ = np.unique(y)

~/anaconda2/envs/bunny/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    717                     ensure_min_features=ensure_min_features,
    718                     warn_on_dtype=warn_on_dtype,
--> 719                     estimator=estimator)
    720     if multi_output:
    721         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~/anaconda2/envs/bunny/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    494             try:
    495                 warnings.simplefilter('error', ComplexWarning)
--> 496                 array = np.asarray(array, dtype=dtype, order=order)
    497             except ComplexWarning:
    498                 raise ValueError("Complex data not supported\n"

~/anaconda2/envs/bunny/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: could not convert string to float: 'white'

In [12]:

pd.crosstab(gssdata['health'], gssdata['sex'])

Out[12]:

sex	female	male
health
excellent	198	161
fair	188	167
good	442	329
poor	44	40

In [13]:

#####
# You code here: EDA
# 4. Create a binary called `poorhealth`.  
# 5. Explore the distribution of the responses, `health` and `poorhealth`, 
# 6. Explore what variables may be related to whether or not some is of poor health.
####

Question: What classification accuracy could you achieve if you simply predicted poorhealth without a model? What classification accuracy would you get if you were to predict the multi-class health variable? Is accuracy the correct metric?

your answer here

Data Cleaning - Basic Handling of Missingness¶

Let's begin by fitting an unregularized logistic regression model to predict poor health based on all the other predictors in the model and three $k$-NN models with $k=5,10,20$.

First we need to do a small amount of data clean-up.

Determine the amount of missingness in each variable. If there is a lot, we will drop the variable from the predictor set (not quite yet). If there is a little, we will impute.
Drop any variables with lots of missingnes (in a new data set).
Do simple imputations for variables with a little bit of missingness.
Create dummies for categorical predictors.

In [ ]:

#########
# 1. Determine the amount of missingness in each variable. 
# Use is.na() in combination with .sum()
########

# Your code here

In [ ]:

#######
# And then build your predictor set
# 2. Drop any variables with lots of missingnes (in a new data set).
# 3. Do simple imputations for variables with a little bit of missingness.
# 4. Create dummies for categorical predictors.
#########
X = gssdata[['partyid','age','sex','sexornt','educ','marital','race','income']]

In [ ]:

from IPython.core.display import HTML
display(HTML(gssdata.to_html()))

In [ ]:

#create dummies (lots of ways to do it, two ways will be in the solutions


#print(X.shape)

Part 2: Fit Basic Models¶

In this section we ask you to:

Split the data into 70-30 train-test splits (use the code provided...should have been done before EDA :( )
Fit an unregularize logistic regression model to predict poorhealth from all predictors except income.

2b. If you have time: use 'LogisticRegressionCV' to find a well-tuned L2 regularized model.

Fit $k$-NN classification models with $k=1,15,25$ to predict poorhealth from all predictors except income.
Report classification accuracy on both train and test set for all models.

In [ ]:

#######
# Use the following train_test_split code to: 
# 1. Split the data into 70-30 train-test splits
#######
from sklearn.model_selection import train_test_split
itrain, itest = train_test_split(range(gssdata.shape[0]), train_size=0.70)

from sklearn.neighbors import KNeighborsClassifier

KNeighborsClassifier

In [ ]:

######
# 2. Fit an unregularize logistic regression model to predict `poorhealth` 
#    from all predictors except income.
# 2b. If you have time: use 'LogisticRegressionCV' to find a well-tuned L2 regularized model.
# 3. Fit $k$-NN classification models with k=1,15,25 to predict `poorhealth` 
#    from all predictors except income.
######

In [ ]:

######
# 4. Report classification accuracy on both train and test set for all models.
######

Part 3: Evaluate Models via Confusion matrices and ROC Curves¶

In this part we ask that you:

Plot the histograms of predicted probabilities for your favorite model from above
Create the confusion matrices for (a) the default threshold for classification and (b) a well-chosen threshold for classification to balance errors more equally.
Make ROC curves to evaluate a model's overall useability.
Use the ROC curves to select a threshold to balance the two types of errors.

As a reminder of Confustion Matrices:

the samples that are +ive and the classifier predicts as +ive are called True Positives (TP)
the samples that are -ive and the classifier predicts (wrongly) as +ive are called False Positives (FP)
the samples that are -ive and the classifier predicts as -ive are called True Negatives (TN)
the samples that are +ive and the classifier predicts as -ive are called False Negatives (FN)

A classifier produces a confusion matrix which looks like this:

confusionmatrix

IMPORTANT NOTE: In sklearn, to obtain the confusion matrix in the form above, always have the observed y first, i.e.: use as confusion_matrix(y_true, y_pred)

In [ ]:

#####
# 1. Plot the histograms of predicted probabilities on train for your favorite 
#    model from above
#####

In [ ]:

#####
#  2. Create the confusion matrices for (a) the default threshold for classification and 
#     (b) a well-chosen threshold for classification to balance errors more equally.
#####

from sklearn.metrics import confusion_matrix

# this function may help to manually make confusion table from a different threshold
def t_repredict(est, t, xtest):
    probs = est.predict_proba(xtest)
    p0 = probs[:,0]
    p1 = probs[:,1]
    ypred = (p1 > t)*1
    return ypred

In [ ]:

#####
# 3. Make ROC curves to evaluate a model's overall useability.
#####

from sklearn.metrics import roc_curve, auc

# a function to make 'pretty' ROC curves for this model
def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0):
    initial=False
    if not ax:
        ax=plt.gca()
        initial=True
    if proba:#for stuff like logistic regression
        fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])
    else:#for stuff like SVM
        fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))
    roc_auc = auc(fpr, tpr)
    if skip:
        l=fpr.shape[0]
        ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    else:
        ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    label_kwargs = {}
    label_kwargs['bbox'] = dict(
        boxstyle='round,pad=0.3', alpha=0.2,
    )
    if labe!=None:
        for k in range(0, fpr.shape[0],labe):
            #from https://gist.github.com/podshumok/c1d1c9394335d86255b8
            threshold = str(np.round(thresholds[k], 2))
            ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
    if initial:
        ax.plot([0, 1], [0, 1], 'k--')
        ax.set_xlim([0.0, 1.0])
        ax.set_ylim([0.0, 1.05])
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title('ROC')
    ax.legend(loc="lower right")
    return ax

sns.set_context("poster")

Use the ROC curves to select a threshold to balance the two types of errors.

your answer here

Part 4: Imputation¶

In this part we ask that you explore the effects of imputation:

Plot the histogram of income.
Create a new variable income_imp that imputes the median or mean income for all the missing values and plot the histogram for this new variable.
Compare the histograms above.

Update your poorhealth prediction model(s) by incorporating income_imp.
Compare the accuracy of this new model.

And if there is time:

Create a new variable income_imp2 that imputes the value via a model.
Update your poorhealth prediction model(s) by incorporating income_imp2.
Compare the accuracy of this newest model.

In [ ]:

#####
# 1. Plot the histogram of `income`.
# 2. Create a new variable `income_imp` that imputes the median or 
#    mean income for all the missing values and plot the histogram for this new variable.
#####

Compare the histograms above.

your answer here

In [ ]:

#####
# 4. Update your `poorhealth` prediction model(s) by incorporating `income_imp`. 
# 5. Calculate and compare the accuracy of this new model.
# And if there is time:
# 6. Create a new variable `income_imp2` that imputes the value via a model.
# 7. Update your `poorhealth` prediction model(s) by incorporating `income_imp2`. 
# 8. Calculate and compare the accuracy of this newest model.
#####

5 and 8. Compare the accuracies.

your answer here

In [ ]:

CS-109A Introduction to Data Science

Lab 7: $k$-NN Classification and Imputation¶

Learning Goals¶

Part 1: General Social Survey Data + EDA¶

EDA¶

Data Cleaning - Basic Handling of Missingness¶

Part 2: Fit Basic Models¶

Part 3: Evaluate Models via Confusion matrices and ROC Curves¶

Part 4: Imputation¶