CSCI E-109A¶

Standard Section 8: Multiclass Model and Midterm Review¶

Harvard University
Fall 2017
Section Leaders: Albert Wu, Nathaniel Burbank
Instructors: Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine

**Download this notebook from the CS109 repo or here:** **http://bit.ly/109_S8**

This section can be split into two major parts. The first covers multiclass model fitting and is designed to help you get started on HW 7. The second part goes over midterm 2 from 2016 and is designed to help you prepare for the upcoming midterm.

Specifically, we will:

1. Use the iris dataset, which we used in section 3, to cover multiclass models and fitting them.
2. Dive into midterm 2 from last year and go through a thorough review of the solutions that were posted.

In [1]:

import numpy as np
import pandas as pd
import scipy as sp
from sklearn import preprocessing
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier as DecisionTree
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.max_columns = 500
%matplotlib inline

/anaconda/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [2]:

#--------  plot_decision_boundary
# A function that visualizes the data and the decision boundaries
# Input: 
#      x (predictors)
#      y (labels)
#      model (the classifier you want to visualize)
#      title (title for plot)
#      ax (a set of axes to plot on)
#      poly_degree (highest degree of polynomial terms included in the model; None by default)

def plot_decision_boundary(x, y, model, title, ax, poly_degree=None):
    
    # Create mesh
    offset = .1
    # Interval of points for Sepal length
    min0 = x[:,0].min() - offset
    max0 = x[:,0].max() + offset
    interval0 = np.arange(min0, max0, (max0-min0)/300)
    n0 = np.size(interval0)
    
    # Interval of points for Petal width
    min1 = x[:,1].min() - offset
    max1 = x[:,1].max() + offset
    interval1 = np.arange(min1, max1, (max1-min1)/300)
    n1 = np.size(interval1)

    # Create mesh grid of points
    x1, x2 = np.meshgrid(interval0, interval1)
    x1 = x1.reshape(-1,1)
    x2 = x2.reshape(-1,1)
    xx = np.concatenate((x1, x2), axis=1)

    # Predict on mesh of points
    # Check if polynomial terms need to be included
    if(poly_degree!=None):
        # Use PolynomialFeatures to generate polynomial terms
        poly = PolynomialFeatures(poly_degree,include_bias = False)
        xx_ = poly.fit_transform(xx)
        yy = model.predict(xx_) 
        
    else:   
        yy = model.predict(xx)
        
    yy = yy.reshape((n0, n1))

    # Plot decision surface
    x1 = x1.reshape(n0, n1)
    x2 = x2.reshape(n0, n1)
    ax.contourf(x1, x2, yy, cmap=plt.cm.coolwarm, alpha=0.1)
    
    # Plot scatter plot of data
    yy = y.reshape(-1,)
    ax.scatter(x[yy==0,0], x[yy==0,1], c='blue', label='Class 0', cmap=plt.cm.coolwarm)
    
    ax.scatter(x[yy==1,0], x[yy==1,1], c='cyan', label='Class 1',cmap=plt.cm.coolwarm)
    
    ax.scatter(x[yy==2,0], x[yy==2,1], c='red', label='Class 2', cmap=plt.cm.coolwarm)
    
    # Label axis, title
    ax.set_title(title)
    ax.legend()
    ax.set_xlabel('Sepal length')
    ax.set_ylabel('Petal width')

Part 0: Fitting Multiclass Models to the Iris Dataset¶

This first part, part 0, is designed to help you get started on the first part of the homework 7. The dataset we use is similar to that of section 3, where we had flower type as our predictor (0,1,2). Here we will focus on sepal length and petal width as our predictors. We will fit and compare the training and test accuracies of the following classification methods:

Multiclass Logistic Regression (Multinomial and one-vs-rest (OvR))
Multiclass Logistic Regression cubic terms
Linear Discriminant Analysis
Quadratic Discriminant Analysis
k-Nearest Neighbors

In [3]:

df = pd.read_csv('https://raw.githubusercontent.com/cs109/a-2017/master/Sections/Standard/s8_data/section_8_data.csv')
np.random.seed(seed=109)
msk = np.random.rand(len(df)) < 0.5
data_train_2 = df[msk]
data_test_2 = df[~msk]
data_train_2.head()

Out[3]:

	slength	pwidth
1	4.9	0.2
3	4.6	0.2
6	4.6	0.3
8	4.4	0.2
9	4.9	0.1

Let's plot the distribution of each variable and see what we get:

In [4]:

sns.set() # Calling this defaults to regular seaborn plotting, with the gray background.
sns.distplot(df.slength)
sns.distplot(df.pwidth).set(xlabel='Distributions of Sepal Length and Petal Width')
plt.legend(['Sepal Length', 'Sepal Width'])
plt.xlim(-2,10);

The column target contains our three types of flowers (0,1,2), while slength and pwidth are the sepal legnth and petal width for each specific flower observation.

In [5]:

# Plot the training points
colors = ['blue', 'cyan', 'red']
plt.scatter(data_train_2.iloc[:, 0], data_train_2.iloc[:, 1], c=np.array(colors)[data_train_2.iloc[:,2].values], 
            cmap=plt.cm.coolwarm, edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Petal width')

Out[5]:

It appears that for our case, a linear classifier will not be able to perfectly separate the points. But it does appear that a linear classifier should seem to do relatively well here.

Here we will use multinomial logistic regression and one-vs-rest (OvR) logistic regression methods for fitting a multiclass classifier. In OvR, a separate binary classifier is fit for each class, whereas in multinomial logistic regression a single classifier is fit for all classes.

In [6]:

X_train_2 = data_train_2[['slength', 'pwidth']]
y_train_2 = data_train_2['target']
X_test_2 = data_test_2[['slength', 'pwidth']]
y_test_2 = data_test_2['target']

Here, we will fit both a OvR logistic regression and Multinomial logistic regression model. First, we look at the implied decision boundaries from both models on the training set.

In [7]:

logregcv = LogisticRegressionCV(multi_class='ovr')
logregcv.fit(X_train_2, y_train_2)
logregcv_2 = LogisticRegressionCV(multi_class='multinomial')
logregcv_2.fit(X_train_2, y_train_2)

Out[7]:

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='multinomial', n_jobs=1, penalty='l2',
           random_state=None, refit=True, scoring=None, solver='lbfgs',
           tol=0.0001, verbose=0)

In [8]:

f, axes = plt.subplots(2, 2, figsize = (13,13))

plot_decision_boundary(X_train_2.values, y_train_2.values, 
                               logregcv, "LogReg OvR - Train", axes[0,0])

plot_decision_boundary(X_train_2.values, y_train_2.values, 
                               logregcv_2, "LogReg Multinomial - Train", axes[0,1])

plot_decision_boundary(X_test_2.values, y_test_2.values, 
                               logregcv, "LogReg OvR - Test", axes[1,0])

plot_decision_boundary(X_test_2.values, y_test_2.values, 
                               logregcv_2, "LogReg Multinomial - Test", axes[1,1])

Next we look we look at the training and testing accuracies.

In [9]:

print("Train OvR:", logregcv.score(X_train_2, y_train_2))
print("Test OvR:", logregcv.score(X_test_2, y_test_2))
print("Train Multinomial:", logregcv_2.score(X_train_2, y_train_2))
print("Test Multinomial:",logregcv_2.score(X_test_2, y_test_2))

Train OvR: 0.971014492754
Test OvR: 0.925925925926
Train Multinomial: 0.971014492754
Test Multinomial: 0.913580246914

Now, we show how to fit a Multiclass Logistic Regression model with cubic terms:

In [10]:

poly_degree = 3
poly = PolynomialFeatures(poly_degree, include_bias=False)
X_train_poly_cubic = poly.fit_transform(X_train_2)
X_test_poly_cubic = poly.fit_transform(X_test_2)

logregcv_cubic = LogisticRegressionCV()
logregcv_cubic.fit(X_train_poly_cubic, y_train_2)

logregcv_2_poly_cubic = LogisticRegressionCV(multi_class='multinomial')
logregcv_2_poly_cubic.fit(X_train_poly_cubic, y_train_2)

Out[10]:

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='multinomial', n_jobs=1, penalty='l2',
           random_state=None, refit=True, scoring=None, solver='lbfgs',
           tol=0.0001, verbose=0)

In [11]:

f, axes = plt.subplots(1, 2, figsize = (15,8))

plot_decision_boundary(X_train_poly_cubic, y_train_2.values, 
                               logregcv_cubic, "LogReg OvR Cubic", axes[0],poly_degree)

plot_decision_boundary(X_train_poly_cubic, y_train_2.values, 
                               logregcv_2_poly_cubic, "LogReg Multinomial Cubic", axes[1],poly_degree)

In [12]:

print("Train LR Cubic Features OvR:", logregcv_cubic.score(X_train_poly_cubic, y_train_2))
print("Test LR Cubic Features OvR:", logregcv_cubic.score(X_test_poly_cubic, y_test_2))
print("Train LR Cubic Features Multinomial:", logregcv_2_poly_cubic.score(X_train_poly_cubic, y_train_2))
print("Test LR Cubic Features Multinomial:",logregcv_2_poly_cubic.score(X_test_poly_cubic, y_test_2))

Train LR Cubic Features OvR: 0.971014492754
Test LR Cubic Features OvR: 0.925925925926
Train LR Cubic Features Multinomial: 0.971014492754
Test LR Cubic Features Multinomial: 0.913580246914

We can then compare our results to that given by LDA and QDA:

In [13]:

lda = LDA()
lda.fit(X_train_2, y_train_2)
qda = QDA()
qda.fit(X_train_2, y_train_2)

Out[13]:

QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariances=False, tol=0.0001)

In [14]:

f, axes = plt.subplots(2, 2, figsize = (12,12))

plot_decision_boundary(X_train_2.values, y_train_2.values, 
                               lda, "Linear Discriminant Analysis (LDA) - Train", axes[0,0])

plot_decision_boundary(X_train_2.values, y_train_2.values, 
                               qda, "Quadratic Discriminant Analysis (QDA) - Train", axes[0,1])

plot_decision_boundary(X_test_2.values, y_test_2.values, 
                               lda, "Linear Discriminant Analysis (LDA) - Test", axes[1,0])

plot_decision_boundary(X_test_2.values, y_test_2.values, 
                               qda, "Quadratic Discriminant Analysis (QDA) - Test", axes[1,1])

plt.show()

In [15]:

print("Train LDA:", lda.score(X_train_2, y_train_2))
print("Test LDA:", lda.score(X_test_2, y_test_2))
print("Train QDA:", qda.score(X_train_2, y_train_2))
print("Test QDA:",qda.score(X_test_2, y_test_2))

Train LDA: 0.971014492754
Test LDA: 0.950617283951
Train QDA: 0.971014492754
Test QDA: 0.925925925926

Finally, we cover the fitting of k-Nearest Neighbors for $k = 2,5$:

In [16]:

knn_2 = KNN(n_neighbors=2)
knn_2.fit(X_train_2, y_train_2)

knn_5 = KNN(n_neighbors=5)
knn_5.fit(X_train_2, y_train_2)

Out[16]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [17]:

f, axes = plt.subplots(2, 2, figsize = (12,12))

plot_decision_boundary(X_train_2.values, y_train_2.values, 
                               knn_2, "KNN 2 - Train", axes[0,0])

plot_decision_boundary(X_train_2.values, y_train_2.values, 
                               knn_5, "KNN 5 - Train", axes[0,1])

plot_decision_boundary(X_test_2.values, y_test_2.values, 
                               knn_2, "KNN 2 - Test", axes[1,0])

plot_decision_boundary(X_test_2.values, y_test_2.values, 
                               knn_5, "KNN 5 - Test", axes[1,1])
plt.show()

In [18]:

print("Train KNN 2:", knn_2.score(X_train_2, y_train_2))
print("Test KNN 2:", knn_2.score(X_test_2, y_test_2))
print("Train KNN 5:", knn_5.score(X_train_2, y_train_2))
print("Test KNN 5:", knn_5.score(X_test_2, y_test_2))

Train KNN 2: 0.95652173913
Test KNN 2: 0.975308641975
Train KNN 5: 0.95652173913
Test KNN 5: 0.950617283951

Now we will dive into the last part of HW 7:¶

In [19]:

cost_func = lambda y_true, y_pred, cost_val: float((1000*np.sum(y_pred == 0) + 
                        cost_val*np.sum(y_true[y_pred != 0] != y_pred[y_pred != 0])))/y_true.shape[0]

In [20]:

logregcv = LogisticRegressionCV(multi_class = 'ovr')
logregcv.fit(X_train_2, y_train_2)
print("Cost of OvR Classifier: ", cost_func(y_test_2, logregcv.predict(X_test_2), 5000))

Cost of OvR Classifier:  740.7407407407408

In [21]:

largest_probs = np.max(logregcv.predict_proba(X_test_2), axis=1)
y_hat = logregcv.predict(X_test_2)
threshold = 0.7
for k, val in enumerate(largest_probs):
    if val < threshold:
        y_hat[k] = 0
print("Cost at cutoff",threshold, ":",cost_func(y_test_2, y_hat, 5000))

Cost at cutoff 0.7 : 617.283950617284

We will now move on to the Midterm 2 solutions from last year:

Part I: Diagnosing the Simian Flu 2016¶

You are given the early data for an outbreak of a dangerous virus originating from a group of primates being kept in a Massechussetts biomedical research lab, this virus is dubbed the "Simian Flu".

You have the medical records of $n$ number of patients in 'flu_train.csv. There are two general types of patients in the data, flu patients and healthy (this is recorded in the column labeled flu, a 0 indicates the absences of the virus and a 1 indicates presence). Furthermore, scientists have found that there are two strains of the virus, each requiring a different type of treatment (this is recorded in the column labeled flutype, a 1 indicates the absences of the virus, a 2 indicates presence of strain 1 and a 3 indicates the presence of strain 2).

Your task: build a model to predict if a given patient has the flu. Your goal is to catch as many flu patients as possible without misdiagnosing too many healthy patients.

The deliverable: a function called flu_predict which satisfies:

input: x_test, a set of medical predictors for a group of patients
output: y_pred, a set of labels, one for each patient; 0 for healthy and 1 for infected with the flu virus

The MA state government will use your model to diagnose sets of future patients (held by us). You can expect that there will be an increase in the number of flu patients in any groups of patients in the future.

We provide you with some benchmarks for comparison.

Baseline Model:

~50% expected accuracy on healthy patients in observed data
~50% expected accuracy on flu patients in observed data
~50% expected accuracy on healthy patients in future data
~50% expected accuracy on flu patients in future data
time to build: 5 min

Reasonable Model:

~69% expected accuracy on healthy patients in observed data
~55% expected accuracy on flu patients, in observed data
~69% expected accuracy on healthy patients in future data
~60% expected accuracy on flu patients, in future data
time to build: 20 min

Grading: Your grade will be based on:

your model's ability to out-perform our benchmarks
your ability to carefully and thoroughly follow the data science pipeline (see lecture slides for definition)
the extend to which all choices are reasonable and defensible by methods you have learned in this class

Solutions:

Step 1: Read the data, clean and explore the data¶

In [22]:

flu_train = pd.read_csv('https://raw.githubusercontent.com/cs109/a-2017/master/Midterms/2016%20Midterm%202/data/flu_train.csv')
flu_test = pd.read_csv('https://raw.githubusercontent.com/cs109/a-2017/master/Midterms/2016%20Midterm%202/data/flu_test.csv')

flu_train.head()

Out[22]:

	ID	Gender	Age	AgeDecade	AgeMonths	Race1	Race3	Education	MaritalStatus	HHIncome	HHIncomeMid	Poverty	HomeRooms	HomeOwn	Work	Weight	Length	HeadCirc	Height	BMI	BMICatUnder20yrs	BMI_WHO	Pulse	BPSysAve	BPDiaAve	BPSys1	BPDia1	BPSys2	BPDia2	BPSys3	BPDia3	Testosterone	DirectChol	TotChol	UrineVol1	UrineFlow1	UrineVol2	UrineFlow2	Diabetes	DiabetesAge	HealthGen	DaysMentHlthBad	LittleInterest	Depressed	nPregnancies	nBabies	Age1stBaby	SleepHrsNight	SleepTrouble	PhysActive	PhysActiveDays	TVHrsDay	CompHrsDay	TVHrsDayChild	CompHrsDayChild	Alcohol12PlusYr	AlcoholDay	AlcoholYear	SmokeNow	Smoke100	Smoke100n	SmokeAge	Marijuana	AgeFirstMarij	RegularMarij	AgeRegMarij	HardDrugs	SexEver	SexAge	SexNumPartnLife	SexNumPartYear	SameSex	SexOrientation	PregnantNow	flutype
0	51624	male	34	30-39	409.0	White	NaN	High School	Married	25000-34999	30000.0	1.36	6.0	Own	NotWorking	87.4	NaN	NaN	164.7	32.22	NaN	30.0_plus	70.0	113.0	85.0	114.0	88.0	114.0	88.0	112.0	82.0	NaN	1.29	3.49	352.0	NaN	NaN	NaN	No	NaN	Good	15.0	Most	Several	NaN	NaN	NaN	4.0	Yes	No	NaN	NaN	NaN	NaN	NaN	Yes	NaN	0.0	No	Yes	Smoker	18.0	Yes	17.0	No	NaN	Yes	Yes	16.0	8.0	1.0	No	Heterosexual	NaN	1
1	51630	female	49	40-49	596.0	White	NaN	Some College	LivePartner	35000-44999	40000.0	1.91	5.0	Rent	NotWorking	86.7	NaN	NaN	168.4	30.57	NaN	30.0_plus	86.0	112.0	75.0	118.0	82.0	108.0	74.0	116.0	76.0	NaN	1.16	6.70	77.0	0.094	NaN	NaN	No	NaN	Good	10.0	Several	Several	2.0	2.0	27.0	8.0	Yes	No	NaN	NaN	NaN	NaN	NaN	Yes	2.0	20.0	Yes	Yes	Smoker	38.0	Yes	18.0	No	NaN	Yes	Yes	12.0	10.0	1.0	Yes	Heterosexual	NaN	1
2	51638	male	9	0-9	115.0	White	NaN	NaN	NaN	75000-99999	87500.0	1.84	6.0	Rent	NaN	29.8	NaN	NaN	133.1	16.82	NaN	12.0_18.5	82.0	86.0	47.0	84.0	50.0	84.0	50.0	88.0	44.0	NaN	1.34	4.86	123.0	1.538	NaN	NaN	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1
3	51646	male	8	0-9	101.0	White	NaN	NaN	NaN	55000-64999	60000.0	2.33	7.0	Own	NaN	35.2	NaN	NaN	130.6	20.64	NaN	18.5_to_24.9	72.0	107.0	37.0	114.0	46.0	108.0	36.0	106.0	38.0	NaN	1.55	4.09	238.0	1.322	NaN	NaN	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	6.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1
4	51647	female	45	40-49	541.0	White	NaN	College Grad	Married	75000-99999	87500.0	5.00	6.0	Own	Working	75.7	NaN	NaN	166.7	27.24	NaN	25.0_to_29.9	62.0	118.0	64.0	106.0	62.0	118.0	68.0	118.0	60.0	NaN	2.12	5.82	106.0	1.116	NaN	NaN	No	NaN	Vgood	3.0	None	None	1.0	NaN	NaN	8.0	No	Yes	5.0	NaN	NaN	NaN	NaN	Yes	3.0	52.0	NaN	No	Non-Smoker	NaN	Yes	13.0	No	NaN	No	Yes	13.0	20.0	0.0	Yes	Bisexual	NaN	1

In [23]:

flu_train.flu.mean(),flu_test.flu.mean()

Out[23]:

(0.05909264201296226, 0.1963470319634703)

So first, only about 6% of the patients have the flu in our training data, and in terms of the predictors, we have a mixture of categorical and numerical variables with a fair amount of both missing and semi-duplicate information. To train a model to predict whether new patients will have the flu, we'll have to convert the categorical variables to indicators and handle all of the missing values.

Let's inspect the column types and how many missing values they have:

In [24]:

def column_type(column, df):
    return 'categ' if df[column].dtype == np.dtype('O') else 'numer'

def nan_fraction(column, df):
    return len(df[df[column].isnull()]) * 1. / len(df)

for i, column in enumerate(flu_train.columns):
    if column == 'flu' or column == 'flutype':
        continue
    elif i % 2 == 0:
        print('')
    else:
        print('\t', end='')
    print('{:16} ({}), NaNs: {:.2f}/{:.2f} train/test'.format(
            column, column_type(column, flu_train),
            nan_fraction(column, flu_train),
            nan_fraction(column, flu_test)), end='')

ID               (numer), NaNs: 0.00/0.00 train/test	Gender           (categ), NaNs: 0.00/0.00 train/test
Age              (numer), NaNs: 0.00/0.00 train/test	AgeDecade        (categ), NaNs: 0.04/0.04 train/test
AgeMonths        (numer), NaNs: 0.48/0.46 train/test	Race1            (categ), NaNs: 0.00/0.00 train/test
Race3            (categ), NaNs: 0.52/0.54 train/test	Education        (categ), NaNs: 0.32/0.30 train/test
MaritalStatus    (categ), NaNs: 0.32/0.30 train/test	HHIncome         (categ), NaNs: 0.09/0.09 train/test
HHIncomeMid      (numer), NaNs: 0.09/0.09 train/test	Poverty          (numer), NaNs: 0.08/0.08 train/test
HomeRooms        (numer), NaNs: 0.01/0.01 train/test	HomeOwn          (categ), NaNs: 0.01/0.01 train/test
Work             (categ), NaNs: 0.26/0.25 train/test	Weight           (numer), NaNs: 0.01/0.01 train/test
Length           (numer), NaNs: 0.93/0.93 train/test	HeadCirc         (numer), NaNs: 0.99/0.99 train/test
Height           (numer), NaNs: 0.04/0.04 train/test	BMI              (numer), NaNs: 0.04/0.04 train/test
BMICatUnder20yrs (categ), NaNs: 0.86/0.87 train/test	BMI_WHO          (categ), NaNs: 0.05/0.05 train/test
Pulse            (numer), NaNs: 0.17/0.17 train/test	BPSysAve         (numer), NaNs: 0.17/0.17 train/test
BPDiaAve         (numer), NaNs: 0.17/0.17 train/test	BPSys1           (numer), NaNs: 0.20/0.21 train/test
BPDia1           (numer), NaNs: 0.20/0.21 train/test	BPSys2           (numer), NaNs: 0.19/0.19 train/test
BPDia2           (numer), NaNs: 0.19/0.19 train/test	BPSys3           (numer), NaNs: 0.19/0.19 train/test
BPDia3           (numer), NaNs: 0.19/0.19 train/test	Testosterone     (numer), NaNs: 0.62/0.64 train/test
DirectChol       (numer), NaNs: 0.17/0.18 train/test	TotChol          (numer), NaNs: 0.17/0.18 train/test
UrineVol1        (numer), NaNs: 0.12/0.12 train/test	UrineFlow1       (numer), NaNs: 0.18/0.18 train/test
UrineVol2        (numer), NaNs: 0.85/0.86 train/test	UrineFlow2       (numer), NaNs: 0.85/0.86 train/test
Diabetes         (categ), NaNs: 0.02/0.02 train/test	DiabetesAge      (numer), NaNs: 0.94/0.92 train/test
HealthGen        (categ), NaNs: 0.28/0.27 train/test	DaysMentHlthBad  (numer), NaNs: 0.28/0.27 train/test
LittleInterest   (categ), NaNs: 0.37/0.36 train/test	Depressed        (categ), NaNs: 0.37/0.36 train/test
nPregnancies     (numer), NaNs: 0.76/0.73 train/test	nBabies          (numer), NaNs: 0.77/0.74 train/test
Age1stBaby       (numer), NaNs: 0.82/0.80 train/test	SleepHrsNight    (numer), NaNs: 0.26/0.25 train/test
SleepTrouble     (categ), NaNs: 0.26/0.25 train/test	PhysActive       (categ), NaNs: 0.20/0.19 train/test
PhysActiveDays   (numer), NaNs: 0.55/0.54 train/test	TVHrsDay         (categ), NaNs: 0.54/0.56 train/test
CompHrsDay       (categ), NaNs: 0.54/0.56 train/test	TVHrsDayChild    (numer), NaNs: 0.92/0.92 train/test
CompHrsDayChild  (numer), NaNs: 0.92/0.92 train/test	Alcohol12PlusYr  (categ), NaNs: 0.38/0.37 train/test
AlcoholDay       (numer), NaNs: 0.54/0.57 train/test	AlcoholYear      (numer), NaNs: 0.45/0.45 train/test
SmokeNow         (categ), NaNs: 0.71/0.66 train/test	Smoke100         (categ), NaNs: 0.32/0.30 train/test
Smoke100n        (categ), NaNs: 0.32/0.30 train/test	SmokeAge         (numer), NaNs: 0.72/0.68 train/test
Marijuana        (categ), NaNs: 0.54/0.56 train/test	AgeFirstMarij    (numer), NaNs: 0.75/0.74 train/test
RegularMarij     (categ), NaNs: 0.54/0.56 train/test	AgeRegMarij      (numer), NaNs: 0.88/0.87 train/test
HardDrugs        (categ), NaNs: 0.47/0.47 train/test	SexEver          (categ), NaNs: 0.47/0.47 train/test
SexAge           (numer), NaNs: 0.49/0.49 train/test	SexNumPartnLife  (numer), NaNs: 0.47/0.47 train/test
SexNumPartYear   (numer), NaNs: 0.54/0.56 train/test	SameSex          (categ), NaNs: 0.47/0.47 train/test
SexOrientation   (categ), NaNs: 0.55/0.57 train/test	PregnantNow      (categ), NaNs: 0.83/0.85 train/test

There are a large number of missing values in the data. Nearly all predictors have some degree of missingness. Some include potentially useful information, while others do not. NaN in the 'pregnancy' column is meaningful and informative, as patients with NaN's in the pregnancy column are males, where as NaN's in other predictors may appear randomly.

As a first pass, we drop any columns with an extremely high fraction of NaNs (along with any other columns that explicitly don't make sense to include, like ID). We encode the remaining categorical columns, reserving a separate category for missing values. We do mean imputation on the remaining quantitative variables.

However, there are many acceptable strategies here, as long as the appropriateness of the method in the context of the task and the data is discussed.

In [25]:

def clean_and_impute(df):
    # don't modify the original frame
    df = df.copy()
    del df['ID']
    del df['AgeMonths'] # almost a duplicate of Age but with more NaNs
    
    for column in df.columns:
        if nan_fraction(column, flu_train) > 0.75:
            print("Dropping {0}".format(column))
            del df[column]
    
    encode = preprocessing.LabelEncoder()

    for column in df.columns:
        if df[column].dtype == np.object:
            df[column] = df[column].fillna('NaN') #Categorical blank values get set as separate category 
            df.loc[:, column] = encode.fit_transform(df[column])
        elif nan_fraction(column, flu_train) > 0:
            df[column+'_missing'] = df[column].isnull().astype(int)
            df[column] = df[column].fillna(flu_train[column].mean()) 
            
    df = df[[c for c in df.columns if c not in ['flu','flutype']] + ['flu','flutype']]
    return df

In [26]:

#Clean and encode

train = clean_and_impute(flu_train)
train.head()

Dropping Length
Dropping HeadCirc
Dropping BMICatUnder20yrs
Dropping UrineVol2
Dropping UrineFlow2
Dropping DiabetesAge
Dropping nPregnancies
Dropping nBabies
Dropping Age1stBaby
Dropping TVHrsDayChild
Dropping CompHrsDayChild
Dropping AgeRegMarij
Dropping PregnantNow

Out[26]:

	Gender	Age	AgeDecade	Race1	Race3	Education	MaritalStatus	HHIncome	HHIncomeMid	Poverty	HomeRooms	HomeOwn	Work	Weight	Height	BMI	BMI_WHO	Pulse	BPSysAve	BPDiaAve	BPSys1	BPDia1	BPSys2	BPDia2	BPSys3	BPDia3	Testosterone	DirectChol	TotChol	UrineVol1	UrineFlow1	Diabetes	HealthGen	DaysMentHlthBad	LittleInterest	Depressed	SleepHrsNight	SleepTrouble	PhysActive	PhysActiveDays	TVHrsDay	CompHrsDay	Alcohol12PlusYr	AlcoholDay	AlcoholYear	SmokeNow	Smoke100	Smoke100n	SmokeAge	Marijuana	AgeFirstMarij	RegularMarij	HardDrugs	SexEver	SexAge	SexNumPartnLife	SexNumPartYear	SameSex	SexOrientation	Testosterone_missing	UrineFlow1_missing	DaysMentHlthBad_missing	SleepHrsNight_missing	PhysActiveDays_missing	AlcoholDay_missing	AlcoholYear_missing	SmokeAge_missing	AgeFirstMarij_missing	SexAge_missing	SexNumPartnLife_missing	SexNumPartYear_missing	flutype
0	1	34	3	4	4	3	2	5	30000.0	1.36	6.0	2	2	87.4	164.7	32.22	3	70.0	113.0	85.0	114.0	88.0	114.0	88.0	112.0	82.0	190.980538	1.29	3.49	352.0	0.957834	1	2	15.000000	0	3	4.000000	2	1	3.747683	7	7	2	2.959651	0.000000	1	2	2	18.000000	2	17.000000	1	2	2	16.000000	8.000000	1.000000	1	1	1	1	0	0	1	1	0	0	0	0	0	0	1
1	0	49	4	4	4	5	1	6	40000.0	1.91	5.0	3	2	86.7	168.4	30.57	3	86.0	112.0	75.0	118.0	82.0	108.0	74.0	116.0	76.0	190.980538	1.16	6.70	77.0	0.094000	1	2	10.000000	3	3	8.000000	2	1	3.747683	7	7	2	2.000000	20.000000	2	2	2	38.000000	2	18.000000	1	2	2	12.000000	10.000000	1.000000	2	1	1	0	0	0	1	0	0	0	0	0	0	0	1
2	1	9	0	4	4	4	3	10	87500.0	1.84	6.0	3	1	29.8	133.1	16.82	0	82.0	86.0	47.0	84.0	50.0	84.0	50.0	88.0	44.0	190.980538	1.34	4.86	123.0	1.538000	1	3	3.946268	1	1	6.913895	0	0	3.747683	7	7	0	2.959651	72.353921	0	0	0	17.939148	0	16.983471	0	0	0	17.456042	14.617021	1.354678	0	3	1	0	1	1	1	1	1	1	1	1	1	1	1
3	1	8	0	4	4	4	3	8	60000.0	2.33	7.0	2	1	35.2	130.6	20.64	1	72.0	107.0	37.0	114.0	46.0	108.0	36.0	106.0	38.0	190.980538	1.55	4.09	238.0	1.322000	1	3	3.946268	1	1	6.913895	0	0	3.747683	7	7	0	2.959651	72.353921	0	0	0	17.939148	0	16.983471	0	0	0	17.456042	14.617021	1.354678	0	3	1	0	1	1	1	1	1	1	1	1	1	1	1
4	0	45	4	4	4	2	2	10	87500.0	5.00	6.0	2	3	75.7	166.7	27.24	2	62.0	118.0	64.0	106.0	62.0	118.0	68.0	118.0	60.0	190.980538	2.12	5.82	106.0	1.116000	1	5	3.000000	2	2	8.000000	1	2	5.000000	7	7	2	3.000000	52.000000	0	1	1	17.939148	2	13.000000	1	1	2	13.000000	20.000000	0.000000	2	0	1	0	0	0	0	0	0	1	0	0	0	0	1

In [27]:

test = clean_and_impute(flu_test)
test.head()

Dropping Length
Dropping HeadCirc
Dropping BMICatUnder20yrs
Dropping UrineVol2
Dropping UrineFlow2
Dropping DiabetesAge
Dropping nPregnancies
Dropping nBabies
Dropping Age1stBaby
Dropping TVHrsDayChild
Dropping CompHrsDayChild
Dropping AgeRegMarij
Dropping PregnantNow

Out[27]:

	Gender	Age	AgeDecade	Race1	Race3	Education	MaritalStatus	HHIncome	HHIncomeMid	Poverty	HomeRooms	HomeOwn	Work	Weight	Height	BMI	BMI_WHO	Pulse	BPSysAve	BPDiaAve	BPSys1	BPDia1	BPSys2	BPDia2	BPSys3	BPDia3	Testosterone	DirectChol	TotChol	UrineVol1	UrineFlow1	Diabetes	HealthGen	DaysMentHlthBad	LittleInterest	Depressed	SleepHrsNight	SleepTrouble	PhysActive	PhysActiveDays	TVHrsDay	CompHrsDay	Alcohol12PlusYr	AlcoholDay	AlcoholYear	SmokeNow	Smoke100	Smoke100n	SmokeAge	Marijuana	AgeFirstMarij	RegularMarij	HardDrugs	SexEver	SexAge	SexNumPartnLife	SexNumPartYear	SameSex	SexOrientation	Pulse_missing	BPSysAve_missing	BPDiaAve_missing	BPSys1_missing	BPDia1_missing	BPSys2_missing	BPDia2_missing	BPSys3_missing	BPDia3_missing	Testosterone_missing	DirectChol_missing	TotChol_missing	UrineVol1_missing	UrineFlow1_missing	DaysMentHlthBad_missing	SleepHrsNight_missing	PhysActiveDays_missing	AlcoholDay_missing	AlcoholYear_missing	SmokeAge_missing	AgeFirstMarij_missing	SexAge_missing	SexNumPartnLife_missing	SexNumPartYear_missing	flu	flutype
0	1	4	0	3	4	4	3	4	22500.0	1.07	9.0	2	1	17.0	105.4	15.30	0	73.546618	117.68368	66.554131	118.668882	67.511987	118.036125	66.70936	117.507627	66.311195	190.980538	1.357699	4.835764	117.41557	0.957834	1	3	3.946268	1	1	6.913895	0	0	3.747683	7	7	0	2.959651	72.353921	0	0	0	17.939148	0	16.983471	0	0	0	17.456042	14.617021	1.354678	0	3	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	1
1	1	60	6	4	4	3	2	3	17500.0	1.03	5.0	2	3	74.6	169.9	25.84	2	84.000000	152.00000	100.000000	154.000000	98.000000	150.000000	98.00000	154.000000	102.000000	190.980538	1.340000	6.390000	238.00000	0.957834	1	5	0.000000	2	2	6.000000	2	1	3.747683	7	7	2	6.000000	36.000000	1	2	2	16.000000	0	16.983471	0	1	2	20.000000	1.000000	1.354678	1	3	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	1	0	0	0	1	0	0	1	0	1
2	1	38	3	4	4	5	2	4	22500.0	1.15	6.0	2	3	117.3	180.9	35.84	3	60.000000	147.00000	81.000000	160.000000	84.000000	150.000000	80.00000	144.000000	82.000000	190.980538	0.930000	4.680000	73.00000	0.468000	1	4	4.000000	3	2	5.000000	1	1	3.747683	7	7	1	2.959651	0.000000	0	1	1	17.939148	1	16.983471	1	1	2	23.000000	1.000000	1.000000	1	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	1	0	1	1	0	0	0	0	1
3	1	8	0	4	4	4	3	9	70000.0	3.55	5.0	2	1	40.1	128.4	24.32	1	76.000000	93.00000	54.000000	86.000000	52.000000	94.000000	56.00000	92.000000	52.000000	190.980538	1.530000	4.140000	124.00000	0.366000	1	3	3.946268	1	1	6.913895	0	0	3.747683	7	7	0	2.959651	72.353921	0	0	0	17.939148	0	16.983471	0	0	0	17.456042	14.617021	1.354678	0	3	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	1	1	1	1	1	1	1	1	1	0	1
4	0	59	5	3	4	0	6	4	22500.0	1.37	4.0	3	2	54.3	145.1	25.79	2	84.000000	150.00000	0.000000	144.000000	0.000000	150.000000	0.00000	150.000000	0.000000	190.980538	1.060000	4.160000	42.00000	0.389000	2	3	3.946268	1	1	5.000000	2	1	3.747683	7	7	0	2.959651	72.353921	0	1	1	17.939148	0	16.983471	0	0	0	17.456042	14.617021	1.354678	0	3	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	1	1	1	1	1	1	1	1	1	3

In [28]:

#What's up in each set

x = train[[c for c in train.columns if c not in ['flu','flutype']]].values
y = train[['flu']].values.reshape(len(x))

x_test = test[[c for c in train.columns if c not in ['flu','flutype']]].values
y_test = test[['flu']].values.reshape(len(x_test))

print('x train shape:', x.shape)
print('x test shape:', x_test.shape)
print('train class 0: {}, train class 1: {}'.format(len(y[y==0]), len(y[y==1])))
print('train class 0: {}, train class 1: {}'.format(len(y_test[y_test==0]), len(y_test[y_test==1])))

x train shape: (5246, 89)
x test shape: (1533, 89)
train class 0: 4936, train class 1: 310
train class 0: 1232, train class 1: 301

Step 2: Model Choice¶

The first task is to decide which, of the large number of classifiers we have learned during this semester, would best suit our task and our data.

It would be possible to do brute force model comparison here - i.e. tune all models and compare which does best with respect to various benchmarks. However, it is also reasonable to do a first round of model comparison by running models (with out of the box parameter settings) on bootstrapped training data and eliminating models which performed very poorly.

In [29]:

def expected_score(model, x_train, y_train):
    overall = 0
    class_0 = 0
    class_1 = 0
    for i in range(100):
        np.random.seed(i)
        sample = np.random.choice(len(x_train), len(x_train))
        x_sub_train = x_train[sample]
        y_sub_train = y_train[sample]
        
        overall += model.score(x_sub_train, y_sub_train)
        class_0 += model.score(x_sub_train[y_sub_train==0], y_sub_train[y_sub_train==0])
        class_1 += model.score(x_sub_train[y_sub_train==1], y_sub_train[y_sub_train==1])

    return pd.Series([overall / 100., 
                      class_0 / 100.,
                      class_1 / 100.],
                      index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])

score = lambda model, x_train, y_train: pd.Series([model.score(x_train, y_train), 
                                                 model.score(x_train[y_train==0], y_train[y_train==0]),
                                                 model.score(x_train[y_train==1], y_train[y_train==1])], 
                                                index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])

In [30]:

#KNN
knn = KNN(n_neighbors=2)
knn.fit(x, y)

knn_scores = score(knn, x, y)
print('knn')

#Unweighted logistic regression
unweighted_logistic = LogisticRegression(C=1000)
unweighted_logistic.fit(x, y)

unweighted_log_scores = score(unweighted_logistic, x, y)
print('unweighted log')


#Weighted logistic regression
weighted_logistic = LogisticRegression(C=1000, class_weight='balanced')
weighted_logistic.fit(x, y)

weighted_log_scores = score(weighted_logistic, x, y)
print('weighted log')


#LDA
lda = LDA()
lda.fit(x, y)

lda_scores = score(lda, x, y)
print('lda')

#QDA
qda = QDA()
qda.fit(x, y)

qda_scores = score(qda, x, y)
print('qda')

#Decision Tree
tree = DecisionTree(max_depth=50, class_weight='balanced', criterion='entropy')
tree.fit(x, y)

tree_scores = score(tree, x, y)
print('tree')


#Random Forest
rf = RandomForest(class_weight='balanced')
rf.fit(x, y)

rf_scores = score(rf, x, y)

print('rf')

#SVC
svc = SVC(C=100, class_weight='balanced')
svc.fit(x, y)

svc_scores = score(svc, x, y)

print('svc')

knn
unweighted log
weighted log
lda
qda
tree

/anaconda/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
/anaconda/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:695: UserWarning: Variables are collinear
  warnings.warn("Variables are collinear")

rf
svc

In [31]:

#Score Dataframe
score_df = pd.DataFrame({'knn': knn_scores, 
                         'unweighted logistic': unweighted_log_scores,
                         'weighted logistic': weighted_log_scores,
                         'lda': lda_scores,
                         'qda': qda_scores,
                         'tree': tree_scores,
                         'rf': rf_scores, 
                         'svc': svc_scores})
score_df

Out[31]:

	knn	lda	qda	rf	svc	tree	unweighted logistic	weighted logistic
overall accuracy	0.943195	0.936332	0.133816	0.987991	0.999619	1.0	0.941098	0.702440
accuracy on class 0	1.000000	0.983590	0.079417	1.000000	0.999595	1.0	1.000000	0.703606
accuracy on class 1	0.038710	0.183871	1.000000	0.796774	1.000000	1.0	0.003226	0.683871

It looks like we can rule out KNN, LDA and unweighted logistic.

What we do: We are going to pick weighted logistic regression and just tune the regularization parameter to beat the test benchmarks. Can you figure out why we chose this course of action? Hint: Instead of looking at overall accuracy, look at accuracy on class 1.

What's probably good to do: QDA, random forest, tree, SVC and weighted logistic are beating our train benchmarks as is. We will tune them to beat the test benchmarks by picking the model and parameter set with the highest CV accuracy.

In [32]:

Cs = 10.**np.arange(-3, 4, 1)
scores = []
for C in Cs:
    print('C:', C)
    weighted_log_scores = np.array([0., 0., 0.])
    kf = KFold(len(x), n_folds=10, shuffle=True, random_state=10)
    for train_index, test_index in kf:
        x_validate_train, x_validate_test = x[train_index], x[test_index]
        y_validate_train, y_validate_test = y[train_index], y[test_index]

        weighted_logistic = LogisticRegression(C=C, class_weight='balanced')
        weighted_logistic.fit(x_validate_train, y_validate_train)

        weighted_log_scores += score(weighted_logistic, x_validate_test, y_validate_test).values

    scores.append(weighted_log_scores / 10.)

scores = pd.DataFrame(np.array(scores).T, columns=[str(C) for C in Cs], index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])

C: 0.001
C: 0.01
C: 0.1
C: 1.0
C: 10.0
C: 100.0
C: 1000.0

What C means here: The C parameter in the LogisticRegression method is the inverse of the regularization parameter $\lambda$. In other words, C $= \frac{1}{\lambda}$. As C gets larger, $\lambda$ is getting smaller. Therefore, when we set C to be very very large in our earlier homework assignments to make sure there was no regularization, this was the same as making $\lambda$ go to zero. When $\lambda$ is zero, there is no regularization applied.

In [33]:

print(scores.T['accuracy on class 1'].max())
scores

0.577597540758

Out[33]:

	0.001	0.01	0.1	1.0	10.0	100.0	1000.0
overall accuracy	0.718639	0.706054	0.694044	0.693281	0.695185	0.692708	0.693090
accuracy on class 0	0.727878	0.714560	0.702211	0.701246	0.703662	0.700433	0.701446
accuracy on class 1	0.577598	0.575547	0.565820	0.568876	0.561737	0.573060	0.561737

To beat the future benchmark, we'll select the parameter which yields the highest accuracy on class 1 (while still beating the benchmark on class 0).

Now let's test our model on the test data:

In [34]:

#Weighted logistic regression
weighted_logistic = LogisticRegression(C=10, class_weight='balanced')
weighted_logistic.fit(x, y)
weighted_log_scores = score(weighted_logistic, x_test, y_test)
weighted_log_scores

Out[34]:

overall accuracy       0.692759
accuracy on class 0    0.712662
accuracy on class 1    0.611296
dtype: float64

Yay, we beat all the benchmarks!

Part II: Diagnosing Strains of the Simian Flu¶

From a public health perspective, we want to balance the cost of vaccinations, early interventions and the cost of treating flu complications of unvaccinated people.

There are two different strains of the flu: strain 1 has a cheaper early intervention as well as a cheaper treatment for flu complications, but patients with strain 1 has a higher rate of developing complications if treated with the wrong intervention. Strain 2 has a more expensive early intervention as well as a more costly treatment for flu complications, but patients with strain 2 has a lower rate of developing complications if treated with the wrong intervention. With no intervention, flu patients develop complications at the same rate regardless of the strain.

Your task: build a model to predict if a given patient has the flu and identify the flu strain. The state government of MA will use your model to inform public health policies: we will vaccinate people you've identified as healthy and apply corresponding interventions to patients with different strains of the flu. We have provided you with a function to compute the total expected cost of this policy decision that takes into account the cost of the vaccine, the interventions and the cost of the treatments for flu complications resulting from misdiagnosing patients. Your goal is to make sure your model produces a public health policy with the lowest associated expected cost.

The deliverable: a function called flu_predict which satisfies:

input: x_test, a set of medical predictors for a group of patients
output: y_pred, a set of labels, one for each patient; 0 for healthy, 1 for infected with strain 1, and 2 for infected with strain 2.

We provide you with some benchmarks for comparison.

Three Baseline Models:

expected cost on observed data: \$6,818,206, \$7,035,735, \$8,297,197.5
these values correspond to the all healthy, all 1, and all 2 classifiers, respectively.
time to build: 1 min

Reasonable Model:

expected cost on observed data: $6,300,000
time to build: 20 min

Grading: Your grade will be based on:

your model's ability to out-perform our benchmarks
your ability to carefully and thoroughly follow the data science pipeline (see lecture slides for definition)
the extend to which all choices are reasonable and defensible by methods you have learned in this class

In [35]:

#--------  cost
# A function that computes the expected cost of the public healthy policy based on the 
# classifications generated by your model
# Input: 
#      y_true (true class labels: 0, 1, 2)
#      y_pred (predicted class labels: 0, 1, 2)
# Returns: 
#      total_cost (expected total cost)

def cost(y_true, y_pred):
    cost_of_treatment_1 = 29500
    cost_of_treatment_2 = 45000
    cost_of_intervention_1 = 4150
    cost_of_intervention_2 = 4250
    cost_of_vaccine = 15
    
    prob_complications_untreated = 0.65
    prob_complications_1 = 0.30
    prob_complications_2 = 0.15
    
    trials = 1000    
    
    intervention_cost = cost_of_intervention_1 * len(y_pred[y_pred==1]) + cost_of_intervention_2 * len(y_pred[y_pred==2])

    vaccine_cost = cost_of_vaccine * len(y_pred[y_pred==0])
    
    false_neg_1 = ((y_true == 1) & (y_pred == 2)).sum()
    false_neg_2 = ((y_true == 2) & (y_pred == 1)).sum()
    
    untreated_1 = ((y_true == 1) & (y_pred == 0)).sum()    
    untreated_2 = ((y_true == 2) & (y_pred == 0)).sum()
    
    false_neg_1_cost = np.random.binomial(1, prob_complications_1, (false_neg_1, trials)) * cost_of_treatment_1
    false_neg_2_cost = np.random.binomial(1, prob_complications_2, (false_neg_2, trials)) * cost_of_treatment_2
    untreated_1_cost = np.random.binomial(1, prob_complications_untreated, (untreated_1, trials)) * cost_of_treatment_1
    untreated_2_cost = np.random.binomial(1, prob_complications_untreated, (untreated_2, trials)) * cost_of_treatment_2
    
    false_neg_1_cost = false_neg_1_cost.sum(axis=0)
    expected_false_neg_1_cost = false_neg_1_cost.mean()
    
    false_neg_2_cost = false_neg_2_cost.sum(axis=0)
    expected_false_neg_2_cost = false_neg_2_cost.mean()
    
    untreated_1_cost = untreated_1_cost.sum(axis=0)
    expected_untreated_1_cost = untreated_1_cost.mean()
    
    untreated_2_cost = untreated_2_cost.sum(axis=0)
    expected_untreated_2_cost = untreated_2_cost.mean()
    
    total_cost = vaccine_cost + intervention_cost + expected_false_neg_1_cost + expected_false_neg_2_cost + expected_untreated_1_cost + expected_untreated_2_cost
    
    return total_cost

We're just going to take the weighted logistic model, again, and tune the regularization parameter to both beat the benchmark on the observed data and minimize expected cost on unseen data (i.e. to prevent overfitting). Instead of using 'balanced' class weights, we're using a custom weighting scheme for the three classes (this parameter should really be tuned!).

It would probally also be go through the whole "choosing a model, tuning these models"-process again, this time to minimize cost.

Note: Be aware that the cost is now sensitive to sample size! The smaller the pool of patients the less the cost. If you are evaluating cost on a held-out test set then you can artificially make the cost very small. The benchmarks we give are for the test set.

In [36]:

x = train.values[:, :-2]
y = train.values[:, -1]
y = y - 1

x_test = test.values[:, :-2]
y_test = test.values[:, -1]

y_test = y_test - 1

We can take a look at our label frequencies, notice how they are imbalanced.

In [37]:

np.bincount(y.astype(int))/len(y)

Out[37]:

array([ 0.94090736,  0.04327106,  0.01582158])

We'll define a function for computing the accuracy rates so that we can conveniently call it later.

In [38]:

score = lambda model, x_test, y_test: pd.Series([model.score(x_test, y_test), 
                                                 model.score(x_test[y_test==0], y_test[y_test==0]),
                                                 model.score(x_test[y_test==1], y_test[y_test==1]), 
                                                 model.score(x_test[y_test==2], y_test[y_test==2]), 
                                                 cost(y_test, model.predict(x_test))],
                                                index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1', 'accuracy on class 2', 'total cost'])

Let’s make some benchmarks to compare the cost of our model against...¶

First, let's cost a perfect model...

In [39]:

l = [('Perfect', cost(y, y), cost(y_test, y_test))] 

print('minimimum cost on train:', cost(y, y))
print('minimimum cost on test:', cost(y_test, y_test))

minimimum cost on train: 1368840.0
minimimum cost on test: 1277630.0

What if we presume that all patients are healthy?

In [40]:

l.append(('All Healthy', cost(y, np.array([0] * len(y))), cost(y_test, np.array([0] * len(y_test)))))

print('simple model cost on train (all 0):', cost(y, np.array([0] * len(y))))
print('simple model cost on test (all 0):', cost(y_test, np.array([0] * len(y_test))))

simple model cost on train (all 0): 6860801.5
simple model cost on test (all 0): 6796092.5

All have strain 1 ...

In [41]:

l.append(('All 1', cost(y, np.array([1] * len(y))),cost(y_test, np.array([1] * len(y_test)))))

print('simple model cost on train (all 1):', cost(y, np.array([1] * len(y))))
print('simple model cost on test: (all 1):', cost(y_test, np.array([1] * len(y_test))))

simple model cost on train (all 1): 22333985.0
simple model cost on test: (all 1): 7040595.0

All have strain 2 ...

In [42]:

l.append(('All 2', cost(y, np.array([2] * len(y))),cost(y_test, np.array([2] * len(y_test)))))

print('simple model cost on train (all 2):', cost(y, np.array([2] * len(y))))
print('simple model cost on test (all 2):', cost(y_test, np.array([2] * len(y_test))))

simple model cost on train (all 2): 24307488.5
simple model cost on test (all 2): 8293303.5

In [43]:

pd.DataFrame(l,columns=["Model","Cost on Train (Mil.)","Cost on Test (Mil.)"]).set_index('Model')/10**6

Out[43]:

	Cost on Train (Mil.)	Cost on Test (Mil.)
Model
Perfect	1.368840	1.277630
All Healthy	6.848833	6.786303
All 1	22.334660	7.034025
All 2	24.300055	8.297640

Now let’s cost out the models we trained above…

In [44]:

models = {'knn': knn, 
         'unweighted logistic': unweighted_logistic,
         'weighted logistic': weighted_logistic,
        'lda': lda,
         'qda': qda,
         'tree': tree,
          'rf': rf, 
         'svc': svc}

for k in models:
    print("\n\n**************")
    print(k, "Train")
    mod = models[k]
    mod.fit(x, y)
    scores = score(mod, x, y)
    print(scores.round(decimals=3))
    train_cost = scores['total cost']
    scores = score(mod, x_test, y_test)
    print(k, "Train")
    print(scores.round(decimals=3))
    test_cost = scores['total cost']
    l.append((k,train_cost,test_cost))


**************
knn Train
overall accuracy             0.943
accuracy on class 0          1.000
accuracy on class 1          0.040
accuracy on class 2          0.012
total cost             6652962.000
dtype: float64
knn Train
overall accuracy             0.800
accuracy on class 0          0.995
accuracy on class 1          0.005
accuracy on class 2          0.000
total cost             6767277.500
dtype: float64


**************
unweighted logistic Train
overall accuracy             0.941
accuracy on class 0          1.000
accuracy on class 1          0.000
accuracy on class 2          0.000
total cost             6822579.500
dtype: float64
unweighted logistic Train
overall accuracy             0.804
accuracy on class 0          1.000
accuracy on class 1          0.000
accuracy on class 2          0.000
total cost             6803437.000
dtype: float64


**************
weighted logistic Train
overall accuracy             0.917
accuracy on class 0          0.964
accuracy on class 1          0.106
accuracy on class 2          0.301
total cost             6441930.500
dtype: float64
weighted logistic Train
overall accuracy             0.801
accuracy on class 0          0.966
accuracy on class 1          0.085
accuracy on class 2          0.210
total cost             5958928.500
dtype: float64


**************
lda Train
overall accuracy             0.935
accuracy on class 0          0.984
accuracy on class 1          0.101
accuracy on class 2          0.277
total cost             6156034.000
dtype: float64
lda Train
overall accuracy             0.808
accuracy on class 0          0.987
accuracy on class 1          0.060
accuracy on class 2          0.110
total cost             6200839.000
dtype: float64


**************
qda Train
overall accuracy       1.230000e-01
accuracy on class 0    8.000000e-02
accuracy on class 1    1.000000e+00
accuracy on class 2    2.650000e-01
total cost             2.055036e+07
dtype: float64
qda Train
overall accuracy             0.192
accuracy on class 0          0.089
accuracy on class 1          0.915
accuracy on class 2          0.000
total cost             6999286.500
dtype: float64


**************
tree Train

/anaconda/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:455: UserWarning: The priors do not sum to 1. Renormalizing
  UserWarning)
/anaconda/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
/anaconda/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:695: UserWarning: Variables are collinear
  warnings.warn("Variables are collinear")

overall accuracy             1.0
accuracy on class 0          1.0
accuracy on class 1          1.0
accuracy on class 2          1.0
total cost             1368840.0
dtype: float64
tree Train
overall accuracy             0.766
accuracy on class 0          0.938
accuracy on class 1          0.060
accuracy on class 2          0.060
total cost             6626960.500
dtype: float64


**************
rf Train
overall accuracy             0.990
accuracy on class 0          1.000
accuracy on class 1          0.797
accuracy on class 2          0.940
total cost             2184576.000
dtype: float64
rf Train
overall accuracy             0.803
accuracy on class 0          0.999
accuracy on class 1          0.000
accuracy on class 2          0.000
total cost             6807378.000
dtype: float64


**************
svc Train
overall accuracy             1.0
accuracy on class 0          1.0
accuracy on class 1          1.0
accuracy on class 2          1.0
total cost             1377110.0
dtype: float64
svc Train
overall accuracy             0.800
accuracy on class 0          0.994
accuracy on class 1          0.000
accuracy on class 2          0.020
total cost             6783098.500
dtype: float64

In [45]:

pd.DataFrame(l,columns=["Model","Cost on Train (Mil.)","Cost on Test (Mil.)"]).set_index('Model')/10**6

Out[45]:

	Cost on Train (Mil.)	Cost on Test (Mil.)
Model
Perfect	1.368840	1.277630
All Healthy	6.848833	6.786303
All 1	22.334660	7.034025
All 2	24.300055	8.297640
knn	6.652962	6.767277
unweighted logistic	6.822579	6.803437
weighted logistic	6.441930	5.958928
lda	6.156034	6.200839
qda	20.550355	6.999287
tree	1.368840	6.626961
rf	2.184576	6.807378
svc	1.377110	6.783099

Weighted Logistic Reg...¶

Now let’s see if we can beat our benchmarks…

In [46]:

Cs = 10.**np.arange(-3, 4, 1)
scores = []
for C in Cs:
    print('C:', C)
    weighted_log_scores = np.array([0., 0., 0., 0., 0.])
    kf = KFold(len(x), n_folds=10, shuffle=True, random_state=10)
    for train_index, test_index in kf:
        x_validate_train, x_validate_test = x[train_index], x[test_index]
        y_validate_train, y_validate_test = y[train_index], y[test_index]

        weighted_logistic = LogisticRegression(C=C, class_weight={0:0.7, 1:10, 2:10})
        weighted_logistic.fit(x_validate_train, y_validate_train)

        weighted_log_scores += score(weighted_logistic, x_validate_test, y_validate_test).values

    scores.append(weighted_log_scores / 10.)

scores = pd.DataFrame(np.array(scores).T, columns=[str(C) for C in Cs], index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1', 'accuracy on class 2', 'total cost'])

scores

C: 0.001
C: 0.01
C: 0.1
C: 1.0
C: 10.0
C: 100.0
C: 1000.0

Out[46]:

	0.001	0.01	0.1	1.0	10.0	100.0	1000.0
overall accuracy	0.934241	0.932144	0.932144	0.932143	0.931571	0.931761	0.931572
accuracy on class 0	0.986803	0.984975	0.984166	0.984180	0.983358	0.983573	0.983561
accuracy on class 1	0.115197	0.095073	0.103919	0.108443	0.111300	0.108681	0.108681
accuracy on class 2	0.068939	0.091558	0.108983	0.099892	0.099892	0.109892	0.099892
total cost	642216.950000	643906.800000	639099.100000	639168.100000	639347.800000	639408.300000	640704.400000

In [47]:

#Weighted logistic regression on training set 
weighted_logistic = LogisticRegression(C=10, class_weight={0:0.7, 1:10, 2:10})
weighted_logistic.fit(x, y)
weighted_log_scores = score(weighted_logistic, x, y)
weighted_log_scores.round(decimals=3)

Out[47]:

overall accuracy             0.936
accuracy on class 0          0.987
accuracy on class 1          0.115
accuracy on class 2          0.145
total cost             6235795.000
dtype: float64

In [48]:

#Weighted logistic regression on test set
weighted_log_scores = score(weighted_logistic, x_test, y_test)
weighted_log_scores.round(decimals=3)

Out[48]:

overall accuracy             0.807
accuracy on class 0          0.982
accuracy on class 1          0.100
accuracy on class 2          0.070
total cost             6176224.000
dtype: float64

In [49]:

l.append(('Tuned Weighted logistic regression', score(weighted_logistic, x, y)['total cost'],score(weighted_logistic, x_test, y_test)['total cost']))
pd.DataFrame(l,columns=["Model","Cost on Train (Mil.)","Cost on Test (Mil.)"]).set_index('Model')/10**6

Out[49]:

	Cost on Train (Mil.)	Cost on Test (Mil.)
Model
Perfect	1.368840	1.277630
All Healthy	6.848833	6.786303
All 1	22.334660	7.034025
All 2	24.300055	8.297640
knn	6.652962	6.767277
unweighted logistic	6.822579	6.803437
weighted logistic	6.441930	5.958928
lda	6.156034	6.200839
qda	20.550355	6.999287
tree	1.368840	6.626961
rf	2.184576	6.807378
svc	1.377110	6.783099
Tuned Weighted logistic regression	6.251840	6.183774

Yay! We beat the benchmarks on the observed data and did pretty well on test data!