Key Word(s): Random Forests, AdaBoost
This section can be split into two major parts. The first covers multiclass model fitting and is designed to help you get started on HW 7. The second part goes over midterm 2 from 2016 and is designed to help you prepare for the upcoming midterm.
Specifically, we will:
1. Use the iris dataset, which we used in section 3, to cover multiclass models and fitting them.
2. Dive into midterm 2 from last year and go through a thorough review of the solutions that were posted.
import numpy as np
import pandas as pd
import scipy as sp
from sklearn import preprocessing
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier as DecisionTree
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.max_columns = 500
%matplotlib inline
#-------- plot_decision_boundary
# A function that visualizes the data and the decision boundaries
# Input:
# x (predictors)
# y (labels)
# model (the classifier you want to visualize)
# title (title for plot)
# ax (a set of axes to plot on)
# poly_degree (highest degree of polynomial terms included in the model; None by default)
def plot_decision_boundary(x, y, model, title, ax, poly_degree=None):
# Create mesh
offset = .1
# Interval of points for Sepal length
min0 = x[:,0].min() - offset
max0 = x[:,0].max() + offset
interval0 = np.arange(min0, max0, (max0-min0)/300)
n0 = np.size(interval0)
# Interval of points for Petal width
min1 = x[:,1].min() - offset
max1 = x[:,1].max() + offset
interval1 = np.arange(min1, max1, (max1-min1)/300)
n1 = np.size(interval1)
# Create mesh grid of points
x1, x2 = np.meshgrid(interval0, interval1)
x1 = x1.reshape(-1,1)
x2 = x2.reshape(-1,1)
xx = np.concatenate((x1, x2), axis=1)
# Predict on mesh of points
# Check if polynomial terms need to be included
if(poly_degree!=None):
# Use PolynomialFeatures to generate polynomial terms
poly = PolynomialFeatures(poly_degree,include_bias = False)
xx_ = poly.fit_transform(xx)
yy = model.predict(xx_)
else:
yy = model.predict(xx)
yy = yy.reshape((n0, n1))
# Plot decision surface
x1 = x1.reshape(n0, n1)
x2 = x2.reshape(n0, n1)
ax.contourf(x1, x2, yy, cmap=plt.cm.coolwarm, alpha=0.1)
# Plot scatter plot of data
yy = y.reshape(-1,)
ax.scatter(x[yy==0,0], x[yy==0,1], c='blue', label='Class 0', cmap=plt.cm.coolwarm)
ax.scatter(x[yy==1,0], x[yy==1,1], c='cyan', label='Class 1',cmap=plt.cm.coolwarm)
ax.scatter(x[yy==2,0], x[yy==2,1], c='red', label='Class 2', cmap=plt.cm.coolwarm)
# Label axis, title
ax.set_title(title)
ax.legend()
ax.set_xlabel('Sepal length')
ax.set_ylabel('Petal width')
Part 0: Fitting Multiclass Models to the Iris Dataset¶
This first part, part 0, is designed to help you get started on the first part of the homework 7. The dataset we use is similar to that of section 3, where we had flower type as our predictor (0,1,2). Here we will focus on sepal length and petal width as our predictors. We will fit and compare the training and test accuracies of the following classification methods:
- Multiclass Logistic Regression (Multinomial and one-vs-rest (OvR))
- Multiclass Logistic Regression cubic terms
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
- k-Nearest Neighbors
df = pd.read_csv('https://raw.githubusercontent.com/cs109/a-2017/master/Sections/Standard/s8_data/section_8_data.csv')
np.random.seed(seed=109)
msk = np.random.rand(len(df)) < 0.5
data_train_2 = df[msk]
data_test_2 = df[~msk]
data_train_2.head()
Let's plot the distribution of each variable and see what we get:
sns.set() # Calling this defaults to regular seaborn plotting, with the gray background.
sns.distplot(df.slength)
sns.distplot(df.pwidth).set(xlabel='Distributions of Sepal Length and Petal Width')
plt.legend(['Sepal Length', 'Sepal Width'])
plt.xlim(-2,10);
The column target
contains our three types of flowers (0,1,2), while slength
and pwidth
are the sepal legnth and petal width for each specific flower observation.
# Plot the training points
colors = ['blue', 'cyan', 'red']
plt.scatter(data_train_2.iloc[:, 0], data_train_2.iloc[:, 1], c=np.array(colors)[data_train_2.iloc[:,2].values],
cmap=plt.cm.coolwarm, edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Petal width')
It appears that for our case, a linear classifier will not be able to perfectly separate the points. But it does appear that a linear classifier should seem to do relatively well here.
Here we will use multinomial logistic regression and one-vs-rest (OvR) logistic regression methods for fitting a multiclass classifier. In OvR, a separate binary classifier is fit for each class, whereas in multinomial logistic regression a single classifier is fit for all classes.
X_train_2 = data_train_2[['slength', 'pwidth']]
y_train_2 = data_train_2['target']
X_test_2 = data_test_2[['slength', 'pwidth']]
y_test_2 = data_test_2['target']
Here, we will fit both a OvR logistic regression and Multinomial logistic regression model. First, we look at the implied decision boundaries from both models on the training set.
logregcv = LogisticRegressionCV(multi_class='ovr')
logregcv.fit(X_train_2, y_train_2)
logregcv_2 = LogisticRegressionCV(multi_class='multinomial')
logregcv_2.fit(X_train_2, y_train_2)
f, axes = plt.subplots(2, 2, figsize = (13,13))
plot_decision_boundary(X_train_2.values, y_train_2.values,
logregcv, "LogReg OvR - Train", axes[0,0])
plot_decision_boundary(X_train_2.values, y_train_2.values,
logregcv_2, "LogReg Multinomial - Train", axes[0,1])
plot_decision_boundary(X_test_2.values, y_test_2.values,
logregcv, "LogReg OvR - Test", axes[1,0])
plot_decision_boundary(X_test_2.values, y_test_2.values,
logregcv_2, "LogReg Multinomial - Test", axes[1,1])
Next we look we look at the training and testing accuracies.
print("Train OvR:", logregcv.score(X_train_2, y_train_2))
print("Test OvR:", logregcv.score(X_test_2, y_test_2))
print("Train Multinomial:", logregcv_2.score(X_train_2, y_train_2))
print("Test Multinomial:",logregcv_2.score(X_test_2, y_test_2))
Now, we show how to fit a Multiclass Logistic Regression model with cubic terms:
poly_degree = 3
poly = PolynomialFeatures(poly_degree, include_bias=False)
X_train_poly_cubic = poly.fit_transform(X_train_2)
X_test_poly_cubic = poly.fit_transform(X_test_2)
logregcv_cubic = LogisticRegressionCV()
logregcv_cubic.fit(X_train_poly_cubic, y_train_2)
logregcv_2_poly_cubic = LogisticRegressionCV(multi_class='multinomial')
logregcv_2_poly_cubic.fit(X_train_poly_cubic, y_train_2)
f, axes = plt.subplots(1, 2, figsize = (15,8))
plot_decision_boundary(X_train_poly_cubic, y_train_2.values,
logregcv_cubic, "LogReg OvR Cubic", axes[0],poly_degree)
plot_decision_boundary(X_train_poly_cubic, y_train_2.values,
logregcv_2_poly_cubic, "LogReg Multinomial Cubic", axes[1],poly_degree)
print("Train LR Cubic Features OvR:", logregcv_cubic.score(X_train_poly_cubic, y_train_2))
print("Test LR Cubic Features OvR:", logregcv_cubic.score(X_test_poly_cubic, y_test_2))
print("Train LR Cubic Features Multinomial:", logregcv_2_poly_cubic.score(X_train_poly_cubic, y_train_2))
print("Test LR Cubic Features Multinomial:",logregcv_2_poly_cubic.score(X_test_poly_cubic, y_test_2))
We can then compare our results to that given by LDA and QDA:
lda = LDA()
lda.fit(X_train_2, y_train_2)
qda = QDA()
qda.fit(X_train_2, y_train_2)
f, axes = plt.subplots(2, 2, figsize = (12,12))
plot_decision_boundary(X_train_2.values, y_train_2.values,
lda, "Linear Discriminant Analysis (LDA) - Train", axes[0,0])
plot_decision_boundary(X_train_2.values, y_train_2.values,
qda, "Quadratic Discriminant Analysis (QDA) - Train", axes[0,1])
plot_decision_boundary(X_test_2.values, y_test_2.values,
lda, "Linear Discriminant Analysis (LDA) - Test", axes[1,0])
plot_decision_boundary(X_test_2.values, y_test_2.values,
qda, "Quadratic Discriminant Analysis (QDA) - Test", axes[1,1])
plt.show()
print("Train LDA:", lda.score(X_train_2, y_train_2))
print("Test LDA:", lda.score(X_test_2, y_test_2))
print("Train QDA:", qda.score(X_train_2, y_train_2))
print("Test QDA:",qda.score(X_test_2, y_test_2))
Finally, we cover the fitting of k-Nearest Neighbors for $k = 2,5$:
knn_2 = KNN(n_neighbors=2)
knn_2.fit(X_train_2, y_train_2)
knn_5 = KNN(n_neighbors=5)
knn_5.fit(X_train_2, y_train_2)
f, axes = plt.subplots(2, 2, figsize = (12,12))
plot_decision_boundary(X_train_2.values, y_train_2.values,
knn_2, "KNN 2 - Train", axes[0,0])
plot_decision_boundary(X_train_2.values, y_train_2.values,
knn_5, "KNN 5 - Train", axes[0,1])
plot_decision_boundary(X_test_2.values, y_test_2.values,
knn_2, "KNN 2 - Test", axes[1,0])
plot_decision_boundary(X_test_2.values, y_test_2.values,
knn_5, "KNN 5 - Test", axes[1,1])
plt.show()
print("Train KNN 2:", knn_2.score(X_train_2, y_train_2))
print("Test KNN 2:", knn_2.score(X_test_2, y_test_2))
print("Train KNN 5:", knn_5.score(X_train_2, y_train_2))
print("Test KNN 5:", knn_5.score(X_test_2, y_test_2))
Now we will dive into the last part of HW 7:¶
cost_func = lambda y_true, y_pred, cost_val: float((1000*np.sum(y_pred == 0) +
cost_val*np.sum(y_true[y_pred != 0] != y_pred[y_pred != 0])))/y_true.shape[0]
logregcv = LogisticRegressionCV(multi_class = 'ovr')
logregcv.fit(X_train_2, y_train_2)
print("Cost of OvR Classifier: ", cost_func(y_test_2, logregcv.predict(X_test_2), 5000))
largest_probs = np.max(logregcv.predict_proba(X_test_2), axis=1)
y_hat = logregcv.predict(X_test_2)
threshold = 0.7
for k, val in enumerate(largest_probs):
if val < threshold:
y_hat[k] = 0
print("Cost at cutoff",threshold, ":",cost_func(y_test_2, y_hat, 5000))
We will now move on to the Midterm 2 solutions from last year:
Part I: Diagnosing the Simian Flu 2016¶
You are given the early data for an outbreak of a dangerous virus originating from a group of primates being kept in a Massechussetts biomedical research lab, this virus is dubbed the "Simian Flu".
You have the medical records of $n$ number of patients in 'flu_train.csv
. There are two general types of patients in the data, flu patients and healthy (this is recorded in the column labeled flu
, a 0 indicates the absences of the virus and a 1 indicates presence). Furthermore, scientists have found that there are two strains of the virus, each requiring a different type of treatment (this is recorded in the column labeled flutype
, a 1 indicates the absences of the virus, a 2 indicates presence of strain 1 and a 3 indicates the presence of strain 2).
Your task: build a model to predict if a given patient has the flu. Your goal is to catch as many flu patients as possible without misdiagnosing too many healthy patients.
The deliverable: a function called flu_predict
which satisfies:
- input:
x_test
, a set of medical predictors for a group of patients - output:
y_pred
, a set of labels, one for each patient; 0 for healthy and 1 for infected with the flu virus
The MA state government will use your model to diagnose sets of future patients (held by us). You can expect that there will be an increase in the number of flu patients in any groups of patients in the future.
We provide you with some benchmarks for comparison.
Baseline Model:
- ~50% expected accuracy on healthy patients in observed data
- ~50% expected accuracy on flu patients in observed data
- ~50% expected accuracy on healthy patients in future data
- ~50% expected accuracy on flu patients in future data
- time to build: 5 min
Reasonable Model:
- ~69% expected accuracy on healthy patients in observed data
- ~55% expected accuracy on flu patients, in observed data
- ~69% expected accuracy on healthy patients in future data
- ~60% expected accuracy on flu patients, in future data
- time to build: 20 min
Grading: Your grade will be based on:
- your model's ability to out-perform our benchmarks
- your ability to carefully and thoroughly follow the data science pipeline (see lecture slides for definition)
- the extend to which all choices are reasonable and defensible by methods you have learned in this class
Solutions:
Step 1: Read the data, clean and explore the data¶
flu_train = pd.read_csv('https://raw.githubusercontent.com/cs109/a-2017/master/Midterms/2016%20Midterm%202/data/flu_train.csv')
flu_test = pd.read_csv('https://raw.githubusercontent.com/cs109/a-2017/master/Midterms/2016%20Midterm%202/data/flu_test.csv')
flu_train.head()
flu_train.flu.mean(),flu_test.flu.mean()
So first, only about 6% of the patients have the flu in our training data, and in terms of the predictors, we have a mixture of categorical and numerical variables with a fair amount of both missing and semi-duplicate information. To train a model to predict whether new patients will have the flu, we'll have to convert the categorical variables to indicators and handle all of the missing values.
Let's inspect the column types and how many missing values they have:
def column_type(column, df):
return 'categ' if df[column].dtype == np.dtype('O') else 'numer'
def nan_fraction(column, df):
return len(df[df[column].isnull()]) * 1. / len(df)
for i, column in enumerate(flu_train.columns):
if column == 'flu' or column == 'flutype':
continue
elif i % 2 == 0:
print('')
else:
print('\t', end='')
print('{:16} ({}), NaNs: {:.2f}/{:.2f} train/test'.format(
column, column_type(column, flu_train),
nan_fraction(column, flu_train),
nan_fraction(column, flu_test)), end='')
There are a large number of missing values in the data. Nearly all predictors have some degree of missingness. Some include potentially useful information, while others do not. NaN in the 'pregnancy'
column is meaningful and informative, as patients with NaN's in the pregnancy column are males, where as NaN's in other predictors may appear randomly.
As a first pass, we drop any columns with an extremely high fraction of NaNs (along with any other columns that explicitly don't make sense to include, like ID
). We encode the remaining categorical columns, reserving a separate category for missing values. We do mean imputation on the remaining quantitative variables.
However, there are many acceptable strategies here, as long as the appropriateness of the method in the context of the task and the data is discussed.
def clean_and_impute(df):
# don't modify the original frame
df = df.copy()
del df['ID']
del df['AgeMonths'] # almost a duplicate of Age but with more NaNs
for column in df.columns:
if nan_fraction(column, flu_train) > 0.75:
print("Dropping {0}".format(column))
del df[column]
encode = preprocessing.LabelEncoder()
for column in df.columns:
if df[column].dtype == np.object:
df[column] = df[column].fillna('NaN') #Categorical blank values get set as separate category
df.loc[:, column] = encode.fit_transform(df[column])
elif nan_fraction(column, flu_train) > 0:
df[column+'_missing'] = df[column].isnull().astype(int)
df[column] = df[column].fillna(flu_train[column].mean())
df = df[[c for c in df.columns if c not in ['flu','flutype']] + ['flu','flutype']]
return df
#Clean and encode
train = clean_and_impute(flu_train)
train.head()
test = clean_and_impute(flu_test)
test.head()
#What's up in each set
x = train[[c for c in train.columns if c not in ['flu','flutype']]].values
y = train[['flu']].values.reshape(len(x))
x_test = test[[c for c in train.columns if c not in ['flu','flutype']]].values
y_test = test[['flu']].values.reshape(len(x_test))
print('x train shape:', x.shape)
print('x test shape:', x_test.shape)
print('train class 0: {}, train class 1: {}'.format(len(y[y==0]), len(y[y==1])))
print('train class 0: {}, train class 1: {}'.format(len(y_test[y_test==0]), len(y_test[y_test==1])))
Step 2: Model Choice¶
The first task is to decide which, of the large number of classifiers we have learned during this semester, would best suit our task and our data.
It would be possible to do brute force model comparison here - i.e. tune all models and compare which does best with respect to various benchmarks. However, it is also reasonable to do a first round of model comparison by running models (with out of the box parameter settings) on bootstrapped training data and eliminating models which performed very poorly.
def expected_score(model, x_train, y_train):
overall = 0
class_0 = 0
class_1 = 0
for i in range(100):
np.random.seed(i)
sample = np.random.choice(len(x_train), len(x_train))
x_sub_train = x_train[sample]
y_sub_train = y_train[sample]
overall += model.score(x_sub_train, y_sub_train)
class_0 += model.score(x_sub_train[y_sub_train==0], y_sub_train[y_sub_train==0])
class_1 += model.score(x_sub_train[y_sub_train==1], y_sub_train[y_sub_train==1])
return pd.Series([overall / 100.,
class_0 / 100.,
class_1 / 100.],
index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])
score = lambda model, x_train, y_train: pd.Series([model.score(x_train, y_train),
model.score(x_train[y_train==0], y_train[y_train==0]),
model.score(x_train[y_train==1], y_train[y_train==1])],
index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])
#KNN
knn = KNN(n_neighbors=2)
knn.fit(x, y)
knn_scores = score(knn, x, y)
print('knn')
#Unweighted logistic regression
unweighted_logistic = LogisticRegression(C=1000)
unweighted_logistic.fit(x, y)
unweighted_log_scores = score(unweighted_logistic, x, y)
print('unweighted log')
#Weighted logistic regression
weighted_logistic = LogisticRegression(C=1000, class_weight='balanced')
weighted_logistic.fit(x, y)
weighted_log_scores = score(weighted_logistic, x, y)
print('weighted log')
#LDA
lda = LDA()
lda.fit(x, y)
lda_scores = score(lda, x, y)
print('lda')
#QDA
qda = QDA()
qda.fit(x, y)
qda_scores = score(qda, x, y)
print('qda')
#Decision Tree
tree = DecisionTree(max_depth=50, class_weight='balanced', criterion='entropy')
tree.fit(x, y)
tree_scores = score(tree, x, y)
print('tree')
#Random Forest
rf = RandomForest(class_weight='balanced')
rf.fit(x, y)
rf_scores = score(rf, x, y)
print('rf')
#SVC
svc = SVC(C=100, class_weight='balanced')
svc.fit(x, y)
svc_scores = score(svc, x, y)
print('svc')
#Score Dataframe
score_df = pd.DataFrame({'knn': knn_scores,
'unweighted logistic': unweighted_log_scores,
'weighted logistic': weighted_log_scores,
'lda': lda_scores,
'qda': qda_scores,
'tree': tree_scores,
'rf': rf_scores,
'svc': svc_scores})
score_df
It looks like we can rule out KNN, LDA and unweighted logistic.
What we do: We are going to pick weighted logistic regression and just tune the regularization parameter to beat the test benchmarks. Can you figure out why we chose this course of action? Hint: Instead of looking at overall accuracy, look at accuracy on class 1.
What's probably good to do: QDA, random forest, tree, SVC and weighted logistic are beating our train benchmarks as is. We will tune them to beat the test benchmarks by picking the model and parameter set with the highest CV accuracy.
Cs = 10.**np.arange(-3, 4, 1)
scores = []
for C in Cs:
print('C:', C)
weighted_log_scores = np.array([0., 0., 0.])
kf = KFold(len(x), n_folds=10, shuffle=True, random_state=10)
for train_index, test_index in kf:
x_validate_train, x_validate_test = x[train_index], x[test_index]
y_validate_train, y_validate_test = y[train_index], y[test_index]
weighted_logistic = LogisticRegression(C=C, class_weight='balanced')
weighted_logistic.fit(x_validate_train, y_validate_train)
weighted_log_scores += score(weighted_logistic, x_validate_test, y_validate_test).values
scores.append(weighted_log_scores / 10.)
scores = pd.DataFrame(np.array(scores).T, columns=[str(C) for C in Cs], index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])
What C means here: The C
parameter in the LogisticRegression
method is the inverse of the regularization parameter $\lambda$. In other words, C
$= \frac{1}{\lambda}$. As C gets larger, $\lambda$ is getting smaller. Therefore, when we set C
to be very very large in our earlier homework assignments to make sure there was no regularization, this was the same as making $\lambda$ go to zero. When $\lambda$ is zero, there is no regularization applied.
print(scores.T['accuracy on class 1'].max())
scores
To beat the future benchmark, we'll select the parameter which yields the highest accuracy on class 1 (while still beating the benchmark on class 0).
Now let's test our model on the test data:
#Weighted logistic regression
weighted_logistic = LogisticRegression(C=10, class_weight='balanced')
weighted_logistic.fit(x, y)
weighted_log_scores = score(weighted_logistic, x_test, y_test)
weighted_log_scores
Yay, we beat all the benchmarks!
Part II: Diagnosing Strains of the Simian Flu¶
From a public health perspective, we want to balance the cost of vaccinations, early interventions and the cost of treating flu complications of unvaccinated people.
There are two different strains of the flu: strain 1 has a cheaper early intervention as well as a cheaper treatment for flu complications, but patients with strain 1 has a higher rate of developing complications if treated with the wrong intervention. Strain 2 has a more expensive early intervention as well as a more costly treatment for flu complications, but patients with strain 2 has a lower rate of developing complications if treated with the wrong intervention. With no intervention, flu patients develop complications at the same rate regardless of the strain.
Your task: build a model to predict if a given patient has the flu and identify the flu strain. The state government of MA will use your model to inform public health policies: we will vaccinate people you've identified as healthy and apply corresponding interventions to patients with different strains of the flu. We have provided you with a function to compute the total expected cost of this policy decision that takes into account the cost of the vaccine, the interventions and the cost of the treatments for flu complications resulting from misdiagnosing patients. Your goal is to make sure your model produces a public health policy with the lowest associated expected cost.
The deliverable: a function called flu_predict
which satisfies:
- input:
x_test
, a set of medical predictors for a group of patients - output:
y_pred
, a set of labels, one for each patient; 0 for healthy, 1 for infected with strain 1, and 2 for infected with strain 2.
The MA state government will use your model to diagnose sets of future patients (held by us). You can expect that there will be an increase in the number of flu patients in any groups of patients in the future.
We provide you with some benchmarks for comparison.
Three Baseline Models:
- expected cost on observed data: \$6,818,206, \$7,035,735, \$8,297,197.5
- these values correspond to the all healthy, all 1, and all 2 classifiers, respectively.
- time to build: 1 min
Reasonable Model:
- expected cost on observed data: $6,300,000
- time to build: 20 min
Grading: Your grade will be based on:
- your model's ability to out-perform our benchmarks
- your ability to carefully and thoroughly follow the data science pipeline (see lecture slides for definition)
- the extend to which all choices are reasonable and defensible by methods you have learned in this class
#-------- cost
# A function that computes the expected cost of the public healthy policy based on the
# classifications generated by your model
# Input:
# y_true (true class labels: 0, 1, 2)
# y_pred (predicted class labels: 0, 1, 2)
# Returns:
# total_cost (expected total cost)
def cost(y_true, y_pred):
cost_of_treatment_1 = 29500
cost_of_treatment_2 = 45000
cost_of_intervention_1 = 4150
cost_of_intervention_2 = 4250
cost_of_vaccine = 15
prob_complications_untreated = 0.65
prob_complications_1 = 0.30
prob_complications_2 = 0.15
trials = 1000
intervention_cost = cost_of_intervention_1 * len(y_pred[y_pred==1]) + cost_of_intervention_2 * len(y_pred[y_pred==2])
vaccine_cost = cost_of_vaccine * len(y_pred[y_pred==0])
false_neg_1 = ((y_true == 1) & (y_pred == 2)).sum()
false_neg_2 = ((y_true == 2) & (y_pred == 1)).sum()
untreated_1 = ((y_true == 1) & (y_pred == 0)).sum()
untreated_2 = ((y_true == 2) & (y_pred == 0)).sum()
false_neg_1_cost = np.random.binomial(1, prob_complications_1, (false_neg_1, trials)) * cost_of_treatment_1
false_neg_2_cost = np.random.binomial(1, prob_complications_2, (false_neg_2, trials)) * cost_of_treatment_2
untreated_1_cost = np.random.binomial(1, prob_complications_untreated, (untreated_1, trials)) * cost_of_treatment_1
untreated_2_cost = np.random.binomial(1, prob_complications_untreated, (untreated_2, trials)) * cost_of_treatment_2
false_neg_1_cost = false_neg_1_cost.sum(axis=0)
expected_false_neg_1_cost = false_neg_1_cost.mean()
false_neg_2_cost = false_neg_2_cost.sum(axis=0)
expected_false_neg_2_cost = false_neg_2_cost.mean()
untreated_1_cost = untreated_1_cost.sum(axis=0)
expected_untreated_1_cost = untreated_1_cost.mean()
untreated_2_cost = untreated_2_cost.sum(axis=0)
expected_untreated_2_cost = untreated_2_cost.mean()
total_cost = vaccine_cost + intervention_cost + expected_false_neg_1_cost + expected_false_neg_2_cost + expected_untreated_1_cost + expected_untreated_2_cost
return total_cost
We're just going to take the weighted logistic model, again, and tune the regularization parameter to both beat the benchmark on the observed data and minimize expected cost on unseen data (i.e. to prevent overfitting). Instead of using 'balanced' class weights, we're using a custom weighting scheme for the three classes (this parameter should really be tuned!).
It would probally also be go through the whole "choosing a model, tuning these models"-process again, this time to minimize cost.
Note: Be aware that the cost is now sensitive to sample size! The smaller the pool of patients the less the cost. If you are evaluating cost on a held-out test set then you can artificially make the cost very small. The benchmarks we give are for the test set.
x = train.values[:, :-2]
y = train.values[:, -1]
y = y - 1
x_test = test.values[:, :-2]
y_test = test.values[:, -1]
y_test = y_test - 1
We can take a look at our label frequencies, notice how they are imbalanced.
np.bincount(y.astype(int))/len(y)
We'll define a function for computing the accuracy rates so that we can conveniently call it later.
score = lambda model, x_test, y_test: pd.Series([model.score(x_test, y_test),
model.score(x_test[y_test==0], y_test[y_test==0]),
model.score(x_test[y_test==1], y_test[y_test==1]),
model.score(x_test[y_test==2], y_test[y_test==2]),
cost(y_test, model.predict(x_test))],
index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1', 'accuracy on class 2', 'total cost'])
Let’s make some benchmarks to compare the cost of our model against...¶
First, let's cost a perfect model...
l = [('Perfect', cost(y, y), cost(y_test, y_test))]
print('minimimum cost on train:', cost(y, y))
print('minimimum cost on test:', cost(y_test, y_test))
What if we presume that all patients are healthy?
l.append(('All Healthy', cost(y, np.array([0] * len(y))), cost(y_test, np.array([0] * len(y_test)))))
print('simple model cost on train (all 0):', cost(y, np.array([0] * len(y))))
print('simple model cost on test (all 0):', cost(y_test, np.array([0] * len(y_test))))
All have strain 1 ...
l.append(('All 1', cost(y, np.array([1] * len(y))),cost(y_test, np.array([1] * len(y_test)))))
print('simple model cost on train (all 1):', cost(y, np.array([1] * len(y))))
print('simple model cost on test: (all 1):', cost(y_test, np.array([1] * len(y_test))))
All have strain 2 ...
l.append(('All 2', cost(y, np.array([2] * len(y))),cost(y_test, np.array([2] * len(y_test)))))
print('simple model cost on train (all 2):', cost(y, np.array([2] * len(y))))
print('simple model cost on test (all 2):', cost(y_test, np.array([2] * len(y_test))))
pd.DataFrame(l,columns=["Model","Cost on Train (Mil.)","Cost on Test (Mil.)"]).set_index('Model')/10**6
Now let’s cost out the models we trained above…
models = {'knn': knn,
'unweighted logistic': unweighted_logistic,
'weighted logistic': weighted_logistic,
'lda': lda,
'qda': qda,
'tree': tree,
'rf': rf,
'svc': svc}
for k in models:
print("\n\n**************")
print(k, "Train")
mod = models[k]
mod.fit(x, y)
scores = score(mod, x, y)
print(scores.round(decimals=3))
train_cost = scores['total cost']
scores = score(mod, x_test, y_test)
print(k, "Train")
print(scores.round(decimals=3))
test_cost = scores['total cost']
l.append((k,train_cost,test_cost))
pd.DataFrame(l,columns=["Model","Cost on Train (Mil.)","Cost on Test (Mil.)"]).set_index('Model')/10**6
Weighted Logistic Reg...¶
Now let’s see if we can beat our benchmarks…
Cs = 10.**np.arange(-3, 4, 1)
scores = []
for C in Cs:
print('C:', C)
weighted_log_scores = np.array([0., 0., 0., 0., 0.])
kf = KFold(len(x), n_folds=10, shuffle=True, random_state=10)
for train_index, test_index in kf:
x_validate_train, x_validate_test = x[train_index], x[test_index]
y_validate_train, y_validate_test = y[train_index], y[test_index]
weighted_logistic = LogisticRegression(C=C, class_weight={0:0.7, 1:10, 2:10})
weighted_logistic.fit(x_validate_train, y_validate_train)
weighted_log_scores += score(weighted_logistic, x_validate_test, y_validate_test).values
scores.append(weighted_log_scores / 10.)
scores = pd.DataFrame(np.array(scores).T, columns=[str(C) for C in Cs], index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1', 'accuracy on class 2', 'total cost'])
scores
#Weighted logistic regression on training set
weighted_logistic = LogisticRegression(C=10, class_weight={0:0.7, 1:10, 2:10})
weighted_logistic.fit(x, y)
weighted_log_scores = score(weighted_logistic, x, y)
weighted_log_scores.round(decimals=3)
#Weighted logistic regression on test set
weighted_log_scores = score(weighted_logistic, x_test, y_test)
weighted_log_scores.round(decimals=3)
l.append(('Tuned Weighted logistic regression', score(weighted_logistic, x, y)['total cost'],score(weighted_logistic, x_test, y_test)['total cost']))
pd.DataFrame(l,columns=["Model","Cost on Train (Mil.)","Cost on Test (Mil.)"]).set_index('Model')/10**6
Yay! We beat the benchmarks on the observed data and did pretty well on test data!