CS 109A/STAT 121A/AC 209A/CSCI E-109A

Standard Section 6: Logistic Regression and Bootstrapping

Harvard University
Fall 2017
Section Leaders: Albert Wu, Nathaniel Burbank
Instructors: Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine

**Download this notebook from the CS109 repo or here:**
**http://bit.ly/109_S6**

In this section we will be covering logistic regression using a cancer dataset. The goal is to get you familiarized with the implementation of logistic regresison and bootstrapping through the use of a dataset we haven't seen. The exercises below will help you be able to answer parts of Homework 5.

Specifically, we will:

1. Load in and get a feel for the prostate cancer dataset, which is similar in structure to our homework.
2. Fit a logistic regression model to only one predictor and understand how to interpret the results.
3. Compare to a linear regression model and then fit the logistic regression model to the entire dataset.
3. Take the results from the logistic regression model to create a bootstrap with confidence intervals.

For this section we will be using the following packages:

In [ ]:
import sys
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 999)
pd.set_option('display.max_columns', 999)
pd.set_option('display.width', 1000)
pd.set_option('display.notebook_repr_html', True)
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.api import OLS
from statsmodels.api import add_constant
from statsmodels.regression.linear_model import RegressionResults
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
# Note --  Requires sklearn version .18 or higher  

from sklearn.metrics import r2_score
from collections import Counter
sns.set(style="ticks")
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

matplotlib.rcParams['figure.figsize'] = (13.0, 6.0)

assert(sys.version_info.major==3),print(sys.version)
# Python 3 or higher is required

Warmup – back to the titanic dataset

In [ ]:
import statsmodels.formula.api as sm
titanic = sns.load_dataset("titanic")
titanic.head()
In [ ]:
# Subset to only cols we want
titanic = titanic[['sex', 'age', 'class', 'survived']]

# Rename 'class' col to 'pclass' to avoid namespace issues
titanic.columns = ['sex', 'age', 'pclass', 'survived']

#Drop any row with NA values in any col in dataframe
titanic = titanic.dropna()

is_female = pd.get_dummies(titanic.sex)['female']
pclass_coded = pd.get_dummies(titanic.pclass)[['First','Second']]

titanic_c = pd.concat([is_female,pclass_coded,titanic[['age','survived']]],axis=1)
titanic_c.head()
In [ ]:
train, test =  train_test_split(titanic_c, test_size=.2, random_state=123)
model = sm.logit(formula="survived ~ female + First + Second + age", data=train)
model = model.fit()
model.summary()

Note that all our predictors are significant here, and none of the confidence intervals include zero.

Interpreting logistic regression

Being a first class passenger increases the logit of the probability of surviving by X, relative to being a third-class passanger.

In [ ]:
## Your code here 

The odds of First class passengers surviving are X times more likely than the odds of a third-class passenger surviving.

In [ ]:
## Your code here 

Calculate the probability of each passenger in our training set surviving, without using model.predict

Hint: $$ p = \cfrac{e^\eta}{1+e^\eta} = \cfrac{1}{1+e^{-\eta}} $$

Where, $\eta = \alpha + \beta_1 X_1 + ... + \beta_n X_n$

In [ ]:
x_train = titanic_c[['female','First','Second','age']]
## Your code here 
In [ ]:
# Compare with model.predict
model.predict(x_train)

Part (1): Load in our data and conduct basic EDA

We will first load in our dataset below and look at the first few rows. Then, we use the describe function to get a sense of the data. NOTE: A friendly reminder, this data is different from that of the homework. Please take care to make sure you do not accidentally use this data for the homework!

In [ ]:
df = pd.read_csv('https://github.com/albertw1/data/raw/master/section_6_cancer_data.csv')
df.head()
In [ ]:
df.describe()

As we can see, Y here is the response variable (Y = 1 is cancer and Y = 0 is no cancer) and the 6033 predictor variables are gene expressions. The meaning of each column has no easily interpretable meaning. The way to think about it is that the bigger the value is for a gene, the more likely that gene could lead to someone having cancer. Each row represents a person.

Now, let's split this dataset up into a testing and training set using a 50-50 split.

In [ ]:
np.random.seed(9001)
msk = np.random.rand(len(df)) < 0.5
data_train_ns = df[msk] # ns for non-standarized data, we will standardize it right after this
data_test_ns = df[~msk]
data_train_ns.head()

Now let us standarize our data to be between 0 and 1.

In [ ]:
# Your code goes here
data_train = # Your code goes here
data_test = # Your code goes here

data_train['Y']=data_train_ns['Y']
data_test['Y']=data_test_ns['Y']
data_train.describe()

Now to let us subset and create convenient formats for parts in train and test.

In [ ]:
X_train = #Your code goes here
X_test  = #Your code goes here
X_train_np = X_train.values # This is to create a numpy array version of the training and testing set.
X_test_np  = X_test.values
y_train = data_train['Y'].values
y_test = data_test['Y'].values

Part (2): Using one single predictor to fit a Linear Regression Model

Let's use one single predictor variable, Gene 10 and fit a linear regression model to it.

In [ ]:
import statsmodels.api as sm
X_train_gene_10 = sm.add_constant(X_train["Gene 10"])
X_test_gene_10 = sm.add_constant(X_test["Gene 10"])

ols = OLS(endog=y_train, exog=X_train_gene_10).fit()
ols.summary()

Let’s plot our model…

In [ ]:
sns.regplot(x=X_train["Gene 10"], y=y_train)
In [ ]:
y_hat_train = ols.predict(X_train_gene_10)
y_hat_test = ols.predict(X_test_gene_10)
print(y_hat_test)

Do you see anything "wrong" above with the predictions we made for the test set using the linear regression model? Hint: look at entry number 41.

Notice it is possible for us to convert the above values into a binary classification by the following code.

In [ ]:
y_hat_train_0_1 = y_hat_train >.5
y_hat_test_0_1 = y_hat_test >.5

print("Training Set Accuracy for Linear Regression: ", accuracy_score(y_train, y_hat_train_0_1))
print("Test Set Accuracy for Linear Regression: ", accuracy_score(y_test, y_hat_test_0_1))

Part (3): Using one single predictor to fit a Logistic Regression Model instead

Now we can instead fit a logistic regression model onto our data and see how things change.

In [ ]:
logistic_mod = LogisticRegression(C=1000000, fit_intercept=True) #The C value here is a way to prevent regularization
X_train_mul_wcons = sm.add_constant(X_train["Gene 10"])
X_test_mul_wcons = sm.add_constant(X_test["Gene 10"])
logistic_mod.fit(X_train_mul_wcons, y_train)

y_hat_train_logistic_mod = logistic_mod.predict(X_train_mul_wcons)
y_hat_test_logistic_mod = logistic_mod.predict(X_test_mul_wcons)

print("Training Accuracy for Logistic Regression: ", accuracy_score(y_train, y_hat_train_logistic_mod))
print("Test Accuracy for Logistic Regression: ", accuracy_score(y_test, y_hat_test_logistic_mod))

As we see, the training accuracy is basically the same, but the test accuracy with logsitic regression is better due to being able to better account for the cases where the response variable is very large and very small for Gene 10. The takeaway message here is that logistic regresison will outperform a model when your Y values tend to be very extreme.

Part (4): Using all predictor variables in a logistic regression

In [ ]:
logistic_mod_all = LogisticRegression(C=100000, fit_intercept=True)
X_train_mul_wcons = sm.add_constant(X_train)
X_test_mul_wcons = sm.add_constant(X_test)

logistic_mod_all.fit(X_train_mul_wcons, y_train)
y_hat_train_logistic_mod_all = logistic_mod_all.predict(X_train_mul_wcons)
y_hat_test_logistic_mod_all = logistic_mod_all.predict(X_test_mul_wcons)
print("Training accuracy for Logistic Regression: ", accuracy_score(y_train, y_hat_train_logistic_mod_all))
print("Test accuracy for Logistic Regression: ", accuracy_score(y_test, y_hat_test_logistic_mod_all))
In [ ]:
predicted = np.round(y_hat_test_logistic_mod_all)
expected = y_test
print(metrics.classification_report(expected, predicted))
In [ ]:
pd.crosstab(expected,predicted,margins=True, rownames=['Actual'], colnames=['Predicted'])

Using this matrix, we can see that of the people who actually did not have cancer, 6 were said to have cancer (false positive), while of the people who acutally did have cancer, 4 were mistakenly said to not have it (false negative). In biomedical applications, it usually is the case that false negatives tend to have greater consequences. For example, a patient would rather be told they mistakenly had cancer, which might lead to more testing.

Part (4): Incorporating Bootstrapping in a Logistic Regression Model

The main idea behind bootstrapping is to resample our data many times. For EACH time that we resample our data, we will fit the Logistic Regression model. Then for EACH resampling, we can find an associated confidence interval. Notice how we have many more predictors than observations. Intuitively, that creates a problem in that we do not have enough observations to "represent" each predictor varaible. Hence, the resampling gives us a way to approximate the case where we did indeed have enough observations.

In [ ]:
B = 100 # Number of iterations

boot_coefs = np.zeros((X_train.shape[1]+1,B))

for i in range(B):
    #Your code goes here
    boot_coefs[:,i] = logistic_mod_boot.coef_

boot_coefs.shape

If we wanted to find the 95 percent confidence interval, we can use the np.percentile function to find the datapoint that corresponds to the upper and lower bounds. As an example, we have the upper datapoint for the 95% confidence interval. (Why is it 97.5?)

In [ ]:
ci_upper = np.percentile(boot_coefs, 97.5, axis=1)
ci_lower = np.percentile(boot_coefs, 2.5, axis=1)
In [ ]:
# ct significant predictors
sig_b_ct = 0

# if ci contains 0, then insignificant
for i in range(len(ci_upper)):
    if ci_upper[i]<0 or ci_lower[i]>0:
        sig_b_ct += 1

print("Significant coefficents at 5pct level = %i / %i" % (sig_b_ct, X_train_np.shape[1]))