CS109A Introduction to Data Science
Standard Section 5: Logistic Regression and PCA¶
Harvard University
Fall 2018
Section Leaders: Mehul Smriti Raje, Ken Arnold, Karan Motwani, Cecilia Garraffo
Instructors: Pavlos Protopapas, Kevin Rader
#RUN THIS CELL
from IPython.core.display import HTML
HTML('')
In this section we will be covering Logistic Regression and PCA using the Wine dataset. The goal is to get you familiarized with the implementation of Logistic Regresison and PCA through the use of a dataset we haven't seen. The exercises below will help you be able to answer parts of Homework 5.
Specifically, we will:
1. Import data, explore with analyse relationship with response.
2. Fit a linear regression model for classification, understand drawbacks and interpret results.
3. Fit a logistic regression model for classification, compare performance and interpret results.
3. Use PCA to reduce dimensionality, fit model to PCA components and compare performance.
For this section we will be using the following packages:
# Data and Stats packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.api import OLS
from sklearn import metrics, datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
# Visualization packages
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.figsize'] = (13.0, 6.0)
# Other
import itertools
import tqdm
# Aesthetic settings
from IPython.display import display
pd.set_option('display.max_columns', 999)
pd.set_option('display.width', 500)
sns.set_style('whitegrid')
sns.set_context('talk')
Part One¶
Import Dataset¶
#Import dataframe
flowers_df = pd.read_csv('flowers.csv')
del flowers_df['Unnamed: 0']
display(flowers_df.head())
As we can see, species
here is the response variable which defines flower type and the 4 predictor variables are flower characteristics.
#Describe dataset
flowers_df['species'] = flowers_df['species'] == 'versicolor'
display(flowers_df.describe())
Explore Data¶
Check which variables have high correlations and distinctive patterns with the response. Any patterns worth mentioning?
sns.pairplot(flowers_df, hue="species")
plt.show()
This plot is great because it not only shows correlations but also relationship with the response. The left diagonal shows the histogram for each predictor with the response subset. You can re-create the left diagonal with the code below.¶
#Explore dataset for relationships with Quality
subset_species_0, subset_species_1 = flowers_df[flowers_df['species'] == 0], flowers_df[flowers_df['species'] == 1]
#Get set of predictors
columns = list(flowers_df.columns)
columns.remove('species')
#Plot relationship
fig, ax = plt.subplots(2, 2, figsize=(15, 15))
alpha = 0.4
count = 0
for i in range(2):
for j in range(2):
ax[i, j].hist(subset_species_0[columns[count]], alpha=alpha, label='Species = 0', normed=True)
ax[i, j].hist(subset_species_1[columns[count]], alpha=alpha, label='Species = 1', normed=True)
ax[i, j].set_title(columns[count]+' vs Quality')
ax[i, j].set_xlabel(columns[count])
ax[i, j].set_ylabel('Frequency')
ax[i, j].legend()
count += 1
plt.show()
We can observe that petal_length
, petal_width
are useful distinguishing predictors for flower species.
Split and Normalize Data¶
#Split data into train and test
np.random.seed(1001)
msk = np.random.rand(len(flowers_df)) < 0.7
data_train = flowers_df[msk]
data_test = flowers_df[~msk]
#Split predictor and response columns
x_train, y_train = data_train.drop(['species'], axis=1), data_train['species']
x_test, y_test = data_test.drop(['species'], axis=1), data_test['species']
print("Length of Training Set :",len(data_train))
print("Length of Testing Set :",len(data_test))
What is the right way to normalize?¶
#Function to normalize data
def normalize(df, train_dict, flag=True):
for i in df.columns:
if flag:
train_dict[i] = (df[i].min(), df[i].max())
df[i] = (df[i] - train_dict[i][0]) / (train_dict[i][1] - train_dict[i][0])
return df, train_dict
#Normalize train and get characteristics of training columns
x_train, train_dict = normalize(x_train, {}, True)
x_test, train_dict = normalize(x_test, train_dict, False)
Peeking at any information from the test set is cheating, we use the entire dataset for atmost data exploration. No information, including min, max, mean std, should be used from the test set or under the influence of the test set. Thus, the right way to normalize is to store the characteristics of the training set and apply it to both datasets independently.
print("Training Dataframe :")
display(x_train.describe())
print("\n\nTesting Dataframe :")
display(x_test.describe())
Some maximum values in Testing Dataframe are not 1, some minimum values are > 0. Is this wrong? What could be an explanation for this? Could we see negative values in test or values greater than 1?
Part Two¶
Perform Classification using Multiple Linear Regression¶
#Add constant to x_train and x_test
x_train_cst = sm.add_constant(x_train)
x_test_cst = sm.add_constant(x_test)
#Training
model = OLS(y_train, x_train_cst).fit()
#Predict
y_pred_train = model.predict(x_train_cst)>0.5
y_pred_test = model.predict(x_test_cst)>0.5
#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100
print("Training Set Accuracy:",str(train_score)+'%')
print("Testing Set Accuracy:",str(test_score)+'%')
#Get significant predictors and Confidence Intervals using Stats Model Summary
print("Significant Predictors :\n",)
print(model.pvalues[model.pvalues<0.01],"\n")
thresh = 0.01
intervals = model.conf_int(alpha=thresh)
intervals = intervals.rename(index=str, columns={0:str(thresh/2*100)+"%",1:str((1-thresh/2)*100)+"%"})
print("Confidence Intervals:",)
display(intervals)
Here, we have used a higher probability threshold (p<0.01) since we have very few datapoints. Does this make sense?
petal_width
seems like the only significant predictor, is this intuitive given the histogram above?
#Get Performance by Class (Lookup Confusion Matrix)
pd.crosstab(y_test, y_pred_test, margins=True, rownames=['Actual'], colnames=['Predicted'])
Can you calculate True Positive Rate? False Positive Rate?
If this was a medical test for cancer, what do we want ideally, Low False Positive rate or Low False Negative rate? What are the implications?
Part Three¶
Perform Classification using Multiple Logistic Regression¶
#Add constant to x_train and x_test
x_train_cst = sm.add_constant(x_train)
x_test_cst = sm.add_constant(x_test)
#Training
model = LogisticRegression(C=100000).fit(x_train_cst, y_train)
#Predict
y_pred_train = model.predict(x_train_cst)
y_pred_test = model.predict(x_test_cst)
#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100
print("Training Set Accuracy:",str(train_score)+'%')
print("Testing Set Accuracy:",str(test_score)+'%')
#Get Performance by Class
pd.crosstab(y_test, y_pred_test, margins=True, rownames=['Actual'], colnames=['Predicted'])
#Helper function for visualization
def visualize_prob(model, x, y, ax):
"""
Function Name : visualize_prob
Description : A function to visualize the probabilities predicted by a Logistic Regression model
Input:
model (Logistic regression model)
x (n x d array of predictors in training data)
y (n x 1 array of response variable vals in training data: 0 or 1)
ax (an axis object to generate the plot)
"""
#Use the model to predict probabilities for x
y_pred = model.predict_proba(x)
#Separate the predictions on the label 1 and label 0 points
ypos = y_pred[y==1]
yneg = y_pred[y==0]
#Count the number of label 1 and label 0 points
npos = ypos.shape[0]
nneg = yneg.shape[0]
#Plot the probabilities on a vertical line at x = 0,
#with the positive points in blue and negative points in red
pos_handle = ax.plot(np.zeros((npos,1)), ypos[:,1], 'bo', label = 'Species 1', alpha=0.2)
neg_handle = ax.plot(np.zeros((nneg,1)), yneg[:,1], 'ro', label = 'Species 0', alpha=0.2)
#Line to mark prob 0.5
ax.axhline(y = 0.5, color = 'k', linestyle = '--')
#Add y-label and legend, do not display x-axis, set y-axis limit
ax.set_ylabel('Probability of Species Class')
ax.legend(loc = 'best')
ax.get_xaxis().set_visible(False)
#Create Plot
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
#Plot Training
visualize_prob(model, x_train_cst, y_train, ax1)
ax1.set_title('Training Dataset')
#Plot Testing
visualize_prob(model, x_test_cst, y_test, ax2)
ax2.set_title('Testing Dataset')
plt.show()
Does Logistic Regression perform better or worse? Has it overfitted to the training set? What would points near 0.5 line mean?
Analyzing Significance of Coefficients¶
How many of the coefficients estimated by the multiple logistic regression model are significantly different from zero at a significance level of 95%?
#Creating model
model = LogisticRegression(C=100000)
#Initializing variables
bootstrap_iterations = 1000
coeffs = np.zeros((bootstrap_iterations, data_train.shape[1]-1))
#Conduct bootstraping iterations
for i in range(bootstrap_iterations):
temp = data_train.sample(frac=1, replace=True)
response_variable = temp['species']
temp = temp.drop(['species'], axis=1)
model.fit(temp, response_variable)
coeffs[i,:] = model.coef_
#Find Significant Columns, Count
coeffs_count, significant_cols = 0, []
for i in range(coeffs.shape[1]):
coeff_samples = coeffs[:,i]
lower_bound = np.percentile(coeff_samples, 2.5)
upper_bound = np.percentile(coeff_samples, 97.5)
if lower_bound>0 or upper_bound<0:
coeffs_count += 1
significant_cols.append(data_train.columns[i])
print('Columns :', significant_cols)
print('Count of 95% statistically significant coefficients :', coeffs_count)
Part Four¶
Although PCA is used for dimensionality reduction, let's apply it to this problem. Find the number of principal components needed to explain 85% of the data variance and compare the performance of the model fitted to this data.
#Create and fit PCA object
pca = PCA()
pca.fit(x_train)
#Transforming x_train and x_test
x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)
#Find number of components that explain predefined variance threshold
sum_variance, component_count = 0, 0
while sum_variance < 0.85:
sum_variance += pca.explained_variance_ratio_[component_count]
component_count += 1
print('Number of Principal Components that explain >=85% of Variance: ', component_count)
print('Total Variance Explained by '+str(component_count)+' components:', str(sum_variance*100)+'%')
#Create and fit PCA object
pca = PCA(n_components=component_count)
pca.fit(x_train)
#Transforming x_train and x_test
x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)
#Add constant to x_train and x_test
x_train_pca_cst = sm.add_constant(x_train_pca)
x_test_pca_cst = sm.add_constant(x_test_pca)
#Training
model = LogisticRegression(C=100000).fit(x_train_pca_cst, y_train)
#Predict
y_pred_train = model.predict(x_train_pca_cst)
y_pred_test = model.predict(x_test_pca_cst)
#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100
print("Training Set Accuracy:",str(train_score)+'%')
print("Testing Set Accuracy:",str(test_score)+'%')
We observe that 2 PC's were enough to replicate the accuracy performance of Logistic Regression with all predictors. Imagine the impact to complexity we can cause when we have multiple thousand predictors but few PCA components are able to able to match their performance. Drawback? Interpretability.
#Create Plot
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
#Plot Training
visualize_prob(model, x_train_pca_cst, y_train, ax1)
ax1.set_title('Training Dataset')
#Plot Testing
visualize_prob(model, x_test_pca_cst, y_test, ax2)
ax2.set_title('Testing Dataset')
plt.show()
#Get Performance by Class
pd.crosstab(y_test, y_pred_test, margins=True, rownames=['Actual'], colnames=['Predicted'])
Where does PCA perform better/worse? Does trying to explain variance sacrifice the power of the predictors? Is the variance we ignore, the last 15%, better for generalizability of the model i.e. can it prevent overfitting to the training set?