Key Word(s): ??



Title

Pre-Class Notebook

Description :

This notebook is from CS109A. You can run it on Ed. If you choose to download, make sure you download the Heart-2.csv file as well.

This is material you should be familiar with. It is provided here as a review.

In [1]:
%matplotlib inline
import sys
import numpy as np
import pylab as pl
import pandas as pd
import sklearn as sk
import statsmodels.api as sm
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
import sklearn.metrics as met


pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

PCA

Part 0: Reading the data

In this notebook, we will be using the same Heart dataset from last lecture. As a reminder the variables we will be using today include:

  • AHD: whether or not the patient presents atherosclerotic heart disease (a heart attack): Yes or No
  • Sex: a binary indicator for whether the patient is male (Sex=1) or female (Sex=0)
  • Age: age of patient, in years
  • MaxHR: the maximum heart rate of patient based on exercise testing
  • RestBP: the resting systolic blood pressure of the patient
  • Chol: the HDL cholesterol level of the patient
  • Oldpeak: ST depression induced by exercise relative to rest (on an ECG)
  • Slope: the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
  • Ca: number of major vessels (0-3) colored by flourosopy

For further information on the dataset, please see the UC Irvine Machine Learning Repository.

In [2]:
df_heart = pd.read_csv('Heart-2.csv')

# Force the response into a binary indicator:
df_heart['AHD'] = 1*(df_heart['AHD'] == "Yes")

print(df_heart.shape)
df_heart.head()
(303, 15)
Out[2]:
Unnamed: 0 Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca Thal AHD
0 1 63 1 typical 145 233 1 2 150 0 2.3 3 0.0 fixed 0
1 2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2 3.0 normal 1
2 3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2 2.0 reversable 1
3 4 37 1 nonanginal 130 250 0 0 187 0 3.5 3 0.0 normal 0
4 5 41 0 nontypical 130 204 0 2 172 0 1.4 1 0.0 normal 0

Here are some basic summaries and EDA we've seen before:


Part 1: Principal Components Analysis (PCA)

Q1.1 Just a sidebar (and a curiosity), what happens when two of the identical predictor is used in logistic regression? Is an error created? Should one be? Investigate by predicting AHD from two copies of Age, and compare to the simple logistic regression model with Age alone.

In [3]:
y = df_heart['AHD']

logit1 = LogisticRegression(penalty="none",solver="lbfgs").fit(df_heart[['Age']],y)

# investigating what happens when two identical predictors of 'Age' are used
logit2 = LogisticRegression(penalty="none",solver="lbfgs").fit(df_heart[['Age','Age']],y)

print("The coef estimate for Age (when in the model once):",logit1.coef_)
print("The coef estimates for Age (when in the model twice):",logit2.coef_)
The coef estimate for Age (when in the model once): [[0.05198618]]
The coef estimates for Age (when in the model twice): [[0.02599311 0.02599311]]

your answer here

The single coefficient for Age is distributed equally across the two predictors. This is a very reasonable approach as predictions will still be stable.

We will apply PCA to the heart dataset when there are just 4 predictors considered (remember: PCA is used when dimensionality is high (lots of predictors), but this will help us get our heads around what is going on):

In [4]:
# For pedagogical purposes, let's simplify our lives and use just 7 predictors
X = df_heart[['Age','RestBP','Chol','MaxHR','Sex','Oldpeak','Slope']]
y = df_heart['AHD']
X.describe()
Out[4]:
Age RestBP Chol MaxHR Sex Oldpeak Slope
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.438944 131.689769 246.693069 149.607261 0.679868 1.039604 1.600660
std 9.038662 17.599748 51.776918 22.875003 0.467299 1.161075 0.616226
min 29.000000 94.000000 126.000000 71.000000 0.000000 0.000000 1.000000
25% 48.000000 120.000000 211.000000 133.500000 0.000000 0.000000 1.000000
50% 56.000000 130.000000 241.000000 153.000000 1.000000 0.800000 2.000000
75% 61.000000 140.000000 275.000000 166.000000 1.000000 1.600000 2.000000
max 77.000000 200.000000 564.000000 202.000000 1.000000 6.200000 3.000000
In [5]:
# Here is the table of correlations between our predictors.  This will be useful later on
X.corr()
Out[5]:
Age RestBP Chol MaxHR Sex Oldpeak Slope
Age 1.000000 0.284946 0.208950 -0.393806 -0.097542 0.203805 0.161770
RestBP 0.284946 1.000000 0.130120 -0.045351 -0.064456 0.189171 0.117382
Chol 0.208950 0.130120 1.000000 -0.003432 -0.199915 0.046564 -0.004062
MaxHR -0.393806 -0.045351 -0.003432 1.000000 -0.048663 -0.343085 -0.385601
Sex -0.097542 -0.064456 -0.199915 -0.048663 1.000000 0.102173 0.037533
Oldpeak 0.203805 0.189171 0.046564 -0.343085 0.102173 1.000000 0.577537
Slope 0.161770 0.117382 -0.004062 -0.385601 0.037533 0.577537 1.000000

Q1.2 Is there any evidence of multicollinearity in the set of predictors? How do you know? How will PCA handle these correlations?

Solution:

Yes, there is evidence of collinearity as the estimated $\beta$ coefficient for Age changes greatly from the simple regression model ($\hat{\beta} = 0.0520$) to this multiple regression model ($\hat{\beta} = 0.00525$): a 10-fold decrease.

Next we apply the PCA transformation in a few steps, and show some of the results below:

In [6]:
# create/fit the 'full' pca transformation
pca = PCA().fit(X)

# apply the pca transformation to the full predictor set
pcaX = pca.transform(X)

# convert to a data frame
pcaX_df = pd.DataFrame(pcaX, columns=[['PCA1' , 'PCA2', 'PCA3', 'PCA4', 'PCA5', 'PCA6', 'PCA7']])

# here are the weighting (eigen-vectors) of the variables (first 2 at least)
print("First PCA Component (w1):",pca.components_[0,:])
print("Second PCA Component (w2):",pca.components_[1,:])

# here is the variance explained:
print("Variance explained by each component:",pca.explained_variance_ratio_)
First PCA Component (w1): [ 3.84007875e-02  5.04636205e-02  9.97978046e-01 -3.74374033e-03
 -1.80930152e-03  1.15452228e-03 -3.60044587e-06]
Second PCA Component (w2): [ 1.80599125e-01  1.05022594e-01 -1.59461519e-02 -9.77583674e-01
  8.45378533e-04  1.79192401e-02  1.04056605e-02]
Variance explained by each component: [7.47915963e-01 1.50222918e-01 8.52586952e-02 1.61372448e-02
 3.47699826e-04 6.25312302e-05 5.49481983e-05]

Q1.3 Now try the PCA decompositon on the standardized version of X instead.

In [7]:
### edTest(test_pcaZ) ###

# create/fit the standardized version of X
Z = sk.preprocessing.StandardScaler().fit(X).transform(X)

# create/fit the 'full' pca transformation on Z
pca_standard = PCA().fit(Z)
pcaZ = pca_standard.transform(Z)

# convert to a data frame
pcaZ_df = pd.DataFrame(pcaZ, columns=[['PCA1' , 'PCA2', 'PCA3', 'PCA4', 'PCA5', 'PCA6', 'PCA7']])
In [8]:
# Let's look at them to see what they are comprised of:
pd.DataFrame.from_dict({'Variable': X.columns,
                        'PCA1': pca.components_[0],
                        'PCA2': pca.components_[1],
                        'PCA-Z1': pca_standard.components_[0],
                        'PCA-Z2': pca_standard.components_[1]})
Out[8]:
Variable PCA1 PCA2 PCA-Z1 PCA-Z2
0 Age 0.038401 0.180599 0.411736 0.353496
1 RestBP 0.050464 0.105023 0.269612 0.344915
2 Chol 0.997978 -0.015946 0.123722 0.570322
3 MaxHR -0.003744 -0.977584 -0.473658 0.121837
4 Sex -0.001809 0.000845 0.010681 -0.546336
5 Oldpeak 0.001155 0.017919 0.515666 -0.221566
6 Slope -0.000004 0.010406 0.502094 -0.261511

Q2.3 Interpret the results above. What doss $w_1$ represent based on the untransformed data? What doss $w_1$ represent based on the standardized data? Which is a better representation of the data?

your answer here

$w_1$ represents the transformation (change in basis) to convert the columns of $\mathbf{X}$ to the first PCA vector, $z_1$. They elements after quaring sum up to 1, so the magnitude represents euclidean weighting in the transformation (the larger value means more weight in the transformation).

In [9]:
np.sum(pca.components_[0,:]**2)
Out[9]:
1.0000000000000007

It is common for a model with high dimensional data (lots of predictors) to be plotted along the first 2 PCA components (with the classification boundaries added). Below is the scatter plot for these data (without a classificaiton boundary, since we do not have a model yet):

In [10]:
# Plot the response over the first 2 PCA component vectors

fig,(ax1,ax2) =  plt.subplots(1, 2, figsize = (12,5))

ax1.scatter(pcaX_df[['PCA1']][y==0],pcaX_df[['PCA2']][y==0])
ax1.scatter(pcaX_df[['PCA1']][y==1],pcaX_df[['PCA2']][y==1])
ax1.legend(["AHD = No","AHD = Yes"])
ax1.set_xlabel("First PCA Component Vector")
ax1.set_ylabel("Second PCA Component Vector");

ax2.scatter(pcaZ_df[['PCA1']][y==0],pcaZ_df[['PCA2']][y==0])
ax2.scatter(pcaZ_df[['PCA1']][y==1],pcaZ_df[['PCA2']][y==1])
ax2.legend(["AHD = No","AHD = Yes"])
ax2.set_xlabel("First PCA Component Vector (on standardized data)")
ax2.set_ylabel("Second PCA Component Vector (on standardized data)");
Out[10]:
Text(0, 0.5, 'Second PCA Component Vector (on standardized data)')

Q2.4 Does there appear to be good potential here? Which form of the data appear to provide better information as to seperate the classes in the response? What at would a classification boundary look like if a logistic regression model were fit using the first 2 principal components as the predictors?

your answer here

It would again be linear. Here, most likely the boundary would be a line with negative slope.

Part 2: PCA in Regression (PCR)

First let's fit the full logistic regression model to predict AHD from the 7 predictors above.

Remember: PCA is an approach to handling the predictors (unsupervised), so it does not matter if we are using it for a regression or classification type problem.

In [11]:
#fit the 'full' model on the 7 predictors. and print out the coefficients
logit_full = LogisticRegression(penalty="none",solver="lbfgs",max_iter=2000).fit(X,y)

beta = logit_full.coef_[0]

print(beta)
[ 0.01194387  0.01540617  0.00698371 -0.03597127  1.74038737  0.55627766
  0.30007484]

Below is the result of the PCR-1 (logistic) to predict AHD from the first principal component vector.

In [12]:
logit_pcr1 = LogisticRegression(penalty="none",solver="lbfgs").fit(pcaX_df[['PCA1']],y)

print("Intercept from simple PCR-Logistic:",logit_pcr1.intercept_)
print("'Slope' from simple PCR-Logistic:", logit_pcr1.coef_)

print("First PCA Component (w1):",pca.components_[0:1,:])
Intercept from simple PCR-Logistic: [-0.16620988]
'Slope' from simple PCR-Logistic: [[0.00351119]]
First PCA Component (w1): [[ 3.84007875e-02  5.04636205e-02  9.97978046e-01 -3.74374033e-03
  -1.80930152e-03  1.15452228e-03 -3.60044587e-06]]

Q2.1 What does this PCR-1 model tell us about how the predictors relate to the response (aka, estimate the coefficient(s) in the original predictor space)? Is it truly a simple logistic regression model in the original predictor space?

In [13]:
# your code here: do a multiplication of pcr_1's coefficients times the first component vector from PCA

(logit_pcr1.coef_*pca.components_[0:1,:])
Out[13]:
array([[ 1.34832630e-04,  1.77187582e-04,  3.50409493e-03,
        -1.31450001e-05, -6.35280937e-06,  4.05375218e-06,
        -1.26418654e-08]])

your answer here

The estimated slope from PCR1 ( $\hat{\beta}= 0.00351092$) is distributed across the 4 actual predictors, so that the formula would be:

$$\hat{y} = 0.00351(Z_1) = 0.00351(w^T_1\mathbf{X}) = 0.00351(0.0384X_1+0.0505X_2+0.998X_3-0.0037X_4) \\ = 0.000135X_1+0.000177X_2+0.00350X_3-0.0000131X_4) $$

This is how to interpret the estimated coefficients from a regression with PCA components as the predictors: some transformation back to the original space is required.

Here is the above claculation fora few up tothe 7th PCR logistic regression, and then plotted on a 'pretty' plot:

In [14]:
# Fit the other 3 PCRs on the rest of the 7 predictors
#pcaX_df.iloc[:,np.arange(0,5)].head()

logit_pcr2 = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaX_df[['PCA1','PCA2']],y)
logit_pcr3 = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaX_df[['PCA1','PCA2','PCA3']],y)
logit_pcr4 = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaX_df[['PCA1','PCA2','PCA3','PCA4']],y)
logit_pcr5 = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaX_df[['PCA1','PCA2','PCA3','PCA4','PCA5']],y)
logit_pcr6 = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaX_df[['PCA1','PCA2','PCA3','PCA4','PCA5','PCA6']],y)
logit_pcr7 = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaX_df,y)

pcr1=(logit_pcr1.coef_*np.transpose(pca.components_[0:1,:])).sum(axis=1)
pcr2=(logit_pcr2.coef_*np.transpose(pca.components_[0:2,:])).sum(axis=1)
pcr3=(logit_pcr3.coef_*np.transpose(pca.components_[0:3,:])).sum(axis=1)
pcr4=(logit_pcr4.coef_*np.transpose(pca.components_[0:4,:])).sum(axis=1)
pcr5=(logit_pcr5.coef_*np.transpose(pca.components_[0:5,:])).sum(axis=1)
pcr6=(logit_pcr6.coef_*np.transpose(pca.components_[0:6,:])).sum(axis=1)
pcr7=(logit_pcr7.coef_*np.transpose(pca.components_[0:7,:])).sum(axis=1)

results = np.vstack((pcr1,pcr2,pcr3,pcr4,pcr5,pcr6,pcr7,beta))
print(results)
[[ 1.34832630e-04  1.77187582e-04  3.50409493e-03 -1.31450001e-05
  -6.35280937e-06  4.05375218e-06 -1.26418654e-08]
 [ 8.22888139e-03  4.90460029e-03  3.52035681e-03 -4.36772643e-02
   3.00832874e-05  8.05215560e-04  4.64728134e-04]
 [ 9.67851637e-03  1.57232886e-02  2.98524575e-03 -4.25103732e-02
   1.66262028e-05  9.08384148e-04  4.91486409e-04]
 [ 4.87937328e-03  1.65220860e-02  3.12832994e-03 -4.33112238e-02
   3.92241495e-05  9.15507194e-04  5.08938426e-04]
 [ 1.97160112e-03  1.21592916e-02  3.37846293e-03 -3.28220970e-02
   3.64476979e-02  6.57141425e-01  2.16558303e-01]
 [ 1.76384914e-03  1.28415055e-02  3.91486367e-03 -3.62050051e-02
   3.73981251e-01  8.31601131e-01 -3.35831180e-01]
 [ 1.19527616e-02  1.54021528e-02  6.98403516e-03 -3.59641629e-02
   1.74063117e+00  5.56305309e-01  3.00152829e-01]
 [ 1.19438739e-02  1.54061660e-02  6.98371275e-03 -3.59712705e-02
   1.74038737e+00  5.56277658e-01  3.00074835e-01]]
In [15]:
plt.plot(['PCR1','PCR2', 'PCR3', 'PCR4','PCR5','PCR6','PCR7', 'Logistic'],results)

plt.ylabel("Back-calculated Beta Coefficients");

#plt.legend(X.columns);
Out[15]:
Text(0, 0.5, 'Back-calculated Beta Coefficients')

Q2.5 Interpret the plot above. Specifically, compare how each PCA vector "contributes" to the original logistic regression model using all 7 original predictors. How Does PCR-4 compare to the original logistic regression model (in estimated coefficients)?

your answer here

This plot shows that as more PCA vectors are included in the PCA-Regression, the estimated $\beta$s from the original regression model are recovered: if PCR($p$) is used (where $p$ is the number of predictors we started with), they are mathemtaically equivalent.

All of this PCA work should have been done using the standardized versions of the predictors. Below is the code that does exactly that:

In [16]:
scaler = sk.preprocessing.StandardScaler()
scaler.fit(X)
Z = scaler.transform(X)
pca = PCA(n_components=7).fit(Z)
pcaZ = pca.transform(Z)
pcaZ_df = pd.DataFrame(pcaZ, columns=[['PCA1' , 'PCA2', 'PCA3', 'PCA4','PCA5', 'PCA6', 'PCA7']])

print("First PCA Component (w1):",pca.components_[0,:])
print("Second PCA Component (w2):",pca.components_[1,:])
First PCA Component (w1): [ 0.41173599  0.26961154  0.12372188 -0.47365763  0.01068056  0.51566635
  0.50209418]
Second PCA Component (w2): [ 0.35349616  0.34491485  0.57032206  0.121837   -0.54633621 -0.22156592
 -0.26151097]
In [17]:
#fit the 'full' model on the 4 predictors. and print out the coefficients
logit_full = LogisticRegression(C=1000000,solver="lbfgs").fit(Z,y)


betaZ = logit_full.coef_[0]

print("Logistic coef. on standardized predictors:",betaZ)
Logistic coef. on standardized predictors: [ 0.1079209   0.27060575  0.36100403 -0.82125596  0.81211581  0.64488006
  0.1846543 ]
In [18]:
# Fit the PCR
logit_pcr1Z = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaZ_df[['PCA1']],y)
logit_pcr2Z = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaZ_df[['PCA1','PCA2']],y)
logit_pcr3Z = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaZ_df[['PCA1','PCA2','PCA3']],y)
logit_pcr4Z = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaZ_df[['PCA1','PCA2','PCA3','PCA4']],y)
logit_pcr7Z = LogisticRegression(C=1000000,solver="lbfgs").fit(pcaZ_df,y)

pcr1Z=(logit_pcr1Z.coef_*np.transpose(pca.components_[0:1,:])).sum(axis=1)
pcr2Z=(logit_pcr2Z.coef_*np.transpose(pca.components_[0:2,:])).sum(axis=1)
pcr3Z=(logit_pcr3Z.coef_*np.transpose(pca.components_[0:3,:])).sum(axis=1)
pcr4Z=(logit_pcr4Z.coef_*np.transpose(pca.components_[0:4,:])).sum(axis=1)
pcr7Z=(logit_pcr7Z.coef_*np.transpose(pca.components_)).sum(axis=1)

resultsZ = np.vstack((pcr1Z,pcr2Z,pcr3Z,pcr4Z,pcr7Z,betaZ))
print(resultsZ)

plt.plot(['PCR1-Z' , 'PCR2-Z', 'PCR3-Z', 'PCR4-Z', 'PCR7-Z', 'Logistic'],resultsZ)

plt.ylabel("Back-calculated Beta Coefficients");

plt.legend(X.columns);
[[ 0.36018456  0.23585481  0.10823127 -0.4143533   0.0093433   0.45110231
   0.43922945]
 [ 0.23486422  0.11213646 -0.09984974 -0.46425637  0.21008141  0.53808503
   0.54073257]
 [ 0.25062392  0.32032679 -0.14162369 -0.38896297  0.34364814  0.56142451
   0.51311087]
 [ 0.36663298  0.29038468 -0.18487686 -0.48912818  0.40597179  0.4822163
   0.43312993]
 [ 0.1079209   0.27060575  0.36100403 -0.82125596  0.81211581  0.64488006
   0.1846543 ]
 [ 0.1079209   0.27060575  0.36100403 -0.82125596  0.81211581  0.64488006
   0.1846543 ]]
Out[18]:

Q2.6 Compare this plot to the previous one; why does this plot make sense?. What does this illustrate?

Solution:

This plot shows that the components are now more evenly composed of the predictors, rather than the first component being dominated by the predictor with the most variability. The 4 lines move more similarly here than in th eprevious plot where they essentially moved one predictor for one component.

In [19]: