Title :¶
Lecture Notebook
Description :¶
Data Description:¶
Instructions:¶
Hints:¶
Statsmodels
statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
Basic code structure is shown below:
import statsmodels.api as sm
X is our dataset
Add intercept (bias constant):
Fit regression model:
Inspect the results: print(results.summary())
CS109A Introduction to Data Science
Principal Component Analysis¶
Harvard University
Fall 2021
Instructors: Pavlos Protopapas, Natesh Pillai
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy
import seaborn as sns
import plotly
import plotly.graph_objs as go
import sklearn as sk
from sklearn.decomposition import PCA
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
pd.set_option("display.width", 500)
pd.set_option("display.max_columns", 100)
sns.set_style("darkgrid")
sns.set_palette("colorblind")
PCA¶
Part 0: Reading the data¶
In this notebook, we will be using a Heart dataset. The variables we will be using today include:
AHD
: whether or not the patient presents atherosclerotic heart disease (a heart attack):Yes
orNo
Sex
: a binary indicator for whether the patient is male (Sex=1) or female (Sex=0)Age
: age of patient, in yearsMaxHR
: the maximum heart rate of patient based on exercise testingRestBP
: the resting systolic blood pressure of the patientChol
: the HDL cholesterol level of the patientOldpeak
: ST depression induced by exercise relative to rest (on an ECG)Slope
: the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)Ca
: number of major vessels (0-3) colored by flourosopy
For further information on the dataset, please see the UC Irvine Machine Learning Repository.
df_heart = pd.read_csv("Heart.csv")
# Force the response into a binary indicator:
df_heart["AHD"] = (df_heart["AHD"] == "Yes").astype("int")
print(df_heart.shape)
df_heart.head()
df_heart.drop(columns = ['Unnamed: 0'], inplace=True)
df_heart
Here are some basic summaries and EDA from last time:
df_heart.describe()
pd.crosstab(df_heart["Sex"], df_heart["AHD"])
pd.crosstab(df_heart["Thal"], df_heart["AHD"])
pd.crosstab(df_heart["ChestPain"], df_heart["AHD"])
_ = sns.histplot(data=df_heart, x="Age", hue="AHD")
_ = sns.histplot(data=df_heart, x="MaxHR", hue="AHD")
Part 1: Principal Components Analysis (PCA)¶
Q1.1 Just a sidebar (and a curiosity), what happens when two of the identical predictor is used in linear regression? Is an error created? Should one be? Investigate by predicting AHD
from two copies of Age
, and compare to the simple linear regression model with Age
alone.
X = sm.add_constant(df_heart[["Age"]])
y = df_heart["AHD"]
reg1 = sm.OLS(y, X).fit()
reg1.summary()
Solution:
The single coefficient for Age
is distributed equally across the two predictors. This is a very reasonable approach as predictions will still be stable.
# investigating what happens when two identical predictors are used
######
# your code here
######
X = sm.add_constant(df_heart[["Age", "Age"]])
reg2 = sm.OLS(y, X).fit()
print(reg2.summary())
We will apply PCA to the heart dataset when there are just 7 predictors considered (remember: PCA is used when dimensionality is high (lots of predictors), but this will help us get our heads around what is going on):
columns = ["Age", "RestBP", "Chol", "MaxHR", "Sex", "Oldpeak", "Slope"]
X = df_heart[columns]
y = df_heart["AHD"]
X.describe()
X.corr()
First let's fit the full linear regression model to predict AHD
from the 7 predictors above.
Remember: PCA is an approach to handling the predictors, so it does not matter if we are using it for a regression or classification type problem.
reg_full = sm.OLS(y, sm.add_constant(X)).fit()
reg_full.summary()
Q1.2 Is there any evidence of multicollinearity in the set of predictors? How do you know? How will PCA handle these correlations?
Solution:
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
vif_data
Because we have high VIFs, this indicates that we have multicollinearity.
Part 2: PCA in Regression (PCR)¶
Next we apply the PCA transformation in a few steps, and show some of the results below:
# create/fit the 'full' pca transformation
pca = PCA().fit(X)
# apply the pca transformation to the full predictor set
pcaX = pca.transform(X)
# convert to a data frame
pcr_columns = ["PCA1" , "PCA2", "PCA3", "PCA4", "PCA5", "PCA6", "PCA7"]
pcaX_df = pd.DataFrame(pcaX, columns=pcr_columns)
# here are the weighting (eigen-vectors) of the variables (first 2 at least)
print("First PCA Component (w1):", pca.components_[0,:])
print("Second PCA Component (w2):", pca.components_[1,:])
pcaX_df
# here is the variance explained:
print("Variance explained by each component:", pca.explained_variance_ratio_)
blue = sns.color_palette("colorblind")[0]
sns.barplot(y=list(range(1,8)), x=pca.explained_variance_ratio_, orient="h", color=blue)
plt.xscale("log")
_ = sns.barplot(y=list(range(1,8)), x=pca.explained_variance_ratio_, orient="h", color=blue)
# create/fit the 'full' pca transformation
Z = sk.preprocessing.StandardScaler().fit(X).transform(X)
pca_standard = PCA().fit(Z)
pcaZ = pca_standard.transform(Z)
# convert to a data frame
pcaZ_df = pd.DataFrame(pcaZ, columns=pcr_columns)
print(pca_standard.components_.shape)
print(pcaZ.shape)
pd.DataFrame.from_dict({"Variable": X.columns,
"PCA1": pca.components_[0],
"PCA2": pca.components_[1],
"PCA-Z1": pca_standard.components_[0],
"PCA-Z2": pca_standard.components_[1]})
Q2.3 Interpret the results above. What doss $w_1$ represent? Why do the values make sense? What does it's values squared sum up to? Why does this make sense?
Solution:
$w_1$ represents the transformation (change in basis) to convert the columns of $\mathbf{X}$ to the first PCA vector, $z_1$. They elements after quaring sum up to 1, so the magnitude represents euclidean weighting in the transformation (the larger value means more weight in the transformation).
np.sum(pca.components_[0,:]**2)
It is common for a model with high dimensional data (lots of predictors) to be plotted along the first 2 PCA components (with the classification boundaries added). Below is the scatter plot for these data (without a classificaiton boundary, since we do not have a model yet):
# Plot the response over the first 2 PCA component vectors
sns.scatterplot(data=pcaX_df, x="PCA1", y="PCA2", hue=df_heart["AHD"], legend="full")
plt.xlabel("First PCA Component Vector (Z1)")
plt.ylabel("Second PCA Component Vector (Z2)");
Q2.4 What would a classification boundary look like if a linear regression model were fit using the first 2 principal components as the predictors? Does there appear to be good potential here?
Solution:
It would again be linear. Here, most likely the boundary would be a line with negative slope.
Below is the result of the PCR-1 (linear) to predict AHD
from the first principal component vector.
X = sm.add_constant(pcaX_df[["PCA1"]])
reg_pcr1 = sm.OLS(y, X).fit()
reg_pcr1.summary()
print("First PCA Component (w1):", pca.components_[0:1,:])
Q2.5 What does this PCR-1 model tell us about how the predictors relate to the response (aka, estimate the coefficient(s) in the original predictor space)? Is it truly a simple linear regression model in the original predictor space?
beta = reg_pcr1.params[1]
(beta*pca.components_[0:1,:])
Solution:
The estimated slope from PCR1 ($\hat{\beta} \approx 0.0009$) is distributed across the 7 actual predictors, so that the formula would be:
$$\hat{y} = 0.0009(Z_1) + 0.4587 = 0.0009(w^T_1\mathbf{X}) + 0.4587 \\ = 0.0009(0.0384X_1+0.0505X_2+0.998X_3-0.00374X_4-0.0018X_5+0.00115X_6-0.0000036X_7) + 0.4587 \\ = 3.31 \cdot 10^{-5} X_1 + 4.35 \cdot 10^{-5} X_2 + 8.6 \cdot 10^{-4} X_3 - 3.23 \cdot 10^{-6} X_4 - 1.56 \cdot 10^{-6} X_5 + 9.955 \cdot 10^{-7} X_6 - 3.1 \cdot 10^{-9} X_7 + 0.4587$$This is how to interpret the estimated coefficients from a regression with PCA components as the predictors: some transformation back to the original space is required.
Here is the above claculation for all 7 PCR linear regressions, and then plotted on a pretty plot:
results_arr = []
for i in range(1, 8):
reg_pcr_tmp = sm.OLS(y, sm.add_constant(pcaX_df[pcr_columns[:i]])).fit()
pcr_tmp = np.transpose(pca.components_[:i,:]) @ reg_pcr_tmp.params[1:i+1]
results_arr.append(pcr_tmp)
betas = reg_full.params[1:]
results_arr.append(betas)
results = np.vstack(results_arr)
print(results)
plt.plot(pcr_columns + ["Linear"], results)
plt.ylabel("Back-calculated Beta Coefficients")
plt.legend(df_heart.columns)
Q2.6 Interpret the plot above. Specifically, compare how each PCA vector "contributes" to the original linear regression model using all 7 original predictors. How Does PCR-7 compare to the original linear regression model (in estimated coefficients)?
Solution:
This plot shows that as more PCA vectors are included in the PCA-Regression, the estimated $\beta$s from the original regression model are recovered: if PCR($p$) is used (where $p$ is the number of predictors we started with), they are mathemtaically equivalent.
All of this PCA work should have been done using the standardized versions of the predictors. Below is the code that does exactly that:
X = df_heart[columns]
scaler = sk.preprocessing.StandardScaler()
Z = scaler.fit_transform(X)
pca = PCA().fit(Z)
pcaZ = pca.transform(Z)
pcaZ_df = pd.DataFrame(pcaZ, columns=pcr_columns)
print("First PCA Component (w1):", pca.components_[0,:])
print("Second PCA Component (w2):", pca.components_[1,:])
regZ_full = sm.OLS(y, sm.add_constant(pd.DataFrame(Z, columns=columns))).fit()
regZ_full.summary()
# Fit the PCR
results_arr = []
for i in range(1, 8):
reg_pcrZ_tmp = sm.OLS(y, sm.add_constant(pcaZ_df[pcr_columns[:i]])).fit()
pcrZ_tmp = np.transpose(pca.components_[:i,:]) @ reg_pcrZ_tmp.params[1:i+1]
results_arr.append(pcrZ_tmp)
betasZ = regZ_full.params[1:]
results_arr.append(betasZ)
resultsZ = np.vstack(results_arr)
print(resultsZ)
plt.plot(pcr_columns + ["Linear"],resultsZ)
plt.ylabel("Back-calculated Beta Coefficients");
plt.legend(X.columns);
Q2.7 Compare this plot to the previous one; why does this plot make sense?. What does this illustrate?
Solution:
This plot shows that the components are now more evenly composed of the predictors, rather than the first component being dominated by the predictor with the most variability. The 7 lines move more similarly here than in the previous plot where they essentially moved one predictor for one component.
Part 3: Underlying Math¶
What is PCA doing with these eigenvectors? Why does it all work? To answer these questions, it is easiest to restrict ourselves to two dimensions so that we can easily visualize. To show what is going on, we will focus on Age
and MaxHR
because there is a clear negative relationship between these two due to biology.
_ = sns.scatterplot(data=df_heart, x="Age", y="MaxHR")
Note, be careful looking at things with unequal axes, relationships can be lost. Let us set the axes to be equal proportion and the extremely linear relationship reveals itself.
sns.scatterplot(data=df_heart, x="Age", y="MaxHR")
_ = plt.axis("equal")
Now, let's suppose we wanted to summarize this data, how would you do this? One way to do this is to give the direction that explains the greatest variance. Why variance? The equation for sample variance is $S = \frac{(X-\mu)^2}{n-1}$ which centers the data for us and so the direction of the greatest variance really describes the direction in which the data tends to go. In practice, we can get rid of the $(n-1)$ in the denominator because the operations that follow are scale invariant.
X = df_heart[["Age", "MaxHR"]].values
mu = np.mean(X, axis=0)
S = (X - mu).T @ (X - mu)
S
As the variance is a matrix, finding the direction of the greatest variance is equivalent to $\max_{\lVert w \rVert = 1} {\lVert S w \rVert^2}$ which turns out to be the largest eigenvalue. So, the direction of largest variance is simply the largest eigenvalue of $S$.
eigen_values, eigen_vectors = scipy.linalg.eig(S)
w_1 = eigen_vectors[:, np.argmax(eigen_values)]
w_1
To find the second greatest direction of variance, we should remove the effects of the first direction. This is done by projecting $w_1$ onto $X$ and subtracting it.
X_hat = X - X @ np.outer(w_1, w_1)
mu_hat = np.mean(X_hat, axis=0)
S_hat = (X_hat - mu_hat).T @ (X_hat - mu_hat)
S_hat
By the same reasoning as before, the largest eigenvector of this new matrix will correspond to the second direction.
eigen_values, eigen_vectors = scipy.linalg.eig(S_hat)
w_2 = eigen_vectors[:, np.argmax(eigen_values)]
w_2
np.sqrt(np.max(eigen_values))
Comparing to the results from Sklearn, we see that these precisely correspond to the components retrieved by PCA. Thus, PCA is simply an iterative procedure of finding the direction of greatest variance using the sample variance matrix by finding the largest eigenvector, removing its effects via projection, then repeating the procedure.
pca = PCA().fit(X)
pca.components_
To gain a geometric intuition, let us plot the points in the original space (blue), then projected to remove the largest eigenvector (orange) alongside both the eigenvectors. What we notice is that the direction of the first eigenvector is indeed responsible for most of the variance. Just by looking, the data has a spread of length 100 whereas the orange points is something closer to a spread of 50. The second thing to notice is that both eigenvectors together completely summarize the data. That is, after removing the second eigenvector, the data would collapse to the origin. Thus, using all the eigenvectors (PCA components) ultimately retrieves the information in the original data.
sns.scatterplot(x=X[:,0], y=X[:,1])
sns.scatterplot(x=X_hat[:,0], y=X_hat[:,1])
x = np.stack([X[:,0], X_hat[:,0]]).T
y = np.stack([X[:,1], X_hat[:,1]]).T
for i in range(len(X)//20):
sns.lineplot(x=x[i], y=y[i], color="k")
x = [0, -100*w_1[0]]
y = [0, -100*w_1[1]]
sns.lineplot(x=x, y=y)
x = [0, 100*w_2[0]]
y = [0, 100*w_2[1]]
sns.lineplot(x=x, y=y)
plt.xlabel("Age")
plt.ylabel("MaxHR")
_ = plt.axis("equal")
Let us repeat this procedure but for data in 3-dimensions so that you can try to extend the visualization to higher dimensions. Here, we switch to plotly because it handles 3D much better. You immediately notice the linear relationship between all 3 variables.
X = df_heart[["Age", "MaxHR", "RestBP"]].values
# Configure Plotly to be rendered inline in the notebook.
plotly.offline.init_notebook_mode()
# Configure the trace.
trace = go.Scatter3d(
x=X[:,0],
y=X[:,1],
z=X[:,2],
mode="markers",
marker={
"size": 10,
"opacity": 0.8,
}
)
# Configure the layout.
layout = go.Layout(
margin={"l": 0, "r": 0, "b": 0, "t": 0},
scene=go.layout.Scene(
xaxis=go.layout.scene.XAxis(title="Age"),
yaxis=go.layout.scene.YAxis(title="MaxHR"),
zaxis=go.layout.scene.ZAxis(title="RestBP")
)
)
data = [trace]
plot_figure = go.Figure(data=data, layout=layout)
# Render the plot.
plotly.offline.iplot(plot_figure)
Now, let's do the PCA procedure.
X = df_heart[["Age", "MaxHR", "RestBP"]].values
X_orig = X.copy()
# Configure Plotly to be rendered inline in the notebook.
plotly.offline.init_notebook_mode()
# Configure the trace.
trace = go.Scatter3d(
x=X[:,0],
y=X[:,1],
z=X[:,2],
mode="markers",
name="Original Data",
marker={
"size": 10,
"opacity": 0.8,
}
)
# Configure the layout.
layout = go.Layout(
margin={"l": 0, "r": 0, "b": 0, "t": 0},
scene=go.layout.Scene(
xaxis=go.layout.scene.XAxis(title="Age"),
yaxis=go.layout.scene.YAxis(title="MaxHR"),
zaxis=go.layout.scene.ZAxis(title="RestBP")
)
)
data = [trace]
plot_figure = go.Figure(data=data, layout=layout)
## First projection
mu = np.mean(X, axis=0)
S = (X - mu).T @ (X - mu)
eigen_values, eigen_vectors = scipy.linalg.eig(S)
w_1 = eigen_vectors[:, np.argmax(eigen_values)]
X_prev = X.copy()
X = X_orig - X_orig @ np.outer(w_1, w_1)
# Configure the trace.
trace = go.Scatter3d(
x=X[:,0],
y=X[:,1],
z=X[:,2],
mode="markers",
name="First Projection",
marker={
"size": 10,
"opacity": 0.8,
}
)
data.append(trace)
x_lines = []
y_lines = []
z_lines = []
#create the coordinate list for the lines
for i in range(len(X)//10):
trace = go.Scatter3d(
x=[X_prev[i,0], X[i,0]],
y=[X_prev[i,1], X[i,1]],
z=[X_prev[i,2], X[i,2]],
mode="lines",
showlegend=False,
line=go.scatter3d.Line(color="black")
)
data.append(trace)
## Second projection
mu = np.mean(X, axis=0)
S = (X - mu).T @ (X - mu)
eigen_values, eigen_vectors = scipy.linalg.eig(S)
w_2 = eigen_vectors[:, np.argmax(eigen_values)]
X_prev = X.copy()
X = X_orig - X_orig @ np.outer(w_1, w_1) - X_orig @ np.outer(w_2, w_2)
# Configure the trace.
trace = go.Scatter3d(
x=X[:,0],
y=X[:,1],
z=X[:,2],
mode="markers",
name="Second Projection",
marker={
"size": 10,
"opacity": 0.8,
}
)
data.append(trace)
#create the coordinate list for the lines
for i in range(len(X)//10):
trace = go.Scatter3d(
x=[X_prev[i,0], X[i,0]],
y=[X_prev[i,1], X[i,1]],
z=[X_prev[i,2], X[i,2]],
mode="lines",
showlegend=False,
line=go.scatter3d.Line(color="black")
)
data.append(trace)
## Third projection
mu = np.mean(X, axis=0)
S = (X - mu).T @ (X - mu)
eigen_values, eigen_vectors = scipy.linalg.eig(S)
w_3 = eigen_vectors[:, np.argmax(eigen_values)]
## Eigenvectors
trace = go.Scatter3d(
x=[0, 200*w_1[0]],
y=[0, 200*w_1[1]],
z=[0, 200*w_1[2]],
mode="lines",
name="First Eigenvector",
line=go.scatter3d.Line(color="blue")
)
data.append(trace)
trace = go.Scatter3d(
x=[0, 200*w_2[0]],
y=[0, 200*w_2[1]],
z=[0, 200*w_2[2]],
mode="lines",
name="Second Eigenvector",
line=go.scatter3d.Line(color="red")
)
data.append(trace)
trace = go.Scatter3d(
x=[0, 200*w_3[0]],
y=[0, 200*w_3[1]],
z=[0, 200*w_3[2]],
mode="lines",
name="Third Eigenvector",
line=go.scatter3d.Line(color="green")
)
data.append(trace)
plot_figure = go.Figure(data=data, layout=layout)
# Render the plot.
plotly.offline.iplot(plot_figure)
Notice again that the lines are parallel to each other. Now, the original data, a cloud of points in 3D, first gets projected to a plane (the red points), then projected to a line (the green points). Imagine first squishing a ball of Play-Doh into a pancake and then taking the pancake and squishing the outsides to form a rope. This is exactly what PCA is doing except that it does the squishing in directions that have the maximal variance. Why is this all useful? Because, instead of projecting the points as we have done, we can perform dimensionality reduction by projecting the original data on a subset of the eigenvalues. For example, we can take our 3D cloud of points and reduce the dimensions by 66% by only keeping the first eigenvector which is going to responsible for most of the variance and so keeps most of the information in those features.
X = df_heart[["Age", "MaxHR", "RestBP"]].values
X_orig = X.copy()
# Configure Plotly to be rendered inline in the notebook.
plotly.offline.init_notebook_mode()
# Configure the trace.
trace = go.Scatter3d(
x=X[:,0],
y=X[:,1],
z=X[:,2],
mode="markers",
name="Original Data",
marker={
"size": 10,
"opacity": 0.8,
}
)
# Configure the layout.
layout = go.Layout(
margin={"l": 0, "r": 0, "b": 0, "t": 0},
scene=go.layout.Scene(
xaxis=go.layout.scene.XAxis(title="Age"),
yaxis=go.layout.scene.YAxis(title="MaxHR"),
zaxis=go.layout.scene.ZAxis(title="RestBP")
)
)
data = [trace]
plot_figure = go.Figure(data=data, layout=layout)
## Remove smallest eigenvalues
X = X_orig - X_orig @ np.outer(w_2, w_2) - X_orig @ np.outer(w_3, w_3)
# Configure the trace.
trace = go.Scatter3d(
x=X[:,0],
y=X[:,1],
z=X[:,2],
mode="markers",
name="Data Along First Eigenvector",
marker={
"size": 10,
"opacity": 0.8,
}
)
data.append(trace)
plot_figure = go.Figure(data=data, layout=layout)
# Render the plot.
plotly.offline.iplot(plot_figure)
As you can see, the red line is only 1 dimensional but the spread between the two furthest points is almost equal to the original data. We can use this to construct curves of how many dimensions are needed to retain a certain amount of variance. For instance, suppose we want to decrease the dimensions of our 7 dimensional dataset so that we may visualize the data more readily. Suppose furthermore we want the data to keep 90% of the original variance.
columns = ["Age", "RestBP", "Chol", "MaxHR", "Sex", "Oldpeak", "Slope"]
X = df_heart[columns].values
mu = np.mean(X, axis=0)
S = (X - mu).T @ (X - mu) / (len(X) - 1)
total_variance = np.diag(S).sum()
print(f"Total variance is: {total_variance}")
n = len(columns)
var_arr = []
eigenvector_arr = []
for i in range(n):
eigen_values, eigen_vectors = scipy.linalg.eig(S)
w = np.real(eigen_vectors[:, np.argmax(eigen_values)])
eigenvector_arr.append(w)
X = X - X @ np.outer(w, w)
mu = np.mean(X, axis=0)
S = (X - mu).T @ (X - mu) / (len(X) - 1)
variance = np.diag(S).sum()
var_arr.append((total_variance-variance)/total_variance)
sns.lineplot(x=list(range(1, n+1)), y=var_arr)
plt.xlabel("Number of Components")
plt.ylabel("Proportion of Total Variance")
print(var_arr)
So we see that 2 dimensions keeps almost 90% of the original variance and when we jump to 3 dimension it keeps 98% of the original data. Of course, this is a very common task for PCA and so is provided by many packages.
X = df_heart[columns].values
pca = PCA().fit(X)
print(pca.explained_variance_ratio_)
So, we can compress our data in 2 dimensions and now visualize our 7 dimensional dataset.
n_components = 2
X_hat = pca.transform(X)[:,:n_components]
sns.scatterplot(x=X_hat[:,0], y=X_hat[:,1], hue=df_heart["AHD"], legend="full")
plt.xlabel("First PCA Component Vector (Z1)")
plt.ylabel("Second PCA Component Vector (Z2)");
_ = plt.axis("equal")
To summarize, PCA is not a tool to help you make better predictions. It cannot be because it is simple linear transformations of the data. However, it gives one a way to compress the data and to better visualize it without losing information. In the lens of compression, PCA can be thought of as feature engineering as your new compressed data retains much of the information that is now exogenous of the dataset.