CS109A Introduction to Data Science

Standard Section 6: PCA and Logistic Regression

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Section Leaders: Marios Mattheakis, Abhimanyu (Abhi) Vasishth, Robbert (Rob) Struyven

In [ ]:
import requests
from IPython.core.display import HTML
styles = requests.get("http://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text

For this section, our goal is to get you familiarized with Dimensionality Reduction using Principal Components Analysis (PCA) and to recap Logistic Regression from the last homework. This medium article was referenced extensively while creating this notebook.

Specifically, we will:

  • Understand how to define the terms big data and high-dimensionality and see the motivation for PCA
  • Learn what PCA is
  • Use PCA in order to visualize a high-dimensional problem in 2-dimensions
  • Learn about the sklearn PCA library and its nuances
  • Get familiar with the Linear Algebra of PCA
  • Meet the MNIST handwritten digit dataset (and hopefully stay friends for a while)
  • Use PCA in order to improve model training time and understand the speed-accuracy trade-off
  • Discuss when to use PCA and when not to use it

In [ ]:
# Data and stats packages
import numpy as np
import pandas as pd

# Visualization packages
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Other packages
pd.set_option('display.max_columns', 50)
import warnings


Principal Components Analysis helps us deal with high-dimensionality in big-data.

But first...

High-dimensionality is the case when p is large i.e. there are a lot of predictors. This is sometimes a problem because:

  1. Our models may be overfit
  2. There may be multi-collinearity
  3. Matrices may not be invertible (in the case of OLS)

Our challenge: is to represent these p dimensions by a smaller number (m) dimensions without losing too much information. Then, we can fit a model using these m predictors, which addresses the three problems listed above. Here's where PCA comes into play.

What is Principal Components Analysis (PCA)?

A Framework For Dimensionality Reduction

We said that one way to reduce the dimensions of the feature space is to create a new, smaller set of predictors by taking linear combinations of the original predictors. Our original model (let's say it is a Linear Regression Model) looks like this:

$$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon $$

We choose $Z_1$, $Z_2$,$\dots$, $Z_m$, where $m < p$ and where each $Z_i$ is a linear combination of the original p predictors, $X_1$ to $X_p$. We can say that:

$$ Z_i = \sum_{j=1}^{p} c_{ij} X_i $$

for fixed constants $c_{ij}$ (PCA can determines them). As an example, we could say that:

$$ Z_1 = 3.3 X_1 + 4 X_2 + 0 X_3 + \dots + 1.2 X_p $$

In the above equation, we see that $Z_1$ is a linear combination of the original predictors. Then we can build a linear regression model using the new predictors as follows:

$$ Y = \theta_0 + \theta_1 Z_1 + \theta_2 Z_2 + \dots + \theta_m Z_m + \epsilon $$

Notice that this model has a smaller number (m+1 < p+1) of parameters. Each $Z_i$ is called PRINCIPAL COMPONENT. The principcal components consist an $m$-dimensional orthonormal system of coordinates.

PCA is a method to identify a new set of predictors, as linear combinations of the original ones, that captures the 'maximum amount' of variance in the observed data. This is the basic assumption in the PCA.

We see that the "best line" is the one where there is maximal variance along the line. Source here.

In principle, we could explore all the rotations, that is, rotating our coordinate system under all the angles, and find which rotation yields the maximum variance or the smallest covariance. However, when the dimensionality (p) is large this is very time consuming and inefficient technique. In that case we may use PCA which is systematic way to find the best rotation or the best coordinate system. PCA is a mathematical method based on linear algebra, for more details and rigorous formulation see the notes in the advanced section for PCA.

Applications of PCA

One major application of PCA is to address the issues we pointed out earlier (reduce the number of predictors).

In addition, another major application of PCA is in visualization. Specifically, if we have an N-dimensional dataset, how do we visualize it?

One option:

A more practical option: use PCA to get the top 2-3 principal components and plot these components on 2-d or 3-d plots!

PCA for Visualization

Data Source: MTCars Dataset

Here are a few resources that use this dataset and apply PCA for visualization. This notebook references this PCA tutorial in R, these lecture notes from CMU, this blog, and this blog which has some nice visualizations of PCA.

Loading in The Cars Dataset and carry out EDA

This dataset consists of data on 32 models of car, taken from an American motoring magazine (1974 Motor Trend magazine). For each car, you have 11 features, expressed in varying units (US units), They are as follows (source):

  • mpg: Fuel consumption (Miles per (US) gallon): more powerful and heavier cars tend to consume more fuel.
  • cyl: Number of cylinders: more powerful cars often have more cylinders
  • disp: Displacement (cu.in.): the combined volume of the engine's cylinders
  • hp: Gross horsepower: this is a measure of the power generated by the car
  • drat: Rear axle ratio: this describes how a turn of the drive shaft corresponds to a turn of the wheels. Higher values will decrease fuel efficiency.
  • wt: Weight (1000 lbs): pretty self-explanatory!
  • qsec: 1/4 mile time: the cars speed and acceleration
  • vs: Engine block: this denotes whether the vehicle's engine is shaped like a "V", or is a more common straight shape.
  • am: Transmission: this denotes whether the car's transmission is automatic (0) or manual (1).
  • gear: Number of forward gears: sports cars tend to have more gears.
  • carb: Number of carburetors: associated with more powerful engines

Note that the units used vary and occupy different scales.

We are dropping the categorical variables vs and am before we progress any further, and only keeping in the continuous predictors.

In [ ]:
cars_df = pd.read_csv('../data/mtcars.csv')
cars_df = cars_df[cars_df.columns.difference(['am', 'vs'])]
In [ ]:

Our task is to try to visualize this data in a meaningful way. Obviously we can't make a 9-dimensional plot, but we can try to make several different plots using the pairplot function from seaborn.

In [ ]:

But there are numerous variables and numerous more relationships between these variables. We can do better through PCA.

A Better Visualization using PCA

Standardizing Variables

Standardization is crucial for PCA. For more details check the notes from the advanced section.

In [ ]:
from sklearn.preprocessing import StandardScaler

# separating the quantitative predictors from the model of the car (a string)
model = cars_df['model']
quant_df = cars_df[cars_df.columns.difference(['model'])]

# Standardization
quant_scaled = StandardScaler().fit_transform(quant_df)
cars_df_scaled = pd.DataFrame(quant_scaled, columns=quant_df.columns)

# bringing back the model name
cars_df_scaled['model'] = cars_df['model']

Carrying out PCA

Sklearn PCA documentation

In [ ]:
from sklearn.decomposition import PCA

# drop again the model predictor
quant_df = cars_df_scaled[cars_df_scaled.columns.difference(['model'])]

# fitting the PCA object onto our dataframe (excluding the model name column)
pca = PCA().fit(quant_df)

# transforming the dataframe
quant_df_pca = pca.transform(quant_df)


Let us examine some of the attributes we obtain from PCA.

  1. explained_variance_: The amount of variance explained by each of the selected principal components.
  2. explained_variance_ratio_: Percentage of variance explained by each of the selected principal components. By default, if n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.
In [ ]:
fig, ax = plt.subplots(ncols=2, figsize=(20,6))
ax1, ax2 = ax.ravel()

ratio = pca.explained_variance_ratio_
ax1.bar(range(len(ratio)), ratio, color='purple', alpha=0.8)
ax1.set_title('Explained Variance Ratio PCA', fontsize=20)
ax1.set_xticklabels(['PC {}'.format(i+1) for i in range(len(ratio))])
ax1.set_ylabel('Explained Variance Ratio')

# ratio[0]=0
ratio = pca.explained_variance_ratio_
ax2.plot(np.cumsum(ratio), 'o-')

ax2.set_title('Cumulative Sum of Explained Variance Ratio PCA', fontsize=20)

ax2.set_xticklabels(['PC {}'.format(i+1) for i in range(len(ratio))])
ax2.set_ylabel('Cumulative Sum of Explained Variance Ratio');

We see that over 85% of the variance is explained by the first 2 principal components!

  1. components_: This represents the principal components i.e. directions of maximum variance in the data. The components are sorted by explained_variance_.

Let us write the equation for all the principal components using our formulation of the principal components above:

$$ Z_i = \sum_{j=1}^{p} w_{ij} X_i $$
In [ ]:
for i, comp in enumerate(pca.components_):
    expression = 'Z_{} = '.format(i+1)
    for c, x in zip(comp, quant_df.columns):
        if c < 0:
            expression += str(np.round(c,2)) + '*' + x + ' '
            expression += '+' + str(np.round(c,2)) + '*' + x + ' '
    print(expression + '\n')

Using the printed equations above, we can create vectors showing where each feature has a high value. Let us do this for the first 2 principal components (using $v$ to denote a vector):

$$ \begin{aligned} v_{carb} = \begin{pmatrix}-0.24 \\ 0.48 \end{pmatrix}, \; v_{cyl} = \begin{pmatrix}-0.4 \\ 0.02 \end{pmatrix}, \; v_{disp} = \begin{pmatrix}-0.4 \\ -0.09 \end{pmatrix}, \\ v_{drat} = \begin{pmatrix}0.31 \\ 0.34 \end{pmatrix}, \; v_{gear} = \begin{pmatrix}0.21 \\ 0.55 \end{pmatrix}, \; v_{hp} = \begin{pmatrix}-0.37 \\ 0.27 \end{pmatrix}, \\ v_{mpg} = \begin{pmatrix}0.39 \\ 0.03 \end{pmatrix}, \; v_{qsec} = \begin{pmatrix}0.22 \\ -0.48 \end{pmatrix}, \; v_{wt} = \begin{pmatrix}-0.37 \\ -0.17 \end{pmatrix} \end{aligned} $$

Checking if our vectors are orthonormal

Orthonormal vectors are the vectors which are orthogonal (zero dot product) with length equal to one (unit vectors).


We use the dot product between two vectors to check if the vectors are orthogonal or not. If the dot product is 0, that means that the two vectors are orthogonal. The dot product between two vectors is (geometrically):

$$ \textbf{a} \cdot \textbf{b} = ||\textbf{a}|| ||\textbf{b}|| \cos(\theta) $$

Where $\theta$ is the angle between the two vectors and $||\cdot||$ denotes the norm of the vector. Since we assume that the norm of a vector is non-zero, the only way the dot product of two vectors to be zero is when the angle between them is 90 degrees (since $\cos(90) = 0$). Thus, the dot product is a good way to check if two vectors are perpendicular.

Unit vectors

In order to calculate the length $||\textbf{a}||$ of a vector we can take the dot product of a vector with itself, namely $$ ||\textbf{a}|| = \textbf{a}\cdot \textbf{a} $$

In [ ]:
vec1 = pca.components_[0]
vec2 = pca.components_[1]

# print(np.dot(vec1.T, vec2))
print('The dot product between the first two principal components is ',np.round(np.dot(vec1, vec2),5))
print('The length of the first  principal component is ',np.round(np.dot(vec1, vec1),5))

We see that the first two principal components are orthogonal and the first principal component is also a unit vector. Check other pairs of principal components in order to convince yourself that all principal components are always pairwise orthogonal unit vectors.

Plotting Results

In [ ]:
# to plot vectors from the center
vecs = pca.components_[0:2].T*2

fig, ax = plt.subplots(figsize=(16,8))
ax.plot(quant_df_pca[:,0], quant_df_pca[:,1], 'o', markersize=0.01)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_title('Cars Dataset plotted using first 2 Principal Components', fontsize=20)

# plotting arrowheads
for i, vec in enumerate(vecs):
    ax.arrow(0,0,vec[0],vec[1], color='brown', head_width=0.1)
    s = 1.3
    ax.annotate(quant_df.columns[i], (s*vec[0], s*vec[1]), color='brown')

# annotating text
for i, txt in enumerate(cars_df_scaled['model']):
    ax.annotate(txt, (quant_df_pca[:,0][i], quant_df_pca[:,1][i]), size=12)

Any patterns of interest? Let us examine the geography more closely. Source: this blog.

In [ ]:
country = ["Japan", "US", "EU", "US", "EU", "Japan", "US", "EU", "US", "EU"]
times = [3, 4, 7, 3, 1, 3, 4, 3, 1, 3]
country_list = np.array(sum(([x]*y for x,y in zip(country, times)),[]))
In [ ]:
fig, ax = plt.subplots(figsize=(16,8))

# main plot
ax.plot(quant_df_pca[:,0], quant_df_pca[:,1], 'o', markersize=0.01)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_title('Cars Dataset plotted using first 2 Principal Components', fontsize=20)

# plotting arrowheads
for i, vec in enumerate(vecs):
    ax.arrow(0,0,vec[0],vec[1], color='brown', head_width=0.05)
    s = 1.3
    ax.annotate(quant_df.columns[i], (s*vec[0], s*vec[1]), color='brown')

# plotting names
cs = [sns.xkcd_rgb["magenta"], sns.xkcd_rgb["denim blue"], sns.xkcd_rgb["medium green"]]
colors = {"Japan": cs[0], "US": cs[1], "EU": cs[2]}

# dummy plots to show up in the legend
ax.plot(0,0, color=cs[0], label='Japan')
ax.plot(0,0, color=cs[1], label='US')
ax.plot(0,0, color=cs[2], label='EU')

# plotting text with color
for i, txt in enumerate(cars_df_scaled['model']):
    country = country_list[i]
    ax.annotate(txt, (quant_df_pca[:,0][i], quant_df_pca[:,1][i]), color=colors[country], size=12)

What patterns do you see now?

For instance, we can observe that fuel consumption (mpg) is increased for the japanese cars and weight (wt) is increased for US cars.

Addressing n_components

We notice that PCA takes in 1 parameter: n_components. This is the number of principal components that PCA will use. By default, the number of the used principal components is the minimum of the number of rows and the number of columns in the dataset.

Note: Setting the default parameter for n_components and taking the top-k principal components is equivalent to setting n_components=k. Let's check this.

In [ ]:
old_components = pca.components_[0:2]

# doing pca with 2 components
pca2 = PCA(n_components=2).fit(quant_df)

new_components = pca2.components_

# checking equivalence
print(new_components.all() == old_components.all())

PCA to speed up classification of Handwritten Digits

This example, using the MNIST dataset, was borrowed from this Towards Data Science blog post. In this example, we will be classifying hand-written digits.

Data Loading and EDA

In [ ]:
# we'll use keras a lot more in the last few weeks of the course
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

Our training set (x_train) contains 60000 images of size 28 by 28. Our training labels (y_train) are numbers from 0 to 9. Let's examine one of these values below.

In [ ]:
fig, ax = plt.subplots()
ax.imshow(x_train[0], cmap='gray');

Our task is to classify the test set digits as accurately as possible.

We notice that the shape of the training set is $6000 \times 28 \times 28$ which is a 3-dimensional array. We have not dealt with these kinds of arrays before. We will deal with images in greater detail (and not only!!!) in the follow-up course, CS 109b, if you are interested in doing more of this kind of stuff you should take this course. For now, we will reshape the array into a 2-dimensional array of shape $6000\times 784$.

In [ ]:
x_train = x_train.reshape(x_train.shape[0], 784)
x_test = x_test.reshape(x_test.shape[0], 784)

# check if the shapes are ok
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

Normalizing Data

Image data is usually between 0 and 255 (0 indicating a black pixel and 255 indicating a white pixel). We can normalize these values by dividing by 255.

In [ ]:
# checking the min and max of x_train and x_test
print(x_train.min(), x_train.max(), x_test.min(), x_test.max())

x_train = (x_train - x_train.min())/(x_train.max() - x_train.min()) 
x_test = (x_test - x_train.min())/(x_train.max() - x_train.min()) 

print(x_train.min(), x_train.max(), x_test.min(), x_test.max())

Modeling using Logistic Regression

In [ ]:
from sklearn.linear_model import LogisticRegression
from time import time

start = time()
#‘lbfgs’ solver handles multinomial loss in multiclass problems 
logreg_model = LogisticRegression(solver='lbfgs').fit(x_train, y_train)
end = time()

full_logreg_time = end-start
print('Time to fit: {}s'.format(full_logreg_time))
In [ ]:
from sklearn.metrics import accuracy_score

y_preds_train = logreg_model.predict(x_train)
y_preds_test = logreg_model.predict(x_test)

full_logreg_score_train = accuracy_score(y_train, y_preds_train)
full_logreg_score_test = accuracy_score(y_test, y_preds_test)

# Evaluation
print('Training Set Score: {}'.format(full_logreg_score_train))
print('Test Set Score: {}'.format(full_logreg_score_test))
In [ ]:
# get performance by class 
pd.crosstab(y_test, y_preds_test, margins=True, rownames=['Actual'], colnames=['Predicted'])

We get a high training and test set score but it takes a relatively long time to fit a model. Let us see if we can speed things up when using PCA

Logistic Regression Model after PCA

In [ ]:
# Do PCA onto our training set and inspect
pca = PCA(n_components=100).fit(x_train)

fig, ax = plt.subplots(ncols=2, figsize=(20,6))
ax1, ax2 = ax.ravel()

ratio = pca.explained_variance_ratio_
ax1.plot(range(1,len(ratio)+1), ratio, 'o-')
ax1.set_title('Explained Variance Ratio PCA', fontsize=20)
ax1.set_ylabel('Explained Variance Ratio')

ratio = pca.explained_variance_ratio_
ax2.plot(range(1,len(ratio)+1), np.cumsum(ratio), 'o-')
ax2.set_title('Cumulative Sum of Explained Variance Ratio PCA', fontsize=20)
ax2.set_ylabel('Cumulative Sum of Explained Variance Ratio');

We see that the first 100 principal components hold over 90% of the variance and the first 50 principal components hold over 80% of the variance! We have significantly reduced the dimensionality of our problem! Let us use PCA to find the first 100 principal components of our dataset and transform our x_train and x_test accordingly.

In [ ]:
x_train_pca = pca.transform(x_train)
x_test_pca = pca.transform(x_test)

print(x_train_pca.shape, x_test_pca.shape)
In [ ]:
start = time()
logreg_model_pca = LogisticRegression(solver='lbfgs').fit(x_train_pca, y_train)
end = time()

print('Time to fit model (100 PCs): {}s'.format(end-start))
print('Time to fit model (full dataset): {}s'.format(full_logreg_time))
print('So to fit the model with the full dataset is about', np.round(full_logreg_time/(end-start),0), ' times slower than using PCA')

fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(0, full_logreg_time, width=0.5)
ax.bar(1, end-start, width=0.5)
ax.set_xticklabels(['Full Dataset', '100 PCs'])
ax.set_ylabel('Time to Fit Model (s)')
ax.set_title('Time taken to fit different models (s)');

Note: The time taken to fit our model is considerably smaller! Now let us check our accuracy

In [ ]:
y_preds_train_pca = logreg_model_pca.predict(x_train_pca)
y_preds_test_pca = logreg_model_pca.predict(x_test_pca)

# Evaluation
print('Training Set Score (100 PCs): {}'.format(accuracy_score(y_train, y_preds_train_pca)))
print('Test Set Score (100 PCs): {}\n'.format(accuracy_score(y_test, y_preds_test_pca)))

print('Training Set Score (full dataset): {}'.format(full_logreg_score_train))
print('Test Set Score (full dataset): {}'.format(full_logreg_score_test))
In [ ]:
# get performance by class 
pd.crosstab(y_test, y_preds_test_pca, margins=True, rownames=['Actual'], colnames=['Predicted'])

Not a significant drop in accuracy!! But, since we are losing information by not accounting for all the variance, we are faced with a speed accuracy tradeoff. Explore the case of keeping less principal components.

Plotting PCA

Plotting the Reconstructed Image

In [ ]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))
ax1, ax2 = ax.ravel()

ax1.imshow(x_train[0].reshape(28,28), cmap='gray')
ax1.set_title('Original Image with 784 components')

ax2.imshow(x_train_pca[1].reshape(10,10), cmap='gray')
ax2.set_title('Image after PCA with 100 components')


Uhhh... this is terrible. But we can use PCA to carry out an inverse transform in order to get a reconstructed image. Let's try again, using pca.inverse_transform()! Source: this github repo

In [ ]:
img_reconstructed = pca.inverse_transform(x_train_pca[0])

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))
ax1, ax2 = ax.ravel()

ax1.imshow(x_train[0].reshape(28,28), cmap='gray')
ax1.set_title('Original Image with 784 components')

ax2.imshow(img_reconstructed.reshape(28,28), cmap='gray')
ax2.set_title('Reconstructed Image after PCA with 100 components')


Plotting all our points on a 2-dimensional plot given by the first 2 principal components of PCA

This towards data science article has a few similar plots that are pretty cool!

In [ ]:
pca = PCA(n_components=2).fit(x_train)
x_train_2 = pca.transform(x_train)
In [ ]:
fig, ax = plt.subplots(figsize=(16,8))
for i in range(10):
    indices = np.where(y_train == i)[0]
    data = x_train_2[indices]
    ax.plot(data[:,0], data[:,1], 'o', label='{}'.format(i), alpha=0.5)
ax.set_title('First 2 Principal Components of MNIST Data', fontsize=20)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')

Any patterns of interest?

So, should I always use PCA?

PCA is great for:

  1. Speeding up the training without significant descrease of the predictivity ability of a model compared to a model with all p predictors.
  2. Visualizing how predictive your features can be of your response, especially in the case of classification.
  3. Reducing multicollinearity, and thus may improve the computational time of fitting models.
  4. Reducing dimensionality in very high dimensional settings.

PCA is not so good in certain situations because:

  1. Interpretation of coefficients in PCR is completely lost. So do not do PCA if interpretation is important.
  2. When the predictors' distribution deviates significantly from a multivariable Normal distribution.
  3. When the high variance does not indicate high importance.
  4. When the hidden dimensions are not orthonormal.

End of Standard Section