CS-109A Introduction to Data Science

Lab 8: Principle Component Analysis (PCA)

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras.
Contributors: Will Claybaugh, David Sondak, Chris Tanner


In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

Learning Goals

In this lab, we will look at how to use PCA to reduce a dataset to a smaller number of dimensions. The goal is for students to:

  • Understand what PCA is and why it's useful
  • Feel comfortable performing PCA on a new dataset
  • Understand what it means for each component to capture variance from the original dataset
  • Be able to extract the `variance explained` by components.
  • Perform modelling with the PCA components
In [17]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LassoCV
from sklearn.metrics import accuracy_score
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

from sklearn.model_selection import train_test_split

Part 1: Introduction

What is PCA? PCA is a deterministic technique to transform data to a lowered dimensionality, whereby each feature/dimension captures the most variance possible.

Why do we care to use it?

  • Visualizating the components can be useful
  • Allows for more efficient use of resources (time, memory)
  • Statistical reasons: fewer dimensions -> better generalization
  • noise removal / collinearity (improving data quality)

Imagine some dataset where we have two features that are pretty redundant. For example, maybe we have data concerning elite runners. Two of the features may include VO2 max and heartrate. These are highly correlated. We probably don't need both, as they don't offer much additional information from each other. Using a great visual example from online, let's say that this unlabelled graph (always label your axis) represents those two features:

VO2 Max vs Heart Rate

Let's say that this is our entire dataset, just 2 dimensions. If we wish to reduce the dimensions, we can only reduce it to just 1 dimension. A straight line is just 1 dimension (to help clarify this: imagine your straight line as being the x-axis, and values can be somewhere on this axis, but that's it. There is no y-axis dimension for a straight line). So, how should PCA select a straight line through this data?

Below, the image shows all possible projects that are centered in the data:

Animation of possible lines

PCA picks the line that:

  • captures the most variance possible
  • minimizes the distance of the transformed points (distance from the original to the new space)

The animation suggests that these two aspects are actually the same. In fact, this is objectively true, but the proof for which is beyond the scope of the material for now. Feel free to read more at this explanation and via Andrew Ng's notes.

In short, PCA is a math technique that works with the covariance matrix -- the matrix that describes how all pairwise features are correlated with one another. Covariance of two variables measures the degree to which they moved/vary in the same directin; how much one variable affects the other. A positive covariance means they are positively related (i.e., x1 increases as x2 does); negative means negative correlation (x1 decreases as x2 increases).

In data science and machine learning, our models are often just finding patterns in the data this is easier if the data is spread out across each dimension and for the data features to be independent from one another (imagine if there's no variance at all. We couldn't do anything). Can we transform the data into a new set that is a linear combination of those original features?

PCA finds new dimensions (set of basis vectors) such that all the dimensions are orthogonal and hence linearly independent, and ranked according to the variance (eigenvalue). That is, the first component is the most important, as it captures the most variance.

Part 2: The Wine Dataset

Imagine that a wine sommelier has tasted and rated 1,000 distinct wines, and now that she's highly experienced, she is curious if she can more efficiently rate wines without even trying them. That is, perhaps her tasting preferences follow a pattern, allowing her to predict the rating a new wine without even trying it!

The dataset contains 11 chemical features, along with a quality scale from 1-10; however, only values of 3-9 are actually used in the data. The ever-elusive perfect wine has yet to be tasted.

NOTE: While this dataset involves the topic of alcohol, we, the CS109A staff, along with Harvard at large is in no way encouraging alcohol use, and this example should not be intrepreted as any endorsement for such; it is merely a pedagogical example. I apologize if this example offends anyone or is off-putting.

Read-in and checking

First, let's read-in the data and verify it:

In [18]:
wines_df = pd.read_csv("../data/wines.csv", index_col=0)
wines_df.head()
Out[18]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality red
0 8.9 0.590 0.50 2.0 0.337 27.0 81.0 0.99640 3.04 1.61 9.5 6 1
1 7.7 0.690 0.22 1.9 0.084 18.0 94.0 0.99610 3.31 0.48 9.5 5 1
2 8.8 0.685 0.26 1.6 0.088 16.0 23.0 0.99694 3.32 0.47 9.4 5 1
3 11.4 0.460 0.50 2.7 0.122 4.0 17.0 1.00060 3.13 0.70 10.2 5 1
4 8.8 0.240 0.54 2.5 0.083 25.0 57.0 0.99830 3.39 0.54 9.2 5 1
In [19]:
wines_df.describe()
Out[19]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality red
count 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.00000 1000.00000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000
mean 7.558400 0.397455 0.30676 4.489250 0.067218 25.29650 91.03100 0.995351 3.251980 0.572990 10.489433 5.796000 0.50000
std 1.559455 0.189923 0.16783 4.112419 0.046931 17.06237 59.57269 0.002850 0.164416 0.169583 1.151195 0.844451 0.50025
min 3.800000 0.080000 0.00000 0.800000 0.009000 1.00000 6.00000 0.987400 2.740000 0.280000 8.500000 3.000000 0.00000
25% 6.500000 0.260000 0.22000 1.800000 0.042000 12.00000 37.75000 0.993480 3.140000 0.460000 9.500000 5.000000 0.00000
50% 7.200000 0.340000 0.30000 2.400000 0.060000 22.00000 86.00000 0.995690 3.240000 0.550000 10.300000 6.000000 0.50000
75% 8.200000 0.520000 0.40000 6.100000 0.080000 35.00000 135.00000 0.997400 3.360000 0.650000 11.300000 6.000000 1.00000
max 15.500000 1.580000 1.00000 26.050000 0.611000 131.00000 313.00000 1.003690 3.900000 2.000000 14.000000 8.000000 1.00000

For this exercise, let's say that the wine expert is curious if she can predict, as a rough approximation, the categorical quality -- bad, average, or great. Let's define those categories as follows:

  • bad is when for wines that have a quality <= 5
  • average is when a wine has a quality of 6 or 7
  • great is when a wine has a quality of >= 8
In [20]:
# copy the original data so that we're free to make changes
wines_df_recode = wines_df.copy()

# use the 'cut' function to reduce a variable down to the aforementioned bins (inclusive boundaries)
wines_df_recode['quality'] = pd.cut(wines_df_recode['quality'],[0,5,7,10], labels=[0,1,2])
wines_df_recode.loc[wines_df_recode['quality'] == 1]

# drop the un-needed columns
x_data = wines_df_recode.drop(['quality'], axis=1)
y_data = wines_df_recode['quality']

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=.2, random_state=8, stratify=y_data)

# previews our data to check if we correctly constructed the labels (we did)
print(wines_df['quality'].head())
print(wines_df_recode['quality'].head())
0    6
1    5
2    5
3    5
4    5
Name: quality, dtype: int64
0    1
1    0
2    0
3    0
4    0
Name: quality, dtype: category
Categories (3, int64): [0 < 1 < 2]

For sanity, let's see how many wines are in each category:

In [72]:
y_data.value_counts()
Out[72]:
1    598
0    379
2     23
Name: quality, dtype: int64

Now that we've split the data, let's look to see if there are any obvious patterns (correlations between different variables).

In [22]:
from pandas.plotting import scatter_matrix
scatter_matrix(wines_df_recode, figsize=(30,20));