Key Word(s): pca, principle component analysis, dimensionality reduction

# CS-109A Introduction to Data Science

## Lab 8: Principle Component Analysis (PCA)¶

**Harvard University**

**Fall 2019**

**Instructors:** Pavlos Protopapas, Kevin Rader, Chris Tanner

**Lab Instructors:** Chris Tanner and Eleni Kaxiras.

**Contributors:** Will Claybaugh, David Sondak, Chris Tanner

```
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
```

## Learning Goals¶

In this lab, we will look at how to use PCA to reduce a dataset to a smaller number of dimensions. The goal is for students to:

- Understand what PCA is and why it's useful
- Feel comfortable performing PCA on a new dataset
- Understand what it means for each component to capture variance from the original dataset
- Be able to extract the `variance explained` by components.
- Perform modelling with the PCA components

```
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LassoCV
from sklearn.metrics import accuracy_score
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
from sklearn.model_selection import train_test_split
```

## Part 1: Introduction¶

What is PCA? PCA is a deterministic technique to transform data to a lowered dimensionality, whereby each feature/dimension captures the most variance possible.

Why do we care to use it?

- Visualizating the components can be useful
- Allows for more efficient use of resources (time, memory)
- Statistical reasons: fewer dimensions -> better generalization
- noise removal / collinearity (improving data quality)

Imagine some dataset where we have two features that are pretty redundant. For example, maybe we have data concerning elite runners. Two of the features may include `VO2 max`

and `heartrate`

. These are highly correlated. We probably don't need both, as they don't offer much additional information from each other. Using a great visual example from online, let's say that this unlabelled graph **(always label your axis)** represents those two features:

Let's say that this is our entire dataset, just 2 dimensions. If we wish to reduce the dimensions, we can only reduce it to just 1 dimension. A straight line is just 1 dimension (to help clarify this: imagine your straight line as being the x-axis, and values can be somewhere on this axis, but that's it. There is no y-axis dimension for a straight line). So, how should PCA select a straight line through this data?

Below, the image shows all possible projects that are centered in the data:

PCA picks the line that:

- captures the most variance possible
- minimizes the distance of the transformed points (distance from the original to the new space)

The animation **suggests** that these two aspects are actually the same. In fact, this is objectively true, but the proof for which is beyond the scope of the material for now. Feel free to read more at this explanation and via Andrew Ng's notes.

In short, PCA is a math technique that works with the covariance matrix -- the matrix that describes how all pairwise features are correlated with one another. Covariance of two variables measures the degree to which they moved/vary in the same directin; how much one variable affects the other. A positive covariance means they are positively related (i.e., x1 increases as x2 does); negative means negative correlation (x1 decreases as x2 increases).

In data science and machine learning, our models are often just finding patterns in the data this is easier if the data is spread out across each dimension and for the data features to be independent from one another (imagine if there's no variance at all. We couldn't do anything). Can we transform the data into a new set that is a linear combination of those original features?

PCA finds new dimensions (set of basis vectors) such that all the dimensions are orthogonal and hence linearly independent, and ranked according to the variance (eigenvalue). That is, the first component is the most important, as it captures the most variance.

## Part 2: The Wine Dataset¶

Imagine that a wine sommelier has tasted and rated 1,000 distinct wines, and now that she's highly experienced, she is curious if she can more efficiently rate wines without even trying them. That is, perhaps her tasting preferences follow a pattern, allowing her to predict the rating a new wine without even trying it!

The dataset contains 11 chemical features, along with a quality scale from 1-10; however, only values of 3-9 are actually used in the data. The ever-elusive perfect wine has yet to be tasted.

**NOTE:** While this dataset involves the topic of alcohol, we, the CS109A staff, along with Harvard at large is in no way encouraging alcohol use, and this example should not be intrepreted as any endorsement for such; it is merely a pedagogical example. I apologize if this example offends anyone or is off-putting.¶

### Read-in and checking¶

First, let's read-in the data and verify it:

```
wines_df = pd.read_csv("../data/wines.csv", index_col=0)
wines_df.head()
```

```
wines_df.describe()
```

For this exercise, let's say that the wine expert is curious if she can predict, as a rough approximation, the **categorical quality -- bad, average, or great.** Let's define those categories as follows:

`bad`

is when for wines that have a quality <= 5`average`

is when a wine has a quality of 6 or 7`great`

is when a wine has a quality of >= 8

```
# copy the original data so that we're free to make changes
wines_df_recode = wines_df.copy()
# use the 'cut' function to reduce a variable down to the aforementioned bins (inclusive boundaries)
wines_df_recode['quality'] = pd.cut(wines_df_recode['quality'],[0,5,7,10], labels=[0,1,2])
wines_df_recode.loc[wines_df_recode['quality'] == 1]
# drop the un-needed columns
x_data = wines_df_recode.drop(['quality'], axis=1)
y_data = wines_df_recode['quality']
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=.2, random_state=8, stratify=y_data)
# previews our data to check if we correctly constructed the labels (we did)
print(wines_df['quality'].head())
print(wines_df_recode['quality'].head())
```

For sanity, let's see how many wines are in each category:

```
y_data.value_counts()
```

Now that we've split the data, let's look to see if there are any obvious patterns (correlations between different variables).

```
from pandas.plotting import scatter_matrix
scatter_matrix(wines_df_recode, figsize=(30,20));
```