{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Lecture Notebook\n", "\n", "## Description :\n", "\n", "## Data Description:\n", "\n", "## Instructions:\n", "\n", "## Hints: \n", "\n", "Statsmodels\n", "\n", "statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. \n", "\n", "Basic code structure is shown below:\n", "\n", "`import statsmodels.api as sm`\n", "\n", "X is our dataset\n", "\n", "Add intercept (bias constant):\n", "\n", "X = sm.add_constant(X)\n", "\n", "Fit regression model:\n", "\n", "results = sm.OLS(y, X).fit()\n", "\n", "Inspect the results:\n", "print(results.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# CS109A Introduction to Data Science\n", "\n", "## Principal Component Analysis\n", "\n", "\n", "**Harvard University**
\n", "**Fall 2021**
\n", "**Instructors**: Pavlos Protopapas, Natesh Pillai\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import scipy\n", "import seaborn as sns\n", "\n", "import plotly\n", "import plotly.graph_objs as go\n", "\n", "import sklearn as sk\n", "from sklearn.decomposition import PCA\n", "\n", "import statsmodels.api as sm\n", "from statsmodels.stats.outliers_influence import variance_inflation_factor\n", "\n", "pd.set_option(\"display.width\", 500)\n", "pd.set_option(\"display.max_columns\", 100)\n", "\n", "sns.set_style(\"darkgrid\")\n", "sns.set_palette(\"colorblind\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# PCA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 0: Reading the data \n", "\n", "In this notebook, we will be using a Heart dataset. The variables we will be using today include:\n", "\n", "- `AHD`: whether or not the patient presents atherosclerotic heart disease (a heart attack): `Yes` or `No`\n", "- `Sex`: a binary indicator for whether the patient is male (Sex=1) or female (Sex=0)\n", "- `Age`: age of patient, in years\n", "- `MaxHR`: the maximum heart rate of patient based on exercise testing\n", "- `RestBP`: the resting systolic blood pressure of the patient\n", "- `Chol`: the HDL cholesterol level of the patient\n", "- `Oldpeak`: ST depression induced by exercise relative to rest (on an ECG)\n", "- `Slope`: the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)\n", "- `Ca`: number of major vessels (0-3) colored by flourosopy\n", "\n", "For further information on the dataset, please see the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df_heart = pd.read_csv(\"Heart.csv\")\n", "\n", "# Force the response into a binary indicator:\n", "df_heart[\"AHD\"] = (df_heart[\"AHD\"] == \"Yes\").astype(\"int\")\n", "\n", "print(df_heart.shape)\n", "df_heart.head()\n", "df_heart.drop(columns = ['Unnamed: 0'], inplace=True)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "df_heart" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some basic summaries and EDA from last time:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df_heart.describe()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(df_heart[\"Sex\"], df_heart[\"AHD\"])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(df_heart[\"Thal\"], df_heart[\"AHD\"])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(df_heart[\"ChestPain\"], df_heart[\"AHD\"])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "_ = sns.histplot(data=df_heart, x=\"Age\", hue=\"AHD\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "_ = sns.histplot(data=df_heart, x=\"MaxHR\", hue=\"AHD\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Principal Components Analysis (PCA) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1.1** Just a sidebar (and a curiosity), what happens when two of the identical predictor is used in linear regression? Is an error created? Should one be? Investigate by predicting `AHD` from two copies of `Age`, and compare to the simple linear regression model with `Age` alone." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "X = sm.add_constant(df_heart[[\"Age\"]])\n", "y = df_heart[\"AHD\"]\n", "\n", "reg1 = sm.OLS(y, X).fit()\n", "reg1.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**\n", " \n", "The single coefficient for `Age` is distributed equally across the two predictors. This is a very reasonable approach as predictions will still be stable." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# investigating what happens when two identical predictors are used\n", "\n", "######\n", "# your code here\n", "######\n", "\n", "X = sm.add_constant(df_heart[[\"Age\", \"Age\"]])\n", "reg2 = sm.OLS(y, X).fit()\n", "print(reg2.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will apply PCA to the heart dataset when there are just 7 predictors considered (remember: PCA is used when dimensionality is high (lots of predictors), but this will help us get our heads around what is going on):" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "columns = [\"Age\", \"RestBP\", \"Chol\", \"MaxHR\", \"Sex\", \"Oldpeak\", \"Slope\"]\n", "\n", "X = df_heart[columns]\n", "y = df_heart[\"AHD\"]\n", "X.describe()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "X.corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First let's fit the full linear regression model to predict `AHD` from the 7 predictors above.\n", "\n", "Remember: PCA is an approach to handling the predictors, so it does not matter if we are using it for a regression or classification type problem." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "reg_full = sm.OLS(y, sm.add_constant(X)).fit()\n", "\n", "reg_full.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1.2** Is there any evidence of multicollinearity in the set of predictors? How do you know? How will PCA handle these correlations?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# VIF dataframe\n", "vif_data = pd.DataFrame()\n", "vif_data[\"feature\"] = X.columns\n", " \n", "# calculating VIF for each feature\n", "vif_data[\"VIF\"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]\n", " \n", "vif_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we have high VIFs, this indicates that we have multicollinearity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: PCA in Regression (PCR) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we apply the [PCA transformation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) in a few steps, and show some of the results below:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# create/fit the 'full' pca transformation\n", "pca = PCA().fit(X)\n", "\n", "# apply the pca transformation to the full predictor set\n", "pcaX = pca.transform(X)\n", "\n", "# convert to a data frame\n", "pcr_columns = [\"PCA1\" , \"PCA2\", \"PCA3\", \"PCA4\", \"PCA5\", \"PCA6\", \"PCA7\"]\n", "pcaX_df = pd.DataFrame(pcaX, columns=pcr_columns)\n", "\n", "# here are the weighting (eigen-vectors) of the variables (first 2 at least)\n", "print(\"First PCA Component (w1):\", pca.components_[0,:])\n", "print(\"Second PCA Component (w2):\", pca.components_[1,:])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "pcaX_df" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# here is the variance explained:\n", "print(\"Variance explained by each component:\", pca.explained_variance_ratio_)\n", "\n", "blue = sns.color_palette(\"colorblind\")[0]\n", "sns.barplot(y=list(range(1,8)), x=pca.explained_variance_ratio_, orient=\"h\", color=blue)\n", "plt.xscale(\"log\")" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "_ = sns.barplot(y=list(range(1,8)), x=pca.explained_variance_ratio_, orient=\"h\", color=blue)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# create/fit the 'full' pca transformation\n", "Z = sk.preprocessing.StandardScaler().fit(X).transform(X)\n", "pca_standard = PCA().fit(Z)\n", "pcaZ = pca_standard.transform(Z)\n", "\n", "# convert to a data frame\n", "pcaZ_df = pd.DataFrame(pcaZ, columns=pcr_columns)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "print(pca_standard.components_.shape)\n", "print(pcaZ.shape)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pd.DataFrame.from_dict({\"Variable\": X.columns,\n", " \"PCA1\": pca.components_[0],\n", " \"PCA2\": pca.components_[1],\n", " \"PCA-Z1\": pca_standard.components_[0],\n", " \"PCA-Z2\": pca_standard.components_[1]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q2.3** Interpret the results above. What doss $w_1$ represent? Why do the values make sense? What does it's values squared sum up to? Why does this make sense?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**\n", "\n", "$w_1$ represents the transformation (change in basis) to convert the columns of $\\mathbf{X}$ to the first PCA vector, $z_1$. They elements after quaring sum up to 1, so the magnitude represents euclidean weighting in the transformation (the larger value means more weight in the transformation). " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "np.sum(pca.components_[0,:]**2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is common for a model with high dimensional data (lots of predictors) to be plotted along the first 2 PCA components (with the classification boundaries added). Below is the scatter plot for these data (without a classificaiton boundary, since we do not have a model yet):" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# Plot the response over the first 2 PCA component vectors\n", "\n", "sns.scatterplot(data=pcaX_df, x=\"PCA1\", y=\"PCA2\", hue=df_heart[\"AHD\"], legend=\"full\")\n", "\n", "plt.xlabel(\"First PCA Component Vector (Z1)\")\n", "plt.ylabel(\"Second PCA Component Vector (Z2)\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q2.4** What would a classification boundary look like if a linear regression model were fit using the first 2 principal components as the predictors? Does there appear to be good potential here?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**\n", "\n", "It would again be linear. Here, most likely the boundary would be a line with negative slope." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is the result of the PCR-1 (linear) to predict `AHD` from the first principal component vector." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "X = sm.add_constant(pcaX_df[[\"PCA1\"]])\n", "reg_pcr1 = sm.OLS(y, X).fit()\n", "\n", "reg_pcr1.summary()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "print(\"First PCA Component (w1):\", pca.components_[0:1,:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q2.5** What does this PCR-1 model tell us about how the predictors relate to the response (aka, estimate the coefficient(s) in the original predictor space)? Is it truly a simple linear regression model in the original predictor space?" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "beta = reg_pcr1.params[1]\n", "\n", "(beta*pca.components_[0:1,:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**\n", "\n", "The estimated slope from PCR1 ($\\hat{\\beta} \\approx 0.0009$) is distributed across the 7 actual predictors, so that the formula would be:\n", "\n", "$$\\hat{y} = 0.0009(Z_1) + 0.4587 = 0.0009(w^T_1\\mathbf{X}) + 0.4587 \\\\\n", "= 0.0009(0.0384X_1+0.0505X_2+0.998X_3-0.00374X_4-0.0018X_5+0.00115X_6-0.0000036X_7) + 0.4587 \\\\\n", "= 3.31 \\cdot 10^{-5} X_1 + 4.35 \\cdot 10^{-5} X_2 + 8.6 \\cdot 10^{-4} X_3 - 3.23 \\cdot 10^{-6} X_4 - 1.56 \\cdot 10^{-6} X_5 + 9.955 \\cdot 10^{-7} X_6 - 3.1 \\cdot 10^{-9} X_7 + 0.4587$$\n", "\n", "This is how to interpret the estimated coefficients from a regression with PCA components as the predictors: some transformation back to the original space is required." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the above claculation for all 7 PCR linear regressions, and then plotted on a pretty plot:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "results_arr = []\n", "\n", "for i in range(1, 8):\n", " reg_pcr_tmp = sm.OLS(y, sm.add_constant(pcaX_df[pcr_columns[:i]])).fit()\n", " pcr_tmp = np.transpose(pca.components_[:i,:]) @ reg_pcr_tmp.params[1:i+1]\n", " results_arr.append(pcr_tmp)\n", "\n", "betas = reg_full.params[1:]\n", "results_arr.append(betas)\n", "results = np.vstack(results_arr)\n", "print(results)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "plt.plot(pcr_columns + [\"Linear\"], results)\n", "plt.ylabel(\"Back-calculated Beta Coefficients\")\n", "plt.legend(df_heart.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q2.6** Interpret the plot above. Specifically, compare how each PCA vector \"contributes\" to the original linear regression model using all 7 original predictors. How Does PCR-7 compare to the original linear regression model (in estimated coefficients)?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**\n", "\n", "This plot shows that as more PCA vectors are included in the PCA-Regression, the estimated $\\beta$s from the original regression model are recovered: if PCR($p$) is used (where $p$ is the number of predictors we started with), they are mathemtaically equivalent. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of this PCA work should have been done using the standardized versions of the predictors. Below is the code that does exactly that:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "X = df_heart[columns]\n", "\n", "scaler = sk.preprocessing.StandardScaler()\n", "Z = scaler.fit_transform(X)\n", "pca = PCA().fit(Z)\n", "pcaZ = pca.transform(Z)\n", "pcaZ_df = pd.DataFrame(pcaZ, columns=pcr_columns)\n", "\n", "print(\"First PCA Component (w1):\", pca.components_[0,:])\n", "print(\"Second PCA Component (w2):\", pca.components_[1,:])" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "regZ_full = sm.OLS(y, sm.add_constant(pd.DataFrame(Z, columns=columns))).fit()\n", "regZ_full.summary()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Fit the PCR\n", "\n", "results_arr = []\n", "\n", "for i in range(1, 8):\n", " reg_pcrZ_tmp = sm.OLS(y, sm.add_constant(pcaZ_df[pcr_columns[:i]])).fit()\n", " pcrZ_tmp = np.transpose(pca.components_[:i,:]) @ reg_pcrZ_tmp.params[1:i+1]\n", " results_arr.append(pcrZ_tmp)\n", "\n", "betasZ = regZ_full.params[1:]\n", "results_arr.append(betasZ)\n", "resultsZ = np.vstack(results_arr)\n", "print(resultsZ)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "plt.plot(pcr_columns + [\"Linear\"],resultsZ)\n", "plt.ylabel(\"Back-calculated Beta Coefficients\");\n", "plt.legend(X.columns);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q2.7** Compare this plot to the previous one; why does this plot make sense?. What does this illustrate? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution:**\n", "\n", "This plot shows that the components are now more evenly composed of the predictors, rather than the first component being dominated by the predictor with the most variability. The 7 lines move more similarly here than in the previous plot where they essentially moved one predictor for one component." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Underlying Math\n", "\n", "What is PCA doing with these eigenvectors? Why does it all work? To answer these questions, it is easiest to restrict ourselves to two dimensions so that we can easily visualize. To show what is going on, we will focus on `Age` and `MaxHR` because there is a clear negative relationship between these two due to biology." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "_ = sns.scatterplot(data=df_heart, x=\"Age\", y=\"MaxHR\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, be careful looking at things with unequal axes, relationships can be lost. Let us set the axes to be equal proportion and the extremely linear relationship reveals itself." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "sns.scatterplot(data=df_heart, x=\"Age\", y=\"MaxHR\")\n", "_ = plt.axis(\"equal\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's suppose we wanted to summarize this data, how would you do this? One way to do this is to give the direction that explains the greatest variance. Why variance? The equation for sample variance is $S = \\frac{(X-\\mu)^2}{n-1}$ which centers the data for us and so the direction of the greatest variance really describes the direction in which the data tends to go. In practice, we can get rid of the $(n-1)$ in the denominator because the operations that follow are scale invariant." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "X = df_heart[[\"Age\", \"MaxHR\"]].values\n", "\n", "mu = np.mean(X, axis=0)\n", "S = (X - mu).T @ (X - mu)\n", "\n", "S" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the variance is a matrix, finding the direction of the greatest variance is equivalent to $\\max_{\\lVert w \\rVert = 1} {\\lVert S w \\rVert^2}$ which turns out to be the largest eigenvalue. So, the direction of largest variance is simply the largest eigenvalue of $S$." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "eigen_values, eigen_vectors = scipy.linalg.eig(S)\n", "w_1 = eigen_vectors[:, np.argmax(eigen_values)]\n", "w_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find the second greatest direction of variance, we should remove the effects of the first direction. This is done by projecting $w_1$ onto $X$ and subtracting it." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "X_hat = X - X @ np.outer(w_1, w_1)\n", "\n", "mu_hat = np.mean(X_hat, axis=0)\n", "S_hat = (X_hat - mu_hat).T @ (X_hat - mu_hat)\n", "\n", "S_hat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the same reasoning as before, the largest eigenvector of this new matrix will correspond to the second direction." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "eigen_values, eigen_vectors = scipy.linalg.eig(S_hat)\n", "w_2 = eigen_vectors[:, np.argmax(eigen_values)]\n", "w_2" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "np.sqrt(np.max(eigen_values))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparing to the results from Sklearn, we see that these precisely correspond to the components retrieved by PCA. Thus, PCA is simply an iterative procedure of finding the direction of greatest variance using the sample variance matrix by finding the largest eigenvector, removing its effects via projection, then repeating the procedure." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "pca = PCA().fit(X)\n", "pca.components_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To gain a geometric intuition, let us plot the points in the original space (blue), then projected to remove the largest eigenvector (orange) alongside both the eigenvectors. What we notice is that the direction of the first eigenvector is indeed responsible for most of the variance. Just by looking, the data has a spread of length 100 whereas the orange points is something closer to a spread of 50. The second thing to notice is that both eigenvectors together completely summarize the data. That is, after removing the second eigenvector, the data would collapse to the origin. Thus, using all the eigenvectors (PCA components) ultimately retrieves the information in the original data." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "sns.scatterplot(x=X[:,0], y=X[:,1])\n", "sns.scatterplot(x=X_hat[:,0], y=X_hat[:,1])\n", "\n", "x = np.stack([X[:,0], X_hat[:,0]]).T\n", "y = np.stack([X[:,1], X_hat[:,1]]).T\n", "\n", "for i in range(len(X)//20):\n", " sns.lineplot(x=x[i], y=y[i], color=\"k\")\n", " \n", "x = [0, -100*w_1[0]]\n", "y = [0, -100*w_1[1]]\n", " \n", "sns.lineplot(x=x, y=y)\n", "\n", "x = [0, 100*w_2[0]]\n", "y = [0, 100*w_2[1]]\n", " \n", "sns.lineplot(x=x, y=y)\n", "\n", "plt.xlabel(\"Age\")\n", "plt.ylabel(\"MaxHR\")\n", "\n", "_ = plt.axis(\"equal\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us repeat this procedure but for data in 3-dimensions so that you can try to extend the visualization to higher dimensions. Here, we switch to plotly because it handles 3D much better. You immediately notice the linear relationship between all 3 variables." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "X = df_heart[[\"Age\", \"MaxHR\", \"RestBP\"]].values\n", "\n", "# Configure Plotly to be rendered inline in the notebook.\n", "plotly.offline.init_notebook_mode()\n", "\n", "# Configure the trace.\n", "trace = go.Scatter3d(\n", " x=X[:,0],\n", " y=X[:,1],\n", " z=X[:,2],\n", " mode=\"markers\",\n", " marker={\n", " \"size\": 10,\n", " \"opacity\": 0.8,\n", " }\n", ")\n", "\n", "# Configure the layout.\n", "layout = go.Layout(\n", " margin={\"l\": 0, \"r\": 0, \"b\": 0, \"t\": 0},\n", " scene=go.layout.Scene(\n", " xaxis=go.layout.scene.XAxis(title=\"Age\"),\n", " yaxis=go.layout.scene.YAxis(title=\"MaxHR\"),\n", " zaxis=go.layout.scene.ZAxis(title=\"RestBP\")\n", " )\n", ")\n", "\n", "\n", "data = [trace]\n", "\n", "plot_figure = go.Figure(data=data, layout=layout)\n", "\n", "# Render the plot.\n", "plotly.offline.iplot(plot_figure)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's do the PCA procedure." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "X = df_heart[[\"Age\", \"MaxHR\", \"RestBP\"]].values\n", "X_orig = X.copy()\n", "\n", "# Configure Plotly to be rendered inline in the notebook.\n", "plotly.offline.init_notebook_mode()\n", "\n", "# Configure the trace.\n", "trace = go.Scatter3d(\n", " x=X[:,0],\n", " y=X[:,1],\n", " z=X[:,2],\n", " mode=\"markers\",\n", " name=\"Original Data\",\n", " marker={\n", " \"size\": 10,\n", " \"opacity\": 0.8,\n", " }\n", ")\n", "\n", "# Configure the layout.\n", "layout = go.Layout(\n", " margin={\"l\": 0, \"r\": 0, \"b\": 0, \"t\": 0},\n", " scene=go.layout.Scene(\n", " xaxis=go.layout.scene.XAxis(title=\"Age\"),\n", " yaxis=go.layout.scene.YAxis(title=\"MaxHR\"),\n", " zaxis=go.layout.scene.ZAxis(title=\"RestBP\")\n", " )\n", ")\n", "\n", "\n", "data = [trace]\n", "\n", "plot_figure = go.Figure(data=data, layout=layout)\n", "\n", "\n", "## First projection\n", "\n", "mu = np.mean(X, axis=0)\n", "S = (X - mu).T @ (X - mu)\n", "\n", "eigen_values, eigen_vectors = scipy.linalg.eig(S)\n", "w_1 = eigen_vectors[:, np.argmax(eigen_values)]\n", "\n", "X_prev = X.copy()\n", "X = X_orig - X_orig @ np.outer(w_1, w_1)\n", "\n", "# Configure the trace.\n", "trace = go.Scatter3d(\n", " x=X[:,0],\n", " y=X[:,1],\n", " z=X[:,2],\n", " mode=\"markers\",\n", " name=\"First Projection\",\n", " marker={\n", " \"size\": 10,\n", " \"opacity\": 0.8,\n", " }\n", ")\n", "\n", "data.append(trace)\n", "\n", "x_lines = []\n", "y_lines = []\n", "z_lines = []\n", "\n", "#create the coordinate list for the lines\n", "for i in range(len(X)//10):\n", " \n", " trace = go.Scatter3d(\n", " x=[X_prev[i,0], X[i,0]],\n", " y=[X_prev[i,1], X[i,1]],\n", " z=[X_prev[i,2], X[i,2]],\n", " mode=\"lines\",\n", " showlegend=False,\n", " line=go.scatter3d.Line(color=\"black\")\n", " )\n", "\n", " data.append(trace)\n", "\n", "## Second projection\n", "\n", "mu = np.mean(X, axis=0)\n", "S = (X - mu).T @ (X - mu)\n", "\n", "eigen_values, eigen_vectors = scipy.linalg.eig(S)\n", "w_2 = eigen_vectors[:, np.argmax(eigen_values)]\n", "\n", "X_prev = X.copy()\n", "X = X_orig - X_orig @ np.outer(w_1, w_1) - X_orig @ np.outer(w_2, w_2)\n", "\n", "# Configure the trace.\n", "trace = go.Scatter3d(\n", " x=X[:,0],\n", " y=X[:,1],\n", " z=X[:,2],\n", " mode=\"markers\",\n", " name=\"Second Projection\",\n", " marker={\n", " \"size\": 10,\n", " \"opacity\": 0.8,\n", " }\n", ")\n", "\n", "data.append(trace)\n", "\n", "#create the coordinate list for the lines\n", "for i in range(len(X)//10):\n", "\n", " trace = go.Scatter3d(\n", " x=[X_prev[i,0], X[i,0]],\n", " y=[X_prev[i,1], X[i,1]],\n", " z=[X_prev[i,2], X[i,2]],\n", " mode=\"lines\",\n", " showlegend=False,\n", " line=go.scatter3d.Line(color=\"black\")\n", " )\n", "\n", " data.append(trace)\n", "\n", "## Third projection\n", "\n", "mu = np.mean(X, axis=0)\n", "S = (X - mu).T @ (X - mu)\n", "\n", "eigen_values, eigen_vectors = scipy.linalg.eig(S)\n", "w_3 = eigen_vectors[:, np.argmax(eigen_values)]\n", "\n", "\n", "## Eigenvectors\n", "\n", "trace = go.Scatter3d(\n", " x=[0, 200*w_1[0]],\n", " y=[0, 200*w_1[1]],\n", " z=[0, 200*w_1[2]],\n", " mode=\"lines\",\n", " name=\"First Eigenvector\",\n", " line=go.scatter3d.Line(color=\"blue\")\n", ")\n", "\n", "data.append(trace)\n", "\n", "trace = go.Scatter3d(\n", " x=[0, 200*w_2[0]],\n", " y=[0, 200*w_2[1]],\n", " z=[0, 200*w_2[2]],\n", " mode=\"lines\",\n", " name=\"Second Eigenvector\",\n", " line=go.scatter3d.Line(color=\"red\")\n", ")\n", "\n", "data.append(trace)\n", "\n", "trace = go.Scatter3d(\n", " x=[0, 200*w_3[0]],\n", " y=[0, 200*w_3[1]],\n", " z=[0, 200*w_3[2]],\n", " mode=\"lines\",\n", " name=\"Third Eigenvector\",\n", " line=go.scatter3d.Line(color=\"green\")\n", ")\n", "\n", "data.append(trace)\n", "\n", "\n", "plot_figure = go.Figure(data=data, layout=layout)\n", "\n", "# Render the plot.\n", "plotly.offline.iplot(plot_figure)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice again that the lines are parallel to each other. Now, the original data, a cloud of points in 3D, first gets projected to a plane (the red points), then projected to a line (the green points). Imagine first squishing a ball of Play-Doh into a pancake and then taking the pancake and squishing the outsides to form a rope. This is exactly what PCA is doing except that it does the squishing in directions that have the maximal variance. Why is this all useful? Because, instead of projecting the points as we have done, we can perform dimensionality reduction by projecting the original data on a subset of the eigenvalues. For example, we can take our 3D cloud of points and reduce the dimensions by 66% by only keeping the first eigenvector which is going to responsible for most of the variance and so keeps most of the information in those features." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "X = df_heart[[\"Age\", \"MaxHR\", \"RestBP\"]].values\n", "X_orig = X.copy()\n", "\n", "# Configure Plotly to be rendered inline in the notebook.\n", "plotly.offline.init_notebook_mode()\n", "\n", "# Configure the trace.\n", "trace = go.Scatter3d(\n", " x=X[:,0],\n", " y=X[:,1],\n", " z=X[:,2],\n", " mode=\"markers\",\n", " name=\"Original Data\",\n", " marker={\n", " \"size\": 10,\n", " \"opacity\": 0.8,\n", " }\n", ")\n", "\n", "# Configure the layout.\n", "layout = go.Layout(\n", " margin={\"l\": 0, \"r\": 0, \"b\": 0, \"t\": 0},\n", " scene=go.layout.Scene(\n", " xaxis=go.layout.scene.XAxis(title=\"Age\"),\n", " yaxis=go.layout.scene.YAxis(title=\"MaxHR\"),\n", " zaxis=go.layout.scene.ZAxis(title=\"RestBP\")\n", " )\n", ")\n", "\n", "\n", "data = [trace]\n", "\n", "plot_figure = go.Figure(data=data, layout=layout)\n", "\n", "\n", "## Remove smallest eigenvalues\n", "\n", "X = X_orig - X_orig @ np.outer(w_2, w_2) - X_orig @ np.outer(w_3, w_3)\n", "\n", "# Configure the trace.\n", "trace = go.Scatter3d(\n", " x=X[:,0],\n", " y=X[:,1],\n", " z=X[:,2],\n", " mode=\"markers\",\n", " name=\"Data Along First Eigenvector\",\n", " marker={\n", " \"size\": 10,\n", " \"opacity\": 0.8,\n", " }\n", ")\n", "\n", "data.append(trace)\n", "\n", "plot_figure = go.Figure(data=data, layout=layout)\n", "\n", "\n", "# Render the plot.\n", "plotly.offline.iplot(plot_figure)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the red line is only 1 dimensional but the spread between the two furthest points is almost equal to the original data. We can use this to construct curves of how many dimensions are needed to retain a certain amount of variance. For instance, suppose we want to decrease the dimensions of our 7 dimensional dataset so that we may visualize the data more readily. Suppose furthermore we want the data to keep 90% of the original variance." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "columns = [\"Age\", \"RestBP\", \"Chol\", \"MaxHR\", \"Sex\", \"Oldpeak\", \"Slope\"]\n", "\n", "X = df_heart[columns].values\n", "\n", "mu = np.mean(X, axis=0)\n", "S = (X - mu).T @ (X - mu) / (len(X) - 1)\n", "\n", "total_variance = np.diag(S).sum()\n", "print(f\"Total variance is: {total_variance}\")" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "n = len(columns)\n", "var_arr = []\n", "eigenvector_arr = []\n", "\n", "for i in range(n):\n", " eigen_values, eigen_vectors = scipy.linalg.eig(S)\n", " w = np.real(eigen_vectors[:, np.argmax(eigen_values)])\n", " eigenvector_arr.append(w)\n", "\n", " X = X - X @ np.outer(w, w)\n", "\n", " mu = np.mean(X, axis=0)\n", " S = (X - mu).T @ (X - mu) / (len(X) - 1)\n", "\n", " variance = np.diag(S).sum()\n", " var_arr.append((total_variance-variance)/total_variance)\n", " \n", "sns.lineplot(x=list(range(1, n+1)), y=var_arr)\n", "plt.xlabel(\"Number of Components\")\n", "plt.ylabel(\"Proportion of Total Variance\")\n", "print(var_arr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we see that 2 dimensions keeps almost 90% of the original variance and when we jump to 3 dimension it keeps 98% of the original data. Of course, this is a very common task for PCA and so is provided by many packages." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "X = df_heart[columns].values\n", "\n", "pca = PCA().fit(X)\n", "print(pca.explained_variance_ratio_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, we can compress our data in 2 dimensions and now visualize our 7 dimensional dataset." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "n_components = 2\n", "\n", "X_hat = pca.transform(X)[:,:n_components]\n", "\n", "sns.scatterplot(x=X_hat[:,0], y=X_hat[:,1], hue=df_heart[\"AHD\"], legend=\"full\")\n", "\n", "plt.xlabel(\"First PCA Component Vector (Z1)\")\n", "plt.ylabel(\"Second PCA Component Vector (Z2)\");\n", "\n", "_ = plt.axis(\"equal\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To summarize, PCA is not a tool to help you make better predictions. It cannot be because it is simple linear transformations of the data. However, it gives one a way to compress the data and to better visualize it without losing information. In the lens of compression, PCA can be thought of as feature engineering as your new compressed data retains much of the information that is now exogenous of the dataset." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }