Key Word(s): PCA, Model Selection
Download this notebook from the CS109 repo or here: http://bit.ly/Sec_5_109a (right click save to computer)
For this section, our goal is to review and further our understanding of the Partial Components Analysis (PCA) model. PCA is highly effective in applications to high dimensionional datasets, which we will use here. Specifically, this section is designed to help us answer Homework 4, part (h).
Specifically, we will:
1. Review the basics of Partial Components Analysis and hone our intution
2. Discuss implementation of PCA within Python and coding issues to keep in mind
3. Use the principles of model selection we have learned in lecture to find a "best" PCA feature set.
4. Compare our PCA model with other models we have fit in labs and lecture and discuss coefficient meanings.
For this section we will be using the following packages:
import sys
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 999)
pd.set_option('display.max_columns', 999)
pd.set_option('display.width', 1000)
pd.set_option('display.notebook_repr_html', True)
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.api import OLS
from statsmodels.api import add_constant
from statsmodels.regression.linear_model import RegressionResults
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
# Note -- Requires sklearn version .18 or higher
from sklearn.metrics import r2_score
from collections import Counter
sns.set(style="ticks")
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
matplotlib.rcParams['figure.figsize'] = (13.0, 6.0)
assert(sys.version_info.major==3),print(sys.version)
# Python 3 or higher is required
Part (1): Load in our data and conduct basic EDA¶
We will first load in our dataset below and look at the first few rows. Then, we use the describe function to get a sense of the data.
crime_df = pd.read_csv('https://raw.githubusercontent.com/albertw1/data/master/Crime.csv').drop(['Date', 'Year'], axis=1)
crime_df.head()
crime_df.describe()
Convert the columns that are categorical variables into dummy variables by one-hot encoding.
categorical_columns = ['Month', 'Weekend', 'Season', 'DOW']
numerical_columns = ['Temp', 'Dewpoint', 'Windspeed', 'Pressure', 'Precipitation', 'TMAX_C', 'TMIN_C']
crime_df = pd.get_dummies(crime_df, columns=categorical_columns, drop_first=True)
crime_df.head()
Now, let's split this dataset up into a testing and training set.
train, test = train_test_split(crime_df, test_size=.2, random_state=123)
train.shape,test.shape
Now let us standarize the numerical variables only.
mean = train[numerical_columns].mean()
std = train[numerical_columns].std()
train[numerical_columns] = (train[numerical_columns] - mean)/std
test[numerical_columns] = (test[numerical_columns] - mean)/std
Now to let us subset and create convenient formats for parts in train and test
all_predictors = ['Temp','Dewpoint','Windspeed','Pressure','Precipitation','TMAX_C','TMIN_C','Month_2','Month_3','Month_4','Month_5','Month_6','Month_7','Month_8','Month_9','Month_10','Month_11','Month_12','Weekend_1','Season_Spring','Season_Summer','Season_Winter','DOW_Monday','DOW_Saturday','DOW_Sunday','DOW_Thursday','DOW_Tuesday','DOW_Wednesday']
X_train_df = train[all_predictors]
X_test_df = test[all_predictors]
X_train_np = train[all_predictors].values
X_test_np = test[all_predictors].values
y_train = train['Incidence'].values
y_test = test['Incidence'].values
y_test.shape
Part (2): Use subset selection to fit a linear regression model¶
Let's use the forward/backward subset selection method from HW 3 to fit a linear regression model on the data.
def step_forwards_backwards(df, y_val, direction='forward'):
assert direction in ['forward', 'backward']
y = y_val.reshape(-1,1)
predictors = set(df.columns)
selected_predictors = set() if direction=='forward' else set(predictors)
n = df.shape[0]
best_bic = np.inf
best_bics = []
best_models = []
if direction == 'forward':
X = np.ones(n).reshape(-1,1)
X = np.concatenate([X, df[list(selected_predictors)].values], axis=1)
while (True):
possible_bic_scores = []
possible_predictors = list(selected_predictors ^ predictors)
if len(possible_predictors) == 0:
break
for predictor in possible_predictors:
x_temp = np.concatenate([X, df[predictor].values.reshape(-1,1)], axis=1)
model = OLS(endog=y, exog=x_temp).fit()
bic = model.bic
possible_bic_scores.append(bic)
best_predictor_ix = np.argmin(possible_bic_scores)
best_predictor = possible_predictors[best_predictor_ix]
best_bic = np.min(possible_bic_scores)
best_bics.append(best_bic)
selected_predictors.add(best_predictor)
X = np.concatenate([X, df[best_predictor].values.reshape(-1,1)], axis=1)
best_models.append(list(selected_predictors))
else:
while (True):
possible_bic_scores = []
possible_predictors = list(selected_predictors)
if len(possible_predictors) == 0:
break
for predictor in possible_predictors:
X = np.concatenate([np.ones(n).reshape(-1,1), df[list(selected_predictors - set([predictor]))].values], axis=1)
model = OLS(endog=y, exog=X).fit()
bic = model.bic
possible_bic_scores.append(bic)
best_predictor_ix = np.argmin(possible_bic_scores)
best_predictor = possible_predictors[best_predictor_ix]
best_bic = possible_bic_scores[best_predictor_ix]
selected_predictors.discard(best_predictor)
best_bics.append(best_bic)
best_models.append(list(selected_predictors))
index_of_best_bic = np.argmin(best_bics)
return best_models[index_of_best_bic]
Let's run the subset selection function and see which variables were included in the best model:
predictors_forward = # Your code goes here
predictors_forward
predictors_backward = # Your code goes here
predictors_backward
Based on these variables, we can see what the R-squared values are for our training and testing sets.
X = sm.add_constant(X_train_df[predictors_backward])
X_test = sm.add_constant(X_test_df[predictors_backward])
y = train['Incidence'].values.reshape(-1,1)
model = OLS(endog=y, exog=X)
result = model.fit()
y_hat_train = result.predict()
y_hat_test = result.predict(exog=X_test)
print('Backward Selection Training R2 = ', r2_score(y_train, y_hat_train))
print('Backward Selection Testing R2 = ', r2_score(y_test, y_hat_test))
X = sm.add_constant(X_train_df[predictors_forward])
X_test = sm.add_constant(X_test_df[predictors_forward])
y = train['Incidence'].values.reshape(-1,1)
model = OLS(endog=y, exog=X)
result = model.fit()
y_hat_train = result.predict()
y_hat_test = result.predict(exog=X_test)
print('Forward Selection Training R2 = ', r2_score(y_train, y_hat_train))
print('Forward Selection Testing R2 = ', r2_score(y_test, y_hat_test))
Part (3): Create a data frame with continuous predictors taken to polynomial power¶
Now, we will work with an example where we will manually take numeric predictors and take them to a polynomial power. The next step is a simple example to help you see how we can do this.
np.hstack((np.array([[1, 2,3], [4, 5,6],[7,8,9],[10,11,12]])**(i+1) for i in range(3)))
We want to create a data frame with the continuous predictors taken up to a power 3, while keeping the rest of the categorical predictors the same.
X_train_numerical_powers = # Your code goes here
print('Number of Total Predictors with Continuous Polynomial Terms Added is', X_train_numerical_powers.shape[1])
X_train_np_powers = np.concatenate((X_train_numerical_powers,X_train_df.drop(numerical_columns, axis=1)),axis=1)
X_train_df_powers = pd.DataFrame(X_train_np_powers)
newcolname = ['Temp', 'Dewpoint', 'Windspeed', 'Pressure', 'Precipitation', 'TMAX_C', 'TMIN_C', 'Temp^2', 'Dewpoint^2', 'Windspeed^2', 'Pressure^2', 'Precipitation^2', 'TMAX_C^2', 'TMIN_^2','Temp^3', 'Dewpoint^3', 'Windspeed^3', 'Pressure^3', 'Precipitation^3', 'TMAX_C^3', 'TMIN_C^3'] + list(X_train_df.drop(numerical_columns, axis=1))
X_train_df_powers.columns = newcolname
X_train_df_powers.head()
We can do the same with the test set:
X_test_numerical_powers = # Your code goes here
X_test_np_powers = np.concatenate((X_test_numerical_powers,X_test_df.drop(numerical_columns, axis=1)),axis=1)
X_test_df_powers = pd.DataFrame(X_test_np_powers)
newcolname = ['Temp', 'Dewpoint', 'Windspeed', 'Pressure', 'Precipitation', 'TMAX_C', 'TMIN_C', 'Temp^2', 'Dewpoint^2', 'Windspeed^2', 'Pressure^2', 'Precipitation^2', 'TMAX_C^2', 'TMIN_^2','Temp^3', 'Dewpoint^3', 'Windspeed^3', 'Pressure^3', 'Precipitation^3', 'TMAX_C^3', 'TMIN_C^3'] + list(X_train_df.drop(numerical_columns, axis=1))
X_test_df_powers.columns = newcolname
X_test_df_powers.head()
We can do forward and backward selection as well on this new data frame.
predictors_forward = step_forwards_backwards(X_train_df_powers, y_train, direction='forward')
predictors_forward
predictors_backward = step_forwards_backwards(X_train_df_powers, y_train, direction='backward')
predictors_backward
X = sm.add_constant(X_train_df_powers[predictors_backward])
X_test = sm.add_constant(X_test_df_powers[predictors_backward])
y = train['Incidence'].values.reshape(-1,1)
model = OLS(endog=y, exog=X)
result = model.fit()
y_hat_train = result.predict()
y_hat_test = result.predict(exog=X_test)
print('Forward Selection Training R2 = ', r2_score(y_train, y_hat_train))
print('Forward Selection Testing R2 = ', r2_score(y_test, y_hat_test))
X = sm.add_constant(X_train_df_powers[predictors_forward])
X_test = sm.add_constant(X_test_df_powers[predictors_forward])
y = train['Incidence'].values.reshape(-1,1)
model = OLS(endog=y, exog=X)
result = model.fit()
y_hat_train = result.predict()
y_hat_test = result.predict(exog=X_test)
print('Forward Selection Training R2 = ', r2_score(y_train, y_hat_train))
print('Forward Selection Testing R2 = ', r2_score(y_test, y_hat_test))
Part (4) Create the expanded matrix containing all terms¶
Let's now create a design matrix that includes all polynomial terms up to the third order, including all interactions.
all_poly_terms = PolynomialFeatures(degree=3, interaction_only=False, include_bias=False)
X_train_full_poly = all_poly_terms.fit_transform(X_train_df)
X_test_full_poly = all_poly_terms.fit_transform(X_test_df)
print('number of total predictors', X_train_full_poly.shape[1])
X_train_full_poly.shape
We can use the following function to get an idea of the interactions between the variables (print first 25).
all_poly_terms.get_feature_names()[0:25]
X_train_full_poly
If we wanted to drop the 0's, we can use the following function:
zero_column_index = np.where(~X_train_full_poly.any(axis=0))[0]
X_train_full_poly_nonzero_col = np.delete(X_train_full_poly, zero_column_index, axis = 1)
X_test_full_poly_nonzero_col = np.delete(X_test_full_poly, zero_column_index, axis = 1)
print(X_train_full_poly_nonzero_col)
Now we can fit our PCA model:
pca = PCA(n_components=5)
pca.fit(X_train_full_poly)
train_pca = pca.transform(X_train_full_poly)
test_pca = pca.transform(X_test_full_poly)
print('Explained variance ratio:', pca.explained_variance_ratio_)
We can obtain the corresponding coefficients of each principal component.
pca.components_
# First "row" of array corresponds to first component weights.
If you recall, our weights squared had to sum to one. Let's see if this is the case:
# Your code goes here
Now we will look at each of the components, or weights from our 5 principal components. Because we centered our dataset, each value is essentially a correlation value.
feature_frame = pd.DataFrame(pca.components_,columns=all_poly_terms.get_feature_names(),index = ['PC-1','PC-2','PC-3','PC-4','PC-5'])
feature_frame
Now, let's take a look at the top correlation values for the first partial component.
feature_frame[0:1].sort_values(by=['PC-1'], axis = 1, ascending = False).T[0:3]
The first partial component really depends on x4^3, x2 x4^2 and x4^2. When each of these values increase, we would expect each of them to increase together.
Finally, let's look at the R-squared models fo the first 10 partial components.
R2_pca_train = []
R2_pca_test = []
for i in range(10):
pca = PCA(n_components=i+1)
pca.fit(X_train_full_poly_nonzero_col)
X_train_pca = pca.transform(X_train_full_poly_nonzero_col)
X_test_pca = pca.transform(X_test_full_poly_nonzero_col)
regression_model = LinearRegression(fit_intercept=True)
regression_model.fit(X_train_pca, y_train)
R2_pca_train.append(regression_model.score(X_train_pca, y_train))
R2_pca_test.append(regression_model.score(X_test_pca, y_test))
print('Explained variance ratio for Model with',i+1, 'components:', pca.explained_variance_ratio_)
R2_pca_train
R2_pca_test