Key Word(s): pandas
CS109A Introduction to Data Science
Lab 4: Multiple Regression and Feature engineering¶
Harvard University
Fall 2021
Instructors: Pavlos Protopapas and Natesh Pillai
Lab Team: Marios Mattheakis, Hayden Joy, Chris Gumb, and Eleni Kaxiras
Authors: Eleni Kaxiras, Rahul Dave, David Sondak, Will Claybaugh, and Pavlos Protopapas
## RUN THIS CELL TO GET THE RIGHT FORMATTING
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Learning Objectives¶
After this lab, you should be able to
- Implement multiple regression models with
sklearn
. - Work with categorical variables including transforming them.
- Incorporate pipelines into your workflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from pandas.api.types import CategoricalDtype
from sklearn.compose import make_column_transformer, TransformedTargetRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer
from pandas.plotting import scatter_matrix
import seaborn as sns
%matplotlib inline
1 - Exploring the Football data¶
Introduction¶
The data imported below were scraped by Shubham Maurya and record various facts about players in the English Premier League.
Our goal is to fit models that predict the players' market value (what the player could earn when hired by a new team).¶
There are all sorts of questions we could answer, for example is there a relationship between a player’s popularity and his market value? Or we could make interesting observations about players in the top 6 teams.
The data were scraped by Shubham Maurya from a variety of sources, including transfermrkt.com and Fantasy Premier League (FPL). They record various facts about players in the English Premier League.
Data description¶
name
: Name of the player
club
: Club of the player
age
: Age of the player
position
: The usual position on the pitch
position_cat
: 1 for attackers, 2 for midfielders, 3 for defenders, 4 for goalkeepers
page_views
: Average daily Wikipedia page views from September 1, 2016 to May 1, 2017
fpl_points
: FPL points accumulated over the previous season
region
: 1 for England, 2 for EU, 3 for Americas, 4 for Rest of World
nationality
: Player's nationality
new_signing
: Whether a new signing for 2017/18 (till 20th July)
new_foreign
: Whether a new signing from a different league, for 2017/18 (till 20th July)
club_id
: a numerical version of the Club feature
Our return variable
market_value
: As on transfermrkt.com on July 20th, 2017
Import the data¶
league_df = pd.read_csv("league_data.csv")
league_df.head()
league_df.shape
league_df.isnull().sum()
We have not talked about handling missing values so we will just drop this here.¶
league_df = league_df.dropna()
league_df.isnull().sum()
response = 'market_value'
y = league_df[response]
league_df.describe(include="all")
- The people that hired us to predict on this data want to know if being in a big club affects the market value of a player. So we need to create a new binary categorical variable named
big_clubs
with values $0$ or $1$ designating if a club belongs to the Top 6 clubs:big_clubs = ['Arsenal', 'Chelsea', 'Liverpool', 'Manchester+City', 'Manchester+United', 'Tottenham']
- They also want to look at players in age groups and not just by age. Put the
age
feature in bins according to the values below, and name the variableage_cat
:
pandas
has the.cut()
method that breaks a variable intobins
withlabels
age_bins = [___] age_labels = [___] league_df['age_cat'] = pd.cut(x=league_df['age'],\ bins=age_bins, labels=age_labels)
# 1. your code here
# check
list(league_df[['club', 'big_club']].groupby(['big_club']).apply(np.unique))
Applying functions to pandas DataFrames and Series¶
A simpler but less generic way to do the previous exercise would be
league_df['big_club2'] = league_df.apply(lambda row: 1 if row['club'] in big_clubs else 0, axis=1)
If the function to create the new column is simple, there is a more direct way to create the new column (feature), e.g.:
df['new_column'] = df['column']**2
# 2. your code here
# check
list(league_df[['age_cat', 'age', ]].sort_values(by='age_cat').groupby(['age_cat']).apply(np.unique))
Looking at data types more closely¶
league_df.dtypes
# let's see what features we want to use in the model
categorical_cols = ['position_cat', 'new_signing', 'big_club', 'age_cat', 'region'] # non-ordinal
numerical_cols = ['age', 'page_views', 'fpl_points']
ordinal_cols = [] # we do not have any
league_df.head()
# cast categorical variables as pandas type `category`
cat_type = CategoricalDtype(ordered=False)
for var in categorical_cols:
league_df[var] = league_df[var].astype(cat_type)
league_df[categorical_cols+numerical_cols].dtypes
# Shape of things
league_df.age.values.reshape(-1,1).shape
Stratified train/test split¶
We want to split before we do any EDA since, ideally, we do not want our test set to influence our design decisions. Also, to make sure that the training and test data have appropriate representation of each region; it would be bad for the training data to entirely miss a region. This is especially important because some regions are rather rare.
train and test subsets = sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)[source]
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Use the train_test_split
function and its stratify
argument to split the data, keeping equal representation of each region.
Note: This will not work if the dataset contains missing data.
# your code here
# check
train_data.shape, test_data.shape, y_train.shape, y_test.shape
Now that we won't be peeking at the test set, let's explore and look for patterns! We'll practice a number of useful pandas and numpy functions along the way.
We notice that our dataset contains columns with different data types. We need to apply a specific preprocessing for each one of them. Categorical variables that are ordinal need to be coded as integers, while the rest of them need to be one-hot-encoded. We can do this sequentially or better use sklearn's pipeline
structure. Our pipeline could conveniently include any standardization/normalisation of numerical values. For now we will let them as they are.
train_data.head()
sns.pairplot(train_data[['age', 'page_views', 'market_value']], \
kind='reg', diag_kind='hist');
train_data.columns
train_data[['club','club_id']].\
groupby(['club_id']).agg({'club' : np.unique,
})
train_data.groupby('position').agg({
'market_value': np.mean,
'page_views': np.median,
'fpl_points': np.max
})
2 - Transform categorical variables¶
categorical_cols, numerical_cols
X_train = train_data[categorical_cols+numerical_cols].copy()
X_test = test_data[categorical_cols+numerical_cols].copy()
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Using sklearn OneHotEncoder()
¶
By default, keeps all one-hot created columns. Fine-grained drop mechanism, can drop only binary variables, or the first in the list of categories, or even a specific one ($cats[i]$).
drop{‘first’, ‘if_binary’} or a array-like of shape (n_features,), default=None
It also has a mechanism for handling the presence of unknown categories in the test set.
handle_unknown{‘error’, ‘ignore’}, default=’error’
oh = OneHotEncoder(drop='if_binary', sparse=False, handle_unknown='error')
oh_train = oh.fit_transform(train_data[categorical_cols])
oh_train[:10]
list(zip(categorical_cols, oh.categories_))
oh_train.shape, train_data[categorical_cols].shape
oh_test = oh.transform(test_data[categorical_cols])
oh_test.shape, test_data[categorical_cols].shape
# remember these are "views" of the dataframe
# the dataframe remains unchanged
train_data[categorical_cols].head(5)
train_data[numerical_cols].values.shape, oh_train.shape
Using pandas get_dummies()
¶
By default keeps all $k$ dummies out of $k$ categorical levels. Can be made to remove the first level, so that we have $k-1 dummies$.
dummies_train = pd.get_dummies(train_data[categorical_cols]) #drop_first=True
dummies_train.head()
# transform the test set
dummies_test = pd.get_dummies(test_data[categorical_cols])
Note: if the test dataset has a category that does not exist in the training set, this will throw an error.
pd.set_option('display.max_columns', None)
# create the design matrix for the train set
design_train_df = pd.concat([train_data[numerical_cols], dummies_train], axis=1)
design_train_df.head()
# for the test set
design_test_df = pd.concat([test_data[numerical_cols], dummies_test], axis=1)
design_train_df.dtypes
# the dataframe remains unchanged
train_data[categorical_cols].head(5)
list(zip(categorical_cols, oh.categories_))
Now, let's run the model using our design matrices¶
#create linear model
regression = LinearRegression()
#fit linear model
regression.fit(design_train_df, y_train)
y_pred = regression.predict(design_test_df)
r2_train = regression.score(design_train_df, y_train)
r2_test = regression.score(design_test_df, y_test)
print(f'R^2 train = {r2_train:.5}')
print(f'R^2 test = {r2_test:.5}')
3 - Using Transformation Pipelines¶
There could be many transformations that need to be executed sequentialy in order to construct the design matrix. As we saw, it is possible to handcraft the design matrix ourselves by transforming individual columns, it is more efficient and error-free to create an sklearn pipeline
to do this for you. Sklearn can work directly with $numpy$ arrays or $DataFrames$.
When using the latter, sklearn.compose.ColumnTransformer
is useful, as it applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form the design matrix.
Making a pipeline¶
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
my_pipeline = Pipeline([
('imputer', Imputer(strategy='median')), # we will be using later
('std_scaler', StandardScaler()), # optional
('selector', ColumnTransformer()) # for one-hot encoding
('regressor', lr) # actual regressor model
])
# transform categoricals
categorical_encoder = OneHotEncoder(drop='if_binary', handle_unknown='error') #handle_unknown='ignore'
# transform numericals
numerical_pipe = Pipeline([
('imputer', SimpleImputer(strategy='mean')), # for later
#('stdscaler', StandardScaler()) # for later
])
# bring all transformations together
preprocessor = ColumnTransformer([
('cat', categorical_encoder, categorical_cols),
('num', numerical_pipe, numerical_cols)
])
# add a regressor
lr = LinearRegression()
model = Pipeline([
('preprocessor', preprocessor),
('regressor', lr)
])
model.fit(X_train, y_train)
ohe = (model.named_steps['preprocessor'].named_transformers_['cat'])
feature_names = ohe.get_feature_names(input_features=categorical_cols)
feature_names = np.r_[feature_names, numerical_cols]
feature_names = list(feature_names)
feature_names
print(f'LR train R^2: {model.score(X_train, y_train):.3f}')
print(f'LR test R^2: {model.score(X_test, y_test):.3f}')
# grab the linear regressor
linear_regressor = model.named_steps['regressor']
linear_regressor.coef_.shape
pd.DataFrame(zip(feature_names+numerical_cols, linear_regressor.coef_), columns=['feature', 'coeff'])
A different way to construct the pipeline¶
preprocessor = make_column_transformer(
(OneHotEncoder(drop='if_binary', handle_unknown='error'), categorical_cols),
#(StandardScaler(), numerical_columns),
(SimpleImputer(strategy='mean'), numerical_cols),
remainder='passthrough'
)
model = make_pipeline(
preprocessor,
LinearRegression()
)
model.fit(X_train, y_train)
feature_names = (model.named_steps['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(input_features=categorical_cols))
feature_names = np.concatenate(
[feature_names, numerical_cols])
coefs = pd.DataFrame(
model.named_steps['linearregression'].coef_,
columns=['Coefficients'], index=feature_names
)
coefs
print(f'LR train R^2: {model.score(X_train, y_train):.3f}')
print(f'LR test R^2: {model.score(X_test, y_test):.3f}')
4 - Feature Engineering¶
Let's focus on introducing new features to see if our model performs better. After talking to our client for four hours and doing some some thought, we concluded that the mean predicted market value should be:
$$\hat{y} = \beta_0 + \beta_1\cdot \text{fpl_points} + \beta_2\cdot\text{age} + \beta_3\cdot\text{age}^2 + \beta_4\cdot \text{new_signing} +\beta_5\cdot \text{big_club} + \beta_6\cdot \text{position_cat} \\ + \beta_7\cdot \text{age_cat} + \beta_8\cdot \text{page_views}\times \text{fpl_points}$$We're including a 2nd degree polynomial in age because we expect pay to increase as a player gains experience, but then decrease as they continue aging. We also include an interaction term between page_views
and fpl_points
.
- Build a design matrix function and fit this model to the training data. How good is the overall model?
- Interpret the regression model. What is the meaning of the coefficient for:
- age and age$^2$
- big_club
- What should a player do in order to improve their market value? How many page views should a player go get to increase their market value by 10?
# load a fresh train and test set.
train_data = pd.read_csv("train_data.csv")
test_data = pd.read_csv("test_data.csv")
train_data.head(2)
# your code here
# check
print(f'LR train R^2: {model.score(X_train, y_train):.3f}')
print(f'LR test R^2: {model.score(X_test, y_test):.3f}')
Conceptual questions¶
- The model is reasonably good. We're capturing about 76% of the variation in market values, and the test set confirms that we're not overfitting too badly.
- Look at the coefficients, depends upon your split..
- Linear regression on non-experimental data can't determine causation, so we can't prove that a given relationship runs in the direction we might think. For instance, doing whatever it takes to get more page views probably doesn't meaningfully increase market value; it's likely the causation runs in the other direction and great players get more views. Even so, we can use page views to help us tell who is a great player and thus likely to be paid well.
agecoef = float(coefs.loc['age'].values)
age2coef = float(coefs.loc['age_sq'].values)
agecoef, age2coef
x_vals = np.linspace(-100,100,1000)
y_vals = agecoef*x_vals +age2coef*x_vals**2
plt.plot(x_vals, y_vals)
plt.title("Effect of Age on Player Market value")
plt.xlabel("Age")
plt.ylabel("Contribution to Predicted Market Value")
plt.show()
Conceptual questions¶
- If our model does not have a constant, we must include all four dummy variable columns. If we drop one, we're not modeling any effect of being in that category, and effectively assuming the dropped category's effect is 0.
- Being in position 2 (instead of position 1) has an impact between -1.54 and +2.38 on a player's market value. Since we're using an intercept, the dropped category becomes the baseline and the effect of any dummy variable is the effect of being in that category instead of the baseline category.