Key Word(s): logistic regression, linear regression, mle
CS-109A Introduction to Data Science
Lab 6: Logistic Regression¶
Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras.
Contributors: Chris Tanner
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Learning Goals (EDIT)¶
In this lab, we'll explore different models used to predict which of several labels applies to a new datapoint based on labels observed in the training data.
By the end of this lab, you should:
- Be familiar with the
sklearn
implementations of- Linear Regression
- Logistic Regression
- Be able to make an informed choice of model based on the data at hand
- (Bonus) Structure your sklearn code into Pipelines to make building, fitting, and tracking your models easier
# IMPORTS GALORE
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix
import statsmodels.api as sm
from statsmodels.api import OLS
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
Part 1: The AirBnB NYC 2019 Dataset + EDA¶
The dataset contains information about AirBnB hosts in NYC from 2019. There are 49k unique hosts and 16 features for each:
- id: listing ID
- name: name of the listing
- host_id: host ID
- host_name: name of the host
- neighbourhood_group: NYC borough
- neighbourhood: neighborhood
- latitude: latitude coordinates
- longitude: longitude coordinates
- room_type: listing space type (e.g., private room, entire home)
- price: price in dollars per night
- minimum_nights: number of min. nights required for booking
- number_of_reviews: number of reviews
- last_review: date of the last review
- reviews_per_month: number of reviews per month
- calculated_host_listings_count: number of listings the host has
- availability_365: number of days the listing is available for booking
Our goal is to predict the price of unseen housing units as being 'affordable' or 'unaffordable', by using their features. We will assume that this task is for a particular client who has a specific budget and would like to simplify the problem by classifying any unit that costs \< \$150 per night as 'affordable' and any unit that costs \\$150 or great as 'unaffordable'.
For this task, we will exercise our normal data science pipeline -- from EDA to modelling and visualization. In particular, we will show the performance of 3 classifiers:
- Maximum Likelihood Estimate (MLE)
- Linear Regression
- Logistic Regression
Let's get started! And awaaaaay we go!
Read-in and checking¶
We do the usual read-in and verification of the data:
df = pd.read_csv("../data/nyc_airbnb.csv") #, index_col=0)
df.head()
Building the training/dev/testing data¶
As usual, we split the data before we begin our analysis. It would be unfair to cheat by looking at the testing data. Let's divide the data into 60% training, 20% development (aka validation), 20% testing. However, before we split the data, let's make the simple transformation and converting the prices into a categories of being affordable or not.
df['affordable'] = np.where(df['price'] < 150, 1, 0)
df
NOTE: The affordable
column now has a value of 1 whenever the price is < 150, and 0 otherwise.
Also, the feature named neighbourhood_group
can be easily confused with neighbourhood
, so let's go ahead and rename it to borough
, as that is more distinct:
df.rename(columns={"neighbourhood_group": "borough"}, inplace=True)
df
Without looking at the full data yet, let's just ensure our prices are within valid ranges:
df['price'].describe()
Uh-oh. We see that price
has a minimum value of \$0. I highly doubt any unit in NYC is free. These data instances are garbage, so let's go ahead and remove any instance that has a price of \\$0.
print("original training size:", df.shape)
df = df.loc[df['price'] != 0]
print("new training size:", df.shape)
Now, let's split the data while ensuring that our test set has a fair distribution of affordable units, then further split our training set so as to create the development set:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['affordable'])
df_train, df_dev = train_test_split(df_train, test_size=0.25, random_state=99) #stratify=df_train['affordable'])
# ensure our dataset splits are of the % sizes we want
total_size = len(df_train) + len(df_dev) + len(df_test)
print("train:", len(df_train), "=>", len(df_train) / total_size)
print("dev:", len(df_dev), " =>", len(df_dev) / total_size)
print("test:", len(df_test), "=>", len(df_test) / total_size)
Let's remove the target value (i.e., affordable) from our current dataframes and create it as separate prediction dataframes.
# training
x_train = df_train.drop(['price', 'affordable'], axis=1)
y_train = pd.DataFrame(data=df_train['affordable'], columns=["affordable"])
# dev
x_dev = df_dev.drop(['price', 'affordable'], axis=1)
y_dev = pd.DataFrame(data=df_dev['affordable'], columns=["affordable"])
# test
x_test = df_test.drop(['price', 'affordable'], axis=1)
y_test = pd.DataFrame(data=df_test['affordable'], columns=["affordable"])
From now onwards, we will do EDA and cleaning based on the training set, x_train
.
for col in x_train.columns:
print(col, ":", np.sum([x_train[col].isnull()]))
Oh dear. It appears ~6k of the rows have missing values concerning the reviews. It seems impossible to impute the last_review
feature with reasonable values, as this is very specific to each unit. At best, we could guess the date based on the reviews_per_month
, but that feature is missing for the same rows. Further, it might be difficult to replace reviews_per_month
with reasonable values -- sure, we could fill in values to be the median value, but that seems wrong to generalize so heavily, especially for over 20% of our data. Consequently, let's just ignore these two columns.
x_train = x_train.drop(['last_review', 'reviews_per_month'], axis=1)
x_dev = x_dev.drop(['last_review', 'reviews_per_month'], axis=1)
x_test = x_test.drop(['last_review', 'reviews_per_month'], axis=1)
Let's look at the summary statistics of the data:
x_train.describe()
Next, we see that the minimum_nights
feature has a maximum value of 1,250. That's almost 3.5 years, which is probably longer than the duration that most people rent an apartment. This seems anomalous and wrong. Let's discard it and other units that are outrageous. Well, what constitutes 'outrageous'? We see that the standard deviation for minimum_nights
is 21.24. If we assume our distribution of values are normally distributed, then only using values that are within 2 standard deviations of the mean would yield us with ~95% of the original data. However, we have no reason to believe our data is actually normally distributed, especially since our mean is 7. To have a better idea of our actual values, let's plot it as a histogram.
fig, ax = plt.subplots(1,1)
ax.hist(x_train['minimum_nights'], 25, log=True)
plt.xlabel('minimum_nights')
plt.ylabel('count')
Yea, that instance was a strong outlier, and the host was being ridiculously greedy. That's a clever way to get out a multi-year lease. Notice that we are using log-scale. Clearly, a lot of our mass is from units less than 365 days. To get a better sense of that subset, let's re-plot only units with minumum_nights < 365 days.
subset = x_train['minimum_nights']<365
fig, ax = plt.subplots(1,1)
ax.hist(x_train['minimum_nights'][subset], 30, log=True)
plt.xlabel('minimum_nights')
plt.ylabel('count')
Ok, that doesn't look too bad, as most units require < 30 nights. It's surprising that some hosts list an unreasonable requirement for the minimum number of nights. There is a risk that any host that lists such an unreasonable value might also have other incorrect information. Personally, I think anything beyond 30 days could be suspicious. If we were to exclude any unit that requires more than 30 days, how many instances would we be ignoring?
len(x_train.loc[x_train['minimum_nights']>30])
Alright, we'd be throwing away 436 out of our ~30k entries. That's roughly 1.5\% of our data. While we generally want to keep and use as much data as we can, I think this is an okay amount to discard, especially considering (1) we have a decently large amount of data remaining, and (2) the entries beyond a 30-day-min could be unrealiable.
good_subset = x_train['minimum_nights'] <= 30
x_train = x_train.loc[good_subset]
y_train = y_train.loc[good_subset]
Notice that we only trimmed our training data, not our development or testing data. I am making this choice because in real scenarios, we would not know the nature of the testing data values. We pre-processed our data to ignore all data that has a price of $0, and to ignore certain columns (even if it's in the testing set), but that was fair because those columns proved to be obvious, bogus element of the dataset. However, it would be unfair to inspect the values of the training set and then to further trim the development and testing set accordingly, conditioned on certain data values.
The remaining columns of our training data all have reasonable summary statistics. None of the min's or max's are cause for concern, and we have no reason to assert a certain distribution of values. Since all the feature values are within reasonable ranges, and there are no missing values (NaNs) remaining, we can confidently move foward. To recap, our remaining columns are now:
[col for col in x_train.columns] # easier to read vertically than horizontally
We don't have a terribly large number of features. This allows us to inspect every pairwise interaction. A scatterplot is great for this, as it provides us with a high-level picture of how every pair of features correlates. If any subplot of features depicts a linear relationship (i.e., a clear, concise path with mass concentrated together), then we can assume there exists some collinearity -- that the two features overlap in what they are capturing and that they are not independent from each other.
scatter_matrix(x_train, figsize=(30,20));
Part 2: Predicting with MLE¶
Maximum-likelihood estimation (MLE) is a very simple model which does not require learning any weight/coefficient parameters. Specifically, MLE selects the parameter value ($y$) that makes the observed data most probable, so as to maximize the likelihood function. This choice of $y$ is completely independent of $x$. That is, a MLE model returns the $y$-value that was probable in the data its seen.
# [SOLUTION: REMOVE]
mle_y = y_train['affordable'].value_counts().idxmax()
dev_accuracy = y_dev['affordable'].value_counts()[mle_y] / len(y_dev['affordable'])
dev_accuracy
Part 3: Predicting with Linear Regression¶
Now, let's actually use our features to make more informed predictions. Since our model needs to use numeric values, not textual ones, let's use ONLY the following features for our linear model:
borough
, using 1-hot encodings. There are 5 distinct boroughs, so represent them via 4 unique columns.latitude
longitude
room_type
, using 1-hot encodings. There are 3 distinct room_types, so represent them via 2 unique columns.minimum_nights
number_of_reviews
calculated_host_listings_count
availability_365
# [SOLUTION: remove!!]
x_train = pd.get_dummies(x_train, columns=['borough', 'room_type'], drop_first=True)
x_train = x_train.drop(['id', 'name', 'host_id', 'host_name', 'neighbourhood'], axis=1)
x_dev = pd.get_dummies(x_dev, columns=['borough', 'room_type'], drop_first=True)
x_dev = x_dev.drop(['id', 'name', 'host_id', 'host_name', 'neighbourhood'], axis=1)
x_test = pd.get_dummies(x_test, columns=['borough', 'room_type'], drop_first=True)
x_test = x_test.drop(['id', 'name', 'host_id', 'host_name', 'neighbourhood'], axis=1)
# [SOLUTION HERE]
# training set
x_train_padded = sm.add_constant(x_train) # to allow for beta_0
y_train_lr = y_train['affordable'].values.reshape(-1,1)
# development set
x_dev_padded = sm.add_constant(x_dev)
y_dev_lr = y_dev['affordable'].values.reshape(-1,1)
model = OLS(y_train_lr, x_train_padded)
results = model.fit()
results.summary()
# your code here
y_hat_dev = results.predict(exog=x_dev_padded)
# calculating and reporting the requested values, particularly the Test R^2
print('Train R^2 = {:.4}'.format(results.rsquared))
print('Test R^2 = {:.4}'.format(r2_score(y_dev_lr, y_hat_dev)))
# i'm using numpy's round() function, instead of manually checking for values above 0.5
accuracy_score(y_dev, np.round(y_hat_dev))
# [SOLUTION HERE]
best_accuracy = -1
best_model = None
for cur_alpha in [0.001, .01, .05, .1, .5, 1, 5, 10, 50, 100, 500]:
# fit (using Ridge Regression), predict, and score
fitted_ridge = Ridge(alpha=cur_alpha).fit(x_train, y_train_lr)
y_hat_dev = fitted_ridge.predict(x_dev).reshape(1,-1)[0]
cur_accuracy = accuracy_score(y_dev['affordable'].to_numpy(), np.round(y_hat_dev))
if cur_accuracy > best_accuracy:
best_accuracy = cur_accuracy
best_model = fitted_ridge
# fit (using Lasso Regression), predict, and score
fitted_lasso = Lasso(alpha=cur_alpha).fit(x_train, y_train_lr)
y_hat_dev = fitted_lasso.predict(x_dev).reshape(1,-1)[0]
cur_accuracy = accuracy_score(y_dev['affordable'].to_numpy(), np.round(y_hat_dev))
if cur_accuracy > best_accuracy:
best_accuracy = cur_accuracy
best_model = fitted_lasso
print("best_model:", best_model, "yielded accuracy of:", best_accuracy)
Note, we did not perform cross-validation, so perhaps our model could have performed even better, had we done so.
# [SOLUTION HERE]
# construct training residuals
y_hat_train = best_model.predict(x_train)
training_residuals = y_train_lr[:,0] - y_hat_train[:,0]
# construct dev residuals
y_hat_dev = best_model.predict(x_dev).reshape(1,-1)[0]
dev_residuals = y_dev['affordable'].to_numpy() - y_hat_dev
# make plot of training residuals
fig, axes = plt.subplots(1,2,figsize=(15,5))
axes[0].set_title('Histogram of Training Residuals')
axes[0].hist(training_residuals, alpha=0.8, bins=20)
axes[0].axhline(0, c='black', lw=2)
axes[0].set_xlabel(r'residuals')
# make plot of dev residuals
axes[1].set_title('Histogram of Development Residuals')
axes[1].hist(dev_residuals, alpha=0.4, bins=20)
axes[1].axhline(0, c='black', lw=2)
axes[1].set_xlabel(r'residuals')
plt.show()
print("min residual:", min(dev_residuals))
The above plots suggest that the training data is not too conducive to being modelled by a linear regression model, for the residuals seem to be bimodal -- there isn't a single normal distrubion of residual values. Also, just for fun, we plotted the errors/residuals from having evaluated on the unseen development set. Doing so would provide no information about the assumptions of linear model being appropriate for our training data (as the unseen data could be completely dissimiliar from anything we saw during training). Yet, we hope that the development set residuals would be minimal, and it's nice that the errors seem to follow a normal distrubtion -- although, there's some outliers that we perform badly on, but this can happen.
Part 4: Binary Logistic Regression¶
Linear regression is usually a good baseline model, but since the outcome we're trying to predict only takes values 0 and 1 we'll want to use logistic regression instead of basic linear regression.
We will use sklearn
for now, but statsmodels
also provides LogisticRegression, along with nifty features like confidence intervals.
First, let's import the necessary classes:
from sklearn.linear_model import LogisticRegression
Next, let's instantiate a new LogisticRegression model:
lr = LogisticRegression()
Now, we can fit our model with just 1 line!
lr.fit(x_train, y_train['affordable'])
See .predict() documentation here. NOTE: regularization is applied by default. Especially pay attention to the following arguments/parameters:
- C penalty, which we discussed in class. Experiment with varying values from 0 to 100 million!
- max_iterations: experiment with values from 5 to 5000. Do you expect more iterations to always perform better? Why or why not?
- penalty: for designating L1 (Lasso) or L2 (Ridge) loss; default is L2
- solver: especially for the multi-class setting
After fitting the model, you can print the .coef_
value to see its coefficient.
y_hat_dev = lr.predict(x_dev)
initial_score = accuracy_score(y_dev['affordable'].to_numpy(), y_hat_dev)
print("our initial logistic regression model yielded accuracy score of:", initial_score)
best_accuracy = -1
best_model = None
# experiment with different values
c_vals = [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]
num_iters = [5, 10, 100, 1000, 5000]
for c_val in c_vals:
for num_iter in num_iters:
lr = LogisticRegression(C=c_val, solver='liblinear', max_iter=num_iter)
lr.fit(x_train, y_train['affordable'])
y_hat_dev = lr.predict(x_dev)
cur_accuracy = accuracy_score(y_dev['affordable'].to_numpy(), y_hat_dev)
if cur_accuracy > best_accuracy:
best_accuracy = cur_accuracy
best_model = lr
print("best logistic regression model:", lr, "yielded an accuracy score:", best_accuracy)
print("its learned coefficients:", len(best_model.coef_[0]))
print("the coefficients align with our features:", x_dev.shape)
The results here should show that for this dataset, logistic regression offered effectively identical performance as linear regression. There are two main takeaways from this:
- logistic regression should not be viewed as being superior to linear regression; it should be viewed as a solution to a different type of problem -- classification (predicting categorical outputs), not regression (predicting continuous-valued outputs).
- In our situation, our two categories/classes (affordable or not) had an ordinal nature. That is, the continuum of prices directly aligned with the structure of our two classes. Alternatively, you could imagine other scenarios where our two categories are nominal and thus un-rankable (e.g., predicting cancer or not, or predicting which NYC borough an AirBnB is in based on its property features).
Part 5 (The Real Challenge): Multiclass Classification¶
Before we move on, let's consider a more common use case of logistic regression: predicting not just a binary variable, but what level a categorical variable will take. Instead of breaking the price variable into two classes (affordable being true or false), we may care for more fine-level granularity.
For this exercise, go back to the original df
dataframe and construct 5 classes of pricing:
- budget: < 80
- affordable: 80 < x < 120
- average: 120 < x < 180
- expensive: 180 < x < 240
- very expensive: 240 < x
The cut
function obviously stores a lot of extra information for us. It's a very useful tool for discretizing an existing variable.
# creates multi-class labels for training
x_train_multiclass = x_train.copy()
x_train_multiclass['price_level'] = pd.cut(df_train['price'],[0,80,120,180,240,float('inf')], labels=[0,1,2,3,4])
y_train_multiclass = pd.DataFrame(data=x_train_multiclass['price_level'], columns=["price_level"])
x_train_multiclass = x_train_multiclass.drop(['price_level'], axis=1)
# creats multi-class labels for dev
x_dev_multiclass = x_dev.copy()
x_dev_multiclass['price_level'] = pd.cut(df_dev['price'],[0,80,120,180,240,float('inf')], labels=[0,1,2,3,4])
y_dev_multiclass = pd.DataFrame(data=x_dev_multiclass['price_level'], columns=["price_level"])
x_dev_multiclass = x_dev_multiclass.drop(['price_level'], axis=1)
best_accuracy = -1
best_model = None
# experiment with different values
c_vals = [1, 10, 100, 1000, 10000]
num_iters = [10, 100, 1000, 5000]
for c_val in c_vals:
for num_iter in num_iters:
lr = LogisticRegression(solver="lbfgs", max_iter=10000)
lr.fit(x_train_multiclass, y_train_multiclass['price_level'])
y_hat_dev = lr.predict(x_dev_multiclass)
cur_accuracy = accuracy_score(y_dev_multiclass['price_level'].to_numpy(), y_hat_dev)
print(cur_accuracy)
if cur_accuracy > best_accuracy:
best_accuracy = cur_accuracy
best_model = lr
print("best logistic regression model:", lr, "yielded an accuracy score:", best_accuracy)
print("its learned coefficients:", len(best_model.coef_[0]))
print("the coefficients align with our features:", x_dev.shape)
for i in range(len(x_dev.columns)):
print("feature:", x_dev.columns[i], "; coef:", best_model.coef_[0][i])
Despite having 5 distinct price categories now, our performance isn't too bad! To increase performance further, we could first use cross-validation. Then, we could look at our original data and try to better use its features. For example, perhaps it would be useful to expand out our 'neighbourhood' feature into one-hot encodings? I imagine the fine-level, granular information of 'neighbourhood' correlates well with price. The only concern and question to ask ourselves is how much data do we have for each neighbourhood? (We'd aim to have plenty of representative data). Related, the longitude and latitude features provide fine-level information, but perhaps it's hard for the model to use it since the range is so small. If we were to scale the lat and long values to be between 0 and 1, it might allow for the model to better distinguish between the nuanced values.
For this exercise, we uniformly care about each price level and prediction thereof. However, in some scenarios, our classification accuracy for some categories is much more important than others (e.g., predicting cancer or not). That is, our false negatives (misses) are way more serious and potentially deadly. For situations like this, it is better to error on the side of caution and allow for false positives (aka false alarms) moreso than false negatives (misses). To handle this, one could weight each class, and specify such during training/ fitting our model. As we learned in class, we can plot the performance as we vary the prediction threshold, while paying attention to how that affects the number of false negatives and false positives.