Key Word(s): Case Study
## RUN THIS CELL TO GET THE RIGHT FORMATTING
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2020-CS109A/master/themes/static/css/cs109.css").text
HTML(styles)
# import the necessary libraries
import re
import requests
import random
import pandas as pd
import math
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
from sklearn.neighbors import KNeighborsRegressor
from matplotlib import gridspec
from sklearn.model_selection import KFold
import numpy as np
import seaborn as sns
from time import sleep
from bs4 import BeautifulSoup
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
# global properties
data_dir = "data/" # where to save data
num_search_pages = 50 # how many search pages to cull through
# NOTE:
# if you haven't yet downloaded the data, this should be set to True
download_data = False
Disclaimer¶
Alcohol is drug. There are state and federal laws that govern the sale, distribution, and consumption of such. In the United States, those who consume alcohol must be at least 21 years of age. In no way am I, or anyone else at IACS or Harvard at large, promoting or encouraging the usage of alcohol. My intention is not to celebrate it. Anyone who chooses to consume alcohol should be of legal age and should do so responsibly. Abusing alcohol has serious, grave effects.
The point of this exercise is purely pedagogical, and it illustrates the wide range of tasks to which one can apply data science and machine learning. That is, I am focusing on a particular interest and demonstrating how it can be used to answer questions that one may be interested in for one's own personal life. You could easily imagine this being used in professional settings, too.
Learning Objectives¶
Help see the big picture process of conducting a project, and to illustrate some of the nuanced details and common pitfalls.
Predicting good bourbons¶
The point of this notebook is to address the first question that we posed in the beginning:
- Are there certain attributes of bourbons that are predictive of good (i.e., highly rated by users) bourbons?
- Find hidden gems (i.e., should be good but current reviews are absent or unsupportive of such)
- Find over-hyped whiskeys (i.e., the reviews seem high but the attributes aren't indicative)
- Are there significant results if we target experts' ratings instead of average customer ratings?
To this effect, let's first load and clean the data, identically to what we did in the previous notebook. NOTE: The only reason this notebook exists as a separate notebook, instead of one long notebook, is for readability and organization.
Load and clean the data¶
whiskey_urls = set()
if download_data:
# we define this for convenience, as every state's url begins with this prefix
base_url = 'https://distiller.com/search?term=bourbon'
# visits each search result page
for page_num in range(1, num_search_pages):
cur_page = requests.get('https://distiller.com/search?page=' + str(page_num) + '&term;=bourbon')
# uses BeautifulSoup to extract all links to whiskeys
bs_page = BeautifulSoup(cur_page.content, "html.parser")
for link in bs_page.findAll('a', attrs={'href': re.compile("^/spirits/")}):
whiskey_urls.add(link.get('href'))
sleep(1)
# saves each URL to disk, so that we don't have to crawl the search results again
f = open("whiskey_urls.txt", "w")
for url in whiskey_urls:
f.write(url + "\n")
f.close()
# fetches each page and saves it to the hard drive
for url in whiskey_urls:
cur_page = requests.get('https://distiller.com' + url).content
# writes file
f = open(data_dir + url[9:], 'wb')
f.write(cur_page)
f.close()
# sleeps between 1-3 seconds, in case the site tries to detect crawling
sleep(random.randint(1,3))
else:
# if the files have already been saved to disk
# then you can just load them here, instead of crawling again
with open('whiskey_urls.txt') as f:
whiskey_urls = set(line.strip() for line in f)
whiskeys = {}
# loads whiskey webpage
for i, url in enumerate(whiskey_urls):
filename = data_dir + url[9:]
file_contents = open(filename, 'r').read()
# instantiates a new BeautifulSoup object
soup = BeautifulSoup(file_contents, "html.parser")
# extracts details about the whiskey
name = soup.find('h1', attrs={'class': re.compile("secondary-headline name")}).text.strip()
location = soup.find('h2', attrs={'class': "ultra-mini-headline location middleweight"}).text.strip()
soup.find('div', attrs={'class': "spirit-badge"})
badge = ""
if soup.find('div', attrs={'class': "spirit-badge"}) != None:
badge = soup.find('div', attrs={'class': "spirit-badge"}).text.strip()
num_ratings = 0
rating = "N/A"
if soup.find('span', attrs={'itemprop': "ratingCount"}) != None:
num_ratings = int(soup.find('span', attrs={'itemprop': "ratingCount"}).text.strip())
rating = float(soup.find('span', attrs={'itemprop': "ratingValue"}).text.strip())
age = soup.find('li', attrs={'class': "detail age"}).find('div', attrs='value').text.strip()
price = int(re.findall("cost-(\d)", str(soup.find('div', attrs={'class': re.compile("spirit-cost")})))[0])
abv = ""
if soup.find('li', attrs={'class': "detail abv"}).find('div', attrs='value').text != "":
abv = float(soup.find('li', attrs={'class': "detail abv"}).find('div', attrs='value').text)
whiskey_type = soup.find('li', attrs={'class': "detail whiskey-style"}).div.text
cask_type = ""
if soup.find('li', attrs={'class': "detail cask-type"}) != None:
cask_type = soup.find('li', attrs={'class': "detail cask-type"}).find('div', attrs='value').text.strip()
review = ""
expert = ""
score = ""
flavor_summary = ""
flavor_profile = []
# check if an expert reviewed it
if soup.find('p', attrs={'itemprop': "reviewBody"}) != None:
review = soup.find('p', attrs={'itemprop': "reviewBody"}).text.replace("\"","").strip()
expert = soup.find('div', attrs={'class': 'meet-experts'}).a.text.strip()
score = int(soup.find('div', attrs={'class': "distiller-score"}).span.text.strip())
flavor_summary = soup.find('h3', attrs={'class': "secondary-headline flavors middleweight"}).text.strip()
# extracts flavor profile
flavor_profile = eval(soup.find('canvas').attrs['data-flavors'])
cur_whiskey = [name, whiskey_type, cask_type, location, age, abv, price, badge, num_ratings, \
rating, flavor_summary, expert, score]
if flavor_profile:
cur_whiskey.extend(list(flavor_profile.values()))
else:
cur_whiskey.extend(np.zeros(14))
cur_whiskey.append(review)
whiskeys[i] = cur_whiskey
df = pd.DataFrame.from_dict(whiskeys, orient='index', \
columns=['Name', 'Type', 'Cask', 'Location', 'Age', 'ABV %', 'Price', 'Badge',\
'# Ratings', "Customers' Rating", 'Flavor Summary', 'Expert', 'Expert Score',\
'Smoky', 'Peaty', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
'Sweet', 'Briny', 'Salty', 'Vanilla', 'Tart', 'Fruity', 'Floral', 'Review'])
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# filter by only those that are bourbons and reviewed by an expert
df2 = df.loc[(df['Expert'] != "")]
df2 = df2.loc[(df['Type'] == "Bourbon")]
# remove the single bourbon that was not rated by customers
df2 = df2.loc[df2['Customers\' Rating'] != "N/A"]
df2 = df2.astype({'Customers\' Rating' : 'float64'})
# let's fix the 'Age' feature
df2['Age'] = df2['Age'].replace(['NAS', 'nas', 'N/A',''],'0')
df2['Age'].replace(to_replace =' [yY]ear[sS]*', value = '', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='6.*', value = '6', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='(\d+) [Yy].*', value = '\\1', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='4 [Mm]onths', value = '4', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 [Mm]onths', value = '9', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='18 - 20 [Mm]onths', value = '1.5', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='32 [Mm]onths', value = '2.67', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 [Mm]onths', value = '9', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 to 11', value = '0.75', regex = True)
df2 = df2.astype({'Age': 'float64'})
df2['Age'] = df2['Age'].replace(0,7)
# fix the 'Badge' feature by making a new 'Rare' feature
df2['Rare'] = [True if x == 'RARE' else False for x in df2['Badge']] #df['Badge'] #.map({"RARE": True})
del df2['Badge']
# convert to appropriate data types
df2 = df2.astype({'Expert Score': 'int32', 'Customers\' Rating' : 'float64', 'ABV %': 'float64'})
df2.describe()
Solving the Problem: Predicting good bourbons¶
Goal:¶
Predict Customers' Rating
using flavors and other features
Data:¶
All of our 585 bourbons, but split into train/dev/test
Features:¶
From our EDA, it appears safe to use most of our features. There's a chance that some of the features are collinear, and if the results demonstrate such, then we can always remove them. For now, we will focus on using the following features:
- all 14 flavors
- Age
- ABV %
- Price
- Rare
- Expert score
NOTE: we will also keep track of the Names, just for manually inspecting our errors, but this will not be used as a feature to any model we use.
Accuracy Metric:¶
Mean Squared Error
1. Prepare the data for all experiments¶
We have 585 bourbons. I think it's reasonable to:
- train on 400
- develop on 100
- test on 85
features =['Name', 'Age', 'ABV %', 'Price', 'Rare', 'Expert Score',\
'Smoky', 'Peaty', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
'Sweet', 'Briny', 'Salty', 'Vanilla', 'Tart', 'Fruity', 'Floral']
# splitting the data into "train" and test sets
x_, xtest, y_, ytest = train_test_split(df2[features], df2["Customers\' Rating"], test_size=85, random_state=538)
# this further divides our 'train' into proper 'train' and 'dev' splits
xtrain, xdev, ytrain, ydev = train_test_split(x_, y_, test_size=100, random_state=22)
print(len(xtrain), len(xdev), len(xtest))
Model choices:¶
- Scale data: no
- Polynomial features: no
Before we do anything fancy, let's start simple by throwing all of our features at it and dissecting the results from one linear model.
# fit a linear regression
lreg = LinearRegression()
lreg.fit(xtrain.drop(['Name'], axis=1), ytrain)
Let's evaluate on the train and dev sets.
# evaluate on the training set
y_train_pred = lreg.predict(xtrain.drop(['Name'], axis=1))
train_MSE = mean_squared_error(ytrain, y_train_pred)
round(train_MSE, 5)
print("train_MSE:", train_MSE)
# evaluate on the dev set
y_dev_pred = lreg.predict(xdev.drop(['Name'], axis=1))
dev_MSE = mean_squared_error(ydev, y_dev_pred)
round(dev_MSE, 5)
print("dev_MSE:", dev_MSE)
Is this value of 0.083 for dev_MSE
good? I have no idea, but seems really low (we will soon discuss the fact that it's lower than the train_MSE
)!
Since we have 19 features, we cannot easily plot the regression line to inspect how well it fits the target labels. Instead, let's look at the residuals to see:
- if they are normally distributed
- if there are any trends with respect to the predict y-value
# sort the values of x before line plot
residuals = ydev - y_dev_pred
# plot the histograms of residuals
fig, ax = plt.subplots(1,2, figsize = (12, 4))
gs = gridspec.GridSpec(1, 2, width_ratios=[3, 5])
ax[0] = plt.subplot(gs[0])
ax[0].set_xlabel('Residuals')
ax[0].set_ylabel('Frequency')
ax[0].hist(residuals, alpha = 0.5, bins=8)
# plot the predictions vs residuals
ax[1] = plt.subplot(gs[1])
ax[1].plot(y_dev_pred, y_dev_pred - ydev,'o',color='#2ea2db',alpha=0.7,label = 'Residual')
ax[1].plot([y_dev_pred.min(),y_dev_pred.max()],[0,0],'k--',label = 'Line $y=0$')
ax[1].set_xlabel('Predicted income values',fontsize=10)
ax[1].set_ylabel('Residuals',fontsize=10)
ax[1].legend(loc='upper right',fontsize=10)
fig.suptitle('Analysis of Residuals (Dev set)',fontsize=14);
Oh, that's good. That's really good! If it didn't look like it were normally distributed (left graph), then it would indicate that our features aren't linearly independent, and that we'd need to carefully remove some. Notwithstanding, it still might be the case that some features aren't too helpful. On the right, we see that there seems very little pattern as to how far off our predictions are. That is, when we make extreme predictions (very low or very high scores), our predictions tend to be inaccurate. When we make average score predictions, our residuals are kind of uniformly spread.
Let's look at the coefficients, to gain insights:
coefs = pd.DataFrame(lreg.coef_, columns=['Coefficients'], index=features[1:])
coefs
It can be difficult to interpret the meaning of these coefficients because:
- some features have different scales (e.g., the 14 flavors each is from 1-100 but price is from 1-5)
- one feature is categorical (Rare).
It makes sense that the flavors all have low values, and that some are slightly negative. For example, most bourbons have low flavor values for Herbal. So, if a bourbon has an Herbal value of 50, this feature alone will contribute a 0.0543 decrease in the overall prediction of the customer rating. Keep in mind, this rating should be between 0 and 5. Speaking of which, linear regression is entirely capable of making predictions beyond 5, which is an unfortunate artifact. If we print y_dev_pred.max()
and y_dev_pred.min()
, we see that all predictions were between 2.72 and 4.88, which is perfectly reasonable!
Regarding the other coefficients, they all seem justifiable to me and nothing has caused concern for me yet. For example, the Expert Score is out of 100 and has one of the largest coefficients. As an example, if an Expert Score is 50, it alone will contribute 1.94. But, if it has a score of 100, it'll contribute twice as much (3.8897). Clearly, such a high-rated bourbon ought to receive higher than a 3.7098 from customers, and this is what we rely on the other features to contribute. So, a higher ABV %, Price, and Age all contribute toward a higher score.
The Rare feature is possibly just noise. It has a low value (-0.06). So, if a bourbon is denoted as being rare, it only contribute 0.02. Possible explanations are the training/dev data and that the tag itself seems a bit unreliable. Within the EDA, I detailed that several very hard-to-find whiskeys didn't have the rarity tag. Moreover, if a whiskey truly is rare, it can be rare because it's highly desired and in limited supply, or because it's from a very small distillery and produces very basic, low-quality whiskey.
QUESTION: how do we know if we're suffering from not having enough data?¶
Using just 1 run, as we did above, is short-sighted, and we can't place too much stock into the performance or features' coefficients. Let's perform bootstrapping with 100 samples (using 100% of the training data each time), while keeping track of (a) the coefficients so that we can see a distribution of them; and (2) the MSE performance on the dev set.
num_bootstraps = 100
bootstrap_coefs = []
train_MSEs = []
dev_MSEs = []
# bootstrapping will shuffle the rows, so we need to ensure
# that the y's maintain the same order as the x's
merged = pd.concat([xtrain, ytrain], axis=1, sort=False)
for i in range(num_bootstraps):
boot_df = merged.sample(frac=1, replace=True)
xtrain_boot = boot_df.drop(["Name", "Customers\' Rating"], axis=1)
ytrain_boot = boot_df["Customers\' Rating"]
boot_lr = LinearRegression()
boot_lr.fit(xtrain_boot, ytrain_boot)
# evaluate on the train set
y_train_pred_boot = boot_lr.predict(xtrain_boot)
train_MSE_boot = mean_squared_error(ytrain_boot, y_train_pred_boot)
round(train_MSE_boot, 5)
train_MSEs.append(train_MSE_boot)
# evaluate on the dev set
y_dev_pred_boot = boot_lr.predict(xdev.drop(['Name'], axis=1))
dev_MSE_boot = mean_squared_error(ydev, y_dev_pred_boot)
round(dev_MSE_boot, 5)
dev_MSEs.append(dev_MSE_boot)
bootstrap_coefs.append(boot_lr.coef_)
bootstrap_coefs = np.array(bootstrap_coefs)
## PLOTS THE HISTOGRAMS OF BETA COEFFICIENTS
fig, ax = plt.subplots(nrows=7, ncols=3, figsize=(20, 20), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = .5, wspace=.2)
ax = ax.ravel()
fontsize = 10
for i in range(1, len(features)):
betavals = bootstrap_coefs[:,i-1]
betavals.sort()
x1 = np.percentile(betavals,2.5)
x2 = np.percentile(betavals,97.5)
x = np.linspace(x1,x2,500)
counts, bins = np.histogram(betavals)
y = counts.max()
ax[i-1].hist(bootstrap_coefs[:,i-1],bins=10, color="#FF7E79",alpha=0.3,edgecolor='black', linewidth=1)
ax[i-1].fill_between(x,y, color = '#007D66',alpha=0.2)
# prettify
ax[i-1].set_ylabel(f'Count',fontsize=10)
ax[i-1].set_xlabel(r'$\beta$ for ' + features[i],fontsize=10)
ax[19].set_axis_off()
ax[20].set_axis_off()
plt.xticks(fontsize=20)
fig.suptitle(f'95 % confidence interval for feature coefficients', fontsize=16)
fig.subplots_adjust(top=0.95)
sns.despine()
Hey, this looks pretty good! Each beta coefficient looks pretty normally distributed, and the range of values isn't too large. Part of this narrow range can be explained by the small data size.
To see the variation of the actual predictions, let's plot the MSE that came from each dev split.
plt.figure(figsize=(6,4))
plt.hist(train_MSEs, bins=10, alpha=0.5, label="Train split")
plt.hist(dev_MSEs, bins=10, alpha=0.5, label="Dev split")
plt.xlabel("MSE Values", size=10)
plt.ylabel("Count", size=10)
plt.title("Performance differences")
plt.legend(loc='upper right')
Ah, shucks, that's completely unexpected. The training accuracy should be better (lower) than the dev accuracy.
QUESTION: what could be causes?¶
ANSWER:¶
In general, this could mean at least one of several things likely occurred:
- our training and dev splits aren't too similar to one another
- our training and/or dev splits are too small
- there isn't enough of a signal to properly learn (e.g., bad features, or too many features)
- we didn't train our model long enough (not applicable for simple linear regression)
Our data is really small (400 training samples). So, let's try cross-validation in addition to bootstrapping!
Using just 1 split, as we did above, is short-sighted, and we can't place too much stock into the coefficients and results. So, let's use 10-fold cross-validation for each of 100 bootstrapped samples!
num_bootstraps = 100
bootstrap_coefs = []
# i want things to be random, but consistent for everyone who runs this.
# we need to generate 100 bootstraps, each of which should
# be a random sample of the data. if a single bootstrap
# sample has a `random_state` defined, then the sample will be
# the same every time. well, i want 100 of these unique ones,
# so i generate a list of 100 random seeds. this list of seeds
# needs to be fixed, which is what i do in the following 2 lines:
random.seed(a=21)
bootstrap_seeds = random.sample(range(100000), num_bootstraps)
# now we include all of the original 'train'
# which encompasses the original 'dev' but excludes the 'test'
merged = pd.concat([x_, y_], axis=1, sort=False)
train_MSEs = []
dev_MSEs = []
for i in range(num_bootstraps):
boot_df = merged.sample(frac=1, replace=True, random_state=bootstrap_seeds[i])
kf = KFold(n_splits=10, random_state=270)
#kf.get_n_splits(boot_df.drop(["Customers\' Rating"], axis=1))
for train_index, dev_index in kf.split(boot_df):
xtrain_cv = boot_df.iloc[train_index].drop(["Name", "Customers\' Rating"], axis=1)
ytrain_cv = boot_df.iloc[train_index]["Customers\' Rating"]
xdev_cv = boot_df.iloc[dev_index].drop(["Name", "Customers\' Rating"], axis=1)
ydev_cv = boot_df.iloc[dev_index]["Customers\' Rating"]
cv_lr = LinearRegression()
cv_lr.fit(xtrain_cv, ytrain_cv)
# evaluate on the train set
y_train_pred_cv = cv_lr.predict(xtrain_cv)
train_MSE_cv = mean_squared_error(ytrain_cv, y_train_pred_cv)
round(train_MSE_cv, 5)
train_MSEs.append(train_MSE_cv)
# evaluate on the dev set
y_dev_pred_cv = cv_lr.predict(xdev_cv)
dev_MSE_cv = mean_squared_error(ydev_cv, y_dev_pred_cv)
round(dev_MSE_cv, 5)
dev_MSEs.append(dev_MSE_cv)
bootstrap_coefs.append(cv_lr.coef_)
print("train_MSEs:", np.mean(train_MSEs))
print("dev_MSEs:", np.mean(dev_MSEs))
bootstrap_coefs = np.array(bootstrap_coefs)
We now see the property that we'd expect: the dev accuracy is worse than the training accuracy, although only by a little. We could probably benefit from having:
- more data observations (i.e., whiskeys)
- possibly different features
- possibly better modelling
We can't make more data observations (whiskeys), and we've done a lot to sample our data so as to make our results more robust and meaningful. So, let's focus on trying different features, and then we can move on to other modelling choices!
## PLOTS THE HISTOGRAMS OF BETA COEFFICIENTS
fig, ax = plt.subplots(nrows=7, ncols=3, figsize=(20, 20), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = .5, wspace=.2)
ax = ax.ravel()
fontsize = 10
for i in range(1, len(features)):
betavals = bootstrap_coefs[:,i-1]
betavals.sort()
x1 = np.percentile(betavals,2.5)
x2 = np.percentile(betavals,97.5)
x = np.linspace(x1,x2,500)
counts, bins = np.histogram(betavals)
y = counts.max()
ax[i-1].hist(bootstrap_coefs[:,i-1],bins=10, color="#FF7E79",alpha=0.3,edgecolor='black', linewidth=1)
ax[i-1].fill_between(x,y, color = '#007D66',alpha=0.2)
# prettify
ax[i-1].set_ylabel(f'Count',fontsize=10)
ax[i-1].set_xlabel(r'$\beta$ for ' + features[i],fontsize=10)
ax[19].set_axis_off()
ax[20].set_axis_off()
plt.xticks(fontsize=20)
fig.suptitle(f'95 % confidence interval for feature coefficients', fontsize=16)
fig.subplots_adjust(top=0.95)
sns.despine()
plt.figure(figsize=(6,4))
plt.hist(train_MSEs, bins=10, alpha=0.5, label="Train split")
plt.hist(dev_MSEs, bins=10, alpha=0.5, label="Dev split")
plt.xlabel("MSE Values", size=10)
plt.ylabel("Count", size=10)
plt.title("Performance differences")
plt.legend(loc='upper right')
We see that the coefficients look even more normally distributed, and the MSE values for the training are more reasonable now -- compared to the dev set. The dev set performance has quite a span, which makes sense because we tried many splits (1,000 tests), and the data is small.
Above, we threw the entire kitchen sink at the problem (aka, all features). Let's do forward passes to iteratively determine, in a greedy fashion, which combination of features yields the best results, on average. Keep in mind that the data is small, so there will be some noise. Specifically, let's start with just 1 feature (which performs best by itself), then keep adding 1 more feature, while keeping track of the performances.
num_bootstraps = 50
bootstrap_coefs = []
random.seed(a=21)
bootstrap_seeds = random.sample(range(100000), num_bootstraps)
# now we include all of the original 'train'
# which encompasses the original 'dev' but excludes the 'test'
merged = pd.concat([x_, y_], axis=1, sort=False)
selected_features = []
candidate_features = features[1:]
mse_across_features = []
while len(candidate_features) > 0:
best_mse = 9999
best_feature = ""
for i, feature in enumerate(candidate_features):
dev_MSEs = []
for i in range(num_bootstraps):
boot_df = merged.sample(frac=1, replace=True, random_state=bootstrap_seeds[i])
kf = KFold(n_splits=10, random_state=270)
#kf.get_n_splits(boot_df.drop(["Customers\' Rating"], axis=1))
for train_index, dev_index in kf.split(boot_df):
xtrain_cv = boot_df.iloc[train_index][selected_features + [feature]]
ytrain_cv = boot_df.iloc[train_index]["Customers\' Rating"]
xdev_cv = boot_df.iloc[dev_index][selected_features + [feature]]
ydev_cv = boot_df.iloc[dev_index]["Customers\' Rating"]
cv_lr = LinearRegression()
cv_lr.fit(xtrain_cv, ytrain_cv)
# evaluate on the dev set
y_dev_pred_cv = cv_lr.predict(xdev_cv)
dev_MSE_cv = mean_squared_error(ydev_cv, y_dev_pred_cv)
round(dev_MSE_cv, 5)
dev_MSEs.append(dev_MSE_cv)
cur_mse = np.mean(dev_MSEs)
if cur_mse < best_mse:
best_mse = cur_mse
best_feature = feature
mse_across_features.append(best_mse)
print("best feature:", best_feature)
selected_features.append(best_feature)
candidate_features.remove(best_feature)
Let's plot the performance as we increased the # of features.
ind = [_ + 1 for _ in range(len(features)-1)]
fig, ax = plt.subplots(figsize=(6,5))
ax.bar(ind, mse_across_features, color = '#007D66',alpha=0.3,edgecolor='k')
ax.set_xticks(sorted(ind));
ax.set_xlabel(r'# features used', fontsize=10)
ax.set_ylabel('Average MSE', fontsize=10)
ax.set_title(f'Feature Selection Performance (Dev set)',fontsize=14)
sns.despine()
plt.tight_layout()
best_num_features = np.argmin(mse_across_features)
print("best_num_features:", best_num_features)
print("features:", selected_features[0:best_num_features])
print("mse_across_features:", np.min(mse_across_features))
The results plateau, and adding more features doesn't offer any significant change.
for i in range(len(selected_features)):
print(mse_across_features[i], selected_features[0:i])
An MSE of 0.09 seems pretty good to me! Let's compare it to other models.
3. kNN¶
Let's start simple, as we did with Linear Regression. We begin by fitting a kNN regressor on 1 particular split (the original one). The purpose of starting small is to ensure we're doing it correctly and that the results make sense.
# create and fit a kNN regressor
k = 5
knn_model = KNeighborsRegressor(n_neighbors=k)
knn_model.fit(xtrain.drop(['Name'], axis=1), ytrain)
# evaluate on the training set
y_train_pred = knn_model.predict(xtrain.drop(['Name'], axis=1))
mse_train = mean_squared_error(ytrain, y_train_pred)
# evaluate on the dev set
y_dev_pred = knn_model.predict(xdev.drop(['Name'], axis=1))
mse_dev = mean_squared_error(ydev, y_dev_pred)
mse_dev
print("mse_train:", mse_train, "mse_dev:", mse_dev)
Ok, it runs! However, those scores look much worse than what we started with when using linear regression. Hmm, hopefully it can improve. Let's vary our k value and see if we can do better for this original data split.
max_k = 20
train_MSEs = []
dev_MSEs = []
for k in range(1, max_k):
knn_model = KNeighborsRegressor(n_neighbors=k)
knn_model.fit(xtrain.drop(['Name'], axis=1), ytrain)
# evaluate on the training set
y_train_pred = knn_model.predict(xtrain.drop(['Name'], axis=1))
train_MSE = round(mean_squared_error(ytrain, y_train_pred), 5)
train_MSEs.append(train_MSE)
# evaluate on the dev set
y_dev_pred = knn_model.predict(xdev.drop(['Name'], axis=1))
dev_MSE = round(mean_squared_error(ydev, y_dev_pred), 5)
dev_MSEs.append(dev_MSE)
print("k:", k, "train_MSE:", train_MSE, "dev_MSE:", dev_MSE)
print("*** best result was k =", np.argmin(dev_MSEs), ":", np.min(dev_MSEs))
# let's visualize the trend
ind = [_ + 1 for _ in range(max_k-1)]
fig, ax = plt.subplots(1,1, figsize=(4,4))
ax.plot(ind, train_MSEs, 's-', label='Train split',color='#02afed',linewidth=2)
ax.plot(ind, dev_MSEs, 's--', label='Dev split',color='#355561',linewidth=2)
ax.set_xlabel(r'$K$ values', fontsize=10)
ax.set_ylabel('MSE', fontsize=10)
ax.set_title(r'kNN Performance')
ax.legend(fontsize=14)
fig.tight_layout()
Ah, that's not as good as linear regression! This is only over 1 split. Let's do the same procedure we did with linear regression, so that it's a fair comparison: feature selection + boostrapping + cross-validation.
best_num_features = np.argmin(mse_across_features)
print("best_num_features:", best_num_features)
print("features:", selected_features[0:best_num_features])
print("mse_across_features:", np.min(mse_across_features))
min_k = 12
max_k = 18
num_bootstraps = 10
bootstrap_coefs = []
random.seed(a=21)
bootstrap_seeds = random.sample(range(100000), num_bootstraps)
# now we include all of the original 'train'
# which encompasses the original 'dev' but excludes the 'test'
merged = pd.concat([x_, y_], axis=1, sort=False)
k_to_num_features = {}
k_to_best_features = {}
for k in range(min_k, max_k):
selected_features = []
candidate_features = features[1:]
mse_across_features = []
while len(candidate_features) > 0:
best_mse = 9999
best_feature = ""
for i, feature in enumerate(candidate_features):
dev_MSEs = []
for i in range(num_bootstraps):
boot_df = merged.sample(frac=1, replace=True, random_state=bootstrap_seeds[i])
kf = KFold(n_splits=10, random_state=270)
for train_index, dev_index in kf.split(boot_df):
xtrain_cv = boot_df.iloc[train_index][selected_features + [feature]]
ytrain_cv = boot_df.iloc[train_index]["Customers\' Rating"]
xdev_cv = boot_df.iloc[dev_index][selected_features + [feature]]
ydev_cv = boot_df.iloc[dev_index]["Customers\' Rating"]
knn_model = KNeighborsRegressor(n_neighbors=k)
knn_model.fit(xtrain_cv, ytrain_cv)
# evaluate on the training set
y_train_pred = knn_model.predict(xtrain_cv)
train_MSE = round(mean_squared_error(ytrain_cv, y_train_pred), 5)
train_MSEs.append(train_MSE)
# evaluate on the dev set
y_dev_pred = knn_model.predict(xdev_cv)
dev_MSE = round(mean_squared_error(ydev_cv, y_dev_pred), 5)
dev_MSEs.append(dev_MSE)
cur_mse = np.mean(dev_MSEs)
if cur_mse < best_mse:
best_mse = cur_mse
best_feature = feature
mse_across_features.append(np.min(best_mse))
selected_features.append(best_feature)
candidate_features.remove(best_feature)
print("k:", k, "best_mse:", best_mse)
k_to_num_features[k] = mse_across_features
k_to_best_features[k] = selected_features[0:np.argmin(mse_across_features)+1]
# let's visualize the trend
ind = [_ + 1 for _ in range(len(features)-1)]
fig, ax = plt.subplots(1,1, figsize=(8,6))
len(k_to_num_features[12])
for k in range(min_k, max_k):
ax.plot(ind, k_to_num_features[k], 's-', label='k='+str(k),linewidth=2)
ax.set_xlabel(r'# features', fontsize=10)
ax.set_ylabel('Avg MSE', fontsize=10)
ax.set_title(r'Performance on the Dev Set')
ax.legend(fontsize=10)
xint = range(min(ind), math.ceil(max(ind))+1)
matplotlib.pyplot.xticks(xint)
fig.tight_layout()
k_to_num_features
best_k = 0
best_mse = 99999
for k in k_to_num_features.keys():
cur_num_features = np.argmin(k_to_num_features[k]) + 1
cur_best_mse = np.min(k_to_num_features[k])
if cur_best_mse < best_mse:
best_mse = cur_best_mse
best_k = k
print("best_k:", best_k, "best_mse:", best_mse)
print("best features:", k_to_best_features[best_k])
QUESTION:¶
The MSE when using kNN seems to come close to Linear Regression. Is there anything we could/should change about this kNN experiment?
ANSWER:¶
Scale our data! The features aren't on the same scale, so the distant metric used by kNN (e.g., Euclidean distance) doesn't account for this. The most useful features still include most of what we saw from Linear Regression, so this indicates that our lack of scaling is not a big problem, but it's definitely the proper thing to do. I'm surprised Price
and Rare
were able to make it into our list of top features, since their scales are just way smaller than the others that are from 0-100.