Key Word(s): Case Study
## RUN THIS CELL TO GET THE RIGHT FORMATTING
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2020-CS109A/master/themes/static/css/cs109.css").text
HTML(styles)
# import the necessary libraries
import re
import requests
import random
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
from time import sleep
from bs4 import BeautifulSoup
# global properties
data_dir = "data/" # where to save data
num_search_pages = 50 # how many search pages to cull through
# NOTE:
# if you haven't yet downloaded the data, this should be set to True
download_data = False
Disclaimer¶
Alcohol is drug. There are state and federal laws that govern the sale, distribution, and consumption of such. In the United States, those who consume alcohol must be at least 21 years of age. In no way am I, or anyone else at IACS or Harvard at large, promoting or encouraging the usage of alcohol. My intention is not to celebrate it. Anyone who chooses to consume alcohol should be of legal age and should do so responsibly. Abusing alcohol has serious, grave effects.
The point of this exercise is purely pedagogical, and it illustrates the wide range of tasks to which one can apply data science and machine learning. That is, I am focusing on a particular interest and demonstrating how it can be used to answer questions that one may be interested in for one's own personal life. You could easily imagine this being used in professional settings, too.
Learning Objectives¶
Help see the big picture process of conducting a project, and to illustrate some of the nuanced details and common pitfalls.
1. Problem Overview¶
Whiskey is a type of alcohol, and there are many different types of whiskey, including bourbon, which will be the focus of this project.
I am interested in determining:
Are there certain attributes of bourbons that are predictive of good (i.e., highly rated by users) bourbons?
- Find hidden gems (i.e., should be good but current reviews are absent or unsupportive of such)
- Find over-hyped whiskeys (i.e., the reviews seem high but the attributes aren't indicative)
- Are there significant results if we target experts' ratings instead of average customer ratings?
Are there certain attributes of bourbons that are predictive of expensive bourbons?
- Find under-priced whiskeys
- Find over-priced whiskeys
Which bourbons are more similar to each other?
- Which attributes are important for determining similarness? (e.g., does price play a role?)
2. Obtaining Data¶
We need a website that has a bunch of whiskey data. Distiller.com
seems to be the most authoritative and comprehensive site.
Using distiller.com
as our source, I don't see a way to display a list of all of their bourbons. But, if you search for the keyword bourbon
, over 2,000 search results appear, each with a link to the particular whiskey. After manual inspection, these in fact are bourbons, but a few are not and merely have some association with bourbon (e.g., non-bourbon whiskeys that were casked in old bourbon barrels).
Let's crawl the search results pages to create a set of all candidate bourbons (i.e., whiskey_urls
)! Then, using this set, let's download each page. Note, we use a set()
instead of a list, in case there are duplicates.
whiskey_urls = set()
if download_data:
# we define this for convenience, as every state's url begins with this prefix
base_url = 'https://distiller.com/search?term=bourbon'
# visits each search result page
for page_num in range(1, num_search_pages):
cur_page = requests.get('https://distiller.com/search?page=' + str(page_num) + '&term;=bourbon')
# uses BeautifulSoup to extract all links to whiskeys
bs_page = BeautifulSoup(cur_page.content, "html.parser")
for link in bs_page.findAll('a', attrs={'href': re.compile("^/spirits/")}):
whiskey_urls.add(link.get('href'))
sleep(1)
# saves each URL to disk, so that we don't have to crawl the search results again
f = open("whiskey_urls.txt", "w")
for url in whiskey_urls:
f.write(url + "\n")
f.close()
# fetches each page and saves it to the hard drive
for url in whiskey_urls:
cur_page = requests.get('https://distiller.com' + url).content
# writes file
f = open(data_dir + url[9:], 'wb')
f.write(cur_page)
f.close()
# sleeps between 1-3 seconds, in case the site tries to detect crawling
sleep(random.randint(1,3))
else:
# if the files have already been saved to disk
# then you can just load them here, instead of crawling again
with open('whiskey_urls.txt') as f:
whiskey_urls = set(line.strip() for line in f)
We now have a list of all whiskey urls, in whiskey_urls
, along with the actual page downloaded to our hard drive. We downloaded them to the hard drive for convenience, so that in the future, we don't have to spend the 2 hours crawling all pages again.
Let's now load each of these pages!
whiskeys = {}
# loads whiskey webpage
for i, url in enumerate(whiskey_urls):
filename = data_dir + url[9:]
file_contents = open(filename, 'r').read()
# instantiates a new BeautifulSoup object
soup = BeautifulSoup(file_contents, "html.parser")
# extracts details about the whiskey
name = soup.find('h1', attrs={'class': re.compile("secondary-headline name")}).text.strip()
location = soup.find('h2', attrs={'class': "ultra-mini-headline location middleweight"}).text.strip()
soup.find('div', attrs={'class': "spirit-badge"})
badge = ""
if soup.find('div', attrs={'class': "spirit-badge"}) != None:
badge = soup.find('div', attrs={'class': "spirit-badge"}).text.strip()
num_ratings = 0
rating = "N/A"
if soup.find('span', attrs={'itemprop': "ratingCount"}) != None:
num_ratings = int(soup.find('span', attrs={'itemprop': "ratingCount"}).text.strip())
rating = float(soup.find('span', attrs={'itemprop': "ratingValue"}).text.strip())
age = soup.find('li', attrs={'class': "detail age"}).find('div', attrs='value').text.strip()
price = int(re.findall("cost-(\d)", str(soup.find('div', attrs={'class': re.compile("spirit-cost")})))[0])
abv = ""
if soup.find('li', attrs={'class': "detail abv"}).find('div', attrs='value').text != "":
abv = float(soup.find('li', attrs={'class': "detail abv"}).find('div', attrs='value').text)
whiskey_type = soup.find('li', attrs={'class': "detail whiskey-style"}).div.text
cask_type = ""
if soup.find('li', attrs={'class': "detail cask-type"}) != None:
cask_type = soup.find('li', attrs={'class': "detail cask-type"}).find('div', attrs='value').text.strip()
review = ""
expert = ""
score = ""
flavor_summary = ""
flavor_profile = []
# check if an expert reviewed it
if soup.find('p', attrs={'itemprop': "reviewBody"}) != None:
review = soup.find('p', attrs={'itemprop': "reviewBody"}).text.replace("\"","").strip()
expert = soup.find('div', attrs={'class': 'meet-experts'}).a.text.strip()
score = int(soup.find('div', attrs={'class': "distiller-score"}).span.text.strip())
flavor_summary = soup.find('h3', attrs={'class': "secondary-headline flavors middleweight"}).text.strip()
# extracts flavor profile
flavor_profile = eval(soup.find('canvas').attrs['data-flavors'])
cur_whiskey = [name, whiskey_type, cask_type, location, age, abv, price, badge, num_ratings, \
rating, flavor_summary, expert, score]
if flavor_profile:
cur_whiskey.extend(list(flavor_profile.values()))
else:
cur_whiskey.extend(np.zeros(14))
cur_whiskey.append(review)
whiskeys[i] = cur_whiskey
df = pd.DataFrame.from_dict(whiskeys, orient='index', \
columns=['Name', 'Type', 'Cask', 'Location', 'Age', 'ABV %', 'Price', 'Badge',\
'# Ratings', "Customers' Rating", 'Flavor Summary', 'Expert', 'Expert Score',\
'Smoky', 'Peaty', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
'Sweet', 'Briny', 'Salty', 'Vanilla', 'Tart', 'Fruity', 'Floral', 'Review'])
3. Data Sanity Check / Cleaning¶
What do our features look like? Are any features wonky, inconsistent, useless, or missing values?
pd.set_option('display.max_columns', None)
df2 = df.loc[(df['Expert'] != "")]
print(len(df2))
df2['Type'].value_counts()
pd.set_option('display.max_rows', None)
df2 = df2.loc[(df['Type'] == "Bourbon")]
df2.dtypes
df2.loc[df2['Customers\' Rating'] == "N/A"]
# there still exists 1 whiskey that has no Customer Rating, so let's remove it
df2 = df2.loc[df2['Customers\' Rating'] != "N/A"]
df2 = df2.astype({'Customers\' Rating' : 'float64'})
# we can keep the 'Age' feature for now but be mindful
# that it's missing for nearly half of the whiskeys
len(df2.loc[(df2['Age'] == 'NAS') | (df2['Age'] == 'nas') | (df2['Age'] == '')])
# let's replace all missing values with a reasonable value.
# for now, let's use 0 as a placeholder so that we can later swap it out.
df2['Age'] = df2['Age'].replace(['NAS', 'nas', 'N/A',''],'0')
# remove the 'Years' part of the text
df2['Age'].replace(to_replace =' [yY]ear[sS]*', value = '', regex = True)
# manually cleaning up values that otherwise would be a bit impossible to automatically clean-up
df2['Age'] = df2['Age'].replace(to_replace ='6.*', value = '6', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='(\d+) [Yy].*', value = '\\1', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='4 [Mm]onths', value = '4', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 [Mm]onths', value = '9', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='18 - 20 [Mm]onths', value = '1.5', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='32 [Mm]onths', value = '2.67', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 [Mm]onths', value = '9', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 to 11', value = '0.75', regex = True)
# let's look at all of the items that had an Age statement listed
# (now that all values have been cleaned-up)
df2.loc[df2['Age'] > '0']['Age']
df2 = df2.astype({'Age': 'float64'})
# how many had values?
len(df2.loc[df2['Age'] > 0])
df2['Age'].describe()
df2.loc[df2['Age'] > 0].hist(column='Age', bins='auto')
I think it's fair to impute all missing values (i.e., 0) with 7. This is based on research, too (Googling and personal knowledge)
df2['Age'] = df2['Age'].replace(0,7)
df2['Age'].describe()
df2.hist(column='Age', bins='auto')
df2['Flavor Summary'].value_counts()
Ok, there's a long tail of values, and it seems the Flavors are just the two most prominent flavors listed for each whiskey, although some only list one flavor. Perhaps this offers no additional information/signal than using the raw values of the flavors. Although, it might be worth experimenting with this by turning this feature into 2 new features: primary flavor, secondary flavor. These would need to be one-hot encoded though, and since there are 14 distinct flavors, that would create 28 new features (or 26). Again, these 26 features might be redundant and not help our models.
df2['Badge'].value_counts()
We see that all Badge values are either 'RARE' or just requests from users for an expert to review it. So, let's change the badge column to being a 'Rare' column.
df2['Rare'] = [True if x == 'RARE' else False for x in df2['Badge']] #df['Badge'] #.map({"RARE": True})
del df2['Badge']
df2['Rare'].value_counts()
df2['Expert'].value_counts()
Let's cast our features to the correct data types and view summary statistics
df2 = df2.astype({'Expert Score': 'int32', 'Customers\' Rating' : 'float64', 'ABV %': 'float64'})
df2.describe()
df2.dtypes
4. EDA¶
Now that our data is cleaned, let's explore it and try to understand any patterns. This understanding will impact our modelling choices. Based on the .describe()
statistics above, let's first look at the most extreme values of features that seem a bit lopsided in their distribution of values.
df2.sort_values(by=['Smoky'], ascending=False)[0:15][['Name', 'Smoky', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['Peaty'], ascending=False)[0:10][['Name', 'Peaty', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['Spicy'], ascending=True)[0:14][['Name', 'Spicy', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['Herbal'], ascending=False)[0:20][['Name', 'Herbal', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['Oily'], ascending=False)[0:15][['Name', 'Oily', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['Full-bodied'], ascending=True)[0:20][['Name', 'Full-bodied', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['Briny'], ascending=False)[0:20][['Name', 'Briny', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['Salty'], ascending=False)[0:20][['Name', 'Salty', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['Tart'], ascending=False)[0:20][['Name', 'Tart', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['Fruity'], ascending=False)[0:20][['Name', 'Fruity', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['Age'], ascending=False)[0:20][['Name', 'Age', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
df2.sort_values(by=['# Ratings'], ascending=False)[0:20][['Name', '# Ratings', 'Rare', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Ah, interestingly, the single-most popular, Blanton's
is actually very hard to find these days. The field wrongly states it's not rare, but it is essentially impossible to find in most US States, and bottles are commonly marked up from $40 MSRP to $200. I'm very surprised that this has the most reviews, but I suspect it's because it has the highest allure amongst the rare ones that are somewhat possible to find. Years ago, it was very easy to find, so maybe some reviews were from this time.
Additionally, Weller Special Reserve
is also impossible to get within most places in the US, for most times of the year, but it has TONS of allure and attention. People obsess over it. Weller Antique 107
is even rarer. The rest are very common within stores and bars, so the data makes sense for these.
df2.sort_values(by=["Customers\' Rating"], ascending=False)[0:20][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
This seems correct to me, not because I've tasted any of these, but because these are famous and are highly coveted. I've never heard of the best rated, Parker's
, though. I'd be suspicious that it's an outlier and wrong, especially considering it only has 2 reviews from users; however, the expert also gave it a high score, so it seems like a valid entry.
df2.sort_values(by=["Customers\' Rating"], ascending=True)[0:20][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
I've never heard of any of these, so this list seems reasonable. Plus, the experts gave them all horrible reviews, so I don't suspect anything suspicious is going on (e.g., customers ironically rating a controversial, highly-appraised whiskey as being horribly low, as if to troll the ratings).
df2.sort_values(by=['Expert Score'], ascending=False)[0:20][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
This seems right. Never heard of Hirsch
or Parker's
but the others are speciality versions of famous/popular whiskeys, so this makes sense.
df2.sort_values(by=['Price'], ascending=False)[0:15][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
We don't have high granularity (prices are just 1-5), which is perhaps a blessing in disguise -- most bourbons are \$30 - \\$50, but some rare ones, especially due to price gouging, can be \$100 - \\$3,000. That's a wild range and is largely due to rarity, allure, and sensationalism within human behavior, as opposed to actual qualities of the bourbon. So, maybe it's good that we don't have to deal with outlier whiskeys have extraordinary prices.
df2['Location'].value_counts()
Some distilleries produce different brands of whiskey. Most come from Kentucky. You can see that some distilleries produce tons of different types, but this can be a bit misleading because some of those different types are just slight variations (e.g., Eagle Rare 10, Eagle Rare 17), whereas others are completely different brands (e.g., Buffalo Trace, Blanton's). For now, it's probably best to just ignore the location
feature, but we'll keep it in mind for modelling, if we get desperate. One idea would be to create 2 fields from this: 1 for the geographic state (e.g., Kentucky), and another for the distillery (e.g., Booker's).
fig, axs = plt.subplots(nrows=5, ncols=3, figsize=(20, 20), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = .5, wspace=.2)
axs = axs.ravel()
fontsize = 10
flavors = ['Smoky', 'Peaty', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
'Sweet', 'Briny', 'Salty', 'Vanilla', 'Tart', 'Fruity', 'Floral']
# plot histograms
for i, flavor in enumerate(flavors):
axs[i].hist(df2[flavor], alpha=0.7, color='lightblue', bins='auto', density=False, histtype = 'bar', edgecolor='k')
axs[i].set_title("Distribution of " + flavor + " Flavor", fontsize=fontsize)
axs[i].set_xlabel(flavor + " Flavor", fontsize=fontsize)
axs[i].set_ylabel('Count', fontsize=fontsize)
# removes the empty one, since we only have 14 flavors, not 15
axs[14].set_axis_off()
These all seem pretty reasonable, and I'm glad that the values have a good spread. A few flavors are a bit skewed, and these are the ones that we inspected above.
Salty
is usually 0), we would not be able to discern any meaningful trend, so we can throw this out from our visualization. Otherwise, our graph woud just be a bunch of points overlapping one another at the 0 value.grid_features = ['Smoky', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
'Sweet', 'Vanilla', 'Fruity', 'Floral', \
'Age', 'Price', 'Customers\' Rating', 'Expert Score']
scatter = pd.plotting.scatter_matrix(df2[grid_features], alpha=0.4, figsize=(20,20));
for ax in scatter.ravel():
ax.set_xlabel(ax.get_xlabel(), rotation = 90)
ax.set_ylabel(ax.get_ylabel(), rotation = 90)
We see that:
Customer's Rating
is highly correlated with the expert's rating- The higher the
Price
, the more likely it is to have a high score from both customers and experts - The higher the
Richness
, the moreFull-bodied
andSweet
it tends to be (strong correlations) - The higher the
Oiliness
, the more likely it is to beFull-bodied
- No individual flavor seems correlated with the scores from customers or experts. The closest trend is from
Full-bodied
andRich
, as they seem slightly directly correlated with the scores.
This is an indication that predicting the score is not trivially easy; the Full-bodied
and Richness
can play some role, but if flavors give any indication, it'll be due to a combination of flavors instead of any one particular flavor.