Key Word(s): Case Study


CS109A Introduction to Data Science

Case Study: Hunting for Flavors

PARTS 1- 4: Problem Statement, Obtaining Data, Cleaning Data, Exploring Data

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner


In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2020-CS109A/master/themes/static/css/cs109.css").text
HTML(styles)
Out[1]:
In [2]:
# import the necessary libraries
import re
import requests
import random
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
from time import sleep
from bs4 import BeautifulSoup

# global properties
data_dir = "data/" # where to save data
num_search_pages = 50 # how many search pages to cull through

# NOTE:
# if you haven't yet downloaded the data, this should be set to True
download_data = False

Disclaimer

Alcohol is drug. There are state and federal laws that govern the sale, distribution, and consumption of such. In the United States, those who consume alcohol must be at least 21 years of age. In no way am I, or anyone else at IACS or Harvard at large, promoting or encouraging the usage of alcohol. My intention is not to celebrate it. Anyone who chooses to consume alcohol should be of legal age and should do so responsibly. Abusing alcohol has serious, grave effects.

The point of this exercise is purely pedagogical, and it illustrates the wide range of tasks to which one can apply data science and machine learning. That is, I am focusing on a particular interest and demonstrating how it can be used to answer questions that one may be interested in for one's own personal life. You could easily imagine this being used in professional settings, too.

Learning Objectives

Help see the big picture process of conducting a project, and to illustrate some of the nuanced details and common pitfalls.

1. Problem Overview

Whiskey is a type of alcohol, and there are many different types of whiskey, including bourbon, which will be the focus of this project.

I am interested in determining:

  1. Are there certain attributes of bourbons that are predictive of good (i.e., highly rated by users) bourbons?

    • Find hidden gems (i.e., should be good but current reviews are absent or unsupportive of such)
    • Find over-hyped whiskeys (i.e., the reviews seem high but the attributes aren't indicative)
    • Are there significant results if we target experts' ratings instead of average customer ratings?
  2. Are there certain attributes of bourbons that are predictive of expensive bourbons?

    • Find under-priced whiskeys
    • Find over-priced whiskeys
  3. Which bourbons are more similar to each other?

    • Which attributes are important for determining similarness? (e.g., does price play a role?)

2. Obtaining Data

We need a website that has a bunch of whiskey data. Distiller.com seems to be the most authoritative and comprehensive site.

Using distiller.com as our source, I don't see a way to display a list of all of their bourbons. But, if you search for the keyword bourbon, over 2,000 search results appear, each with a link to the particular whiskey. After manual inspection, these in fact are bourbons, but a few are not and merely have some association with bourbon (e.g., non-bourbon whiskeys that were casked in old bourbon barrels).

Let's crawl the search results pages to create a set of all candidate bourbons (i.e., whiskey_urls)! Then, using this set, let's download each page. Note, we use a set() instead of a list, in case there are duplicates.

Fetching list of webpages via Requests
In [3]:
whiskey_urls = set()

if download_data:

    # we define this for convenience, as every state's url begins with this prefix
    base_url = 'https://distiller.com/search?term=bourbon'
    
    # visits each search result page
    for page_num in range(1, num_search_pages):
        cur_page = requests.get('https://distiller.com/search?page=' + str(page_num) + '&term;=bourbon')

        # uses BeautifulSoup to extract all links to whiskeys
        bs_page = BeautifulSoup(cur_page.content, "html.parser")
        for link in bs_page.findAll('a', attrs={'href': re.compile("^/spirits/")}):
            whiskey_urls.add(link.get('href'))

        sleep(1)
    
    # saves each URL to disk, so that we don't have to crawl the search results again
    f = open("whiskey_urls.txt", "w")
    for url in whiskey_urls:
        f.write(url + "\n")
    f.close()
    
    # fetches each page and saves it to the hard drive
    for url in whiskey_urls:
        cur_page = requests.get('https://distiller.com' + url).content

        # writes file
        f = open(data_dir + url[9:], 'wb')
        f.write(cur_page)
        f.close()

        # sleeps between 1-3 seconds, in case the site tries to detect crawling
        sleep(random.randint(1,3))
else: 
    
    # if the files have already been saved to disk
    # then you can just load them here, instead of crawling again
    with open('whiskey_urls.txt') as f:
        whiskey_urls = set(line.strip() for line in f)

We now have a list of all whiskey urls, in whiskey_urls, along with the actual page downloaded to our hard drive. We downloaded them to the hard drive for convenience, so that in the future, we don't have to spend the 2 hours crawling all pages again.

Let's now load each of these pages!

In [4]:
whiskeys = {}

# loads whiskey webpage
for i, url in enumerate(whiskey_urls):

    filename = data_dir + url[9:]
    file_contents = open(filename, 'r').read()
    
    # instantiates a new BeautifulSoup object
    soup = BeautifulSoup(file_contents, "html.parser")
    
    # extracts details about the whiskey
    name = soup.find('h1', attrs={'class': re.compile("secondary-headline name")}).text.strip()
    location = soup.find('h2', attrs={'class': "ultra-mini-headline location middleweight"}).text.strip()

    soup.find('div', attrs={'class': "spirit-badge"})

    badge = ""
    if soup.find('div', attrs={'class': "spirit-badge"}) != None:
        badge = soup.find('div', attrs={'class': "spirit-badge"}).text.strip()
        
    num_ratings = 0
    rating = "N/A"
    if soup.find('span', attrs={'itemprop': "ratingCount"}) != None:
        num_ratings = int(soup.find('span', attrs={'itemprop': "ratingCount"}).text.strip())
        rating = float(soup.find('span', attrs={'itemprop': "ratingValue"}).text.strip())
    
    age = soup.find('li', attrs={'class': "detail age"}).find('div', attrs='value').text.strip()
    price = int(re.findall("cost-(\d)", str(soup.find('div', attrs={'class': re.compile("spirit-cost")})))[0])
    abv = ""
    
    if soup.find('li', attrs={'class': "detail abv"}).find('div', attrs='value').text != "":
        abv = float(soup.find('li', attrs={'class': "detail abv"}).find('div', attrs='value').text)
    
    whiskey_type = soup.find('li', attrs={'class': "detail whiskey-style"}).div.text
    cask_type = ""
    if soup.find('li', attrs={'class': "detail cask-type"}) != None:
        cask_type = soup.find('li', attrs={'class': "detail cask-type"}).find('div', attrs='value').text.strip()
    
    review = ""
    expert = ""
    score = ""
    flavor_summary = ""
    flavor_profile = []
    
    # check if an expert reviewed it
    if soup.find('p', attrs={'itemprop': "reviewBody"}) != None:
        review = soup.find('p', attrs={'itemprop': "reviewBody"}).text.replace("\"","").strip()
    
        expert = soup.find('div', attrs={'class': 'meet-experts'}).a.text.strip()
        score = int(soup.find('div', attrs={'class': "distiller-score"}).span.text.strip())
        flavor_summary = soup.find('h3', attrs={'class': "secondary-headline flavors middleweight"}).text.strip()
    
        # extracts flavor profile
        flavor_profile = eval(soup.find('canvas').attrs['data-flavors'])
    
    cur_whiskey = [name, whiskey_type, cask_type, location, age, abv, price, badge, num_ratings, \
        rating, flavor_summary, expert, score]

    if flavor_profile:
        cur_whiskey.extend(list(flavor_profile.values()))
    else:
        cur_whiskey.extend(np.zeros(14))
        
    cur_whiskey.append(review)
    whiskeys[i] = cur_whiskey

    df = pd.DataFrame.from_dict(whiskeys, orient='index', \
        columns=['Name', 'Type', 'Cask', 'Location', 'Age', 'ABV %', 'Price', 'Badge',\
                 '# Ratings', "Customers' Rating", 'Flavor Summary', 'Expert', 'Expert Score',\
                 'Smoky', 'Peaty', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
                 'Sweet', 'Briny', 'Salty', 'Vanilla', 'Tart', 'Fruity', 'Floral', 'Review'])

3. Data Sanity Check / Cleaning

What do our features look like? Are any features wonky, inconsistent, useless, or missing values?

Let's use only the whiskeys that have been reviewed by experts
In [5]:
pd.set_option('display.max_columns', None)
df2 = df.loc[(df['Expert'] != "")]
print(len(df2))
710
In [6]:
df2['Type'].value_counts()
Out[6]:
Bourbon                          586
Single Malt                       27
Blended American Whiskey          14
Aged Rum                          12
Peated Single Malt                11
Other Whiskey                     11
Flavored Whiskey                   5
Gold Rum                           4
Rhum Agricole Vieux                4
Tequila Reposado                   4
Blended                            3
American Single Malt               3
Spiced Rum                         3
Tequila Añejo                      3
Flavored Rum                       2
Barrel-Aged Gin                    2
Rye                                2
Canadian                           2
Dark Rum                           2
Cachaça                            2
Other Brandy                       1
Rhum Agricole Éléve Sous Bois      1
Rhum Agricole Blanc                1
Dairy/Egg Liqueurs                 1
Silver Rum                         1
White                              1
Old Tom Gin                        1
Other Liqueurs                     1
Name: Type, dtype: int64
Let's only use the bourbons. We have 586 bourbons, which is my primary focus. This isn't tons, but the non-bourbons will likely add noise, as they are different alcohols.
In [7]:
pd.set_option('display.max_rows', None)
df2 = df2.loc[(df['Type'] == "Bourbon")]
Let's inspect the data types
In [8]:
df2.dtypes
Out[8]:
Name                  object
Type                  object
Cask                  object
Location              object
Age                   object
ABV %                 object
Price                  int64
Badge                 object
# Ratings              int64
Customers' Rating     object
Flavor Summary        object
Expert                object
Expert Score          object
Smoky                float64
Peaty                float64
Spicy                float64
Herbal               float64
Oily                 float64
Full-bodied          float64
Rich                 float64
Sweet                float64
Briny                float64
Salty                float64
Vanilla              float64
Tart                 float64
Fruity               float64
Floral               float64
Review                object
dtype: object
"Customers' Rating" feature should be a Float. Let's fix it.
In [9]:
df2.loc[df2['Customers\' Rating'] == "N/A"]
Out[9]:
Name Type Cask Location Age ABV % Price Badge # Ratings Customers' Rating Flavor Summary Expert Expert Score Smoky Peaty Spicy Herbal Oily Full-bodied Rich Sweet Briny Salty Vanilla Tart Fruity Floral Review
1765 Tacoma New West Bourbon Bourbon new, charred American oak Heritage Distilling Co. // Washington, USA NAS 46 2 0 N/A Vanilla & Sweet Brock Schulte 78 0.0 0.0 30.0 40.0 0.0 40.0 30.0 80.0 20.0 20.0 100.0 30.0 60.0 50.0 Nose is full of sweet corn, fresh caramel, and...
In [10]:
# there still exists 1 whiskey that has no Customer Rating, so let's remove it
df2 = df2.loc[df2['Customers\' Rating'] != "N/A"]
df2 = df2.astype({'Customers\' Rating' : 'float64'})
"Age" feature should represent years. Let's fix it.
In [11]:
# we can keep the 'Age' feature for now but be mindful
# that it's missing for nearly half of the whiskeys
len(df2.loc[(df2['Age'] == 'NAS') | (df2['Age'] == 'nas') | (df2['Age'] == '')])
Out[11]:
378
In [12]:
# let's replace all missing values with a reasonable value.
# for now, let's use 0 as a placeholder so that we can later swap it out.
df2['Age'] = df2['Age'].replace(['NAS', 'nas', 'N/A',''],'0')
In [13]:
# remove the 'Years' part of the text
df2['Age'].replace(to_replace =' [yY]ear[sS]*', value = '', regex = True) 
Out[13]:
0                     0
4                     0
12        7 y, 2 m,16 d
21                    0
22                    0
26                   17
27                    0
28                    0
38                    6
40                   17
49                    0
52                    0
53                    0
59                    0
60                   10
65                    0
67                    0
75                    0
81                    0
85                    0
97                   15
98                   12
99                    0
113                   0
115                   0
118                  12
119     6 YR 3 MO 10 DY
126                   0
127                   0
128                   0
129                   0
130                   0
136                   0
140                  22
143                   0
147                   0
148                   0
149                   3
151                   0
153                   0
154     6 YR 4 MO 12 DY
159                   0
162     6 YR 4 MO 21 DY
164                   0
165                   0
166                  11
171                   0
176                   0
180                   0
181                   0
185                   0
188                  13
191                   0
193                   0
195                   0
197                   0
198                   0
202                   9
207                   0
217                   0
224                   0
230                   0
231                  17
236                   6
256                   0
257                  14
258                   6
263                   0
264                   0
265                   0
269                  12
276                   0
279                   0
282                   0
283                   0
284                   0
285                   0
287                   0
291                   0
292                   0
293                   0
296                   8
303                  15
308        7 yrs, 9 mos
309                   0
313                   0
315             9 to 11
318                   7
321                   0
322                  15
325                   0
326                   0
328            9 Months
329                   0
336                   0
337                   0
343                   0
344                   8
345                  11
347                   0
355          9 8 Months
357                   0
365                   0
369                  13
371                   0
373                   9
374                   0
380                   0
391                   0
398                  12
404                   0
410                   0
412                   0
415                   0
416                   0
417                   9
419                   0
421                   0
422                   0
424                  10
429                   0
434                   0
435                   0
437                  23
439                   0
440                   4
442                   0
446                   0
448                  10
449                   0
453                   0
460                   0
465                   0
468         6, 5 months
471                   0
473                   0
474                   0
478                  12
479                   0
481                   0
486                   0
488          12 YR 5 MO
489                  15
490                   0
493                   0
496                   0
498      6 YR 3 MO 1 DY
508                   0
514                  10
516      6 YR 5 MO 1 DY
519                   0
520                   0
527                   0
531                   9
545                   0
546                   0
547                   0
548                   0
549                   0
554                   6
556                   0
559                   5
562                   0
570                   0
581                   0
590                   0
592                  14
593                  12
597      7 Y, 2 M, 28 D
605                  12
609                  10
611                   0
614                  12
618                   0
620                   0
622          12 YR 5 MO
623                   1
626      6 YR 3 MO 6 DY
629                   0
631                   0
636                   0
639                   9
641                   9
644                  12
647                   0
649                   0
653                   4
658                   0
664                  10
666                   9
670                   7
671                   0
675                   0
681                   8
683                  10
688                   0
702                  27
703                 7.2
704           32 Months
705                   0
710                   0
714                   6
719      18 - 20 months
722                  12
727                   0
736                   0
751                   0
752                   0
754                  23
755                  15
770                   0
772                   7
775                   0
776                   0
777                   0
782                   0
790                   0
791                  11
793                   0
800                   0
801                   0
824                   0
826                   0
828                   0
831                   0
832                  21
838                   0
842                   0
843                   0
850                   0
851                   0
861                   8
863                   7
865                   0
868                   0
871                  12
882                   0
885                   0
888                   0
895                   0
899                   5
904                   0
916                  11
917                  10
918                   0
928                   0
931                   0
932                   0
937                   0
939                   0
940                   0
942                   0
945                   8
953                   0
960                   0
963                  12
970                   0
971                   0
975                  12
976                   0
979                  12
980                   0
982                   0
990                   0
993                  14
996                   0
1000                  0
1002                  0
1003    6 YR 10 MO 1 DY
1004                  0
1005                  0
1017                  0
1026                  0
1027                  0
1031                  0
1032                  0
1051                  0
1055                  0
1056                  0
1059                  0
1063                  8
1064     6 YR 4 MO 6 DY
1070                  0
1073    6 YR 6 MO 19 DY
1080                  0
1082                  0
1088                  6
1089    6 YR, 2 MO, 1 D
1095                  0
1096                  0
1098                 10
1099                  0
1100                  0
1102                 11
1108                  0
1109                  9
1114                  0
1117                  0
1118                  0
1119                  8
1122                  7
1127                 20
1128                  0
1129                  0
1130                  7
1133                  0
1134                  0
1135                  0
1149                  0
1160                  0
1161                  0
1162                 12
1164                  0
1166                  0
1170                  0
1173                  9
1181                  0
1182                  9
1185                  0
1192                 22
1193                 25
1195                  0
1196                  6
1205    6 YR 2 MO 10 DY
1212                 12
1218                  4
1219                  0
1227                  0
1229                  0
1237                  0
1239                  0
1240                  0
1244           4 months
1250                  0
1252                6-8
1253                  0
1258                  0
1260                  0
1262                  5
1270                  0
1271                  0
1275                  9
1279                  0
1283                  0
1297                 17
1298                  0
1310                  0
1317                 20
1322                  0
1324                 14
1326                  0
1330                  0
1341                  0
1348                 24
1349                  0
1355                  0
1357                  4
1358                  0
1365                  0
1371                  0
1372                  0
1376                 10
1378                  0
1379                  0
1381                  0
1382                  0
1383                 18
1384                  0
1386                 17
1390                 17
1391                  0
1394                  0
1395                  0
1396        6, 5 months
1404                  0
1407                 10
1414                  0
1421                  0
1423                  0
1429                 10
1434                  0
1436                  9
1437                  0
1442                  9
1443                 20
1444                  0
1447                  3
1454                  0
1455                  0
1456                  0
1463                 15
1467                  0
1481                  0
1484                 11
1485                  8
1500                 10
1502                  0
1503                  0
1504                  0
1507                  0
1511                 20
1513                  0
1514                  0
1518                  0
1523                  0
1524                  0
1525                  3
1528                 17
1529                  0
1531                  4
1537                  0
1543                  0
1545                 17
1549                 12
1550    6 YR 3 MO 14 DY
1553                  0
1560                  0
1570                  0
1573                 12
1575                  0
1582                  3
1583                  0
1585                  9
1589                 22
1590                 15
1592                  0
1595                  0
1596                  0
1602                  0
1604                  5
1605                 12
1607                 12
1620     6 Y, 7 M, 23 D
1621                 26
1627                  0
1628                  0
1634                  0
1635                 13
1637                  0
1640                 12
1641                  0
1646                  0
1648                  0
1649                  0
1650                  0
1657                  0
1664                 10
1665                  0
1668                 10
1671                  0
1674           6, 11 mo
1675                 17
1676                  0
1679                  0
1681                  0
1682                  0
1686                  0
1689                  0
1700                 10
1708                 12
1710                  0
1721                 11
1725                  0
1727                  0
1728          8 YR 3 MO
1729                  0
1732                  0
1736                 10
1738                  0
1743                  3
1744                  0
1748                 15
1749                  8
1752                 12
1757                 23
1758                  0
1759                  0
1767                  9
1768                  0
1776                  0
1778                  0
1779                  0
1785                  0
1789                  0
1794                  0
1798                  0
1813                 10
1823                  5
1826                  0
1830                  0
1832                  0
1833                 28
1834                 12
1837                  0
1839                  0
1847                  0
1849                  0
1851                 12
1855                  0
1858                  0
1861                  0
1862                  0
1864                  0
1868                  0
1873                  0
1882                 15
1884          5 YR 6 MO
1899                  0
1903                  0
1905                  0
1909                  0
1917                 10
1920                  0
1925                  0
1926                  7
1929                 12
1932                 12
1933                 21
1936                 10
1939                  0
1945                  0
1947                  0
1952                  0
1957                  0
1959                  0
1964    6 YR 8 MO 14 DY
1965                  6
1969                  0
1972                 17
1978                  0
1980                  0
1981                  0
1983                  0
1991                 16
1992                  0
1999                 13
2000                  0
2004                 11
2011                  0
2012                  0
2020                 23
2026                  0
2027                  0
2037                 12
2044                  0
2045                  0
2055                  0
2062                  0
2063                  0
2066                  0
2069                 17
2071                  0
2073                  0
2078                 12
2079                 12
2083                  0
2087                  9
2088                 11
2092                  0
2096                  0
2098                 11
2099                  0
2106                  3
2107                 10
2108                 14
2110                  3
2120                  0
2125                  0
2128          19 months
2129                  9
2131                 10
2134                  0
2137                  0
2138                  0
2143                  0
2156          6 YR 4 MO
2159                  0
2160                  0
2164                  0
2167                  0
2173                  0
2179                  0
2184                  0
2185                 12
2195                  5
Name: Age, dtype: object
In [14]:
# manually cleaning up values that otherwise would be a bit impossible to automatically clean-up
df2['Age'] = df2['Age'].replace(to_replace ='6.*', value = '6', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='(\d+) [Yy].*', value = '\\1', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='4 [Mm]onths', value = '4', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 [Mm]onths', value = '9', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='18 - 20 [Mm]onths', value = '1.5', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='32 [Mm]onths', value = '2.67', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 [Mm]onths', value = '9', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 to 11', value = '0.75', regex = True)
In [15]:
# let's look at all of the items that had an Age statement listed
# (now that all values have been cleaned-up)
df2.loc[df2['Age'] > '0']['Age']
Out[15]:
12         7
26        17
38         6
40        17
60        10
97        15
98        12
118       12
119        6
140       22
149        3
154        6
162        6
166       11
188       13
202        9
231       17
236        6
257       14
258        6
269       12
296        8
303       15
308        7
315     0.75
318        7
322       15
328        9
344        8
345       11
355        9
369       13
373        9
398       12
417        9
424       10
437       23
440        4
448       10
468        6
478       12
488       12
489       15
498        6
514       10
516        6
531        9
554        6
559        5
592       14
593       12
597        7
605       12
609       10
614       12
622       12
623        1
626        6
639        9
641        9
644       12
653        4
664       10
666        9
670        7
681        8
683       10
702       27
703      7.2
704     2.67
714        6
719      1.5
722       12
754       23
755       15
772        7
791       11
832       21
861        8
863        7
871       12
899        5
916       11
917       10
945        8
963       12
975       12
979       12
993       14
1003       6
1063       8
1064       6
1073       6
1088       6
1089       6
1098      10
1102      11
1109       9
1119       8
1122       7
1127      20
1130       7
1162      12
1173       9
1182       9
1192      22
1193      25
1196       6
1205       6
1212      12
1218       4
1244       4
1252       6
1262       5
1275       9
1297      17
1317      20
1324      14
1348      24
1357       4
1376      10
1383      18
1386      17
1390      17
1396       6
1407      10
1429      10
1436       9
1442       9
1443      20
1447       3
1463      15
1484      11
1485       8
1500      10
1511      20
1525       3
1528      17
1531       4
1545      17
1549      12
1550       6
1573      12
1582       3
1585       9
1589      22
1590      15
1604       5
1605      12
1607      12
1620       6
1621      26
1635      13
1640      12
1664      10
1668      10
1674       6
1675      17
1700      10
1708      12
1721      11
1728       8
1736      10
1743       3
1748      15
1749       8
1752      12
1757      23
1767       9
1813      10
1823       5
1833      28
1834      12
1851      12
1882      15
1884       5
1917      10
1926       7
1929      12
1932      12
1933      21
1936      10
1964       6
1965       6
1972      17
1991      16
1999      13
2004      11
2020      23
2037      12
2069      17
2078      12
2079      12
2087       9
2088      11
2098      11
2106       3
2107      10
2108      14
2110       3
2128      19
2129       9
2131      10
2156       6
2185      12
2195       5
Name: Age, dtype: object
In [16]:
df2 = df2.astype({'Age': 'float64'})
In [17]:
# how many had values?
len(df2.loc[df2['Age'] > 0])
Out[17]:
206
In [18]:
df2['Age'].describe()
Out[18]:
count    585.000000
mean       3.776274
std        6.010627
min        0.000000
25%        0.000000
50%        0.000000
75%        7.000000
max       28.000000
Name: Age, dtype: float64
In [19]:
df2.loc[df2['Age'] > 0].hist(column='Age', bins='auto')
Out[19]:
array([[]],
      dtype=object)

I think it's fair to impute all missing values (i.e., 0) with 7. This is based on research, too (Googling and personal knowledge)

In [20]:
df2['Age'] = df2['Age'].replace(0,7)
In [21]:
df2['Age'].describe()
Out[21]:
count    585.000000
mean       8.311316
std        3.607727
min        0.750000
25%        7.000000
50%        7.000000
75%        7.000000
max       28.000000
Name: Age, dtype: float64
In [22]:
df2.hist(column='Age', bins='auto')
Out[22]:
array([[]],
      dtype=object)
What's the distribution of the "Flavor Summary" feature? Is it consistent enough to use?
In [23]:
df2['Flavor Summary'].value_counts()
Out[23]:
Rich & Full Bodied       54
Sweet & Rich             40
Sweet                    36
Vanilla & Sweet          34
Spicy                    33
Vanilla & Rich           24
Full Bodied & Rich       20
Sweet & Vanilla          20
Spicy & Rich             18
Vanilla                  18
Vanilla & Full Bodied    17
Fruity & Sweet           17
Full Bodied & Spicy      17
Spicy & Vanilla          16
Rich & Vanilla           13
Sweet & Spicy            13
Rich & Spicy             13
Full Bodied              11
Vanilla & Spicy          11
Full Bodied & Vanilla    10
Spicy & Sweet            10
Spicy & Full Bodied      10
Fruity                    9
Rich                      9
Sweet & Full Bodied       9
Rich & Sweet              9
Sweet & Fruity            8
Spicy & Fruity            7
Fruity & Rich             7
Spicy & Smoky             5
Fruity & Spicy            5
Fruity & Vanilla          4
Sweet & Oily              3
Floral & Fruity           3
Floral                    3
Spicy & Herbal            3
Sweet & Herbal            3
Floral & Vanilla          2
Oily & Rich               2
Tart                      2
Full Bodied & Fruity      2
Fruity & Floral           2
Rich & Oily               2
Sweet & Smoky             2
Full Bodied & Sweet       2
Fruity & Herbal           2
Spicy & Oily              1
Smoky & Spicy             1
Sweet & Briny             1
Oily & Sweet              1
Oily                      1
Rich & Fruity             1
Vanilla & Floral          1
Sweet & Salty             1
Oily & Full Bodied        1
Spicy & Tart              1
Smoky & Sweet             1
Rich & Smoky              1
Herbal & Tart             1
Floral & Herbal           1
Full Bodied & Oily        1
Smoky & Vanilla           1
Vanilla & Fruity          1
Herbal                    1
Smoky & Rich              1
Herbal & Floral           1
Tart & Vanilla            1
Floral & Sweet            1
Spicy & Floral            1
Herbal & Fruity           1
Fruity & Full Bodied      1
Name: Flavor Summary, dtype: int64

Ok, there's a long tail of values, and it seems the Flavors are just the two most prominent flavors listed for each whiskey, although some only list one flavor. Perhaps this offers no additional information/signal than using the raw values of the flavors. Although, it might be worth experimenting with this by turning this feature into 2 new features: primary flavor, secondary flavor. These would need to be one-hot encoded though, and since there are 14 distinct flavors, that would create 28 new features (or 26). Again, these 26 features might be redundant and not help our models.

What is the "Badge" feature like?

In [24]:
df2['Badge'].value_counts()
Out[24]:
                                            428
RARE                                        119
Requested By\nElw00t                          2
Requested By\njd139                           1
Requested By\ntjbriley                        1
Requested By\nBourbon_Obsessed_Lexington      1
Requested By\nCymru-and-the-Ferg              1
Requested By\ndanmeister33                    1
Requested By\nCblake34                        1
Requested By\ndjriebesell                     1
Requested By\nandrewls24                      1
Requested By\nspectorjuan                     1
Requested By\ncubfancccc                      1
Requested By\nsamueljcarlson                  1
Requested By\nJFForbes                        1
Requested By\nJamesSpears                     1
Requested By\nZonaPT                          1
Requested By\nrsbolen                         1
Requested By\nstevenblackburn7                1
Requested By\nmcoop8                          1
Requested By\nSharksfan321                    1
Requested By\nEast17                          1
Requested By\ntkezo645                        1
Requested By\nalshepherd1                     1
Requested By\nSoba45                          1
Requested By\nBourbonPizon                    1
Requested By\nhomerhomerson                   1
Requested By\nGlengoolieBlue                  1
Requested By\nAJLovesWhiskey                  1
Requested By\ncpreynolds87                    1
Requested By\nEzikiel                         1
Requested By\njimcorwin3                      1
Requested By\nTmoore8601                      1
Requested By\nbodkins                         1
Requested By\nstonetone96                     1
Requested By\nGilly                           1
Requested By\nAWhite                          1
Requested By\nJacob-Haralson                  1
Requested By\n1901                            1
Name: Badge, dtype: int64

We see that all Badge values are either 'RARE' or just requests from users for an expert to review it. So, let's change the badge column to being a 'Rare' column.

In [25]:
df2['Rare'] = [True if x == 'RARE' else False for x in df2['Badge']] #df['Badge'] #.map({"RARE": True})
del df2['Badge']
df2['Rare'].value_counts()
Out[25]:
False    466
True     119
Name: Rare, dtype: int64

What is the "Expert" feature like?

In [26]:
df2['Expert'].value_counts()
Out[26]:
Jacob Grier          92
Jake Emen            85
Amanda Schuster      76
Stephanie Moreno     66
Rob Morton           62
Keith Allison        26
Colin Howard         23
Sam Davies           21
Nicole Gilbert       17
Distiller Staff      15
Brock Schulte        14
Paul Belbusti        13
Ryan Conklin         12
Jack Robertiello     10
Tim Knittel          10
Katrina Niemisto      8
Dennis Gobis          4
Ron Bechtol           4
Blair Phillips        4
Jason Albaum          3
Matthew Sheinberg     3
Thijs Klaverstijn     3
Derek Gamlin          2
Phil Olson            2
Liza Weisstuch        2
Lucas Gamlin          2
Michael J. Neff       2
Anna Archibald        1
Perri Salka           1
Eric Abert            1
Brad Japhe            1
Name: Expert, dtype: int64

Let's cast our features to the correct data types and view summary statistics

In [27]:
df2 = df2.astype({'Expert Score': 'int32', 'Customers\' Rating' : 'float64', 'ABV %': 'float64'})
In [28]:
df2.describe()
Out[28]:
Age ABV % Price # Ratings Customers' Rating Expert Score Smoky Peaty Spicy Herbal Oily Full-bodied Rich Sweet Briny Salty Vanilla Tart Fruity Floral
count 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000 585.000000
mean 8.311316 49.899838 2.803419 416.352137 3.756325 86.447863 21.485470 0.230769 53.726496 25.714530 30.695726 58.447863 57.798291 58.381197 4.018803 5.476923 50.340171 21.548718 37.066667 17.835897
std 3.607727 7.170325 1.055314 1097.906235 0.521737 5.737503 18.322834 2.187232 19.560069 19.547419 22.712536 18.317344 18.361300 16.824334 8.589855 9.603433 20.526266 17.517548 21.334758 20.336899
min 0.750000 40.000000 1.000000 1.000000 1.000000 65.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 0.000000 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 7.000000 45.000000 2.000000 22.000000 3.460000 83.000000 10.000000 0.000000 40.000000 10.000000 10.000000 45.000000 45.000000 50.000000 0.000000 0.000000 35.000000 10.000000 20.000000 0.000000
50% 7.000000 47.500000 3.000000 83.000000 3.780000 87.000000 20.000000 0.000000 55.000000 25.000000 30.000000 60.000000 60.000000 60.000000 0.000000 0.000000 50.000000 20.000000 35.000000 10.000000
75% 7.000000 53.500000 4.000000 238.000000 4.170000 91.000000 30.000000 0.000000 70.000000 40.000000 45.000000 70.000000 70.000000 70.000000 5.000000 10.000000 70.000000 30.000000 50.000000 30.000000
max 28.000000 72.050000 5.000000 9072.000000 4.880000 98.000000 90.000000 40.000000 100.000000 90.000000 100.000000 100.000000 100.000000 99.000000 80.000000 80.000000 100.000000 75.000000 100.000000 95.000000
In [48]:
df2.dtypes
Out[48]:
Name                  object
Type                  object
Cask                  object
Location              object
Age                  float64
ABV %                float64
Price                  int64
# Ratings              int64
Customers' Rating    float64
Flavor Summary        object
Expert                object
Expert Score           int32
Smoky                float64
Peaty                float64
Spicy                float64
Herbal               float64
Oily                 float64
Full-bodied          float64
Rich                 float64
Sweet                float64
Briny                float64
Salty                float64
Vanilla              float64
Tart                 float64
Fruity               float64
Floral               float64
Review                object
Rare                    bool
dtype: object

4. EDA

Now that our data is cleaned, let's explore it and try to understand any patterns. This understanding will impact our modelling choices. Based on the .describe() statistics above, let's first look at the most extreme values of features that seem a bit lopsided in their distribution of values.

Which are the most "Smoky"? Intent is to see if there are any errors or something worth noting.

In [29]:
df2.sort_values(by=['Smoky'], ascending=False)[0:15][['Name', 'Smoky', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[29]:
Name Smoky ABV % Price Customers' Rating Expert Score
0 Cleveland Bourbon 90.0 50.00 2 2.48 70
2055 Warbringer Mesquite Smoked Southwest Bourbon 80.0 49.00 3 4.16 85
1073 Booker's Bourbon Batch 2019-04 "Beaten Biscuits" 80.0 63.05 3 4.21 87
1595 Jim Beam Black Label Extra-Aged 80.0 43.00 1 3.28 84
1757 Pappy Van Winkle 23 Year 80.0 47.80 5 4.54 89
1181 Jim Beam Double Oak 75.0 43.00 2 3.21 82
1785 Rebel Yell Kentucky Straight Bourbon 100 Proof 75.0 50.00 1 3.46 86
315 Booker's 25th Anniversary Bourbon 75.0 65.40 4 4.45 96
1640 Elijah Craig Barrel Proof Bourbon 70.0 68.50 4 4.33 93
1447 Garrison Brothers Texas Straight Bourbon 70.0 47.00 4 3.51 84
1823 Barrell Bourbon Batch 001 70.0 60.80 3 4.35 88
556 Elk Rider Bourbon 70.0 46.00 2 3.02 79
2164 Old Forester Single Barrel Bourbon Barrel Stre... 70.0 65.00 3 3.80 84
365 Lexington Bourbon 70.0 43.00 2 2.77 70
1936 Parker's Heritage Heavy Char Bourbon 10 Year 65.0 60.00 4 4.88 92

Which are the most "Peaty"? Intent is to see if there are any errors or something worth noting.

In [30]:
df2.sort_values(by=['Peaty'], ascending=False)[0:10][['Name', 'Peaty', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[30]:
Name Peaty ABV % Price Customers' Rating Expert Score
1529 New Riff Backsetter Peated Backset Bourbon 40.0 50.0 2 3.06 84
1117 Backbone Prime Blended Bourbon 20.0 52.0 2 3.61 82
2055 Warbringer Mesquite Smoked Southwest Bourbon 15.0 49.0 3 4.16 85
562 J. Riddle Peated Bourbon 15.0 45.5 2 3.35 87
171 Evan Williams Single Barrel 10.0 43.3 2 3.87 96
1905 Old Bardstown Black Label Kentucky Straight Bo... 10.0 45.0 1 2.90 86
2179 Knob Creek Small Batch Bourbon 10.0 50.0 2 3.47 84
963 Lux Row Distillers Double Barrel Bourbon 12 Year 5.0 59.2 4 4.03 91
1000 Still & Oak Straight Bourbon 5.0 43.0 2 3.40 83
1240 Angel's Envy Bourbon Finished in Port Wine Bar... 5.0 62.0 5 4.26 92

Which are the least "Spicy"? Intent is to see if there are any errors or something worth noting.

In [31]:
df2.sort_values(by=['Spicy'], ascending=True)[0:14][['Name', 'Spicy', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[31]:
Name Spicy ABV % Price Customers' Rating Expert Score
1031 Dancing Pines Bourbon 0.0 44.00 2 3.09 82
224 Rebel Yell Kentucky Straight Bourbon 0.0 40.00 1 2.88 84
1729 Burnside Oregon Oaked Bourbon 10.0 48.00 2 3.43 84
618 Three Chord Blended Bourbon 10.0 40.50 2 3.10 72
681 Old Charter 8 Year 10.0 40.00 1 2.88 80
1381 Rough Rider Straight Bourbon 15.0 45.00 2 3.34 85
939 TX Straight Bourbon Whiskey 15.0 47.00 4 3.47 84
1778 Orange County Distillery Bourbon 15.0 45.00 3 3.38 79
671 County Seat Spirits Hidden Copper Bourbon 15.0 45.00 2 3.50 80
1991 Black Maple Hill 16 Year Bourbon 15.0 47.50 4 4.62 97
130 Early Times 354 Bourbon 15.0 40.00 1 3.03 87
256 Larceny Small Batch Kentucky Straight Bourbon 15.0 46.00 1 3.47 88
236 Wild Turkey Rare Breed 15.0 54.10 2 3.84 92
1129 George T. Stagg Bourbon (Fall 2014) 15.0 69.05 3 4.53 92

Which are the most "Herbal"? Intent is to see if there are any errors or something worth noting.

In [32]:
df2.sort_values(by=['Herbal'], ascending=False)[0:20][['Name', 'Herbal', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[32]:
Name Herbal ABV % Price Customers' Rating Expert Score
592 Remus Volstead Reserve 14 Year Bottled in Bond... 90.0 50.0 5 4.25 89
1349 Temperance Trader Chinato Barrel-Finished Bourbon 90.0 45.0 2 3.61 87
1503 Redemption High Rye Bourbon 82.0 46.0 2 3.35 82
1851 Elijah Craig Barrel Proof Bourbon Batch C919 80.0 68.4 3 4.33 91
865 Treaty Oak Red Handed Bourbon (Kentucky & Virg... 80.0 47.5 2 3.68 83
373 Belle Meade Cask Strength Single Barrel Bourbo... 80.0 61.2 3 4.37 95
623 Cody Road Bourbon 78.0 45.0 2 2.72 71
895 Evan Williams White Label Bottled in Bond Bourbon 76.0 50.0 1 3.35 80
1830 Black Button Four Grain Bourbon 75.0 42.0 2 3.31 87
1442 Yellowstone Kentucky Straight Bourbon 9 Year (... 75.0 50.5 4 4.08 88
1371 Rock Hill Farms Bourbon 74.0 50.0 3 4.19 91
1637 Old Fitzgerald Bottled In Bond Bourbon 73.0 50.0 1 3.39 84
166 Jim Beam Signature Craft High Rye Bourbon 11 Year 70.0 45.0 4 3.50 84
1429 Henry McKenna 10 Year Bottled in Bond Bourbon 70.0 50.0 2 3.93 91
1095 Four Roses Small Batch Select Bourbon 70.0 52.0 4 4.00 89
343 St. Augustine Double Cask Bourbon 70.0 43.9 2 3.48 83
1748 I.W. Harper 15 Year Bourbon 70.0 43.0 3 4.06 94
38 Heaven Hill Bottled In Bond 6 Year 70.0 50.0 1 3.43 79
326 Four Roses Limited Edition Small Batch Bourbon... 70.0 56.3 4 4.48 93
258 Booker's Bourbon Batch 2015-04 "Oven Buster Ba... 70.0 63.5 3 3.66 85

Which are the most "Oily"? Intent is to see if there are any errors or something worth noting.

In [33]:
df2.sort_values(by=['Oily'], ascending=False)[0:15][['Name', 'Oily', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[33]:
Name Oily ABV % Price Customers' Rating Expert Score
1757 Pappy Van Winkle 23 Year 100.0 47.80 5 4.54 89
1640 Elijah Craig Barrel Proof Bourbon 100.0 68.50 4 4.33 93
1442 Yellowstone Kentucky Straight Bourbon 9 Year (... 95.0 50.50 4 4.08 88
1851 Elijah Craig Barrel Proof Bourbon Batch C919 90.0 68.40 3 4.33 91
514 Parker's Heritage Cognac Barrel Finish 10 Year... 90.0 50.00 3 4.53 96
303 George T. Stagg Bourbon (Fall 2013) 90.0 64.10 5 4.62 97
865 Treaty Oak Red Handed Bourbon (Kentucky & Virg... 90.0 47.50 2 3.68 83
1182 Barrell Bourbon Batch 008 85.0 66.40 4 4.20 83
916 Four Roses Limited Edition Single Barrel Bourb... 80.0 54.20 4 4.25 98
1785 Rebel Yell Kentucky Straight Bourbon 100 Proof 80.0 50.00 1 3.46 86
2079 1792 Aged Twelve Years 80.0 48.30 2 3.88 88
329 Colonel E.H. Taylor, Jr. Small Batch Bottled i... 80.0 50.00 2 4.14 90
1073 Booker's Bourbon Batch 2019-04 "Beaten Biscuits" 80.0 63.05 3 4.21 87
1262 J. Henry & Sons 5 Year Wisconsin Straight Bour... 80.0 60.00 3 4.18 83
1524 Treaty Oak Ghost Hill Texas Bourbon 80.0 47.50 2 3.50 80

Which are the least "Full-bodied"? Intent is to see if there are any errors or something worth noting.

In [34]:
df2.sort_values(by=['Full-bodied'], ascending=True)[0:20][['Name', 'Full-bodied', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[34]:
Name Full-bodied ABV % Price Customers' Rating Expert Score
618 Three Chord Blended Bourbon 5.0 40.50 2 3.10 72
0 Cleveland Bourbon 10.0 50.00 2 2.48 70
1031 Dancing Pines Bourbon 10.0 44.00 2 3.09 82
1641 291 Colorado Bourbon 15.0 50.00 3 3.60 78
2027 Feisty Spirits Blue Corn Bourbon 15.0 44.00 2 2.50 81
496 Bird Dog Small Batch Bourbon 15.0 43.00 2 3.14 82
2063 Ancient Age 20.0 40.00 1 2.55 78
1218 Burnside Bourbon 20.0 48.00 2 3.14 79
658 Maryland Club Straight Bourbon 20.0 47.50 2 2.00 78
1391 Central Standard Bourbon 20.0 45.00 2 2.42 80
1423 Hudson Baby Bourbon 20.0 46.00 4 3.28 83
1513 Missouri Spirits Bourbon Whiskey 20.0 40.00 2 3.32 72
1543 J.W. Overbey Bourbon 20.0 45.00 4 3.00 69
2088 Jim Beam Signature Craft Whole Rolled Oat Bour... 20.0 45.00 4 3.59 81
415 New Holland Beer Barrel Bourbon 20.0 40.00 3 3.07 65
486 Double Diamond Limited Edition Bourbon 267 20.0 40.00 2 1.00 80
1026 Barrell Bourbon New Year 2019 20.0 56.05 4 3.97 77
1729 Burnside Oregon Oaked Bourbon 20.0 48.00 2 3.43 84
180 Feisty Spirits Better Days Bourbon 20.0 44.00 2 3.33 80
22 Delaware Phoenix Bourbon 20.0 50.00 3 3.00 75

Which are the most "Briny"? Intent is to see if there are any errors or something worth noting.

In [35]:
df2.sort_values(by=['Briny'], ascending=False)[0:20][['Name', 'Briny', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[35]:
Name Briny ABV % Price Customers' Rating Expert Score
1524 Treaty Oak Ghost Hill Texas Bourbon 80.0 47.50 2 3.50 80
1834 Elijah Craig 12 Year 60.0 47.00 2 3.80 93
1481 Jefferson's Ocean Aged at Sea Voyage 15 Specia... 50.0 45.00 4 3.87 82
527 Old Soul Blended Straight Bourbon 45.0 45.00 2 3.49 74
1882 Pappy Van Winkle Family Reserve 15 Year 40.0 53.50 4 4.53 89
224 Rebel Yell Kentucky Straight Bourbon 40.0 40.00 1 2.88 84
623 Cody Road Bourbon 40.0 45.00 2 2.72 71
1149 Murray Hill Club Blended Bourbon 40.0 51.00 4 3.49 82
1757 Pappy Van Winkle 23 Year 30.0 47.80 5 4.54 89
1447 Garrison Brothers Texas Straight Bourbon 30.0 47.00 4 3.51 84
664 Michter's 10 Year Single Barrel Bourbon 30.0 47.20 4 4.20 90
1117 Backbone Prime Blended Bourbon 30.0 52.00 2 3.61 82
861 1792 Ridgemont Reserve Bourbon 8 Year 30.0 46.85 2 3.61 90
171 Evan Williams Single Barrel 30.0 43.30 2 3.87 96
1905 Old Bardstown Black Label Kentucky Straight Bo... 30.0 45.00 1 2.90 86
1317 Pappy Van Winkle 20 Year 30.0 45.20 4 4.67 92
1205 Booker's Bourbon Batch 2018-02 Backyard BBQ 25.0 64.40 3 4.20 83
2055 Warbringer Mesquite Smoked Southwest Bourbon 25.0 49.00 3 4.16 85
990 Noah's Mill Bourbon 20.0 57.15 3 4.02 93
198 Old Bardstown Estate Bottled Kentucky Straight... 20.0 50.50 2 3.30 84

Which are the most "Salty"? Intent is to see if there are any errors or something worth noting.

In [36]:
df2.sort_values(by=['Salty'], ascending=False)[0:20][['Name', 'Salty', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[36]:
Name Salty ABV % Price Customers' Rating Expert Score
1524 Treaty Oak Ghost Hill Texas Bourbon 80.0 47.50 2 3.50 80
1864 Blaum Bros. Galena Reserve (Series 0) 55.0 57.80 3 4.50 86
207 Remus Repeal Reserve Series III Straight Bourbon 55.0 50.00 4 3.88 86
1481 Jefferson's Ocean Aged at Sea Voyage 15 Specia... 50.0 45.00 4 3.87 82
1834 Elijah Craig 12 Year 50.0 47.00 2 3.80 93
162 Booker's Bourbon Batch 2020-01 "Granny's Batch" 40.0 63.20 4 4.07 92
1837 Coppersea Excelsior Bourbon 40.0 48.00 4 2.88 81
1785 Rebel Yell Kentucky Straight Bourbon 100 Proof 40.0 50.00 1 3.46 86
1757 Pappy Van Winkle 23 Year 40.0 47.80 5 4.54 89
1882 Pappy Van Winkle Family Reserve 15 Year 40.0 53.50 4 4.53 89
2179 Knob Creek Small Batch Bourbon 40.0 50.00 2 3.47 84
1732 Colter's Run Bourbon 35.0 44.00 2 3.57 79
373 Belle Meade Cask Strength Single Barrel Bourbo... 30.0 61.20 3 4.37 95
293 Ezra Brooks Kentucky Straight Bourbon 90 Proof 30.0 45.00 1 3.41 88
171 Evan Williams Single Barrel 30.0 43.30 2 3.87 96
1317 Pappy Van Winkle 20 Year 30.0 45.20 4 4.67 92
1728 Barrell Bourbon Batch 005 30.0 62.35 4 4.07 81
224 Rebel Yell Kentucky Straight Bourbon 30.0 40.00 1 2.88 84
1130 Virgin Bourbon 7 Year 101 30.0 50.50 1 3.35 83
126 Henry DuYore's Straight Bourbon Whiskey 30.0 45.60 2 3.32 81

Which are the most "Tart"? Intent is to see if there are any errors or something worth noting.

In [37]:
df2.sort_values(by=['Tart'], ascending=False)[0:20][['Name', 'Tart', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[37]:
Name Tart ABV % Price Customers' Rating Expert Score
592 Remus Volstead Reserve 14 Year Bottled in Bond... 75.0 50.00 5 4.25 89
1925 Redemption Temptation Bourbon 74.0 41.00 2 2.91 77
895 Evan Williams White Label Bottled in Bond Bourbon 72.0 50.00 1 3.35 80
1524 Treaty Oak Ghost Hill Texas Bourbon 70.0 47.50 2 3.50 80
1768 Peach Street Colorado Straight Bourbon 70.0 46.00 3 3.59 83
782 Johnny Drum Green Label Bourbon 70.0 40.00 2 3.19 82
1933 Elijah Craig Single Barrel 21 Year 69.0 45.00 5 4.14 87
1371 Rock Hill Farms Bourbon 68.0 50.00 3 4.19 91
865 Treaty Oak Red Handed Bourbon (Kentucky & Virg... 65.0 47.50 2 3.68 83
176 Oola Waitsburg Bourbon 65.0 47.00 2 3.37 80
1503 Redemption High Rye Bourbon 65.0 46.00 2 3.35 82
1310 The Walking Dead Kentucky Straight Bourbon 65.0 47.00 2 3.09 83
448 Buffalo Trace Experimental Collection French O... 60.0 45.00 4 3.35 88
283 Calumet Farm Bourbon 60.0 43.00 2 3.10 72
1637 Old Fitzgerald Bottled In Bond Bourbon 60.0 50.00 1 3.39 84
1575 George T. Stagg Bourbon (Fall 2019) 60.0 58.45 4 4.59 98
710 Old Heaven Hill Gold Label Bottled In Bond Bou... 60.0 50.00 1 3.12 77
918 Daviess County Kentucky Straight Bourbon 60.0 48.00 2 3.32 89
851 Coopers' Craft Barrel Reserve Straight Bourbon 60.0 50.00 2 3.65 85
446 Duke Kentucky Straight Bourbon 60.0 44.00 2 3.18 83

Which are the most "Fruity"? Intent is to see if there are any errors or something worth noting.

In [38]:
df2.sort_values(by=['Fruity'], ascending=False)[0:20][['Name', 'Fruity', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[38]:
Name Fruity ABV % Price Customers' Rating Expert Score
325 Heaven Hill Select Stock Bourbon 100.0 65.10 4 4.09 84
2195 Barrell Bourbon Batch 007 90.0 61.20 3 3.50 83
448 Buffalo Trace Experimental Collection French O... 90.0 45.00 4 3.35 88
453 W.H. Harrison Governor's Reserve Bourbon 90.0 56.50 3 3.36 84
1056 Parker's Heritage Master Distiller's Blend of ... 90.0 63.50 3 4.18 93
1980 1792 High Rye Bourbon 90.0 47.15 2 3.72 85
1727 Four Roses Limited Edition Small Batch Bourbon... 90.0 53.65 4 4.28 92
1525 Blaum Bros Knotter Bourbon 3 Year (Batch #6) 85.0 45.00 2 3.49 76
1959 Woodford Reserve Master’s Collection Brandy Ca... 85.0 45.20 4 4.17 91
895 Evan Williams White Label Bottled in Bond Bourbon 81.0 50.00 1 3.35 80
629 Hancock's President's Reserve Single Barrel Bo... 80.0 44.45 3 3.81 88
751 Daviess County Kentucky Straight Bourbon Frenc... 80.0 48.00 2 3.73 91
916 Four Roses Limited Edition Single Barrel Bourb... 80.0 54.20 4 4.25 98
1002 10th Mountain Bourbon 80.0 46.00 3 3.30 75
1349 Temperance Trader Chinato Barrel-Finished Bourbon 80.0 45.00 2 3.61 87
1051 Woodford Reserve Master's Collection Four Wood... 80.0 47.20 4 4.30 96
1839 Woodford Reserve Master's Collection Sonoma-Cu... 80.0 45.20 4 3.69 84
1832 Dark Corner Distillery Lewis Redmond Bourbon 80.0 43.00 3 2.90 77
1729 Burnside Oregon Oaked Bourbon 80.0 48.00 2 3.43 84
1833 Hirsch Selection 28 Year Bourbon 80.0 43.40 5 3.89 97

Which are the oldest? Intent is to see if there are any errors or something worth noting.

In [39]:
df2.sort_values(by=['Age'], ascending=False)[0:20][['Name', 'Age', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[39]:
Name Age ABV % Price Customers' Rating Expert Score
1833 Hirsch Selection 28 Year Bourbon 28.0 43.40 5 3.89 97
702 Heaven Hill 27 Year Barrel Proof Kentucky Stra... 27.0 47.35 5 4.26 93
1621 Old Blowhard 26 Year Bourbon 26.0 45.35 5 3.68 81
1193 Michter's 25 Year Single Barrel Bourbon 25.0 54.30 5 4.49 87
1348 Rhetoric 24 Year Bourbon 24.0 45.40 4 4.21 89
1757 Pappy Van Winkle 23 Year 23.0 47.80 5 4.54 89
437 Rhetoric 23 Year Bourbon 23.0 45.30 4 4.21 85
2020 Evan Williams 23 Year Bourbon 23.0 53.50 5 4.37 86
754 Elijah Craig 23 Year Bourbon 23.0 45.00 5 4.19 90
140 Blade And Bow Bourbon 22 Year (2015 Release) 22.0 46.00 5 4.29 87
1192 Rhetoric 22 Year Bourbon 22.0 45.00 4 4.29 87
1589 Lost Prophet 22 Year Bourbon 22.0 45.05 4 4.34 90
1933 Elijah Craig Single Barrel 21 Year 21.0 45.00 5 4.14 87
832 Rhetoric 21 Year Bourbon 21.0 45.10 4 4.05 90
1511 Michter's 20 Year Single Barrel Bourbon 20.0 57.10 5 4.49 94
1317 Pappy Van Winkle 20 Year 20.0 45.20 4 4.67 92
1127 Barterhouse 20 Year Bourbon 20.0 45.10 4 4.06 87
1443 Rhetoric 20 Year Bourbon 20.0 45.00 4 3.92 89
2128 Batch 206 Old Log Cabin Bourbon 19.0 43.00 2 3.15 83
1383 Elijah Craig 18 Year 18.0 45.00 5 4.31 92

Which are the most popular? Intent is to see if there are any errors or something worth noting.

In [40]:
df2.sort_values(by=['# Ratings'], ascending=False)[0:20][['Name', '# Ratings', 'Rare', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[40]:
Name # Ratings Rare ABV % Price Customers' Rating Expert Score
276 Blanton's Original Single Barrel 9072 False 46.5 2 4.30 89
826 Buffalo Trace Bourbon 8913 False 45.0 2 3.66 83
917 Eagle Rare 10 Year Bourbon 8656 False 45.0 2 4.02 91
113 Maker's Mark Bourbon 7209 False 45.0 2 3.46 87
85 Woodford Reserve Bourbon 7087 False 45.2 2 3.65 85
1744 Bulleit Bourbon 6712 False 45.0 2 3.48 86
1909 Four Roses Single Barrel Bourbon 5890 False 50.0 2 4.00 90
465 Basil Hayden's Bourbon 5328 False 40.0 2 3.62 80
1133 Weller Special Reserve 4931 False 45.0 2 3.89 91
285 Angel's Envy Bourbon Finished in Port Wine Bar... 4741 False 43.3 2 3.86 84
1592 Woodford Reserve Double Oaked 4590 False 45.2 2 4.11 92
329 Colonel E.H. Taylor, Jr. Small Batch Bottled i... 4444 False 50.0 2 4.14 90
2179 Knob Creek Small Batch Bourbon 4354 False 50.0 2 3.47 84
380 Elijah Craig Small Batch Bourbon 4345 False 47.0 2 3.68 85
479 Four Roses Small Batch Bourbon 4159 False 45.0 2 3.83 92
171 Evan Williams Single Barrel 3983 False 43.3 2 3.87 96
1429 Henry McKenna 10 Year Bottled in Bond Bourbon 3868 False 50.0 2 3.93 91
256 Larceny Small Batch Kentucky Straight Bourbon 3620 False 46.0 1 3.47 88
81 Maker's Mark 46 3551 False 47.0 2 3.74 90
263 Weller Antique 107 3495 False 53.5 2 4.12 92

Ah, interestingly, the single-most popular, Blanton's is actually very hard to find these days. The field wrongly states it's not rare, but it is essentially impossible to find in most US States, and bottles are commonly marked up from $40 MSRP to $200. I'm very surprised that this has the most reviews, but I suspect it's because it has the highest allure amongst the rare ones that are somewhat possible to find. Years ago, it was very easy to find, so maybe some reviews were from this time.

Additionally, Weller Special Reserve is also impossible to get within most places in the US, for most times of the year, but it has TONS of allure and attention. People obsess over it. Weller Antique 107 is even rarer. The rest are very common within stores and bars, so the data makes sense for these.

Which are the best according to customers? Intent is to see if there are any errors or something worth noting.

In [41]:
df2.sort_values(by=["Customers\' Rating"], ascending=False)[0:20][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[41]:
Name # Ratings ABV % Price Customers' Rating Expert Score
1936 Parker's Heritage Heavy Char Bourbon 10 Year 2 60.00 4 4.88 92
1283 Abraham Bowman Sweet XVI Bourbon 15 58.00 3 4.84 93
871 William Larue Weller Bourbon (Fall 2020) 1 67.25 4 4.75 93
2078 William Larue Weller Bourbon (Fall 2016) 134 67.70 4 4.71 96
1570 William Larue Weller Bourbon (Fall 2017) 182 64.10 4 4.71 90
593 William Larue Weller Bourbon (Fall 2015) 232 67.30 3 4.70 98
1463 King of Kentucky 15 Year Kentucky Straight Bou... 23 65.50 5 4.69 90
296 Old Forester President's Choice Bourbon 3 59.30 5 4.67 90
21 Four Roses Limited Edition 50th Anniversary Sm... 158 54.30 4 4.67 94
1317 Pappy Van Winkle 20 Year 893 45.20 4 4.67 92
1957 George T. Stagg Bourbon (Fall 2017) 387 64.60 4 4.66 94
727 William Larue Weller Bourbon (Fall 2018) 183 62.85 4 4.65 96
2069 Russell's Reserve 1998 19 51.10 5 4.64 92
303 George T. Stagg Bourbon (Fall 2013) 422 64.10 5 4.62 97
97 George T. Stagg Bourbon (Fall 2015) 104 69.10 3 4.62 91
1991 Black Maple Hill 16 Year Bourbon 69 47.50 4 4.62 97
2108 King of Kentucky 14 Year Kentucky Straight Bou... 23 67.50 5 4.61 94
1596 George T. Stagg Bourbon (Fall 2018) 382 62.45 4 4.59 93
1575 George T. Stagg Bourbon (Fall 2019) 395 58.45 4 4.59 98
1932 William Larue Weller Bourbon (Fall 2013) 139 68.10 4 4.57 96

This seems correct to me, not because I've tasted any of these, but because these are famous and are highly coveted. I've never heard of the best rated, Parker's, though. I'd be suspicious that it's an outlier and wrong, especially considering it only has 2 reviews from users; however, the expert also gave it a high score, so it seems like a valid entry.

In [42]:
df2.sort_values(by=["Customers\' Rating"], ascending=True)[0:20][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[42]:
Name # Ratings ABV % Price Customers' Rating Expert Score
486 Double Diamond Limited Edition Bourbon 267 1 40.0 2 1.00 80
2167 Syntax Spirits Bourbon 2 47.5 2 2.00 80
658 Maryland Club Straight Bourbon 2 47.5 2 2.00 78
1920 Winchester "Extra Smooth" Bourbon 75 45.0 1 2.21 65
1108 Detroit City Two-Faced Bourbon 8 47.0 2 2.25 77
1244 Black Button Little Barrel Bourbon 8 42.0 3 2.38 75
230 Adirondack 601 Bourbon 21 43.2 3 2.40 78
1250 Old Crow Kentucky Straight Bourbon 197 40.0 1 2.40 71
1391 Central Standard Bourbon 9 45.0 2 2.42 80
801 Evan Williams Green Label 177 40.0 1 2.45 81
0 Cleveland Bourbon 116 50.0 2 2.48 70
2027 Feisty Spirits Blue Corn Bourbon 2 44.0 2 2.50 81
2063 Ancient Age 212 40.0 1 2.55 78
611 Graveyard Sam's Baby Bourbon 5 45.0 2 2.55 72
1279 John B. Stetson Kentucky Straight Bourbon Whiskey 52 42.0 2 2.56 68
328 New Liberty Bloody Butcher Bourbon 19 47.5 2 2.57 87
1689 Kentucky Tavern Bourbon 47 40.0 1 2.59 80
1826 Cabin Still Bourbon 20 40.0 1 2.61 79
824 Yellow Rose Double Barrel Bourbon 24 43.0 2 2.66 83
195 Old Hickory Great American Straight Bourbon 13 43.0 2 2.67 80

I've never heard of any of these, so this list seems reasonable. Plus, the experts gave them all horrible reviews, so I don't suspect anything suspicious is going on (e.g., customers ironically rating a controversial, highly-appraised whiskey as being horribly low, as if to troll the ratings).

Which are the best according to experts? Intent is to see if there are any errors or something worth noting.

In [43]:
df2.sort_values(by=['Expert Score'], ascending=False)[0:20][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[43]:
Name # Ratings ABV % Price Customers' Rating Expert Score
12 Booker's Bourbon Batch 2015-01 "Big Man, Small... 698 64.35 3 4.38 98
1635 Four Roses Limited Edition Single Barrel Bourb... 158 59.80 5 4.23 98
1575 George T. Stagg Bourbon (Fall 2019) 395 58.45 4 4.59 98
593 William Larue Weller Bourbon (Fall 2015) 232 67.30 3 4.70 98
916 Four Roses Limited Edition Single Barrel Bourb... 503 54.20 4 4.25 98
1917 Parker's Heritage Promise of Hope 224 48.00 4 4.45 98
303 George T. Stagg Bourbon (Fall 2013) 422 64.10 5 4.62 97
1671 Wild Turkey Diamond Anniversary Bourbon 278 45.50 4 4.33 97
1991 Black Maple Hill 16 Year Bourbon 69 47.50 4 4.62 97
1833 Hirsch Selection 28 Year Bourbon 11 43.40 5 3.89 97
1932 William Larue Weller Bourbon (Fall 2013) 139 68.10 4 4.57 96
1537 Four Roses Limited Edition Small Batch Bourbon... 207 54.30 5 4.42 96
489 George T. Stagg Bourbon (Fall 2020) 9 65.20 4 4.47 96
1051 Woodford Reserve Master's Collection Four Wood... 113 47.20 4 4.30 96
1004 Four Roses Limited Edition Small Batch Bourbon... 176 55.60 4 4.35 96
231 Eagle Rare 17 Year Bourbon (Fall 2014) 177 45.00 3 4.31 96
1573 Old Forester Birthday Bourbon 2018 225 50.50 4 4.42 96
514 Parker's Heritage Cognac Barrel Finish 10 Year... 11 50.00 3 4.53 96
171 Evan Williams Single Barrel 3983 43.30 2 3.87 96
2078 William Larue Weller Bourbon (Fall 2016) 134 67.70 4 4.71 96

This seems right. Never heard of Hirsch or Parker's but the others are speciality versions of famous/popular whiskeys, so this makes sense.

Which are the most expensive? Intent is to see if there are any errors or something worth noting.

In [44]:
df2.sort_values(by=['Price'], ascending=False)[0:15][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]
Out[44]:
Name # Ratings ABV % Price Customers' Rating Expert Score
592 Remus Volstead Reserve 14 Year Bottled in Bond... 31 50.00 5 4.25 89
1649 Angel's Envy Bourbon Finished in Mizunara Oak 3 48.90 5 3.58 91
702 Heaven Hill 27 Year Barrel Proof Kentucky Stra... 28 47.35 5 4.26 93
140 Blade And Bow Bourbon 22 Year (2015 Release) 56 46.00 5 4.29 87
303 George T. Stagg Bourbon (Fall 2013) 422 64.10 5 4.62 97
1779 Russell's Reserve 2002 30 57.30 5 4.43 95
296 Old Forester President's Choice Bourbon 3 59.30 5 4.67 90
2108 King of Kentucky 14 Year Kentucky Straight Bou... 23 67.50 5 4.61 94
2020 Evan Williams 23 Year Bourbon 35 53.50 5 4.37 86
1511 Michter's 20 Year Single Barrel Bourbon 110 57.10 5 4.49 94
982 Angel's Envy Bourbon Finished in Port Wine Bar... 105 61.20 5 4.09 93
1383 Elijah Craig 18 Year 689 45.00 5 4.31 92
1833 Hirsch Selection 28 Year Bourbon 11 43.40 5 3.89 97
590 Jim Beam Distiller's Masterpiece 96 50.00 5 4.24 86
264 Angel's Envy Bourbon Finished in Port Wine Bar... 4 60.20 5 4.00 94

We don't have high granularity (prices are just 1-5), which is perhaps a blessing in disguise -- most bourbons are \$30 - \\$50, but some rare ones, especially due to price gouging, can be \$100 - \\$3,000. That's a wild range and is largely due to rarity, allure, and sensationalism within human behavior, as opposed to actual qualities of the bourbon. So, maybe it's good that we don't have to deal with outlier whiskeys have extraordinary prices.

Where do they come from? Intent is to see if there are any errors or something worth noting.

In [45]:
df2['Location'].value_counts()
Out[45]:
Booker's // Kentucky, USA                                                                 22
Four Roses // Kentucky, USA                                                               18
Buffalo Trace // Kentucky, USA                                                            16
Old Forester // Kentucky, USA                                                             16
Jim Beam // Kentucky, USA                                                                 15
Heaven Hill // Kentucky, USA                                                              14
Elijah Craig // Kentucky, USA                                                             13
Woodford Reserve // Kentucky, USA                                                         12
Wild Turkey // Kentucky, USA                                                              10
Barton 1792 // Kentucky, USA                                                               7
Knob Creek // Kentucky, USA                                                                7
Michter's // Kentucky, USA                                                                 7
Redemption // Indiana, USA                                                                 6
Barrell Craft Spirits // (bottled in) Kentucky, USA                                        6
Colonel E.H. Taylor, Jr. // Kentucky, USA                                                  6
A. Smith Bowman // Virginia, USA                                                           6
Angel's Envy // Kentucky, USA                                                              6
Yellowstone  // Kentucky, USA                                                              5
Evan Williams // Kentucky, USA                                                             5
Wyoming Whiskey // Wyoming, USA                                                            5
Barrell Craft Spirits // USA                                                               5
Parker's Heritage Collection // Kentucky, USA                                              5
George Remus // Indiana, USA                                                               5
William Larue Weller // Kentucky, USA                                                      5
Maker's Mark // Kentucky, USA                                                              5
Eagle Rare // Kentucky, USA                                                                5
Belle Meade // Indiana (bottled in Tennessee), USA                                         5
George T. Stagg // Kentucky, USA                                                           5
Weller // Kentucky, USA                                                                    4
Blood Oath // Kentucky, USA                                                                4
Rebel Yell // Kentucky, USA                                                                4
Laws Whiskey House // Colorado, USA                                                        4
Garrison Brothers // Texas, USA                                                            4
Rhetoric // Kentucky, USA                                                                  4
Larceny // Kentucky, USA                                                                   4
Jefferson's // Kentucky, USA                                                               3
Barrell Craft Spirits // Tennessee, USA                                                    3
Kentucky Bourbon Distillers, Ltd. // Kentucky, USA                                         3
Bulleit // Kentucky, USA                                                                   3
St. Augustine // Florida, USA                                                              3
Orphan Barrel Whisky Co. // Kentucky, USA                                                  3
Buffalo Trace Antique Collection // Kentucky, USA                                          3
Daviess County // Kentucky, USA                                                            3
Bardstown Bourbon Company // Kentucky, USA                                                 3
Blaum Bros. // Indiana (further aged & bottled in Illinois), USA                           3
Pappy Van Winkle // Kentucky, USA                                                          3
Old Grand-Dad // Kentucky, USA                                                             3
New Riff // Kentucky, USA                                                                  3
Woodinville // Washington, USA                                                             3
Sam Houston // Kentucky, USA                                                               2
Abraham Bowman // Virginia, USA                                                            2
Baker's // Kentucky, USA                                                                   2
Feisty Spirits // Colorado, USA                                                            2
Coopers' Craft // Kentucky, USA                                                            2
Virgil Kaine // (bottled in) South Carolina, USA                                           2
HIRSCH // Kentucky, USA                                                                    2
Black Maple Hill // Kentucky, USA                                                          2
Sonoma County Distilling // California, USA                                                2
Old Rip Van Winkle // Kentucky, USA                                                        2
Belle Meade // (bottled in) Tennessee, USA                                                 2
Rock Town // Arkansas, USA                                                                 2
Old Fitzgerald // Kentucky, USA                                                            2
Black Button Distilling // New York, USA                                                   2
King of Kentucky // Kentucky, USA                                                          2
291  // Colorado, USA                                                                      2
Temperance Trader // Indiana (bottled in Oregon), USA                                      2
Backbone Bourbon // Indiana (bottled in Kentucky), USA                                     2
Ezra Brooks // Kentucky, USA                                                               2
Burnside // Oregon, USA                                                                    2
Smooth Ambler // USA                                                                       2
Penelope // Indiana, USA                                                                   2
Blanton's // Kentucky, USA                                                                 2
Jefferson's // USA                                                                         2
HIRSCH // Indiana, USA                                                                     2
Hudson Whiskey // New York, USA                                                            2
Rabbit Hole // Kentucky, USA                                                               2
Russell's Reserve // Kentucky, USA                                                         2
Chattanooga Whiskey Co. // Tennessee, USA                                                  2
FEW // Illinois, USA                                                                       2
Berkshire Mountain // Massachusettes, USA                                                  2
Bardstown Bourbon Company // Tennessee, USA                                                2
Ranger Creek // Texas, USA                                                                 2
Basil Hayden's // Kentucky, USA                                                            2
Kentucky, USA                                                                              2
Balcones // Texas, USA                                                                     2
Batch 206 // Washington, USA                                                               2
I.W. Harper // Kentucky, USA                                                               2
Ancient Age // Kentucky, USA                                                               1
Downslope Distilling // Colorado, USA                                                      1
Tatoosh // USA                                                                             1
Elk Rider // Washington , USA                                                              1
Early Times // Kentucky, USA                                                               1
Old Ezra // Kentucky, USA                                                                  1
OYO // Ohio, USA                                                                           1
PM Spirits // Indiana, USA                                                                 1
Wiggly Bridge // Maine, USA                                                                1
Pinhook  // USA                                                                            1
Dark Horse Distillery // Kansas, USA                                                       1
Eight & Sand // Indiana, USA                                                               1
J. Henry & Sons // Wisconsin, USA                                                          1
10th Mountain  // Vail, Colorado, USA                                                      1
Coppersea // New York, USA                                                                 1
Grand Traverse Distillery // Michigan, USA                                                 1
Redemption // Indiana , USA                                                                1
New Holland // Indiana (bottled in Michigan), USA                                          1
Belle Meade // Tennessee, USA                                                              1
Milam & Greene // USA                                                                      1
New Riff // Indiana (bottled in Kentucky), USA                                             1
Hillrock Estate // USA                                                                     1
Western Spirits Beverage Company // USA                                                    1
Western Spirits Company // Kentucky, USA                                                   1
Breckenridge // USA                                                                        1
Van Brunt Stillhouse // New York, USA                                                      1
Ghost Hill // Texas, USA                                                                   1
District Made // Washington D.C., USA                                                      1
Winchester // USA                                                                          1
The Family Jones // Colorado, USA                                                          1
Prohibition Distillery // New York, USA                                                    1
Elmer T. Lee // Kentucky, USA                                                              1
Stagg Jr. // Kentucky , USA                                                                1
Medley Bros. // Kentucky, USA                                                              1
Dark Corner Distillery // South Carolina, USA                                              1
Valley Shine Distillery // USA                                                             1
Duke // Kentucky, USA                                                                      1
W.H. Harrison // Indiana, USA                                                              1
Metze's // Indiana, USA                                                                    1
Noah's Mill // Kentucky, USA                                                               1
SILO // Vermont, USA                                                                       1
Illinois, USA                                                                              1
Chicken Cock // Indiana (aged & bottled in Kentucky), USA                                  1
Deadwood // Indiana, USA                                                                   1
Central Standard // Wisconsin, USA                                                         1
33 // USA                                                                                  1
Black Dirt // New York, USA                                                                1
Koval // Illinois, USA                                                                     1
Pennsylvania, USA                                                                          1
Cyrus Noble // Kentucky, USA                                                               1
Rhetoric // Kentucky (bottled in Tennessee), USA                                           1
Gristmill Distillers // New York, USA                                                      1
George Remus // Indiana , USA                                                              1
Cascade Alchemy // South Carolina (bottled in Oregon), USA                                 1
PennyPacker // Kentucky, USA                                                               1
Abraham Bowman // Virginia , USA                                                           1
Booker's // Kentucky  , USA                                                                1
Wathen's // Kentucky, USA                                                                  1
Corner Creek // Kentucky, USA                                                              1
Blaum Bros. // Illinois, USA                                                               1
Dancing Pines // Colorado, USA                                                             1
Old Soul // USA                                                                            1
Penelope // USA                                                                            1
Sazerac // USA                                                                             1
Two James Spirits // (bottled in) Michigan, USA                                            1
Orange County Distillery // New York, USA                                                  1
Spirit Works // California, USA                                                            1
Old Crow // USA                                                                            1
Kooper Family // Indiana (aged in Texas), USA                                              1
David Nicholson // Kentucky (bottled in Missouri), USA                                     1
Pinhook  // Kentucky, USA                                                                  1
Peach Street // Colorado, USA                                                              1
Short Mountain // Tennessee, USA                                                           1
Heaven's Door // Tennessee, USA                                                            1
MB Roland Distillery // Kentucky, USA                                                      1
Port Chilkoot // Alaska, USA                                                               1
Kentucky Owl // Kentucky, USA                                                              1
Steward's Whiskies // Kentucky, Tennessee and Indiana, USA                                 1
Stagg Jr. // Kentucky, USA                                                                 1
Michter's // USA                                                                           1
Yellow Rose // Texas, USA                                                                  1
Cedar Ridge // Iowa, USA                                                                   1
The Walking Dead // Kentucky, USA                                                          1
Long Island Spirits // New York, USA                                                       1
Johnny Drum // Kentucky, USA                                                               1
Jos. A. Magnus & Co. // Indiana (Finished and Bottled in Washington DC), USA               1
Sonoma County Distilling Co. // California, USA                                            1
Graveyard Sam's // Pennsylvania, USA                                                       1
Union Horse // Kansas, USA                                                                 1
HIRSCH // Indiana (bottled in Ohio), USA                                                   1
Finger Lakes Distilling // New York, USA                                                   1
Six & Twenty // South Carolina, USA                                                        1
Old Bardstown Distilling Company // Kentucky, USA                                          1
Colorado Gold // Colorado, USA                                                             1
Prichard's // Tennessee, USA                                                               1
Cooperstown Distillery // New York, USA                                                    1
Tom's Town // Missouri, USA                                                                1
Barrell Craft Spirits // Tennessee , USA                                                   1
Old Elk // (bottled in) Colorado, USA                                                      1
A. Smith Bowman // USA                                                                     1
Headframe Spirits // Montana, USA                                                          1
Treaty Oak Distilling // USA                                                               1
Rough Rider // Long Island , USA                                                           1
TahWahKaro // Texas, USA                                                                   1
Redemption Whiskey // Indiana, USA                                                         1
Fighting Cock // Kentucky, USA                                                             1
Widow Jane // USA                                                                          1
Henry McKenna // Kentucky, USA                                                             1
Dry Fly // Washington, USA                                                                 1
Rod & Rifle // Tennessee, USA                                                              1
Heaven's Door // USA                                                                       1
Jos. A. Magnus & Co. // (blended & bottled in Washington D.C.), USA                        1
Fistful of Bourbon // USA                                                                  1
New Liberty // Pennsylvania, USA                                                           1
Chicken Cock // Kentucky , USA                                                             1
Bond & Lillard // Kentucky, USA                                                            1
Oola // Washington, USA                                                                    1
Indiana (bottled in Pennsylvania), USA                                                     1
Koenig Distillery and Winery // (bottled in) Idaho, USA                                    1
Journeyman Distillery // Michigan, USA                                                     1
Copper Fiddle // Illinois, USA                                                             1
Three Chord // USA                                                                         1
Bulleit // Kentucky , USA                                                                  1
Old Bardstown // Kentucky, USA                                                             1
Deadwood // Indiana , USA                                                                  1
Filibuster // USA                                                                          1
J.W. Overbey & Co. // New York, USA                                                        1
Treaty Oak // USA                                                                          1
Indiana (bottled in California), USA                                                       1
The Clover // Indiana , USA                                                                1
Detroit City // USA                                                                        1
Old Charter // Kentucky, USA                                                               1
Smooth Ambler // West Virginia, USA                                                        1
Wigle // Pennsylvania, USA                                                                 1
Still & Oak // Wisconsin, USA                                                              1
Booker's // Kentucky , USA                                                                 1
Yellow Rose // USA                                                                         1
Breaking & Entering // Kentucky , USA                                                      1
Legent // Kentucky, USA                                                                    1
Temperance Trader // (bottled in) Oregon, USA                                              1
Deerhammer // Colorado, USA                                                                1
Stetson // Kentucky, USA                                                                   1
Ransom Spirits // Oregon, USA                                                              1
SOA Spirits // Indiana, USA                                                                1
James E. Pepper // Indiana (bottled in Kentucky), USA                                      1
Knob Creek // Kentucky , USA                                                               1
Taconic Distillery // New York, USA                                                        1
Catskill Distilling Co. // New York, USA                                                   1
Delaware Phoenix // New York, USA                                                          1
Bird Dog // Kentucky, USA                                                                  1
Big House // Indiana (bottled in Kentucky), USA                                            1
Hood River Distilling // Kentucky (bottled in Oregon), USA                                 1
Kinsey // (bottled in) Pennsylvania, USA                                                   1
Amador Whiskey Co. // Kentucky (Finished and Bottled in California), USA                   1
Five & 20 Spirits // New York, USA                                                         1
Sweetens Cove // Tennessee , USA                                                           1
Two James // Michigan, USA                                                                 1
Cabin Still // Kentucky, USA                                                               1
Old Ripy // Kentucky, USA                                                                  1
McAfee's Benchmark // Kentucky, USA                                                        1
Blade And Bow // Kentucky, USA                                                             1
Frey Ranch // Nevada, USA                                                                  1
J. Henry & Sons // Wisconsin , USA                                                         1
Charred Oak Spirits // Wisconsin, USA                                                      1
Eagle Rare // Kentucky , USA                                                               1
2bar Spirits // Washington, USA                                                            1
35 Maple Street // USA                                                                     1
Orphan Barrel Whisky Co. // Kentucky (bottled in Tennessee), USA                           1
Colter's Run // Idaho, USA                                                                 1
TX Whiskey // Texas, USA                                                                   1
Adirondack Distilling Company // New York, USA                                             1
Old Tub // Kentucky, USA                                                                   1
Barrell Craft Spirits // Kentucky, USA                                                     1
Missouri Spirits // Missouri, USA                                                          1
Very Old Barton // Kentucky, USA                                                           1
Early Times // Kentucky , USA                                                              1
Watershed // Ohio, USA                                                                     1
Kiepersol Estates // Texas, USA                                                            1
Defiance // USA                                                                            1
Still Austin // Texas, USA                                                                 1
Cody Road // Iowa, USA                                                                     1
O4D // Indiana (aged in Georgia), USA                                                      1
Old Hickory Great American // Indiana (bottled in Ohio), USA                               1
Parker's Heritage Collection // Kentucky , USA                                             1
Warbringer // (bottled in) California , USA                                                1
Tom's Town // Tennessee (bottled in Missouri), USA                                         1
Kinsey // (bottled in) Pennsylvania , USA                                                  1
Diageo // Kentucky, USA                                                                    1
The Splinter Group // Tennessee and Kentucky (Finished and Bottled in California), USA     1
Lux Row Distillers // Kentucky, USA                                                        1
Clyde May's // Alabama, USA                                                                1
Cleveland Whiskey // Ohio, USA                                                             1
Taconic Distillery // USA                                                                  1
Kings County // New York, USA                                                              1
Syntax Spirits // Colorado, USA                                                            1
Oregon Spirit Distillers // Oregon, USA                                                    1
Jim Beam // Kentucky  , USA                                                                1
Name: Location, dtype: int64

Some distilleries produce different brands of whiskey. Most come from Kentucky. You can see that some distilleries produce tons of different types, but this can be a bit misleading because some of those different types are just slight variations (e.g., Eagle Rare 10, Eagle Rare 17), whereas others are completely different brands (e.g., Buffalo Trace, Blanton's). For now, it's probably best to just ignore the location feature, but we'll keep it in mind for modelling, if we get desperate. One idea would be to create 2 fields from this: 1 for the geographic state (e.g., Kentucky), and another for the distillery (e.g., Booker's).

Let's look at the distribution of flavor values? Intent is to see if there are any errors or something worth noting.

In [46]:
fig, axs = plt.subplots(nrows=5, ncols=3, figsize=(20, 20), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = .5, wspace=.2)
axs = axs.ravel()
fontsize = 10

flavors = ['Smoky', 'Peaty', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
        'Sweet', 'Briny', 'Salty', 'Vanilla', 'Tart', 'Fruity', 'Floral']

# plot histograms
for i, flavor in enumerate(flavors):
    axs[i].hist(df2[flavor], alpha=0.7, color='lightblue', bins='auto', density=False, histtype = 'bar', edgecolor='k')
    axs[i].set_title("Distribution of " + flavor + " Flavor", fontsize=fontsize)
    axs[i].set_xlabel(flavor + " Flavor", fontsize=fontsize)
    axs[i].set_ylabel('Count', fontsize=fontsize)
    
# removes the empty one, since we only have 14 flavors, not 15
axs[14].set_axis_off()

These all seem pretty reasonable, and I'm glad that the values have a good spread. A few flavors are a bit skewed, and these are the ones that we inspected above.

Let's look for any patterns/correlations that may exists between our features. Since some of the above flavors are skewed (e.g., Salty is usually 0), we would not be able to discern any meaningful trend, so we can throw this out from our visualization. Otherwise, our graph woud just be a bunch of points overlapping one another at the 0 value.

In [47]:
grid_features = ['Smoky', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
        'Sweet', 'Vanilla', 'Fruity', 'Floral', \
        'Age', 'Price', 'Customers\' Rating', 'Expert Score']

scatter = pd.plotting.scatter_matrix(df2[grid_features], alpha=0.4, figsize=(20,20));
for ax in scatter.ravel():
    ax.set_xlabel(ax.get_xlabel(), rotation = 90)
    ax.set_ylabel(ax.get_ylabel(), rotation = 90)

We see that:

  • Customer's Rating is highly correlated with the expert's rating
  • The higher the Price, the more likely it is to have a high score from both customers and experts
  • The higher the Richness, the more Full-bodied and Sweet it tends to be (strong correlations)
  • The higher the Oiliness, the more likely it is to be Full-bodied
  • No individual flavor seems correlated with the scores from customers or experts. The closest trend is from Full-bodied and Rich, as they seem slightly directly correlated with the scores.

This is an indication that predicting the score is not trivially easy; the Full-bodied and Richness can play some role, but if flavors give any indication, it'll be due to a combination of flavors instead of any one particular flavor.