CS109A Introduction to Data Science

Case Study: Hunting for Flavors¶

PARTS 1- 4: Problem Statement, Obtaining Data, Cleaning Data, Exploring Data¶

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner

In [1]:

## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2020-CS109A/master/themes/static/css/cs109.css").text
HTML(styles)

Out[1]:

In [2]:

# import the necessary libraries
import re
import requests
import random
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
from time import sleep
from bs4 import BeautifulSoup

# global properties
data_dir = "data/" # where to save data
num_search_pages = 50 # how many search pages to cull through

# NOTE:
# if you haven't yet downloaded the data, this should be set to True
download_data = False

Disclaimer¶

Alcohol is drug. There are state and federal laws that govern the sale, distribution, and consumption of such. In the United States, those who consume alcohol must be at least 21 years of age. In no way am I, or anyone else at IACS or Harvard at large, promoting or encouraging the usage of alcohol. My intention is not to celebrate it. Anyone who chooses to consume alcohol should be of legal age and should do so responsibly. Abusing alcohol has serious, grave effects.

The point of this exercise is purely pedagogical, and it illustrates the wide range of tasks to which one can apply data science and machine learning. That is, I am focusing on a particular interest and demonstrating how it can be used to answer questions that one may be interested in for one's own personal life. You could easily imagine this being used in professional settings, too.

Learning Objectives¶

Help see the big picture process of conducting a project, and to illustrate some of the nuanced details and common pitfalls.

1. Problem Overview¶

Whiskey is a type of alcohol, and there are many different types of whiskey, including bourbon, which will be the focus of this project.

I am interested in determining:

Are there certain attributes of bourbons that are predictive of good (i.e., highly rated by users) bourbons?
- Find hidden gems (i.e., should be good but current reviews are absent or unsupportive of such)
- Find over-hyped whiskeys (i.e., the reviews seem high but the attributes aren't indicative)
- Are there significant results if we target experts' ratings instead of average customer ratings?
Are there certain attributes of bourbons that are predictive of expensive bourbons?
- Find under-priced whiskeys
- Find over-priced whiskeys
Which bourbons are more similar to each other?
- Which attributes are important for determining similarness? (e.g., does price play a role?)

2. Obtaining Data¶

We need a website that has a bunch of whiskey data. Distiller.com seems to be the most authoritative and comprehensive site.

Using distiller.com as our source, I don't see a way to display a list of all of their bourbons. But, if you search for the keyword bourbon, over 2,000 search results appear, each with a link to the particular whiskey. After manual inspection, these in fact are bourbons, but a few are not and merely have some association with bourbon (e.g., non-bourbon whiskeys that were casked in old bourbon barrels).

Let's crawl the search results pages to create a set of all candidate bourbons (i.e., whiskey_urls)! Then, using this set, let's download each page. Note, we use a set() instead of a list, in case there are duplicates.

Fetching list of webpages via Requests

In [3]:

whiskey_urls = set()

if download_data:

    # we define this for convenience, as every state's url begins with this prefix
    base_url = 'https://distiller.com/search?term=bourbon'
    
    # visits each search result page
    for page_num in range(1, num_search_pages):
        cur_page = requests.get('https://distiller.com/search?page=' + str(page_num) + '&term;=bourbon')

        # uses BeautifulSoup to extract all links to whiskeys
        bs_page = BeautifulSoup(cur_page.content, "html.parser")
        for link in bs_page.findAll('a', attrs={'href': re.compile("^/spirits/")}):
            whiskey_urls.add(link.get('href'))

        sleep(1)
    
    # saves each URL to disk, so that we don't have to crawl the search results again
    f = open("whiskey_urls.txt", "w")
    for url in whiskey_urls:
        f.write(url + "\n")
    f.close()
    
    # fetches each page and saves it to the hard drive
    for url in whiskey_urls:
        cur_page = requests.get('https://distiller.com' + url).content

        # writes file
        f = open(data_dir + url[9:], 'wb')
        f.write(cur_page)
        f.close()

        # sleeps between 1-3 seconds, in case the site tries to detect crawling
        sleep(random.randint(1,3))
else: 
    
    # if the files have already been saved to disk
    # then you can just load them here, instead of crawling again
    with open('whiskey_urls.txt') as f:
        whiskey_urls = set(line.strip() for line in f)

We now have a list of all whiskey urls, in whiskey_urls, along with the actual page downloaded to our hard drive. We downloaded them to the hard drive for convenience, so that in the future, we don't have to spend the 2 hours crawling all pages again.

Let's now load each of these pages!

In [4]:

whiskeys = {}

# loads whiskey webpage
for i, url in enumerate(whiskey_urls):

    filename = data_dir + url[9:]
    file_contents = open(filename, 'r').read()
    
    # instantiates a new BeautifulSoup object
    soup = BeautifulSoup(file_contents, "html.parser")
    
    # extracts details about the whiskey
    name = soup.find('h1', attrs={'class': re.compile("secondary-headline name")}).text.strip()
    location = soup.find('h2', attrs={'class': "ultra-mini-headline location middleweight"}).text.strip()

    soup.find('div', attrs={'class': "spirit-badge"})

    badge = ""
    if soup.find('div', attrs={'class': "spirit-badge"}) != None:
        badge = soup.find('div', attrs={'class': "spirit-badge"}).text.strip()
        
    num_ratings = 0
    rating = "N/A"
    if soup.find('span', attrs={'itemprop': "ratingCount"}) != None:
        num_ratings = int(soup.find('span', attrs={'itemprop': "ratingCount"}).text.strip())
        rating = float(soup.find('span', attrs={'itemprop': "ratingValue"}).text.strip())
    
    age = soup.find('li', attrs={'class': "detail age"}).find('div', attrs='value').text.strip()
    price = int(re.findall("cost-(\d)", str(soup.find('div', attrs={'class': re.compile("spirit-cost")})))[0])
    abv = ""
    
    if soup.find('li', attrs={'class': "detail abv"}).find('div', attrs='value').text != "":
        abv = float(soup.find('li', attrs={'class': "detail abv"}).find('div', attrs='value').text)
    
    whiskey_type = soup.find('li', attrs={'class': "detail whiskey-style"}).div.text
    cask_type = ""
    if soup.find('li', attrs={'class': "detail cask-type"}) != None:
        cask_type = soup.find('li', attrs={'class': "detail cask-type"}).find('div', attrs='value').text.strip()
    
    review = ""
    expert = ""
    score = ""
    flavor_summary = ""
    flavor_profile = []
    
    # check if an expert reviewed it
    if soup.find('p', attrs={'itemprop': "reviewBody"}) != None:
        review = soup.find('p', attrs={'itemprop': "reviewBody"}).text.replace("\"","").strip()
    
        expert = soup.find('div', attrs={'class': 'meet-experts'}).a.text.strip()
        score = int(soup.find('div', attrs={'class': "distiller-score"}).span.text.strip())
        flavor_summary = soup.find('h3', attrs={'class': "secondary-headline flavors middleweight"}).text.strip()
    
        # extracts flavor profile
        flavor_profile = eval(soup.find('canvas').attrs['data-flavors'])
    
    cur_whiskey = [name, whiskey_type, cask_type, location, age, abv, price, badge, num_ratings, \
        rating, flavor_summary, expert, score]

    if flavor_profile:
        cur_whiskey.extend(list(flavor_profile.values()))
    else:
        cur_whiskey.extend(np.zeros(14))
        
    cur_whiskey.append(review)
    whiskeys[i] = cur_whiskey

    df = pd.DataFrame.from_dict(whiskeys, orient='index', \
        columns=['Name', 'Type', 'Cask', 'Location', 'Age', 'ABV %', 'Price', 'Badge',\
                 '# Ratings', "Customers' Rating", 'Flavor Summary', 'Expert', 'Expert Score',\
                 'Smoky', 'Peaty', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
                 'Sweet', 'Briny', 'Salty', 'Vanilla', 'Tart', 'Fruity', 'Floral', 'Review'])

3. Data Sanity Check / Cleaning¶

What do our features look like? Are any features wonky, inconsistent, useless, or missing values?

Let's use only the whiskeys that have been reviewed by experts

In [5]:

pd.set_option('display.max_columns', None)
df2 = df.loc[(df['Expert'] != "")]
print(len(df2))

In [6]:

df2['Type'].value_counts()

Out[6]:

Bourbon                          586
Single Malt                       27
Blended American Whiskey          14
Aged Rum                          12
Peated Single Malt                11
Other Whiskey                     11
Flavored Whiskey                   5
Gold Rum                           4
Rhum Agricole Vieux                4
Tequila Reposado                   4
Blended                            3
American Single Malt               3
Spiced Rum                         3
Tequila Añejo                      3
Flavored Rum                       2
Barrel-Aged Gin                    2
Rye                                2
Canadian                           2
Dark Rum                           2
Cachaça                            2
Other Brandy                       1
Rhum Agricole Éléve Sous Bois      1
Rhum Agricole Blanc                1
Dairy/Egg Liqueurs                 1
Silver Rum                         1
White                              1
Old Tom Gin                        1
Other Liqueurs                     1
Name: Type, dtype: int64

Let's only use the bourbons. We have 586 bourbons, which is my primary focus. This isn't tons, but the non-bourbons will likely add noise, as they are different alcohols.

In [7]:

pd.set_option('display.max_rows', None)
df2 = df2.loc[(df['Type'] == "Bourbon")]

Let's inspect the data types

In [8]:

df2.dtypes

Out[8]:

Name                  object
Type                  object
Cask                  object
Location              object
Age                   object
ABV %                 object
Price                  int64
Badge                 object
# Ratings              int64
Customers' Rating     object
Flavor Summary        object
Expert                object
Expert Score          object
Smoky                float64
Peaty                float64
Spicy                float64
Herbal               float64
Oily                 float64
Full-bodied          float64
Rich                 float64
Sweet                float64
Briny                float64
Salty                float64
Vanilla              float64
Tart                 float64
Fruity               float64
Floral               float64
Review                object
dtype: object

"Customers' Rating" feature should be a Float. Let's fix it.

In [9]:

df2.loc[df2['Customers\' Rating'] == "N/A"]

Out[9]:

	Name	Type	Cask	Location	Age	ABV %	Price	Badge	# Ratings	Customers' Rating	Flavor Summary	Expert	Expert Score	Smoky	Peaty	Spicy	Herbal	Oily	Full-bodied	Rich	Sweet	Briny	Salty	Vanilla	Tart	Fruity	Floral	Review
1765	Tacoma New West Bourbon	Bourbon	new, charred American oak	Heritage Distilling Co. // Washington, USA	NAS	46	2		0	N/A	Vanilla & Sweet	Brock Schulte	78	0.0	0.0	30.0	40.0	0.0	40.0	30.0	80.0	20.0	20.0	100.0	30.0	60.0	50.0	Nose is full of sweet corn, fresh caramel, and...

In [10]:

# there still exists 1 whiskey that has no Customer Rating, so let's remove it
df2 = df2.loc[df2['Customers\' Rating'] != "N/A"]
df2 = df2.astype({'Customers\' Rating' : 'float64'})

"Age" feature should represent years. Let's fix it.

In [11]:

# we can keep the 'Age' feature for now but be mindful
# that it's missing for nearly half of the whiskeys
len(df2.loc[(df2['Age'] == 'NAS') | (df2['Age'] == 'nas') | (df2['Age'] == '')])

Out[11]:

In [12]:

# let's replace all missing values with a reasonable value.
# for now, let's use 0 as a placeholder so that we can later swap it out.
df2['Age'] = df2['Age'].replace(['NAS', 'nas', 'N/A',''],'0')

In [13]:

# remove the 'Years' part of the text
df2['Age'].replace(to_replace =' [yY]ear[sS]*', value = '', regex = True)

Out[13]:

0                     0
4                     0
12        7 y, 2 m,16 d
21                    0
22                    0
26                   17
27                    0
28                    0
38                    6
40                   17
49                    0
52                    0
53                    0
59                    0
60                   10
65                    0
67                    0
75                    0
81                    0
85                    0
97                   15
98                   12
99                    0
113                   0
115                   0
118                  12
119     6 YR 3 MO 10 DY
126                   0
127                   0
128                   0
129                   0
130                   0
136                   0
140                  22
143                   0
147                   0
148                   0
149                   3
151                   0
153                   0
154     6 YR 4 MO 12 DY
159                   0
162     6 YR 4 MO 21 DY
164                   0
165                   0
166                  11
171                   0
176                   0
180                   0
181                   0
185                   0
188                  13
191                   0
193                   0
195                   0
197                   0
198                   0
202                   9
207                   0
217                   0
224                   0
230                   0
231                  17
236                   6
256                   0
257                  14
258                   6
263                   0
264                   0
265                   0
269                  12
276                   0
279                   0
282                   0
283                   0
284                   0
285                   0
287                   0
291                   0
292                   0
293                   0
296                   8
303                  15
308        7 yrs, 9 mos
309                   0
313                   0
315             9 to 11
318                   7
321                   0
322                  15
325                   0
326                   0
328            9 Months
329                   0
336                   0
337                   0
343                   0
344                   8
345                  11
347                   0
355          9 8 Months
357                   0
365                   0
369                  13
371                   0
373                   9
374                   0
380                   0
391                   0
398                  12
404                   0
410                   0
412                   0
415                   0
416                   0
417                   9
419                   0
421                   0
422                   0
424                  10
429                   0
434                   0
435                   0
437                  23
439                   0
440                   4
442                   0
446                   0
448                  10
449                   0
453                   0
460                   0
465                   0
468         6, 5 months
471                   0
473                   0
474                   0
478                  12
479                   0
481                   0
486                   0
488          12 YR 5 MO
489                  15
490                   0
493                   0
496                   0
498      6 YR 3 MO 1 DY
508                   0
514                  10
516      6 YR 5 MO 1 DY
519                   0
520                   0
527                   0
531                   9
545                   0
546                   0
547                   0
548                   0
549                   0
554                   6
556                   0
559                   5
562                   0
570                   0
581                   0
590                   0
592                  14
593                  12
597      7 Y, 2 M, 28 D
605                  12
609                  10
611                   0
614                  12
618                   0
620                   0
622          12 YR 5 MO
623                   1
626      6 YR 3 MO 6 DY
629                   0
631                   0
636                   0
639                   9
641                   9
644                  12
647                   0
649                   0
653                   4
658                   0
664                  10
666                   9
670                   7
671                   0
675                   0
681                   8
683                  10
688                   0
702                  27
703                 7.2
704           32 Months
705                   0
710                   0
714                   6
719      18 - 20 months
722                  12
727                   0
736                   0
751                   0
752                   0
754                  23
755                  15
770                   0
772                   7
775                   0
776                   0
777                   0
782                   0
790                   0
791                  11
793                   0
800                   0
801                   0
824                   0
826                   0
828                   0
831                   0
832                  21
838                   0
842                   0
843                   0
850                   0
851                   0
861                   8
863                   7
865                   0
868                   0
871                  12
882                   0
885                   0
888                   0
895                   0
899                   5
904                   0
916                  11
917                  10
918                   0
928                   0
931                   0
932                   0
937                   0
939                   0
940                   0
942                   0
945                   8
953                   0
960                   0
963                  12
970                   0
971                   0
975                  12
976                   0
979                  12
980                   0
982                   0
990                   0
993                  14
996                   0
1000                  0
1002                  0
1003    6 YR 10 MO 1 DY
1004                  0
1005                  0
1017                  0
1026                  0
1027                  0
1031                  0
1032                  0
1051                  0
1055                  0
1056                  0
1059                  0
1063                  8
1064     6 YR 4 MO 6 DY
1070                  0
1073    6 YR 6 MO 19 DY
1080                  0
1082                  0
1088                  6
1089    6 YR, 2 MO, 1 D
1095                  0
1096                  0
1098                 10
1099                  0
1100                  0
1102                 11
1108                  0
1109                  9
1114                  0
1117                  0
1118                  0
1119                  8
1122                  7
1127                 20
1128                  0
1129                  0
1130                  7
1133                  0
1134                  0
1135                  0
1149                  0
1160                  0
1161                  0
1162                 12
1164                  0
1166                  0
1170                  0
1173                  9
1181                  0
1182                  9
1185                  0
1192                 22
1193                 25
1195                  0
1196                  6
1205    6 YR 2 MO 10 DY
1212                 12
1218                  4
1219                  0
1227                  0
1229                  0
1237                  0
1239                  0
1240                  0
1244           4 months
1250                  0
1252                6-8
1253                  0
1258                  0
1260                  0
1262                  5
1270                  0
1271                  0
1275                  9
1279                  0
1283                  0
1297                 17
1298                  0
1310                  0
1317                 20
1322                  0
1324                 14
1326                  0
1330                  0
1341                  0
1348                 24
1349                  0
1355                  0
1357                  4
1358                  0
1365                  0
1371                  0
1372                  0
1376                 10
1378                  0
1379                  0
1381                  0
1382                  0
1383                 18
1384                  0
1386                 17
1390                 17
1391                  0
1394                  0
1395                  0
1396        6, 5 months
1404                  0
1407                 10
1414                  0
1421                  0
1423                  0
1429                 10
1434                  0
1436                  9
1437                  0
1442                  9
1443                 20
1444                  0
1447                  3
1454                  0
1455                  0
1456                  0
1463                 15
1467                  0
1481                  0
1484                 11
1485                  8
1500                 10
1502                  0
1503                  0
1504                  0
1507                  0
1511                 20
1513                  0
1514                  0
1518                  0
1523                  0
1524                  0
1525                  3
1528                 17
1529                  0
1531                  4
1537                  0
1543                  0
1545                 17
1549                 12
1550    6 YR 3 MO 14 DY
1553                  0
1560                  0
1570                  0
1573                 12
1575                  0
1582                  3
1583                  0
1585                  9
1589                 22
1590                 15
1592                  0
1595                  0
1596                  0
1602                  0
1604                  5
1605                 12
1607                 12
1620     6 Y, 7 M, 23 D
1621                 26
1627                  0
1628                  0
1634                  0
1635                 13
1637                  0
1640                 12
1641                  0
1646                  0
1648                  0
1649                  0
1650                  0
1657                  0
1664                 10
1665                  0
1668                 10
1671                  0
1674           6, 11 mo
1675                 17
1676                  0
1679                  0
1681                  0
1682                  0
1686                  0
1689                  0
1700                 10
1708                 12
1710                  0
1721                 11
1725                  0
1727                  0
1728          8 YR 3 MO
1729                  0
1732                  0
1736                 10
1738                  0
1743                  3
1744                  0
1748                 15
1749                  8
1752                 12
1757                 23
1758                  0
1759                  0
1767                  9
1768                  0
1776                  0
1778                  0
1779                  0
1785                  0
1789                  0
1794                  0
1798                  0
1813                 10
1823                  5
1826                  0
1830                  0
1832                  0
1833                 28
1834                 12
1837                  0
1839                  0
1847                  0
1849                  0
1851                 12
1855                  0
1858                  0
1861                  0
1862                  0
1864                  0
1868                  0
1873                  0
1882                 15
1884          5 YR 6 MO
1899                  0
1903                  0
1905                  0
1909                  0
1917                 10
1920                  0
1925                  0
1926                  7
1929                 12
1932                 12
1933                 21
1936                 10
1939                  0
1945                  0
1947                  0
1952                  0
1957                  0
1959                  0
1964    6 YR 8 MO 14 DY
1965                  6
1969                  0
1972                 17
1978                  0
1980                  0
1981                  0
1983                  0
1991                 16
1992                  0
1999                 13
2000                  0
2004                 11
2011                  0
2012                  0
2020                 23
2026                  0
2027                  0
2037                 12
2044                  0
2045                  0
2055                  0
2062                  0
2063                  0
2066                  0
2069                 17
2071                  0
2073                  0
2078                 12
2079                 12
2083                  0
2087                  9
2088                 11
2092                  0
2096                  0
2098                 11
2099                  0
2106                  3
2107                 10
2108                 14
2110                  3
2120                  0
2125                  0
2128          19 months
2129                  9
2131                 10
2134                  0
2137                  0
2138                  0
2143                  0
2156          6 YR 4 MO
2159                  0
2160                  0
2164                  0
2167                  0
2173                  0
2179                  0
2184                  0
2185                 12
2195                  5
Name: Age, dtype: object

In [14]:

# manually cleaning up values that otherwise would be a bit impossible to automatically clean-up
df2['Age'] = df2['Age'].replace(to_replace ='6.*', value = '6', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='(\d+) [Yy].*', value = '\\1', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='4 [Mm]onths', value = '4', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 [Mm]onths', value = '9', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='18 - 20 [Mm]onths', value = '1.5', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='32 [Mm]onths', value = '2.67', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 [Mm]onths', value = '9', regex = True)
df2['Age'] = df2['Age'].replace(to_replace ='9 to 11', value = '0.75', regex = True)

In [15]:

# let's look at all of the items that had an Age statement listed
# (now that all values have been cleaned-up)
df2.loc[df2['Age'] > '0']['Age']

Out[15]:

12         7
26        17
38         6
40        17
60        10
97        15
98        12
118       12
119        6
140       22
149        3
154        6
162        6
166       11
188       13
202        9
231       17
236        6
257       14
258        6
269       12
296        8
303       15
308        7
315     0.75
318        7
322       15
328        9
344        8
345       11
355        9
369       13
373        9
398       12
417        9
424       10
437       23
440        4
448       10
468        6
478       12
488       12
489       15
498        6
514       10
516        6
531        9
554        6
559        5
592       14
593       12
597        7
605       12
609       10
614       12
622       12
623        1
626        6
639        9
641        9
644       12
653        4
664       10
666        9
670        7
681        8
683       10
702       27
703      7.2
704     2.67
714        6
719      1.5
722       12
754       23
755       15
772        7
791       11
832       21
861        8
863        7
871       12
899        5
916       11
917       10
945        8
963       12
975       12
979       12
993       14
1003       6
1063       8
1064       6
1073       6
1088       6
1089       6
1098      10
1102      11
1109       9
1119       8
1122       7
1127      20
1130       7
1162      12
1173       9
1182       9
1192      22
1193      25
1196       6
1205       6
1212      12
1218       4
1244       4
1252       6
1262       5
1275       9
1297      17
1317      20
1324      14
1348      24
1357       4
1376      10
1383      18
1386      17
1390      17
1396       6
1407      10
1429      10
1436       9
1442       9
1443      20
1447       3
1463      15
1484      11
1485       8
1500      10
1511      20
1525       3
1528      17
1531       4
1545      17
1549      12
1550       6
1573      12
1582       3
1585       9
1589      22
1590      15
1604       5
1605      12
1607      12
1620       6
1621      26
1635      13
1640      12
1664      10
1668      10
1674       6
1675      17
1700      10
1708      12
1721      11
1728       8
1736      10
1743       3
1748      15
1749       8
1752      12
1757      23
1767       9
1813      10
1823       5
1833      28
1834      12
1851      12
1882      15
1884       5
1917      10
1926       7
1929      12
1932      12
1933      21
1936      10
1964       6
1965       6
1972      17
1991      16
1999      13
2004      11
2020      23
2037      12
2069      17
2078      12
2079      12
2087       9
2088      11
2098      11
2106       3
2107      10
2108      14
2110       3
2128      19
2129       9
2131      10
2156       6
2185      12
2195       5
Name: Age, dtype: object

In [16]:

df2 = df2.astype({'Age': 'float64'})

In [17]:

# how many had values?
len(df2.loc[df2['Age'] > 0])

Out[17]:

In [18]:

df2['Age'].describe()

Out[18]:

count    585.000000
mean       3.776274
std        6.010627
min        0.000000
25%        0.000000
50%        0.000000
75%        7.000000
max       28.000000
Name: Age, dtype: float64

In [19]:

df2.loc[df2['Age'] > 0].hist(column='Age', bins='auto')

Out[19]:

array([[]],
      dtype=object)

I think it's fair to impute all missing values (i.e., 0) with 7. This is based on research, too (Googling and personal knowledge)

In [20]:

df2['Age'] = df2['Age'].replace(0,7)

In [21]:

df2['Age'].describe()

Out[21]:

count    585.000000
mean       8.311316
std        3.607727
min        0.750000
25%        7.000000
50%        7.000000
75%        7.000000
max       28.000000
Name: Age, dtype: float64

In [22]:

df2.hist(column='Age', bins='auto')

Out[22]:

array([[]],
      dtype=object)

What's the distribution of the "Flavor Summary" feature? Is it consistent enough to use?

In [23]:

df2['Flavor Summary'].value_counts()

Out[23]:

Rich & Full Bodied       54
Sweet & Rich             40
Sweet                    36
Vanilla & Sweet          34
Spicy                    33
Vanilla & Rich           24
Full Bodied & Rich       20
Sweet & Vanilla          20
Spicy & Rich             18
Vanilla                  18
Vanilla & Full Bodied    17
Fruity & Sweet           17
Full Bodied & Spicy      17
Spicy & Vanilla          16
Rich & Vanilla           13
Sweet & Spicy            13
Rich & Spicy             13
Full Bodied              11
Vanilla & Spicy          11
Full Bodied & Vanilla    10
Spicy & Sweet            10
Spicy & Full Bodied      10
Fruity                    9
Rich                      9
Sweet & Full Bodied       9
Rich & Sweet              9
Sweet & Fruity            8
Spicy & Fruity            7
Fruity & Rich             7
Spicy & Smoky             5
Fruity & Spicy            5
Fruity & Vanilla          4
Sweet & Oily              3
Floral & Fruity           3
Floral                    3
Spicy & Herbal            3
Sweet & Herbal            3
Floral & Vanilla          2
Oily & Rich               2
Tart                      2
Full Bodied & Fruity      2
Fruity & Floral           2
Rich & Oily               2
Sweet & Smoky             2
Full Bodied & Sweet       2
Fruity & Herbal           2
Spicy & Oily              1
Smoky & Spicy             1
Sweet & Briny             1
Oily & Sweet              1
Oily                      1
Rich & Fruity             1
Vanilla & Floral          1
Sweet & Salty             1
Oily & Full Bodied        1
Spicy & Tart              1
Smoky & Sweet             1
Rich & Smoky              1
Herbal & Tart             1
Floral & Herbal           1
Full Bodied & Oily        1
Smoky & Vanilla           1
Vanilla & Fruity          1
Herbal                    1
Smoky & Rich              1
Herbal & Floral           1
Tart & Vanilla            1
Floral & Sweet            1
Spicy & Floral            1
Herbal & Fruity           1
Fruity & Full Bodied      1
Name: Flavor Summary, dtype: int64

Ok, there's a long tail of values, and it seems the Flavors are just the two most prominent flavors listed for each whiskey, although some only list one flavor. Perhaps this offers no additional information/signal than using the raw values of the flavors. Although, it might be worth experimenting with this by turning this feature into 2 new features: primary flavor, secondary flavor. These would need to be one-hot encoded though, and since there are 14 distinct flavors, that would create 28 new features (or 26). Again, these 26 features might be redundant and not help our models.

What is the "Badge" feature like?

In [24]:

df2['Badge'].value_counts()

Out[24]:

                                            428
RARE                                        119
Requested By\nElw00t                          2
Requested By\njd139                           1
Requested By\ntjbriley                        1
Requested By\nBourbon_Obsessed_Lexington      1
Requested By\nCymru-and-the-Ferg              1
Requested By\ndanmeister33                    1
Requested By\nCblake34                        1
Requested By\ndjriebesell                     1
Requested By\nandrewls24                      1
Requested By\nspectorjuan                     1
Requested By\ncubfancccc                      1
Requested By\nsamueljcarlson                  1
Requested By\nJFForbes                        1
Requested By\nJamesSpears                     1
Requested By\nZonaPT                          1
Requested By\nrsbolen                         1
Requested By\nstevenblackburn7                1
Requested By\nmcoop8                          1
Requested By\nSharksfan321                    1
Requested By\nEast17                          1
Requested By\ntkezo645                        1
Requested By\nalshepherd1                     1
Requested By\nSoba45                          1
Requested By\nBourbonPizon                    1
Requested By\nhomerhomerson                   1
Requested By\nGlengoolieBlue                  1
Requested By\nAJLovesWhiskey                  1
Requested By\ncpreynolds87                    1
Requested By\nEzikiel                         1
Requested By\njimcorwin3                      1
Requested By\nTmoore8601                      1
Requested By\nbodkins                         1
Requested By\nstonetone96                     1
Requested By\nGilly                           1
Requested By\nAWhite                          1
Requested By\nJacob-Haralson                  1
Requested By\n1901                            1
Name: Badge, dtype: int64

We see that all Badge values are either 'RARE' or just requests from users for an expert to review it. So, let's change the badge column to being a 'Rare' column.

In [25]:

df2['Rare'] = [True if x == 'RARE' else False for x in df2['Badge']] #df['Badge'] #.map({"RARE": True})
del df2['Badge']
df2['Rare'].value_counts()

Out[25]:

False    466
True     119
Name: Rare, dtype: int64

What is the "Expert" feature like?

In [26]:

df2['Expert'].value_counts()

Out[26]:

Jacob Grier          92
Jake Emen            85
Amanda Schuster      76
Stephanie Moreno     66
Rob Morton           62
Keith Allison        26
Colin Howard         23
Sam Davies           21
Nicole Gilbert       17
Distiller Staff      15
Brock Schulte        14
Paul Belbusti        13
Ryan Conklin         12
Jack Robertiello     10
Tim Knittel          10
Katrina Niemisto      8
Dennis Gobis          4
Ron Bechtol           4
Blair Phillips        4
Jason Albaum          3
Matthew Sheinberg     3
Thijs Klaverstijn     3
Derek Gamlin          2
Phil Olson            2
Liza Weisstuch        2
Lucas Gamlin          2
Michael J. Neff       2
Anna Archibald        1
Perri Salka           1
Eric Abert            1
Brad Japhe            1
Name: Expert, dtype: int64

Let's cast our features to the correct data types and view summary statistics

In [27]:

df2 = df2.astype({'Expert Score': 'int32', 'Customers\' Rating' : 'float64', 'ABV %': 'float64'})

In [28]:

df2.describe()

Out[28]:

	Age	ABV %	Price	# Ratings	Customers' Rating	Expert Score	Smoky	Peaty	Spicy	Herbal	Oily	Full-bodied	Rich	Sweet	Briny	Salty	Vanilla	Tart	Fruity	Floral
count	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000	585.000000
mean	8.311316	49.899838	2.803419	416.352137	3.756325	86.447863	21.485470	0.230769	53.726496	25.714530	30.695726	58.447863	57.798291	58.381197	4.018803	5.476923	50.340171	21.548718	37.066667	17.835897
std	3.607727	7.170325	1.055314	1097.906235	0.521737	5.737503	18.322834	2.187232	19.560069	19.547419	22.712536	18.317344	18.361300	16.824334	8.589855	9.603433	20.526266	17.517548	21.334758	20.336899
min	0.750000	40.000000	1.000000	1.000000	1.000000	65.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	0.000000	10.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	7.000000	45.000000	2.000000	22.000000	3.460000	83.000000	10.000000	0.000000	40.000000	10.000000	10.000000	45.000000	45.000000	50.000000	0.000000	0.000000	35.000000	10.000000	20.000000	0.000000
50%	7.000000	47.500000	3.000000	83.000000	3.780000	87.000000	20.000000	0.000000	55.000000	25.000000	30.000000	60.000000	60.000000	60.000000	0.000000	0.000000	50.000000	20.000000	35.000000	10.000000
75%	7.000000	53.500000	4.000000	238.000000	4.170000	91.000000	30.000000	0.000000	70.000000	40.000000	45.000000	70.000000	70.000000	70.000000	5.000000	10.000000	70.000000	30.000000	50.000000	30.000000
max	28.000000	72.050000	5.000000	9072.000000	4.880000	98.000000	90.000000	40.000000	100.000000	90.000000	100.000000	100.000000	100.000000	99.000000	80.000000	80.000000	100.000000	75.000000	100.000000	95.000000

In [48]:

df2.dtypes

Out[48]:

Name                  object
Type                  object
Cask                  object
Location              object
Age                  float64
ABV %                float64
Price                  int64
# Ratings              int64
Customers' Rating    float64
Flavor Summary        object
Expert                object
Expert Score           int32
Smoky                float64
Peaty                float64
Spicy                float64
Herbal               float64
Oily                 float64
Full-bodied          float64
Rich                 float64
Sweet                float64
Briny                float64
Salty                float64
Vanilla              float64
Tart                 float64
Fruity               float64
Floral               float64
Review                object
Rare                    bool
dtype: object

4. EDA¶

Now that our data is cleaned, let's explore it and try to understand any patterns. This understanding will impact our modelling choices. Based on the .describe() statistics above, let's first look at the most extreme values of features that seem a bit lopsided in their distribution of values.

Which are the most "Smoky"? Intent is to see if there are any errors or something worth noting.

In [29]:

df2.sort_values(by=['Smoky'], ascending=False)[0:15][['Name', 'Smoky', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[29]:

	Name	Smoky	ABV %	Price	Customers' Rating	Expert Score
0	Cleveland Bourbon	90.0	50.00	2	2.48	70
2055	Warbringer Mesquite Smoked Southwest Bourbon	80.0	49.00	3	4.16	85
1073	Booker's Bourbon Batch 2019-04 "Beaten Biscuits"	80.0	63.05	3	4.21	87
1595	Jim Beam Black Label Extra-Aged	80.0	43.00	1	3.28	84
1757	Pappy Van Winkle 23 Year	80.0	47.80	5	4.54	89
1181	Jim Beam Double Oak	75.0	43.00	2	3.21	82
1785	Rebel Yell Kentucky Straight Bourbon 100 Proof	75.0	50.00	1	3.46	86
315	Booker's 25th Anniversary Bourbon	75.0	65.40	4	4.45	96
1640	Elijah Craig Barrel Proof Bourbon	70.0	68.50	4	4.33	93
1447	Garrison Brothers Texas Straight Bourbon	70.0	47.00	4	3.51	84
1823	Barrell Bourbon Batch 001	70.0	60.80	3	4.35	88
556	Elk Rider Bourbon	70.0	46.00	2	3.02	79
2164	Old Forester Single Barrel Bourbon Barrel Stre...	70.0	65.00	3	3.80	84
365	Lexington Bourbon	70.0	43.00	2	2.77	70
1936	Parker's Heritage Heavy Char Bourbon 10 Year	65.0	60.00	4	4.88	92

Which are the most "Peaty"? Intent is to see if there are any errors or something worth noting.

In [30]:

df2.sort_values(by=['Peaty'], ascending=False)[0:10][['Name', 'Peaty', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[30]:

	Name	Peaty	ABV %	Price	Customers' Rating	Expert Score
1529	New Riff Backsetter Peated Backset Bourbon	40.0	50.0	2	3.06	84
1117	Backbone Prime Blended Bourbon	20.0	52.0	2	3.61	82
2055	Warbringer Mesquite Smoked Southwest Bourbon	15.0	49.0	3	4.16	85
562	J. Riddle Peated Bourbon	15.0	45.5	2	3.35	87
171	Evan Williams Single Barrel	10.0	43.3	2	3.87	96
1905	Old Bardstown Black Label Kentucky Straight Bo...	10.0	45.0	1	2.90	86
2179	Knob Creek Small Batch Bourbon	10.0	50.0	2	3.47	84
963	Lux Row Distillers Double Barrel Bourbon 12 Year	5.0	59.2	4	4.03	91
1000	Still & Oak Straight Bourbon	5.0	43.0	2	3.40	83
1240	Angel's Envy Bourbon Finished in Port Wine Bar...	5.0	62.0	5	4.26	92

Which are the least "Spicy"? Intent is to see if there are any errors or something worth noting.

In [31]:

df2.sort_values(by=['Spicy'], ascending=True)[0:14][['Name', 'Spicy', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[31]:

	Name	Spicy	ABV %	Price	Customers' Rating	Expert Score
1031	Dancing Pines Bourbon	0.0	44.00	2	3.09	82
224	Rebel Yell Kentucky Straight Bourbon	0.0	40.00	1	2.88	84
1729	Burnside Oregon Oaked Bourbon	10.0	48.00	2	3.43	84
618	Three Chord Blended Bourbon	10.0	40.50	2	3.10	72
681	Old Charter 8 Year	10.0	40.00	1	2.88	80
1381	Rough Rider Straight Bourbon	15.0	45.00	2	3.34	85
939	TX Straight Bourbon Whiskey	15.0	47.00	4	3.47	84
1778	Orange County Distillery Bourbon	15.0	45.00	3	3.38	79
671	County Seat Spirits Hidden Copper Bourbon	15.0	45.00	2	3.50	80
1991	Black Maple Hill 16 Year Bourbon	15.0	47.50	4	4.62	97
130	Early Times 354 Bourbon	15.0	40.00	1	3.03	87
256	Larceny Small Batch Kentucky Straight Bourbon	15.0	46.00	1	3.47	88
236	Wild Turkey Rare Breed	15.0	54.10	2	3.84	92
1129	George T. Stagg Bourbon (Fall 2014)	15.0	69.05	3	4.53	92

Which are the most "Herbal"? Intent is to see if there are any errors or something worth noting.

In [32]:

df2.sort_values(by=['Herbal'], ascending=False)[0:20][['Name', 'Herbal', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[32]:

	Name	Herbal	ABV %	Price	Customers' Rating	Expert Score
592	Remus Volstead Reserve 14 Year Bottled in Bond...	90.0	50.0	5	4.25	89
1349	Temperance Trader Chinato Barrel-Finished Bourbon	90.0	45.0	2	3.61	87
1503	Redemption High Rye Bourbon	82.0	46.0	2	3.35	82
1851	Elijah Craig Barrel Proof Bourbon Batch C919	80.0	68.4	3	4.33	91
865	Treaty Oak Red Handed Bourbon (Kentucky & Virg...	80.0	47.5	2	3.68	83
373	Belle Meade Cask Strength Single Barrel Bourbo...	80.0	61.2	3	4.37	95
623	Cody Road Bourbon	78.0	45.0	2	2.72	71
895	Evan Williams White Label Bottled in Bond Bourbon	76.0	50.0	1	3.35	80
1830	Black Button Four Grain Bourbon	75.0	42.0	2	3.31	87
1442	Yellowstone Kentucky Straight Bourbon 9 Year (...	75.0	50.5	4	4.08	88
1371	Rock Hill Farms Bourbon	74.0	50.0	3	4.19	91
1637	Old Fitzgerald Bottled In Bond Bourbon	73.0	50.0	1	3.39	84
166	Jim Beam Signature Craft High Rye Bourbon 11 Year	70.0	45.0	4	3.50	84
1429	Henry McKenna 10 Year Bottled in Bond Bourbon	70.0	50.0	2	3.93	91
1095	Four Roses Small Batch Select Bourbon	70.0	52.0	4	4.00	89
343	St. Augustine Double Cask Bourbon	70.0	43.9	2	3.48	83
1748	I.W. Harper 15 Year Bourbon	70.0	43.0	3	4.06	94
38	Heaven Hill Bottled In Bond 6 Year	70.0	50.0	1	3.43	79
326	Four Roses Limited Edition Small Batch Bourbon...	70.0	56.3	4	4.48	93
258	Booker's Bourbon Batch 2015-04 "Oven Buster Ba...	70.0	63.5	3	3.66	85

Which are the most "Oily"? Intent is to see if there are any errors or something worth noting.

In [33]:

df2.sort_values(by=['Oily'], ascending=False)[0:15][['Name', 'Oily', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[33]:

	Name	Oily	ABV %	Price	Customers' Rating	Expert Score
1757	Pappy Van Winkle 23 Year	100.0	47.80	5	4.54	89
1640	Elijah Craig Barrel Proof Bourbon	100.0	68.50	4	4.33	93
1442	Yellowstone Kentucky Straight Bourbon 9 Year (...	95.0	50.50	4	4.08	88
1851	Elijah Craig Barrel Proof Bourbon Batch C919	90.0	68.40	3	4.33	91
514	Parker's Heritage Cognac Barrel Finish 10 Year...	90.0	50.00	3	4.53	96
303	George T. Stagg Bourbon (Fall 2013)	90.0	64.10	5	4.62	97
865	Treaty Oak Red Handed Bourbon (Kentucky & Virg...	90.0	47.50	2	3.68	83
1182	Barrell Bourbon Batch 008	85.0	66.40	4	4.20	83
916	Four Roses Limited Edition Single Barrel Bourb...	80.0	54.20	4	4.25	98
1785	Rebel Yell Kentucky Straight Bourbon 100 Proof	80.0	50.00	1	3.46	86
2079	1792 Aged Twelve Years	80.0	48.30	2	3.88	88
329	Colonel E.H. Taylor, Jr. Small Batch Bottled i...	80.0	50.00	2	4.14	90
1073	Booker's Bourbon Batch 2019-04 "Beaten Biscuits"	80.0	63.05	3	4.21	87
1262	J. Henry & Sons 5 Year Wisconsin Straight Bour...	80.0	60.00	3	4.18	83
1524	Treaty Oak Ghost Hill Texas Bourbon	80.0	47.50	2	3.50	80

Which are the least "Full-bodied"? Intent is to see if there are any errors or something worth noting.

In [34]:

df2.sort_values(by=['Full-bodied'], ascending=True)[0:20][['Name', 'Full-bodied', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[34]:

	Name	Full-bodied	ABV %	Price	Customers' Rating	Expert Score
618	Three Chord Blended Bourbon	5.0	40.50	2	3.10	72
0	Cleveland Bourbon	10.0	50.00	2	2.48	70
1031	Dancing Pines Bourbon	10.0	44.00	2	3.09	82
1641	291 Colorado Bourbon	15.0	50.00	3	3.60	78
2027	Feisty Spirits Blue Corn Bourbon	15.0	44.00	2	2.50	81
496	Bird Dog Small Batch Bourbon	15.0	43.00	2	3.14	82
2063	Ancient Age	20.0	40.00	1	2.55	78
1218	Burnside Bourbon	20.0	48.00	2	3.14	79
658	Maryland Club Straight Bourbon	20.0	47.50	2	2.00	78
1391	Central Standard Bourbon	20.0	45.00	2	2.42	80
1423	Hudson Baby Bourbon	20.0	46.00	4	3.28	83
1513	Missouri Spirits Bourbon Whiskey	20.0	40.00	2	3.32	72
1543	J.W. Overbey Bourbon	20.0	45.00	4	3.00	69
2088	Jim Beam Signature Craft Whole Rolled Oat Bour...	20.0	45.00	4	3.59	81
415	New Holland Beer Barrel Bourbon	20.0	40.00	3	3.07	65
486	Double Diamond Limited Edition Bourbon 267	20.0	40.00	2	1.00	80
1026	Barrell Bourbon New Year 2019	20.0	56.05	4	3.97	77
1729	Burnside Oregon Oaked Bourbon	20.0	48.00	2	3.43	84
180	Feisty Spirits Better Days Bourbon	20.0	44.00	2	3.33	80
22	Delaware Phoenix Bourbon	20.0	50.00	3	3.00	75

Which are the most "Briny"? Intent is to see if there are any errors or something worth noting.

In [35]:

df2.sort_values(by=['Briny'], ascending=False)[0:20][['Name', 'Briny', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[35]:

	Name	Briny	ABV %	Price	Customers' Rating	Expert Score
1524	Treaty Oak Ghost Hill Texas Bourbon	80.0	47.50	2	3.50	80
1834	Elijah Craig 12 Year	60.0	47.00	2	3.80	93
1481	Jefferson's Ocean Aged at Sea Voyage 15 Specia...	50.0	45.00	4	3.87	82
527	Old Soul Blended Straight Bourbon	45.0	45.00	2	3.49	74
1882	Pappy Van Winkle Family Reserve 15 Year	40.0	53.50	4	4.53	89
224	Rebel Yell Kentucky Straight Bourbon	40.0	40.00	1	2.88	84
623	Cody Road Bourbon	40.0	45.00	2	2.72	71
1149	Murray Hill Club Blended Bourbon	40.0	51.00	4	3.49	82
1757	Pappy Van Winkle 23 Year	30.0	47.80	5	4.54	89
1447	Garrison Brothers Texas Straight Bourbon	30.0	47.00	4	3.51	84
664	Michter's 10 Year Single Barrel Bourbon	30.0	47.20	4	4.20	90
1117	Backbone Prime Blended Bourbon	30.0	52.00	2	3.61	82
861	1792 Ridgemont Reserve Bourbon 8 Year	30.0	46.85	2	3.61	90
171	Evan Williams Single Barrel	30.0	43.30	2	3.87	96
1905	Old Bardstown Black Label Kentucky Straight Bo...	30.0	45.00	1	2.90	86
1317	Pappy Van Winkle 20 Year	30.0	45.20	4	4.67	92
1205	Booker's Bourbon Batch 2018-02 Backyard BBQ	25.0	64.40	3	4.20	83
2055	Warbringer Mesquite Smoked Southwest Bourbon	25.0	49.00	3	4.16	85
990	Noah's Mill Bourbon	20.0	57.15	3	4.02	93
198	Old Bardstown Estate Bottled Kentucky Straight...	20.0	50.50	2	3.30	84

Which are the most "Salty"? Intent is to see if there are any errors or something worth noting.

In [36]:

df2.sort_values(by=['Salty'], ascending=False)[0:20][['Name', 'Salty', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[36]:

	Name	Salty	ABV %	Price	Customers' Rating	Expert Score
1524	Treaty Oak Ghost Hill Texas Bourbon	80.0	47.50	2	3.50	80
1864	Blaum Bros. Galena Reserve (Series 0)	55.0	57.80	3	4.50	86
207	Remus Repeal Reserve Series III Straight Bourbon	55.0	50.00	4	3.88	86
1481	Jefferson's Ocean Aged at Sea Voyage 15 Specia...	50.0	45.00	4	3.87	82
1834	Elijah Craig 12 Year	50.0	47.00	2	3.80	93
162	Booker's Bourbon Batch 2020-01 "Granny's Batch"	40.0	63.20	4	4.07	92
1837	Coppersea Excelsior Bourbon	40.0	48.00	4	2.88	81
1785	Rebel Yell Kentucky Straight Bourbon 100 Proof	40.0	50.00	1	3.46	86
1757	Pappy Van Winkle 23 Year	40.0	47.80	5	4.54	89
1882	Pappy Van Winkle Family Reserve 15 Year	40.0	53.50	4	4.53	89
2179	Knob Creek Small Batch Bourbon	40.0	50.00	2	3.47	84
1732	Colter's Run Bourbon	35.0	44.00	2	3.57	79
373	Belle Meade Cask Strength Single Barrel Bourbo...	30.0	61.20	3	4.37	95
293	Ezra Brooks Kentucky Straight Bourbon 90 Proof	30.0	45.00	1	3.41	88
171	Evan Williams Single Barrel	30.0	43.30	2	3.87	96
1317	Pappy Van Winkle 20 Year	30.0	45.20	4	4.67	92
1728	Barrell Bourbon Batch 005	30.0	62.35	4	4.07	81
224	Rebel Yell Kentucky Straight Bourbon	30.0	40.00	1	2.88	84
1130	Virgin Bourbon 7 Year 101	30.0	50.50	1	3.35	83
126	Henry DuYore's Straight Bourbon Whiskey	30.0	45.60	2	3.32	81

Which are the most "Tart"? Intent is to see if there are any errors or something worth noting.

In [37]:

df2.sort_values(by=['Tart'], ascending=False)[0:20][['Name', 'Tart', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[37]:

	Name	Tart	ABV %	Price	Customers' Rating	Expert Score
592	Remus Volstead Reserve 14 Year Bottled in Bond...	75.0	50.00	5	4.25	89
1925	Redemption Temptation Bourbon	74.0	41.00	2	2.91	77
895	Evan Williams White Label Bottled in Bond Bourbon	72.0	50.00	1	3.35	80
1524	Treaty Oak Ghost Hill Texas Bourbon	70.0	47.50	2	3.50	80
1768	Peach Street Colorado Straight Bourbon	70.0	46.00	3	3.59	83
782	Johnny Drum Green Label Bourbon	70.0	40.00	2	3.19	82
1933	Elijah Craig Single Barrel 21 Year	69.0	45.00	5	4.14	87
1371	Rock Hill Farms Bourbon	68.0	50.00	3	4.19	91
865	Treaty Oak Red Handed Bourbon (Kentucky & Virg...	65.0	47.50	2	3.68	83
176	Oola Waitsburg Bourbon	65.0	47.00	2	3.37	80
1503	Redemption High Rye Bourbon	65.0	46.00	2	3.35	82
1310	The Walking Dead Kentucky Straight Bourbon	65.0	47.00	2	3.09	83
448	Buffalo Trace Experimental Collection French O...	60.0	45.00	4	3.35	88
283	Calumet Farm Bourbon	60.0	43.00	2	3.10	72
1637	Old Fitzgerald Bottled In Bond Bourbon	60.0	50.00	1	3.39	84
1575	George T. Stagg Bourbon (Fall 2019)	60.0	58.45	4	4.59	98
710	Old Heaven Hill Gold Label Bottled In Bond Bou...	60.0	50.00	1	3.12	77
918	Daviess County Kentucky Straight Bourbon	60.0	48.00	2	3.32	89
851	Coopers' Craft Barrel Reserve Straight Bourbon	60.0	50.00	2	3.65	85
446	Duke Kentucky Straight Bourbon	60.0	44.00	2	3.18	83

Which are the most "Fruity"? Intent is to see if there are any errors or something worth noting.

In [38]:

df2.sort_values(by=['Fruity'], ascending=False)[0:20][['Name', 'Fruity', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[38]:

	Name	Fruity	ABV %	Price	Customers' Rating	Expert Score
325	Heaven Hill Select Stock Bourbon	100.0	65.10	4	4.09	84
2195	Barrell Bourbon Batch 007	90.0	61.20	3	3.50	83
448	Buffalo Trace Experimental Collection French O...	90.0	45.00	4	3.35	88
453	W.H. Harrison Governor's Reserve Bourbon	90.0	56.50	3	3.36	84
1056	Parker's Heritage Master Distiller's Blend of ...	90.0	63.50	3	4.18	93
1980	1792 High Rye Bourbon	90.0	47.15	2	3.72	85
1727	Four Roses Limited Edition Small Batch Bourbon...	90.0	53.65	4	4.28	92
1525	Blaum Bros Knotter Bourbon 3 Year (Batch #6)	85.0	45.00	2	3.49	76
1959	Woodford Reserve Master’s Collection Brandy Ca...	85.0	45.20	4	4.17	91
895	Evan Williams White Label Bottled in Bond Bourbon	81.0	50.00	1	3.35	80
629	Hancock's President's Reserve Single Barrel Bo...	80.0	44.45	3	3.81	88
751	Daviess County Kentucky Straight Bourbon Frenc...	80.0	48.00	2	3.73	91
916	Four Roses Limited Edition Single Barrel Bourb...	80.0	54.20	4	4.25	98
1002	10th Mountain Bourbon	80.0	46.00	3	3.30	75
1349	Temperance Trader Chinato Barrel-Finished Bourbon	80.0	45.00	2	3.61	87
1051	Woodford Reserve Master's Collection Four Wood...	80.0	47.20	4	4.30	96
1839	Woodford Reserve Master's Collection Sonoma-Cu...	80.0	45.20	4	3.69	84
1832	Dark Corner Distillery Lewis Redmond Bourbon	80.0	43.00	3	2.90	77
1729	Burnside Oregon Oaked Bourbon	80.0	48.00	2	3.43	84
1833	Hirsch Selection 28 Year Bourbon	80.0	43.40	5	3.89	97

Which are the oldest? Intent is to see if there are any errors or something worth noting.

In [39]:

df2.sort_values(by=['Age'], ascending=False)[0:20][['Name', 'Age', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[39]:

	Name	Age	ABV %	Price	Customers' Rating	Expert Score
1833	Hirsch Selection 28 Year Bourbon	28.0	43.40	5	3.89	97
702	Heaven Hill 27 Year Barrel Proof Kentucky Stra...	27.0	47.35	5	4.26	93
1621	Old Blowhard 26 Year Bourbon	26.0	45.35	5	3.68	81
1193	Michter's 25 Year Single Barrel Bourbon	25.0	54.30	5	4.49	87
1348	Rhetoric 24 Year Bourbon	24.0	45.40	4	4.21	89
1757	Pappy Van Winkle 23 Year	23.0	47.80	5	4.54	89
437	Rhetoric 23 Year Bourbon	23.0	45.30	4	4.21	85
2020	Evan Williams 23 Year Bourbon	23.0	53.50	5	4.37	86
754	Elijah Craig 23 Year Bourbon	23.0	45.00	5	4.19	90
140	Blade And Bow Bourbon 22 Year (2015 Release)	22.0	46.00	5	4.29	87
1192	Rhetoric 22 Year Bourbon	22.0	45.00	4	4.29	87
1589	Lost Prophet 22 Year Bourbon	22.0	45.05	4	4.34	90
1933	Elijah Craig Single Barrel 21 Year	21.0	45.00	5	4.14	87
832	Rhetoric 21 Year Bourbon	21.0	45.10	4	4.05	90
1511	Michter's 20 Year Single Barrel Bourbon	20.0	57.10	5	4.49	94
1317	Pappy Van Winkle 20 Year	20.0	45.20	4	4.67	92
1127	Barterhouse 20 Year Bourbon	20.0	45.10	4	4.06	87
1443	Rhetoric 20 Year Bourbon	20.0	45.00	4	3.92	89
2128	Batch 206 Old Log Cabin Bourbon	19.0	43.00	2	3.15	83
1383	Elijah Craig 18 Year	18.0	45.00	5	4.31	92

Which are the most popular? Intent is to see if there are any errors or something worth noting.

In [40]:

df2.sort_values(by=['# Ratings'], ascending=False)[0:20][['Name', '# Ratings', 'Rare', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[40]:

	Name	# Ratings	Rare	ABV %	Price	Customers' Rating	Expert Score
276	Blanton's Original Single Barrel	9072	False	46.5	2	4.30	89
826	Buffalo Trace Bourbon	8913	False	45.0	2	3.66	83
917	Eagle Rare 10 Year Bourbon	8656	False	45.0	2	4.02	91
113	Maker's Mark Bourbon	7209	False	45.0	2	3.46	87
85	Woodford Reserve Bourbon	7087	False	45.2	2	3.65	85
1744	Bulleit Bourbon	6712	False	45.0	2	3.48	86
1909	Four Roses Single Barrel Bourbon	5890	False	50.0	2	4.00	90
465	Basil Hayden's Bourbon	5328	False	40.0	2	3.62	80
1133	Weller Special Reserve	4931	False	45.0	2	3.89	91
285	Angel's Envy Bourbon Finished in Port Wine Bar...	4741	False	43.3	2	3.86	84
1592	Woodford Reserve Double Oaked	4590	False	45.2	2	4.11	92
329	Colonel E.H. Taylor, Jr. Small Batch Bottled i...	4444	False	50.0	2	4.14	90
2179	Knob Creek Small Batch Bourbon	4354	False	50.0	2	3.47	84
380	Elijah Craig Small Batch Bourbon	4345	False	47.0	2	3.68	85
479	Four Roses Small Batch Bourbon	4159	False	45.0	2	3.83	92
171	Evan Williams Single Barrel	3983	False	43.3	2	3.87	96
1429	Henry McKenna 10 Year Bottled in Bond Bourbon	3868	False	50.0	2	3.93	91
256	Larceny Small Batch Kentucky Straight Bourbon	3620	False	46.0	1	3.47	88
81	Maker's Mark 46	3551	False	47.0	2	3.74	90
263	Weller Antique 107	3495	False	53.5	2	4.12	92

Ah, interestingly, the single-most popular, Blanton's is actually very hard to find these days. The field wrongly states it's not rare, but it is essentially impossible to find in most US States, and bottles are commonly marked up from $40 MSRP to $200. I'm very surprised that this has the most reviews, but I suspect it's because it has the highest allure amongst the rare ones that are somewhat possible to find. Years ago, it was very easy to find, so maybe some reviews were from this time.

Additionally, Weller Special Reserve is also impossible to get within most places in the US, for most times of the year, but it has TONS of allure and attention. People obsess over it. Weller Antique 107 is even rarer. The rest are very common within stores and bars, so the data makes sense for these.

Which are the best according to customers? Intent is to see if there are any errors or something worth noting.

In [41]:

df2.sort_values(by=["Customers\' Rating"], ascending=False)[0:20][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[41]:

	Name	# Ratings	ABV %	Price	Customers' Rating	Expert Score
1936	Parker's Heritage Heavy Char Bourbon 10 Year	2	60.00	4	4.88	92
1283	Abraham Bowman Sweet XVI Bourbon	15	58.00	3	4.84	93
871	William Larue Weller Bourbon (Fall 2020)	1	67.25	4	4.75	93
2078	William Larue Weller Bourbon (Fall 2016)	134	67.70	4	4.71	96
1570	William Larue Weller Bourbon (Fall 2017)	182	64.10	4	4.71	90
593	William Larue Weller Bourbon (Fall 2015)	232	67.30	3	4.70	98
1463	King of Kentucky 15 Year Kentucky Straight Bou...	23	65.50	5	4.69	90
296	Old Forester President's Choice Bourbon	3	59.30	5	4.67	90
21	Four Roses Limited Edition 50th Anniversary Sm...	158	54.30	4	4.67	94
1317	Pappy Van Winkle 20 Year	893	45.20	4	4.67	92
1957	George T. Stagg Bourbon (Fall 2017)	387	64.60	4	4.66	94
727	William Larue Weller Bourbon (Fall 2018)	183	62.85	4	4.65	96
2069	Russell's Reserve 1998	19	51.10	5	4.64	92
303	George T. Stagg Bourbon (Fall 2013)	422	64.10	5	4.62	97
97	George T. Stagg Bourbon (Fall 2015)	104	69.10	3	4.62	91
1991	Black Maple Hill 16 Year Bourbon	69	47.50	4	4.62	97
2108	King of Kentucky 14 Year Kentucky Straight Bou...	23	67.50	5	4.61	94
1596	George T. Stagg Bourbon (Fall 2018)	382	62.45	4	4.59	93
1575	George T. Stagg Bourbon (Fall 2019)	395	58.45	4	4.59	98
1932	William Larue Weller Bourbon (Fall 2013)	139	68.10	4	4.57	96

This seems correct to me, not because I've tasted any of these, but because these are famous and are highly coveted. I've never heard of the best rated, Parker's, though. I'd be suspicious that it's an outlier and wrong, especially considering it only has 2 reviews from users; however, the expert also gave it a high score, so it seems like a valid entry.

In [42]:

df2.sort_values(by=["Customers\' Rating"], ascending=True)[0:20][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[42]:

	Name	# Ratings	ABV %	Price	Customers' Rating	Expert Score
486	Double Diamond Limited Edition Bourbon 267	1	40.0	2	1.00	80
2167	Syntax Spirits Bourbon	2	47.5	2	2.00	80
658	Maryland Club Straight Bourbon	2	47.5	2	2.00	78
1920	Winchester "Extra Smooth" Bourbon	75	45.0	1	2.21	65
1108	Detroit City Two-Faced Bourbon	8	47.0	2	2.25	77
1244	Black Button Little Barrel Bourbon	8	42.0	3	2.38	75
230	Adirondack 601 Bourbon	21	43.2	3	2.40	78
1250	Old Crow Kentucky Straight Bourbon	197	40.0	1	2.40	71
1391	Central Standard Bourbon	9	45.0	2	2.42	80
801	Evan Williams Green Label	177	40.0	1	2.45	81
0	Cleveland Bourbon	116	50.0	2	2.48	70
2027	Feisty Spirits Blue Corn Bourbon	2	44.0	2	2.50	81
2063	Ancient Age	212	40.0	1	2.55	78
611	Graveyard Sam's Baby Bourbon	5	45.0	2	2.55	72
1279	John B. Stetson Kentucky Straight Bourbon Whiskey	52	42.0	2	2.56	68
328	New Liberty Bloody Butcher Bourbon	19	47.5	2	2.57	87
1689	Kentucky Tavern Bourbon	47	40.0	1	2.59	80
1826	Cabin Still Bourbon	20	40.0	1	2.61	79
824	Yellow Rose Double Barrel Bourbon	24	43.0	2	2.66	83
195	Old Hickory Great American Straight Bourbon	13	43.0	2	2.67	80

I've never heard of any of these, so this list seems reasonable. Plus, the experts gave them all horrible reviews, so I don't suspect anything suspicious is going on (e.g., customers ironically rating a controversial, highly-appraised whiskey as being horribly low, as if to troll the ratings).

Which are the best according to experts? Intent is to see if there are any errors or something worth noting.

In [43]:

df2.sort_values(by=['Expert Score'], ascending=False)[0:20][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[43]:

	Name	# Ratings	ABV %	Price	Customers' Rating	Expert Score
12	Booker's Bourbon Batch 2015-01 "Big Man, Small...	698	64.35	3	4.38	98
1635	Four Roses Limited Edition Single Barrel Bourb...	158	59.80	5	4.23	98
1575	George T. Stagg Bourbon (Fall 2019)	395	58.45	4	4.59	98
593	William Larue Weller Bourbon (Fall 2015)	232	67.30	3	4.70	98
916	Four Roses Limited Edition Single Barrel Bourb...	503	54.20	4	4.25	98
1917	Parker's Heritage Promise of Hope	224	48.00	4	4.45	98
303	George T. Stagg Bourbon (Fall 2013)	422	64.10	5	4.62	97
1671	Wild Turkey Diamond Anniversary Bourbon	278	45.50	4	4.33	97
1991	Black Maple Hill 16 Year Bourbon	69	47.50	4	4.62	97
1833	Hirsch Selection 28 Year Bourbon	11	43.40	5	3.89	97
1932	William Larue Weller Bourbon (Fall 2013)	139	68.10	4	4.57	96
1537	Four Roses Limited Edition Small Batch Bourbon...	207	54.30	5	4.42	96
489	George T. Stagg Bourbon (Fall 2020)	9	65.20	4	4.47	96
1051	Woodford Reserve Master's Collection Four Wood...	113	47.20	4	4.30	96
1004	Four Roses Limited Edition Small Batch Bourbon...	176	55.60	4	4.35	96
231	Eagle Rare 17 Year Bourbon (Fall 2014)	177	45.00	3	4.31	96
1573	Old Forester Birthday Bourbon 2018	225	50.50	4	4.42	96
514	Parker's Heritage Cognac Barrel Finish 10 Year...	11	50.00	3	4.53	96
171	Evan Williams Single Barrel	3983	43.30	2	3.87	96
2078	William Larue Weller Bourbon (Fall 2016)	134	67.70	4	4.71	96

This seems right. Never heard of Hirsch or Parker's but the others are speciality versions of famous/popular whiskeys, so this makes sense.

Which are the most expensive? Intent is to see if there are any errors or something worth noting.

In [44]:

df2.sort_values(by=['Price'], ascending=False)[0:15][['Name', '# Ratings', 'ABV %', 'Price', 'Customers\' Rating', 'Expert Score']]

Out[44]:

	Name	# Ratings	ABV %	Price	Customers' Rating	Expert Score
592	Remus Volstead Reserve 14 Year Bottled in Bond...	31	50.00	5	4.25	89
1649	Angel's Envy Bourbon Finished in Mizunara Oak	3	48.90	5	3.58	91
702	Heaven Hill 27 Year Barrel Proof Kentucky Stra...	28	47.35	5	4.26	93
140	Blade And Bow Bourbon 22 Year (2015 Release)	56	46.00	5	4.29	87
303	George T. Stagg Bourbon (Fall 2013)	422	64.10	5	4.62	97
1779	Russell's Reserve 2002	30	57.30	5	4.43	95
296	Old Forester President's Choice Bourbon	3	59.30	5	4.67	90
2108	King of Kentucky 14 Year Kentucky Straight Bou...	23	67.50	5	4.61	94
2020	Evan Williams 23 Year Bourbon	35	53.50	5	4.37	86
1511	Michter's 20 Year Single Barrel Bourbon	110	57.10	5	4.49	94
982	Angel's Envy Bourbon Finished in Port Wine Bar...	105	61.20	5	4.09	93
1383	Elijah Craig 18 Year	689	45.00	5	4.31	92
1833	Hirsch Selection 28 Year Bourbon	11	43.40	5	3.89	97
590	Jim Beam Distiller's Masterpiece	96	50.00	5	4.24	86
264	Angel's Envy Bourbon Finished in Port Wine Bar...	4	60.20	5	4.00	94

We don't have high granularity (prices are just 1-5), which is perhaps a blessing in disguise -- most bourbons are \$30 - \\$50, but some rare ones, especially due to price gouging, can be \$100 - \\$3,000. That's a wild range and is largely due to rarity, allure, and sensationalism within human behavior, as opposed to actual qualities of the bourbon. So, maybe it's good that we don't have to deal with outlier whiskeys have extraordinary prices.

Where do they come from? Intent is to see if there are any errors or something worth noting.

In [45]:

df2['Location'].value_counts()

Out[45]:

Booker's // Kentucky, USA                                                                 22
Four Roses // Kentucky, USA                                                               18
Buffalo Trace // Kentucky, USA                                                            16
Old Forester // Kentucky, USA                                                             16
Jim Beam // Kentucky, USA                                                                 15
Heaven Hill // Kentucky, USA                                                              14
Elijah Craig // Kentucky, USA                                                             13
Woodford Reserve // Kentucky, USA                                                         12
Wild Turkey // Kentucky, USA                                                              10
Barton 1792 // Kentucky, USA                                                               7
Knob Creek // Kentucky, USA                                                                7
Michter's // Kentucky, USA                                                                 7
Redemption // Indiana, USA                                                                 6
Barrell Craft Spirits // (bottled in) Kentucky, USA                                        6
Colonel E.H. Taylor, Jr. // Kentucky, USA                                                  6
A. Smith Bowman // Virginia, USA                                                           6
Angel's Envy // Kentucky, USA                                                              6
Yellowstone  // Kentucky, USA                                                              5
Evan Williams // Kentucky, USA                                                             5
Wyoming Whiskey // Wyoming, USA                                                            5
Barrell Craft Spirits // USA                                                               5
Parker's Heritage Collection // Kentucky, USA                                              5
George Remus // Indiana, USA                                                               5
William Larue Weller // Kentucky, USA                                                      5
Maker's Mark // Kentucky, USA                                                              5
Eagle Rare // Kentucky, USA                                                                5
Belle Meade // Indiana (bottled in Tennessee), USA                                         5
George T. Stagg // Kentucky, USA                                                           5
Weller // Kentucky, USA                                                                    4
Blood Oath // Kentucky, USA                                                                4
Rebel Yell // Kentucky, USA                                                                4
Laws Whiskey House // Colorado, USA                                                        4
Garrison Brothers // Texas, USA                                                            4
Rhetoric // Kentucky, USA                                                                  4
Larceny // Kentucky, USA                                                                   4
Jefferson's // Kentucky, USA                                                               3
Barrell Craft Spirits // Tennessee, USA                                                    3
Kentucky Bourbon Distillers, Ltd. // Kentucky, USA                                         3
Bulleit // Kentucky, USA                                                                   3
St. Augustine // Florida, USA                                                              3
Orphan Barrel Whisky Co. // Kentucky, USA                                                  3
Buffalo Trace Antique Collection // Kentucky, USA                                          3
Daviess County // Kentucky, USA                                                            3
Bardstown Bourbon Company // Kentucky, USA                                                 3
Blaum Bros. // Indiana (further aged & bottled in Illinois), USA                           3
Pappy Van Winkle // Kentucky, USA                                                          3
Old Grand-Dad // Kentucky, USA                                                             3
New Riff // Kentucky, USA                                                                  3
Woodinville // Washington, USA                                                             3
Sam Houston // Kentucky, USA                                                               2
Abraham Bowman // Virginia, USA                                                            2
Baker's // Kentucky, USA                                                                   2
Feisty Spirits // Colorado, USA                                                            2
Coopers' Craft // Kentucky, USA                                                            2
Virgil Kaine // (bottled in) South Carolina, USA                                           2
HIRSCH // Kentucky, USA                                                                    2
Black Maple Hill // Kentucky, USA                                                          2
Sonoma County Distilling // California, USA                                                2
Old Rip Van Winkle // Kentucky, USA                                                        2
Belle Meade // (bottled in) Tennessee, USA                                                 2
Rock Town // Arkansas, USA                                                                 2
Old Fitzgerald // Kentucky, USA                                                            2
Black Button Distilling // New York, USA                                                   2
King of Kentucky // Kentucky, USA                                                          2
291  // Colorado, USA                                                                      2
Temperance Trader // Indiana (bottled in Oregon), USA                                      2
Backbone Bourbon // Indiana (bottled in Kentucky), USA                                     2
Ezra Brooks // Kentucky, USA                                                               2
Burnside // Oregon, USA                                                                    2
Smooth Ambler // USA                                                                       2
Penelope // Indiana, USA                                                                   2
Blanton's // Kentucky, USA                                                                 2
Jefferson's // USA                                                                         2
HIRSCH // Indiana, USA                                                                     2
Hudson Whiskey // New York, USA                                                            2
Rabbit Hole // Kentucky, USA                                                               2
Russell's Reserve // Kentucky, USA                                                         2
Chattanooga Whiskey Co. // Tennessee, USA                                                  2
FEW // Illinois, USA                                                                       2
Berkshire Mountain // Massachusettes, USA                                                  2
Bardstown Bourbon Company // Tennessee, USA                                                2
Ranger Creek // Texas, USA                                                                 2
Basil Hayden's // Kentucky, USA                                                            2
Kentucky, USA                                                                              2
Balcones // Texas, USA                                                                     2
Batch 206 // Washington, USA                                                               2
I.W. Harper // Kentucky, USA                                                               2
Ancient Age // Kentucky, USA                                                               1
Downslope Distilling // Colorado, USA                                                      1
Tatoosh // USA                                                                             1
Elk Rider // Washington , USA                                                              1
Early Times // Kentucky, USA                                                               1
Old Ezra // Kentucky, USA                                                                  1
OYO // Ohio, USA                                                                           1
PM Spirits // Indiana, USA                                                                 1
Wiggly Bridge // Maine, USA                                                                1
Pinhook  // USA                                                                            1
Dark Horse Distillery // Kansas, USA                                                       1
Eight & Sand // Indiana, USA                                                               1
J. Henry & Sons // Wisconsin, USA                                                          1
10th Mountain  // Vail, Colorado, USA                                                      1
Coppersea // New York, USA                                                                 1
Grand Traverse Distillery // Michigan, USA                                                 1
Redemption // Indiana , USA                                                                1
New Holland // Indiana (bottled in Michigan), USA                                          1
Belle Meade // Tennessee, USA                                                              1
Milam & Greene // USA                                                                      1
New Riff // Indiana (bottled in Kentucky), USA                                             1
Hillrock Estate // USA                                                                     1
Western Spirits Beverage Company // USA                                                    1
Western Spirits Company // Kentucky, USA                                                   1
Breckenridge // USA                                                                        1
Van Brunt Stillhouse // New York, USA                                                      1
Ghost Hill // Texas, USA                                                                   1
District Made // Washington D.C., USA                                                      1
Winchester // USA                                                                          1
The Family Jones // Colorado, USA                                                          1
Prohibition Distillery // New York, USA                                                    1
Elmer T. Lee // Kentucky, USA                                                              1
Stagg Jr. // Kentucky , USA                                                                1
Medley Bros. // Kentucky, USA                                                              1
Dark Corner Distillery // South Carolina, USA                                              1
Valley Shine Distillery // USA                                                             1
Duke // Kentucky, USA                                                                      1
W.H. Harrison // Indiana, USA                                                              1
Metze's // Indiana, USA                                                                    1
Noah's Mill // Kentucky, USA                                                               1
SILO // Vermont, USA                                                                       1
Illinois, USA                                                                              1
Chicken Cock // Indiana (aged & bottled in Kentucky), USA                                  1
Deadwood // Indiana, USA                                                                   1
Central Standard // Wisconsin, USA                                                         1
33 // USA                                                                                  1
Black Dirt // New York, USA                                                                1
Koval // Illinois, USA                                                                     1
Pennsylvania, USA                                                                          1
Cyrus Noble // Kentucky, USA                                                               1
Rhetoric // Kentucky (bottled in Tennessee), USA                                           1
Gristmill Distillers // New York, USA                                                      1
George Remus // Indiana , USA                                                              1
Cascade Alchemy // South Carolina (bottled in Oregon), USA                                 1
PennyPacker // Kentucky, USA                                                               1
Abraham Bowman // Virginia , USA                                                           1
Booker's // Kentucky  , USA                                                                1
Wathen's // Kentucky, USA                                                                  1
Corner Creek // Kentucky, USA                                                              1
Blaum Bros. // Illinois, USA                                                               1
Dancing Pines // Colorado, USA                                                             1
Old Soul // USA                                                                            1
Penelope // USA                                                                            1
Sazerac // USA                                                                             1
Two James Spirits // (bottled in) Michigan, USA                                            1
Orange County Distillery // New York, USA                                                  1
Spirit Works // California, USA                                                            1
Old Crow // USA                                                                            1
Kooper Family // Indiana (aged in Texas), USA                                              1
David Nicholson // Kentucky (bottled in Missouri), USA                                     1
Pinhook  // Kentucky, USA                                                                  1
Peach Street // Colorado, USA                                                              1
Short Mountain // Tennessee, USA                                                           1
Heaven's Door // Tennessee, USA                                                            1
MB Roland Distillery // Kentucky, USA                                                      1
Port Chilkoot // Alaska, USA                                                               1
Kentucky Owl // Kentucky, USA                                                              1
Steward's Whiskies // Kentucky, Tennessee and Indiana, USA                                 1
Stagg Jr. // Kentucky, USA                                                                 1
Michter's // USA                                                                           1
Yellow Rose // Texas, USA                                                                  1
Cedar Ridge // Iowa, USA                                                                   1
The Walking Dead // Kentucky, USA                                                          1
Long Island Spirits // New York, USA                                                       1
Johnny Drum // Kentucky, USA                                                               1
Jos. A. Magnus & Co. // Indiana (Finished and Bottled in Washington DC), USA               1
Sonoma County Distilling Co. // California, USA                                            1
Graveyard Sam's // Pennsylvania, USA                                                       1
Union Horse // Kansas, USA                                                                 1
HIRSCH // Indiana (bottled in Ohio), USA                                                   1
Finger Lakes Distilling // New York, USA                                                   1
Six & Twenty // South Carolina, USA                                                        1
Old Bardstown Distilling Company // Kentucky, USA                                          1
Colorado Gold // Colorado, USA                                                             1
Prichard's // Tennessee, USA                                                               1
Cooperstown Distillery // New York, USA                                                    1
Tom's Town // Missouri, USA                                                                1
Barrell Craft Spirits // Tennessee , USA                                                   1
Old Elk // (bottled in) Colorado, USA                                                      1
A. Smith Bowman // USA                                                                     1
Headframe Spirits // Montana, USA                                                          1
Treaty Oak Distilling // USA                                                               1
Rough Rider // Long Island , USA                                                           1
TahWahKaro // Texas, USA                                                                   1
Redemption Whiskey // Indiana, USA                                                         1
Fighting Cock // Kentucky, USA                                                             1
Widow Jane // USA                                                                          1
Henry McKenna // Kentucky, USA                                                             1
Dry Fly // Washington, USA                                                                 1
Rod & Rifle // Tennessee, USA                                                              1
Heaven's Door // USA                                                                       1
Jos. A. Magnus & Co. // (blended & bottled in Washington D.C.), USA                        1
Fistful of Bourbon // USA                                                                  1
New Liberty // Pennsylvania, USA                                                           1
Chicken Cock // Kentucky , USA                                                             1
Bond & Lillard // Kentucky, USA                                                            1
Oola // Washington, USA                                                                    1
Indiana (bottled in Pennsylvania), USA                                                     1
Koenig Distillery and Winery // (bottled in) Idaho, USA                                    1
Journeyman Distillery // Michigan, USA                                                     1
Copper Fiddle // Illinois, USA                                                             1
Three Chord // USA                                                                         1
Bulleit // Kentucky , USA                                                                  1
Old Bardstown // Kentucky, USA                                                             1
Deadwood // Indiana , USA                                                                  1
Filibuster // USA                                                                          1
J.W. Overbey & Co. // New York, USA                                                        1
Treaty Oak // USA                                                                          1
Indiana (bottled in California), USA                                                       1
The Clover // Indiana , USA                                                                1
Detroit City // USA                                                                        1
Old Charter // Kentucky, USA                                                               1
Smooth Ambler // West Virginia, USA                                                        1
Wigle // Pennsylvania, USA                                                                 1
Still & Oak // Wisconsin, USA                                                              1
Booker's // Kentucky , USA                                                                 1
Yellow Rose // USA                                                                         1
Breaking & Entering // Kentucky , USA                                                      1
Legent // Kentucky, USA                                                                    1
Temperance Trader // (bottled in) Oregon, USA                                              1
Deerhammer // Colorado, USA                                                                1
Stetson // Kentucky, USA                                                                   1
Ransom Spirits // Oregon, USA                                                              1
SOA Spirits // Indiana, USA                                                                1
James E. Pepper // Indiana (bottled in Kentucky), USA                                      1
Knob Creek // Kentucky , USA                                                               1
Taconic Distillery // New York, USA                                                        1
Catskill Distilling Co. // New York, USA                                                   1
Delaware Phoenix // New York, USA                                                          1
Bird Dog // Kentucky, USA                                                                  1
Big House // Indiana (bottled in Kentucky), USA                                            1
Hood River Distilling // Kentucky (bottled in Oregon), USA                                 1
Kinsey // (bottled in) Pennsylvania, USA                                                   1
Amador Whiskey Co. // Kentucky (Finished and Bottled in California), USA                   1
Five & 20 Spirits // New York, USA                                                         1
Sweetens Cove // Tennessee , USA                                                           1
Two James // Michigan, USA                                                                 1
Cabin Still // Kentucky, USA                                                               1
Old Ripy // Kentucky, USA                                                                  1
McAfee's Benchmark // Kentucky, USA                                                        1
Blade And Bow // Kentucky, USA                                                             1
Frey Ranch // Nevada, USA                                                                  1
J. Henry & Sons // Wisconsin , USA                                                         1
Charred Oak Spirits // Wisconsin, USA                                                      1
Eagle Rare // Kentucky , USA                                                               1
2bar Spirits // Washington, USA                                                            1
35 Maple Street // USA                                                                     1
Orphan Barrel Whisky Co. // Kentucky (bottled in Tennessee), USA                           1
Colter's Run // Idaho, USA                                                                 1
TX Whiskey // Texas, USA                                                                   1
Adirondack Distilling Company // New York, USA                                             1
Old Tub // Kentucky, USA                                                                   1
Barrell Craft Spirits // Kentucky, USA                                                     1
Missouri Spirits // Missouri, USA                                                          1
Very Old Barton // Kentucky, USA                                                           1
Early Times // Kentucky , USA                                                              1
Watershed // Ohio, USA                                                                     1
Kiepersol Estates // Texas, USA                                                            1
Defiance // USA                                                                            1
Still Austin // Texas, USA                                                                 1
Cody Road // Iowa, USA                                                                     1
O4D // Indiana (aged in Georgia), USA                                                      1
Old Hickory Great American // Indiana (bottled in Ohio), USA                               1
Parker's Heritage Collection // Kentucky , USA                                             1
Warbringer // (bottled in) California , USA                                                1
Tom's Town // Tennessee (bottled in Missouri), USA                                         1
Kinsey // (bottled in) Pennsylvania , USA                                                  1
Diageo // Kentucky, USA                                                                    1
The Splinter Group // Tennessee and Kentucky (Finished and Bottled in California), USA     1
Lux Row Distillers // Kentucky, USA                                                        1
Clyde May's // Alabama, USA                                                                1
Cleveland Whiskey // Ohio, USA                                                             1
Taconic Distillery // USA                                                                  1
Kings County // New York, USA                                                              1
Syntax Spirits // Colorado, USA                                                            1
Oregon Spirit Distillers // Oregon, USA                                                    1
Jim Beam // Kentucky  , USA                                                                1
Name: Location, dtype: int64

Some distilleries produce different brands of whiskey. Most come from Kentucky. You can see that some distilleries produce tons of different types, but this can be a bit misleading because some of those different types are just slight variations (e.g., Eagle Rare 10, Eagle Rare 17), whereas others are completely different brands (e.g., Buffalo Trace, Blanton's). For now, it's probably best to just ignore the location feature, but we'll keep it in mind for modelling, if we get desperate. One idea would be to create 2 fields from this: 1 for the geographic state (e.g., Kentucky), and another for the distillery (e.g., Booker's).

Let's look at the distribution of flavor values? Intent is to see if there are any errors or something worth noting.

In [46]:

fig, axs = plt.subplots(nrows=5, ncols=3, figsize=(20, 20), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = .5, wspace=.2)
axs = axs.ravel()
fontsize = 10

flavors = ['Smoky', 'Peaty', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
        'Sweet', 'Briny', 'Salty', 'Vanilla', 'Tart', 'Fruity', 'Floral']

# plot histograms
for i, flavor in enumerate(flavors):
    axs[i].hist(df2[flavor], alpha=0.7, color='lightblue', bins='auto', density=False, histtype = 'bar', edgecolor='k')
    axs[i].set_title("Distribution of " + flavor + " Flavor", fontsize=fontsize)
    axs[i].set_xlabel(flavor + " Flavor", fontsize=fontsize)
    axs[i].set_ylabel('Count', fontsize=fontsize)
    
# removes the empty one, since we only have 14 flavors, not 15
axs[14].set_axis_off()

These all seem pretty reasonable, and I'm glad that the values have a good spread. A few flavors are a bit skewed, and these are the ones that we inspected above.

Let's look for any patterns/correlations that may exists between our features. Since some of the above flavors are skewed (e.g., Salty is usually 0), we would not be able to discern any meaningful trend, so we can throw this out from our visualization. Otherwise, our graph woud just be a bunch of points overlapping one another at the 0 value.

In [47]:

grid_features = ['Smoky', 'Spicy', 'Herbal', 'Oily', 'Full-bodied', 'Rich',\
        'Sweet', 'Vanilla', 'Fruity', 'Floral', \
        'Age', 'Price', 'Customers\' Rating', 'Expert Score']

scatter = pd.plotting.scatter_matrix(df2[grid_features], alpha=0.4, figsize=(20,20));
for ax in scatter.ravel():
    ax.set_xlabel(ax.get_xlabel(), rotation = 90)
    ax.set_ylabel(ax.get_ylabel(), rotation = 90)

We see that:

Customer's Rating is highly correlated with the expert's rating
The higher the Price, the more likely it is to have a high score from both customers and experts
The higher the Richness, the more Full-bodied and Sweet it tends to be (strong correlations)
The higher the Oiliness, the more likely it is to be Full-bodied
No individual flavor seems correlated with the scores from customers or experts. The closest trend is from Full-bodied and Rich, as they seem slightly directly correlated with the scores.

This is an indication that predicting the score is not trivially easy; the Full-bodied and Richness can play some role, but if flavors give any indication, it'll be due to a combination of flavors instead of any one particular flavor.