CS109A Introduction to Data Science

Standard Section 1: Introduction to Web Scraping

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Section Leaders: Marios Mattheakis, Abhimanyu (Abhi) Vasishth, Robbert (Rob) Struyven


In [ ]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

When we're done today, you will approach messy real-world data with confidence that you can get it into a format that you can manipulate.

Specifically, our learning objectives are:

  • Understand the structure of an HTML document and use that structure to extract desired information
  • Use Python data structures such as lists, dictionaries, and Pandas DataFrames to store and manipulate information
  • Identify some other (semi-)structured formats commonly used for storing and transferring data, such as JSON and CSV
  • Practice using Python packages such as BeautifulSoup and Pandas, including how to navigate their documentation to find functionality.
In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("notebook")
import json

import requests
from bs4 import BeautifulSoup
from IPython.display import HTML
In [ ]:
# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.
requests.packages.urllib3.disable_warnings()

import warnings
warnings.filterwarnings("ignore")

Goals

Is science becoming more collaborative over time? How about literature? Are there a few "geniuses" or lots of hard workers? One way we might answer those questions is by looking at Nobel Prizes. We could ask questions like:

  • Has anyone won a prize more than once?
  • How has the total number of recipients changed over time?
  • How has the number of recipients per award changed over time?

To answer these questions, we'll need data: who received what award when.

Before we dive into acquiring this data the way we've been teaching in class, let's pause to ask: what are 5 different approaches we could take to acquiring Nobel Prize data?

When possible: find a structured dataset (.csv,.json,.xls)

After a google search we stumble upon this dataset on github. It is also in the section folder named github-nobel-prize-winners.csv.

We use pandas to read it:

In [ ]:
df = pd.read_csv("../data/github-nobel-prize-winners.csv")
df.head()

Or you may want to read an xlsx file:

(Potential missing package; you might need to run the following command in your terminal first: !conda install xlrd)

In [ ]:
df = pd.read_excel("../data/github-nobel-prize-winners.xlsx")
df.head()

QUIZ: Did anyone recieve the Nobel Prize more than once?

How would you check if anyone recieved more than one nobel prize?

In [ ]:
# list storing all the names 
name_winners = []
for name in df.winner:
    # Check if we already encountered this name: 
    if name in name_winners:
        # if so, print the name
        print(name)
    else:
        # otherwhise the name to the list
        name_winners.append(name)

We don't want to print "No Prize was Awarded" all the time.

In [ ]:
# Your code here

How can we make this into a oneligner?

In [ ]:
winners = []
[print(name) if (name in winners and name != "No Prize was Awarded") 
 else winners.append(name) for name in df.winner];

Otherwhise: WEB SCRAPING

Turns out that https://www.nobelprize.org/prizes/lists/all-nobel-prizes/ has the data we want.

Let's take a look at the website and to look at the underhood HTML: right-click and click on inspect . Try to find structure in the tree-structured HTML.


But the nobelprize.org server is a little slow sometimes. Fortunately, the Internet Archive periodically crawls most of the Internet and saves what it finds. (That's a lot of data!) So let's grab the data from the Archive's "Wayback Machine" (great name!).

We'll just give you the direct URL, but at the very end you'll see how we can get it out of a JSON response from the Wayback Machine API.

In [ ]:
snapshot_url = 'http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/'
In [ ]:
snapshot = requests.get(snapshot_url)
snapshot

What is a this Response [200]? Let's google: response 200 meaning. All possible codes here.

In [ ]:
type(snapshot)

Try to request "www.xoogle.be"? What happens?

In [ ]:
snapshot_url2 = 'http://web.archive.org/web/20180820111639/https://www.xoogle.be'
snapshot = requests.get(snapshot_url2)
snapshot

Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be on HW1), always put a delay between requests (e.g, time.sleep(1), with the time library) so you don’t unwittingly hammer someone’s webserver and/or get blocked.

In [ ]:
snapshot = requests.get(snapshot_url)
raw_html = snapshot.text
print(raw_html[:500])

Regular Expressions

You can find specific patterns or strings in text by using Regular Expressions: This is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). Some great resources that we recommend, if you are interested in them (could be very useful for a homework problem):

Specify a specific sequence with the help of regex special characters. Some examples:

  • \S : Matches any character which is not a Unicode whitespace character
  • \d : Matches any Unicode decimal digit
  • * : Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.

Let's find all the occurances of 'Marie' in our raw_html:

In [ ]:
import re 
In [ ]:
re.findall(r'Marie',raw_html)

Using \S to match 'Marie' + ' ' + 'any character which is not a Unicode whitespace character':

In [ ]:
re.findall(r'Marie \S',raw_html)

How would we find their lastnames that comes after Marie?

In [ ]:
# Your code here

Now, we have all our data in the notebook. Unfortunately, it is the form of one really long string, which is hard to work with directly. This is where BeautifulSoup comes in.

Parse the HTML with BeautifulSoup

In [ ]:
soup = BeautifulSoup(raw_html, 'html.parser')

Key BeautifulSoup functions we’ll be using in this section:

  • tag.prettify(): Returns cleaned-up version of raw HTML, useful for printing
  • tag.select(selector): Return a list of nodes matching a CSS selector
  • tag.select_one(selector): Return the first node matching a CSS selector
  • tag.text/soup.get_text(): Returns visible text of a node (e.g.,"

    Some text

    " -> "Some text")
  • tag.contents: A list of the immediate children of this node

You can also use these functions to find nodes.

  • tag.find_all(tag_name, attrs=attributes_dict): Returns a list of matching nodes
  • tag.find(tag_name, attrs=attributes_dict): Returns first matching node

BeautifulSoup is a very powerful library -- much more info here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Let's practice some BeautifulSoup commands...

Print a cleaned-up version of the raw HTML

In [ ]:
# Your code here

Find the first “title” object

In [ ]:
# Your code here

Extract the text of first “title” object

In [ ]:
# Your code here

Extracting award data

Let's use the structure of the HTML document to extract the data we want.

From inspecting the page in DevTools, we found that each award is in a div with a by_year class. Let's get all of them.

In [ ]:
award_nodes = soup.select('.by_year')
len(award_nodes)

Let's pull out an example.

In [ ]:
award_node = award_nodes[200]
In [ ]:
HTML(award_node.prettify())

Let's practice getting data out of a BS Node

The prize title

In [ ]:
award_node.select_one('h3').text

How do we separate the title from the year?

In [ ]:
# Your code here

How do we separate the year from the year?

In [ ]:
# Your code here

Let's put them into functions:

In [ ]:
def get_award_title(award_node):
    return award_node.select_one('h3').text[:-4].strip()
In [ ]:
def get_award_year(award_node):
    return int(award_node.select_one('h3').text[-4:])

Make a list of titles for all awards

In [ ]:
list_awards = []
for award_node in award_nodes:
    list_awards.append(get_award_title(award_node))
list_awards

Let's use list comprehension:

In [ ]:
# Your code here

The recipients

How do we handle there being more than one?

In [ ]:
[node.text for node in award_node.select('h6 a')]

We'll leave them as a list for now, to return to this later.

The prize "motivation"

How would you get the 'motivation'/reason of the prize from the following award_node?

In [ ]:
award_node = award_nodes[200]
award_node
In [ ]:
# Your code here

Putting everything into functions:

In [ ]:
def get_award_motivation(award_node):
    award_node = award_node.select_one('p')
    if not award_node:
        return None
    return award_node.text #.lstrip('\u201c').rstrip('\u201d')
In [ ]:
def get_recipients(award_node):
    return [node.text for node in award_node.select('h6 a')]

Let's create a Pandas dataframe

Now let's get all of the awards.

In [ ]:
awards = []
for award_node in soup.select('.by_year'):
    recipients = get_recipients(award_node)
    award = {}
    award['title'] = get_award_title(award_node)
    award['year'] = get_award_year(award_node)
    award['recipients'] = recipients
    award['num_recipients'] = len(recipients)
    award['motivation'] = get_award_motivation(award_node)    
    awards.append(award)
In [ ]:
df_awards_raw = pd.DataFrame(awards)
In [ ]:
df_awards_raw

Some quick EDA.

In [ ]:
df_awards_raw.info()
In [ ]:
df_awards_raw.year.min()

Hm, that's suspiciously close to a round number. Are we missing some?

How about recipients?

In [ ]:
df_awards_raw.head()
In [ ]:
df_awards_raw.num_recipients.value_counts()

Why do some have no recipients?

In [ ]:
df_awards_raw[df_awards_raw.num_recipients == 0]

Ok: 2018 awards have no recipients because this is a 2018 archived version of nobel prize webpage. Some past years lack awards because there actually were none that year. Let's keep only meaningful data:

In [ ]:
df_awards_past = df_awards_raw[df_awards_raw.year != 2018]
df_awards_past.info()

Hm, motivation has a different number of items... why?

In [ ]:
df_awards_past[df_awards_past.motivation.isnull()]

Looks like it's fine that those motivations were missing.

Sort the awards by year.

In [ ]:
df_awards_past.sort_values('year').head()

How many awards of each type were given?

In [ ]:
df_awards_past.title.value_counts()

But wait, that includes the years the awards weren't offered.

In [ ]:
df_awards_actually_offered = df_awards_past[df_awards_past.num_recipients > 0]
df_awards_actually_offered.title.value_counts()

When was each award first given?

In [ ]:
df_awards_actually_offered.groupby('title').year.describe()

How many recipients per year?

Let's include the years with missing awards; if we were to analyze further, we'd have to decide whether to include them.

In [ ]:
df_awards_past.plot.scatter(x='year', y='num_recipients');

It's hard to see a trend when there are multiple observations per year (why?).

Let's try looking at total num recipients by year.

In [ ]:
plt.figure(figsize=[16,6])
# plt.plot(df_awards_past.groupby('year').num_recipients.sum(), color = 'b', linewidth='2')
plt.plot(df_awards_past.groupby('year').num_recipients.sum(),  '-ob', linewidth='2', alpha = 0.75)

plt.title('Total Nobel Awards per year')
plt.xlabel('Year')
plt.ylabel('Number of recipients')
plt.grid('on')
plt.show();

Check out the years 1940-43? Any comment?

Any trends the last 25 years?

In [ ]:
plt.figure(figsize=[16,6])
i = 0
for award in set(df_awards_past.title):
    i += 1
    year = df_awards_past[df_awards_past['title']==award].year
    recips = df_awards_past[df_awards_past['title']==award].num_recipients
    index = year>2019-25
    years_filtered = year[index].values
    recips_filtered = recips[index].values
    plt.subplot(2,3,i)
    plt.bar(years_filtered,recips_filtered, color='b', alpha = 0.7)
    plt.title(award)
    plt.xlabel('Year')
    plt.ylabel('Number of Recipients')
    plt.ylim(0, 3)
plt.tight_layout()

End of Standard Section


Extra: Did anyone recieve the Nobel Prize more than once (based upon scraped data)?

Here's where it bites us that our original DataFrame isn't "tidy". Let's make a tidy one.

In [ ]:
tidy_awards = []
for idx, row in df_awards_past.iterrows():
    for recipient in row['recipients']:
        tidy_awards.append(dict(
            recipient=recipient,
            year=row['year']))
tidy_awards_df = pd.DataFrame(tidy_awards)
tidy_awards_df.info()

Now we can look at each recipient individually.

In [ ]:
tidy_awards_df.recipient.value_counts()

Extra: Other structured data formats: JSON and CSV

CSV

CSV is a lowest-common-denominator format for tabular data.

In [ ]:
df_awards_past.to_csv('../data/awards.csv', index=False)
with open('../data/awards.csv', 'r') as f:
    print(f.read()[:1000])

It loses some info, though: the recipients list became a plain string, and the reader needs to guess whether each column is numeric or not.

In [ ]:
pd.read_csv('../data/awards.csv').recipients.iloc[20]

JSON

JSON preserves structured data, but fewer data-science tools speak it.

In [ ]:
df_awards_past.to_json('../data/awards.json', orient='records')

with open('../data/awards.json', 'r') as f:
    print(f.read()[:1000])

Lists and other basic data types are preserved. (Custom data types aren't preserved, but you'll get an error when saving.)

In [ ]:
pd.read_json('../data/awards.json').recipients.iloc[20]

Extra: Pickle: handy for storing data

For temporary data storage in a single version of Python, pickles will preserve your data even more faithfully, even many custom data types. But don't count on it for exchanging data or long-term storage. (In fact, don't try to load untrusted pickles -- they can run arbitrary code!)

In [ ]:
df_awards_past.to_pickle('../data/awards.pkl')
with open('../data/awards.pkl', 'r', encoding='latin1') as f:
    print(f.read()[:200])

Yup, lots of internal Python and Pandas stuff...

In [ ]:
pd.read_pickle('../data/awards.pkl').recipients.iloc[20]

Extra: Formatted data output

Let's make a textual table of Physics laureates by year, earliest first:

In [ ]:
for idx, row in df_awards_past.sort_values('year').iterrows():
    if 'Physics' in row['title']:
        print('{}: {}'.format(
            row['year'],
            ', '.join(row['recipients'])))

Extra: Parsing JSON to get the Wayback Machine URL

We could go to http://archive.org, search for our URL, and get the URL for the archived version there. But since you'll often need to talk with APIs, let's take this opportunity to use the Wayback Machine's API. This will also give us a chance to practice working with JSON.

In [ ]:
url = "https://www.nobelprize.org/prizes/lists/all-nobel-prizes/"
# All 3 of these do the same thing. The third is my (KCA's) favorite new feature of Python 3.6.
wayback_query_url = 'http://archive.org/wayback/available?url={}'.format(url)
wayback_query_url = 'http://archive.org/wayback/available?url={url}'.format(url=url)
wayback_query_url = f'http://archive.org/wayback/available?url={url}'
r = requests.get(wayback_query_url)

We got some kind of response... what is it?

In [ ]:
r.text

Yay, JSON! It's usually pretty easy to work with JSON, once we parse it.

In [ ]:
json.loads(r.text)

Loading responses as JSON is so common that requests has a convenience method for it:

In [ ]:
response_json = r.json()
response_json

What kind of object is this?

A little Python syntax review: How can we get the snapshot URL?

In [ ]:
snapshot_url = response_json['archived_snapshots']['closest']['url']
snapshot_url
In [ ]: