CS109A Introduction to Data Science

Standard Section 1: Introduction to Web Scraping

Harvard University
Fall 2018
Instructors: Pavlos Protopapas and Kevin Rader
Section Leaders: Cecilia Garraffo, Mehul Smriti Raje, Ken Arnold, Karan Motwani


In [5]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("http://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[5]:

When we're done today, you will approach messy real-world data with confidence that you can get it into a format that you can manipulate.

Specifically, our learning objectives are:

  • Understand the structure of an HTML document and use that structure to extract desired information
  • Use Python data structures such as lists, dictionaries, and Pandas DataFrames to store and manipulate information
  • Identify some other (semi-)structured formats commonly used for storing and transferring data, such as JSON and CSV
  • Practice using Python packages such as BeautifulSoup and Pandas, including how to navigate their documentation to find functionality.
In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("notebook")

import requests
from bs4 import BeautifulSoup
from IPython.display import HTML
In [ ]:
# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.
requests.packages.urllib3.disable_warnings()

Our goals today

Is science becoming more collaborative over time? How about literature? Are there a few "geniuses" or lots of hard workers? One way we might answer those questions is by looking at Nobel Prizes. We could ask questions like:

  • How has the number of recipients per award changed over time?
  • Has anyone won a prize more than once?

To answer these questions, we'll need data: who received what award when.

Before we dive into acquiring this data the way we've been teaching in class, let's pause to ask: what are 5 different approaches we could take to acquiring Nobel Prize data?

Ingesting data (and what is JSON?)

Turns out that https://www.nobelprize.org/prizes/lists/all-nobel-prizes/ has the data we want. But the nobelprize.org server is a little slow sometimes. Fortunately, the Internet Archive periodically crawls most of the Internet and saves what it finds. (That's a lot of data!) So let's grab the data from the Archive's "Wayback Machine" (great name!).

We could go to http://archive.org, search for our URL, and get the URL for the archived version there. But since you'll often need to talk with APIs, let's take this opportunity to use the Wayback Machine's API for the . This will also give us a chance to practice working with JSON.

In [ ]:
url = "https://www.nobelprize.org/prizes/lists/all-nobel-prizes/"
# All 3 of these do the same thing. The third is my (KCA's) favorite new feature of Python 3.6.
wayback_query_url = 'http://archive.org/wayback/available?url={}'.format(url)
wayback_query_url = 'http://archive.org/wayback/available?url={url}'.format(url=url)
wayback_query_url = f'http://archive.org/wayback/available?url={url}'
r = requests.get(wayback_query_url)

We got some kind of response... what is it?

In [ ]:
r.text

Yay, JSON! It's usually pretty easy to work with JSON, once we parse it.

In [ ]:
import json
json.loads(r.text)

Loading responses as JSON is so common that requests has a convenience method for it:

In [ ]:
response_json = r.json()
response_json

What kind of object is this?

A little Python syntax review: How can we get the snapshot URL?

In [ ]:
snapshot_url = response_json['archived_snapshots']['closest']['url']
snapshot = requests.get(snapshot_url)

Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be on HW1), always put a delay between requests (e.g, time.sleep(1), with the time library) so you don’t unwittingly hammer someone’s webserver and/or get blocked.

In [ ]:
raw_html = snapshot.text
print(raw_html[:500])

Now, we have all our data in the notebook. Unfortunately, it is the form of one really long string, which is hard to work with directly. This is where BeautifulSoup comes in.

Parse the HTML with BeautifulSoup

In [ ]:
soup = BeautifulSoup(raw_html, 'html.parser')

Key BeautifulSoup functions we’ll be using in this section:

  • node.prettify(): Returns cleaned-up version of raw HTML, useful for printing
  • node.select(selector): Return a list of nodes matching a CSS selector
  • node.select_one(selector): Return the first node matching a CSS selector
  • node.text/soup.get_text(): Returns visible text of an object (e.g.,"

    Some text

    " -> "Some text")
  • node.contents: A list of the immediate children of this node

You can also use these functions to find nodes.

  • node.find_all(,attrs=): Returns a list of matching objects
  • node.find(,attrs=): Returns first matching object

BeautifulSoup is a very powerful library -- much more info here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Let's practice some BeautifulSoup commands...

Print a cleaned-up version of raw HTML

In [ ]:
print(soup.prettify()[:500])

Find the first “title” object

In [ ]:
soup.select_one('title')

Extract the text of first “title” object

In [ ]:
soup.select_one('title').text

Extracting award data

Let's use the structure of the HTML document to extract the data we want.

From inspecting the page in DevTools, we found that each award is in a div with a by_year class. Let's get all of them.

In [ ]:
award_nodes = soup.select('.by_year')
len(award_nodes)

Let's pull out a few examples.

In [ ]:
award_nodes[200:202]

How about just a single example?

In [ ]:
award_node = award_nodes[200]
In [ ]:
HTML(award_node.prettify())

Individual Activity 1

Let's practice getting data out of a BS Node

Extract the prize title

Start by getting the full title including the year.

In [ ]:
# Your code here

Now try to separate the title from the year

In [ ]:
def get_award_title(award_node):
    # Your code here
In [ ]:
def get_award_year(award_node):
    # Your code here

Make a list of titles for all awards

In [ ]:
# Your code here

Make a list of dictionaries of the title and year for all awards.

In [ ]:
# Your code here

Back together...

The recipients

how do we handle there being more than one?

In [ ]:
[node.text for node in award_node.select('h6 a')]

We'll leave them as a list for now, to return to this later.

The prize "motivation"

In [ ]:
award_node.select_one('p').text

What are those weird quotes at either end?

In [ ]:
print(json.dumps(award_node.select_one('p').text))

Ah, they're "smart quotes". Let's strip them off.

In [ ]:
award_node.select_one('p').text.lstrip('\u201c').rstrip('\u201d')
In [ ]:
def get_award_motivation(award_node):
    award_node = award_node.select_one('p')
    if not award_node:
        return None
    return award_node.text.lstrip('\u201c').rstrip('\u201d')
In [ ]:
def get_recipients(award_node):
    return [node.text for node in award_node.select('h6 a')]

Now let's get all of the awards.

In [ ]:
awards = []
for award_node in soup.select('.by_year'):
    recipients = get_recipients(award_node)
    awards.append(dict(
        title=get_award_title(award_node),
        year=get_award_year(award_node),
        recipients=recipients,
        num_recipients=len(recipients),
        motivation=get_award_motivation(award_node)
    ))
In [ ]:
df_awards_raw = pd.DataFrame(awards)

Some quick EDA.

In [ ]:
df_awards_raw.info()
In [ ]:
df_awards_raw.year.min()

Hm, that's suspiciously close to a round number. Are we missing some?

How about recipients?

In [ ]:
df_awards_raw.num_recipients.value_counts()

Why do some have no recipients?

In [ ]:
df_awards_raw[df_awards_raw.num_recipients == 0]

Ok: 2018 awards have no recipients because they haven't been awarded yet. Some past years lack awards because there actually were none that year. Let's keep only meaningful data:

In [ ]:
df_awards_past = df_awards_raw[df_awards_raw.year != 2018]
df_awards_past.info()

Hm, motivation has a different number of items... why?

In [ ]:
df_awards_past[df_awards_past.motivation.isnull()].head()

Looks like it's fine that those motivations were missing.

Individual Activity 2

Sort the awards by year.

In [ ]:
# Your code here

How many awards of each type were given?

In [ ]:
# Your code here

When was each award first given?

In [ ]:
# Your code here

Back together...

How many recipients per year?

Let's include the years with missing awards; if we were to analyze further, we'd have to decide whether to include them.

In [ ]:
df_awards_past.plot.scatter(x='year', y='num_recipients')

It's hard to see a trend when there are multiple observations per year (why?).

Let's try looking at mean num recipients by year.

In [ ]:
df_awards_past.groupby('year').num_recipients.mean().plot.line()

A complete answer to our question would involve fitting regression models, breaking down by kind of award, etc... here's a quick preview.

In [ ]:
sns.lmplot(x='year', y='num_recipients', hue='title', data=df_awards_past)#, scatter_kws=dict(alpha=.1))
plt.xlim(1900, 2018);

Did anyone recive the Nobel Prize more than once?

Here's where it bites us that our original DataFrame isn't "tidy". Let's make a tidy one.

In [ ]:
tidy_awards = []
for idx, row in df_awards_past.iterrows():
    for recipient in row['recipients']:
        tidy_awards.append(dict(
            recipient=recipient,
            year=row['year']))
tidy_awards_df = pd.DataFrame(tidy_awards)
tidy_awards_df.info()
In [ ]:
tidy_awards_df.recipient.value_counts()

Other structured data formats: JSON and CSV

CSV

CSV is the lowest-common-denominator data format.

In [ ]:
df_awards_past.to_csv('awards.csv', index=False)
with open('awards.csv', 'r') as f:
    print(f.read()[:1000])

It loses some info, though: the recipients list became a plain string.

In [ ]:
pd.read_csv('awards.csv').recipients.iloc[20]

JSON

JSON preserves structured data, but fewer data-science tools speak it.

In [ ]:
df_awards_past.to_json('awards.json', orient='records')

with open('awards.json', 'r') as f:
    print(f.read()[:1000])

Lists are preserved.

In [ ]:
pd.read_json('awards.json').recipients.iloc[20]

Pickle

For temporary data storage in a single version of Python, pickles will preserve your data even more faithfully. But don't count on it for exchanging data (in fact, don't try to load untrusted pickles -- they can run arbitrary code!).

In [ ]:
df_awards_past.to_pickle('awards.pkl', protocol=0)
with open('awards.pkl', 'r', encoding='latin1') as f:
    print(f.read()[:200])

Formatted data output

Let's make a textual table of Physics laureates by year, earliest first:

In [ ]:
for idx, row in df_awards_past.sort_values('year').iterrows():
    if 'Physics' in row['title']:
        print('{}: {}'.format(
            row['year'],
            ', '.join(row['recipients'])))