CS109A Introduction to Data Science

Standard Section 1: Introduction to Web Scraping

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Section Leaders: Marios Mattheakis, Hayden Joy


In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

Section Learning Objectives

When we're done today, you will approach messy real-world data with confidence that you can get it into a format that you can manipulate.

Specifically, our learning objectives are:

  • Understand the tree-like structure of an HTML document and use that structure to extract desired information
  • Use Python data structures such as lists, dictionaries, and Pandas DataFrames to store and manipulate information

  • Practice using Python packages such as BeautifulSoup and Pandas, including how to navigate their documentation to find functionality.

  • Identify some other (semi-)structured formats commonly used for storing and transferring data, such as JSON and CSV

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from bs4 import BeautifulSoup
import requests


import json
from IPython.display import HTML
In [3]:
# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.
%matplotlib inline 

requests.packages.urllib3.disable_warnings()

import warnings
warnings.filterwarnings("ignore")

Section Data Analysis Questions

Is science becoming more collaborative over time? How about literature? Are there a few "geniuses" or lots of hard workers? One way we might answer those questions is by looking at Nobel Prizes. We could ask questions like:

  • 1) Has anyone won a prize more than once?
  • 2) How has the total number of recipients changed over time?
  • 3) How has the number of recipients per award changed over time?

To answer these questions, we'll need data: who received what award and when.

Before we dive into acquiring this data the way we've been teaching in class, let's pause to ask: what are 5 different approaches we could take to acquiring Nobel Prize data?

When possible: find a structured dataset (.csv, .json, .xls)

After a google search we stumble upon this dataset on github. It is also in the section folder named github-nobel-prize-winners.csv.

We use pandas to read it:

In [4]:
df = pd.read_csv("../data/github-nobel-prize-winners.csv")
df.head() #pandas is a very useful package
Out[4]:
year discipline winner desc
0 1901 chemistry Jacobus H. van 't Hoff in recognition of the extraordinary services h...
1 1901 literature Sully Prudhomme in special recognition of his poetic compositi...
2 1901 medicine Emil von Behring for his work on serum therapy, especially its ...
3 1901 peace Henry Dunant NaN
4 1901 peace Frédéric Passy NaN

Or you may want to read an xlsx file:

(Potential missing package; you might need to run the following command in your terminal first: !conda install xlrd)

In [7]:
!conda install --yes xlrd 
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.7.10
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/chris/anaconda3/envs/cs109a

  added / updated specs:
    - xlrd


The following NEW packages will be INSTALLED:

  xlrd               pkgs/main/linux-64::xlrd-1.2.0-py37_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
In [8]:
df = pd.read_excel("../data/github-nobel-prize-winners.xlsx")
df.tail()
Out[8]:
year discipline winner desc
848 2007 medicine Oliver Smithies for their discoveries of principles for introd...
849 2007 peace Intergovernmental Panel on Climate Change (IPCC) for their efforts to build up and disseminate ...
850 2007 peace Albert Arnold (Al) Gore Jr. for their efforts to build up and disseminate ...
851 2007 physics Albert Fert for the discovery of Giant Magnetoresistance
852 2007 physics Peter Grünberg for the discovery of Giant Magnetoresistance

introducing types

In [ ]:
#type(df.winner)
#type(df)

Research Question 1: Did anyone recieve the Nobel Prize more than once?

How would you check if anyone recieved more than one nobel prize?

In [ ]:
# initialize the list storing all the names 
name_winners = []

for name in df.winner:
    
    # Check if we already encountered this name: 
    if name in name_winners:
        
        # if so, print the name
        print(name)
    else:
        # otherwise append the name to the list
        name_winners.append(name)

We don't want to print "No Prize was Awarded" all the time.

In [ ]:
# Your code here
# list storing all the names 
name_winners = []

for name in df.winner:
    
    # Check if we already encountered this name: 
    if name in name_winners and name: 
        # if so, print the name
        print(name)
        
    else:
        # otherwise append the name to the list
        name_winners.append(name)

we can use .split() on a string to separate the words into individual strings and store them in a list.

In [ ]:
UN_string = "Office of the United Nations"
print(UN_string.split())
#n_words = len(UN_string.split())
#print("Number of words: " + str(n_words));

Even better:

In [ ]:
name_winners = []

for name in df.winner:
    
    # Check if we already encountered this name: 
    if name in name_winners and len(name.split()) <= 2: 
        # if so, print the name
        print(name)
        
    else:
        # otherwise append the name to the list
        name_winners.append(name)

How can we make this into a oneligner?

List comprehension form: [f(x) for x in list]

In [ ]:
winners = []
[print(name) if (name in winners and len(name.split()) <= 2) 
 else winners.append(name) for name in df.winner];
In [ ]:
HTML('
\ \
Marie Curie recieved the nobel prize in physics in 1903 and chemistry in 1911.
\ She is one of only four people to recieve two Nobel Prizes.\
\
'
)

Part 2: WEB SCRAPING

The first step in web scraping is to look for structure in the html. Lets look at a real website:

The official Nobel website has the data we want, but in 2018 and 2019 the physics prize was awarded to multiple groups so we will use an archived version of the web-page for an easier introduction to web scraping.

The Internet Archive periodically crawls most of the Internet and saves what it finds. (That's a lot of data!) So let's grab the data from the Archive's "Wayback Machine" (great name!). We've just given you the direct URL, but at the very end you'll see how we can get it out of a JSON response from the Wayback Machine API.

Let's take a look at the 2018 version of the Nobel website and to look at the underhood HTML: right-click and click on inspect . Try to find structure in the tree-structured HTML.

Play around! (give floor to the students)

In [ ]:
###################################################

The first step of web scraping is to write down the structure of the web page

Here some quick recap of HTML tags and what they do in the context of this notebook:

HTML tags are opened and closed as follows: \

some text \<\h3>.

Here are a list of few tags, their definitions and what information they contain in our problem today:

\

: header 3 tag

tag is a header size 3 tag (header 1 is the largest tag). This tag will contain the title and year of the nobel prize, which we will parse out.
\
: header 6 tag
tag (smaller than header 3) will contain the prize recipients
\

: paragraph tag

tags used for text, contains the prize motivation
\
"Content Division element ( \
) is the generic container for flow content." What we care about here is the class attribute, which we will use with beautiful soup to quickly parse information which we want. The class attribute could be attatched to any tag.

Paying attention to tags with class attributes is key to the homework.

In [ ]:
# here is what we will get after selecting using the class by year tag.

einstein = HTML('\
         
\

\ \ The Nobel Prize in Physics 1921 \ \

\
\ \ Albert Einstein \
\

\ “for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect” \

\ ') display(einstein)
In [ ]:
snapshot_url = 'http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/'
In [ ]:
snapshot = requests.get(snapshot_url)
snapshot

Response [200] is a success status code. Let's google: response 200 meaning. All possible codes here.

In [ ]:
type(snapshot)

Try to request "www.xoogle.be". What happens?

In [ ]:
snapshot_url2 = 'http://web.archive.org/web/20180820111639/https://www.xoogle.be'
snapshot = requests.get(snapshot_url2)
snapshot

Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be on HW1), always put a delay between requests (e.g, time.sleep(1), with the time library) so you don’t unwittingly hammer someone’s webserver and/or get blocked.

In [ ]:
snapshot = requests.get(snapshot_url)
raw_html = snapshot.text
print(raw_html[500:])

Regular Expressions

You can find specific patterns or strings in text by using Regular Expressions: This is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). Some great resources that we recommend, if you are interested in them (could be very useful for a homework problem):

Specify a specific sequence with the help of regex special characters. Some examples:

  • \S : Matches any character which is not a Unicode whitespace character
  • \d : Matches any Unicode decimal digit
  • * : Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.

Let's find all the occurances of 'Marie' in our raw_html:

In [ ]:
import re 
In [ ]:
re.findall(r'Marie', raw_html)

Using \S to match 'Marie' + ' ' + 'any character which is not a Unicode whitespace character':

In [ ]:
re.findall(r'Marie \S',raw_html)

How would we find the lastnames that come after Marie?

ANSWER: the \w character represents any alpha-numeric character. \w* is greedy and gets a repeat of the characters until the next bit of whitespace.

In [ ]:
# Your code here
last_names = re.findall(r'Marie \w*', raw_html)
display(last_names)

Now, we have all our data in the notebook. Unfortunately, it is the form of one really long string, which is hard to work with directly. This is where BeautifulSoup comes in.

This is an example of code that grabs the first title. Regex can quickly become complex, which motivates beautiful soup.

In [ ]:
first_title = re.findall(r'

.*<\/a><\/h3>', raw_html)[0] print(first_title) #you can do this via regex, but it gets complicated fast! This motivates Beautiful Soup.

Parse the HTML with BeautifulSoup

In [ ]:
soup = BeautifulSoup(raw_html, 'html.parser')

Key BeautifulSoup functions we’ll be using in this section:

  • tag.prettify(): Returns cleaned-up version of raw HTML, useful for printing
  • tag.select(selector): Return a list of nodes matching a CSS selector
  • tag.select_one(selector): Return the first node matching a CSS selector
  • tag.text/soup.get_text(): Returns visible text of a node (e.g.,"

    Some text

    " -> "Some text")
  • tag.contents: A list of the immediate children of this node

You can also use these functions to find nodes.

  • tag.find_all(tag_name, attrs=attributes_dict): Returns a list of matching nodes
  • tag.find(tag_name, attrs=attributes_dict): Returns first matching node

BeautifulSoup is a very powerful library -- much more info here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Let's practice some BeautifulSoup commands...

Print a cleaned-up version of the raw HTML Which function should we use from above?

In [ ]:
pretty_soup = soup.prettify()
print(pretty_soup[:500]) #what about negative indices?

Find the first “title” object

In [ ]:
# Your code here
soup.select("h3 a")

Extract the text of first “title” object

In [ ]:
#Your code here

Extracting award data

Let's use the structure of the HTML document to extract the data we want.

From inspecting the page in DevTools, we found that each award is in a div with a by_year class. Let's get all of them.

In [ ]:
award_nodes = soup.select('.by_year') #
len(award_nodes)

Let's pull out an example.

In [ ]:
award_node = award_nodes[200]
In [ ]:
HTML(award_node.prettify())

Magic commands:

In [ ]:
# show ls, tree, mkdir

Let's practice getting data out of a BS Node

The prize title

In [ ]:
award_node.select_one('h3').text

How do we separate the year from the selected prize title?

In [ ]:
# %load solutions/sol2.py
award_node.select_one('h3').text[:]

How do we drop the year from the title?

In [ ]:
award_node.select_one('h3').text[:].strip()

Let's put them into functions:

In [ ]:
# %load solutions/sol_functions.py
def get_award_title(award_node):
    return award_node.select_one('h3').text[:-4].strip()

def get_award_year(award_node):
    return int(award_node.select_one('h3').text[-4:])

Make a list of titles for all awards

In [ ]:
#original code:
list_awards = []
for award_node in award_nodes:
    list_awards.append(get_award_title(award_node))
list_awards

Let's use list comprehension:

In [ ]:
# Your code here
[get_award_title(award_node) for award_node in award_nodes ]

The recipients

How do we handle there being more than one?

In [ ]:
award_node.select('h6 a')
In [ ]:
[node.text for node in award_node.select('h6 a')]

We'll leave them as a list for now, to return to this later.

This is how you would get the links: (Relevant for the homework)

In [ ]:
[state_node.get("href") for state_node in award_node.select('h6 a')]

The prize "motivation"

How would you get the 'motivation'/reason of the prize from the following award_node?

In [ ]:
award_node = award_nodes[200]
award_node
In [ ]:
# Your code here
print(award_node.select('p')[0].text);

Putting everything into functions:

In [ ]:
def get_award_motivation(award_node):
    award_node = award_node.select_one('p')
    if not award_node: #0, [], None, and {} all default to False in a python conditional statement.
        return None
    return award_node.text 

Break Out Room 1: Practice with CSS selectors, Functions and list comprehension

In [ ]:
print(award_nodes[200])

Exercise 1.1: complete the following function by assigning the proper CSS-selector so that it returns a list of nobel prize award recipients.

Hint: you can specify multiple selectors separated by a space.

To load the first exercise by deleting the "#" and typing shift-enter to run the cell

clicking on "cell" -> "run all above" is also very helpful to run many cells of the notebook at once.

In [ ]:
# %load exercises/exercise1.py

Exercise 1.2: Change the above function so it uses list comprehension.

To load the execise simply delete the '#' in the code below and run the cell.

In [ ]:
# %load exercises/exercise2.py

Don't look at this cell until you've given the exercise a go! It loads the correct solution.

Exercise 1.2 solution (1.1 solution is contained herein as well)

In [ ]:
# %load solutions/breakoutsol1.py
In [ ]:
%run ./solutions/breakoutsol1.py

Let's create a Pandas dataframe

Now let's get all of the awards.

In [ ]:
awards = []
for award_node in soup.select('.by_year'):
    recipients = get_recipients(award_node)
    
    #initialize the dictionary
    award = {} #{key: value}
    
    award['title'] = get_award_title(award_node)
    award['year'] = get_award_year(award_node)
    award['recipients'] = recipients
    award['num_recipients'] = len(recipients)
    award['motivation'] = get_award_motivation(award_node)    
    awards.append(award)
awards[0:2]
In [ ]:
df_awards_raw = pd.DataFrame(awards)
In [ ]:
#explain open brackets
df_awards_raw

Some quick EDA.

In [ ]:
df_awards_raw.info()
In [ ]:
df_awards_raw.year.min()

What is going on with the recipients column?

In [ ]:
df_awards_raw.head()
In [ ]:
df_awards_raw.num_recipients.value_counts()

Now lets take a look at num_recipients

In [ ]:
df_awards_raw.num_recipients == 0
In [ ]:
df_awards_raw[df_awards_raw.num_recipients == 0]

Ok: 2018 awards have no recipients because this is a 2018 archived version of nobel prize webpage. Some past years lack awards because none were actually awarded that year. Let's keep only meaningful data:

In [ ]:
df_awards_past = df_awards_raw[df_awards_raw.year != 2018]
df_awards_past.info()

Hm, motivation has a different number of items... why?

In [ ]:
df_awards_past[df_awards_past.motivation.isnull()]

Looks like it's fine that those motivations were missing.

Sort the awards by year.

In [ ]:
df_awards_past.sort_values('year').head()

How many awards of each type were given?

In [ ]:
df_awards_past.title.value_counts()

But wait, that includes the years the awards weren't offered.

In [ ]:
df_awards_actually_offered = df_awards_past[df_awards_past.num_recipients > 0]
df_awards_actually_offered.title.value_counts()

When was each award first given?

In [ ]:
df_awards_actually_offered.groupby('title').year
In [ ]:
df_awards_actually_offered.groupby('title').year.describe() # we will use this information later!

How many recipients per year?

Let's include the years with missing awards; if we were to analyze further, we'd have to decide whether to include them.

A good plot that clearly reveals patterns in the data is very important. Is this a good plot or not?

In [ ]:
df_awards_past.plot.scatter(x='year', y='num_recipients') #explain scatterplot

It's hard to see a trend when there are multiple observations per year (why?).

Let's try looking at total num recipients by year.

Lets explore how important a good plot can be
In [ ]:
df_awards_past.groupby('year').num_recipients.sum()
In [ ]:
plt.figure(figsize=[16,6])
plt.plot(df_awards_past.groupby('year').num_recipients.mean(), 'b', linewidth='1')


plt.title('Total Nobel Awards per year')
plt.xlabel('Year')
plt.ylabel('Total recipients per prize')
plt.grid('on')
plt.show()

Check out the years 1940-43? Any comment?

Any trends the last 25 years?

In [ ]:
set(df_awards_past.title)
In [ ]:
plt.figure(figsize=[16,6])
i = 0
for award in set(df_awards_past.title):
    i += 1
    year = df_awards_past[df_awards_past['title']==award].year
    recips = df_awards_past[df_awards_past['title']==award].num_recipients
    index = year > 2020 - 25
    years_filtered  = year[index].values
    recips_filtered = recips[index].values
    
    plt.subplot(2,3,i)
    plt.bar(years_filtered, recips_filtered, color='b', alpha = 0.7)
    plt.title(award)
    plt.xlabel('Year')
    plt.ylabel('Number of Recipients')
    plt.ylim(0, 3)
plt.tight_layout()

A cleaner way to iterate and keep tabs: the enumerate( ) function

'How has the number of recipients per award changed over time?'

In [ ]:
# The enumerate function allows us to delete two lines of code 
# The number of years shown is increased to 75 so we can see the trend.
plt.figure(figsize=[16,6])

for i, award in enumerate(set(df_awards_past.title), 1): ################### <--- enumerate
    year = df_awards_past[ df_awards_past['title'] == award].year
    recips = df_awards_past[ df_awards_past['title'] == award].num_recipients
    index = year > 2019 - 75                   ########################### <--- extend the range
    years_filtered = year[index].values
    recips_filtered = recips[index].values
    
    #plot:
    plt.subplot(2, 3, i) #arguments (nrows, ncols, index)
    plt.bar(years_filtered, recips_filtered, color='b', alpha = 0.7)
    plt.title(award)
    plt.xlabel('Year')
    plt.ylabel('Number of Recipients')
    plt.ylim(0, 3)

plt.tight_layout()

End of Standard Section


Break Out Room II: Dictionaries, dataframes, and Pyplot

Exercise 2.1 (practice creating a dataframe): Build a dataframe of famous physicists from the following lists.

Your dataframe should have the following columns: "name", "year_prize_awarded" and "famous_for".

In [ ]:
famous_award_winners = ["Marie Curie", "Albert Einstein", "James Chadwick", "Werner Karl Heisenberg"] 
nobel_prize_dates    = [1923, 1937, 1940, 1934]
famous_for           = ["spontaneous radioactivity", "general relativity", "strong nuclear force",
                        "uncertainty principle"]
In [ ]:
#initialize dictionary
famous_physicists = {}
#TODO: build Pandas Dataframe

Exercise 2.2: Make a bar plot of the total number of Nobel prizes awarded per field. Make sure to use the 'group by' function to achieve this.

In [ ]:
#create the figure:
plt.figure(figsize=[16,6])
#group by command:
#TODO

Solutions:

Exercise 2.1 Solutions

In [ ]:
# %load solutions/exercise2.1sol

Exercise 2.2 Solutions

In [ ]:
# %load solutions/exercise2.2sol_vanilla
In [ ]:
# %load solutions/exercise2.2sol_improved

Food for thought: Is the prize in Economics more collaborative, or just more modern?

Extra: Did anyone recieve the Nobel Prize more than once (based upon scraped data)?

Here's where it bites us that our original DataFrame isn't "tidy". Let's make a tidy one.

A great scientific article describing tidy data by Hadley Wickam: https://vita.had.co.nz/papers/tidy-data.pdf

In [ ]:
tidy_awards = []
for idx, row in df_awards_past.iterrows():
    for recipient in row['recipients']:
        tidy_awards.append(dict(
            recipient = recipient,
            year = row['year']))
tidy_awards_df = pd.DataFrame(tidy_awards)
tidy_awards_df

Now we can look at each recipient individually.

In [ ]:
tidy_awards_df.recipient.value_counts()

End of Normal Section

Optional Further Readings

Harvard Professor Sean Eddy in the micro and chemical Biology department at Harvard teaches a great course called MCB-112: Biological Data Science. His course is difficult but a great complement to CS109a and is also taught in python.

Here are a couple resources that he referenced early in his course that helped solidify my understanding of data science.

50 Years of Data Science by Dave Donoho (2017)

Tidy data by Hadley Wickam (2014)

Extra Material: Other structured data formats (JSON and CSV)

CSV

CSV is a lowest-common-denominator format for tabular data.

In [ ]:
df_awards_past.to_csv('../data/awards.csv', index=False)
with open('../data/awards.csv', 'r') as f:
    print(f.read()[:1000])

It loses some info, though: the recipients list became a plain string, and the reader needs to guess whether each column is numeric or not.

In [ ]:
pd.read_csv('../data/awards.csv').recipients.iloc[20]

JSON

JSON preserves structured data, but fewer data-science tools speak it.

In [ ]:
df_awards_past.to_json('../data/awards.json', orient='records')

with open('../data/awards.json', 'r') as f:
    print(f.read()[:1000])

Lists and other basic data types are preserved. (Custom data types aren't preserved, but you'll get an error when saving.)

In [ ]:
pd.read_json('../data/awards.json').recipients.iloc[20]

Extra: Pickle: handy for storing data

For temporary data storage in a single version of Python, pickles will preserve your data even more faithfully, even many custom data types. But don't count on it for exchanging data or long-term storage. (In fact, don't try to load untrusted pickles -- they can run arbitrary code!)

In [ ]:
df_awards_past.to_pickle('../data/awards.pkl')
with open('../data/awards.pkl', 'r', encoding='latin1') as f:
    print(f.read()[:200])

Yup, lots of internal Python and Pandas stuff...

In [ ]:
pd.read_pickle('../data/awards.pkl').recipients.iloc[20]

Extra: Formatted data output

Let's make a textual table of Physics laureates by year, earliest first:

In [ ]:
for idx, row in df_awards_past.sort_values('year').iterrows():
    if 'Physics' in row['title']:
        print('{}: {}'.format(
            row['year'],
            ', '.join(row['recipients'])))

Extra: Parsing JSON to get the Wayback Machine URL

We could go to http://archive.org, search for our URL, and get the URL for the archived version there. But since you'll often need to talk with APIs, let's take this opportunity to use the Wayback Machine's API. This will also give us a chance to practice working with JSON.

In [ ]:
url = "https://www.nobelprize.org/prizes/lists/all-nobel-prizes/"
# All 3 of these do the same thing. The third is my (KCA's) favorite new feature of Python 3.6.
wayback_query_url = 'http://archive.org/wayback/available?url={}'.format(url)
wayback_query_url = 'http://archive.org/wayback/available?url={url}'.format(url=url)
wayback_query_url = f'http://archive.org/wayback/available?url={url}'
r = requests.get(wayback_query_url)

We got some kind of response... what is it?

In [ ]:
r.text

Yay, JSON! It's usually pretty easy to work with JSON, once we parse it.

In [ ]:
json.loads(r.text)

Loading responses as JSON is so common that requests has a convenience method for it:

In [ ]:
response_json = r.json()
response_json

What kind of object is this?

A little Python syntax review: How can we get the snapshot URL?

In [ ]:
snapshot_url = response_json['archived_snapshots']['closest']['url']
snapshot_url