Key Word(s): data scraping, beautifulsoup, pandas, matplotlib
CS109A Introduction to Data Science
Standard Section 1: Introduction to Web Scraping¶
Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Section Leaders: Marios Mattheakis, Hayden Joy
## RUN THIS CELL TO GET THE RIGHT FORMATTING
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Section Learning Objectives¶
When we're done today, you will approach messy real-world data with confidence that you can get it into a format that you can manipulate.
Specifically, our learning objectives are:
- Understand the tree-like structure of an HTML document and use that structure to extract desired information
Use Python data structures such as lists, dictionaries, and Pandas DataFrames to store and manipulate information
Practice using Python packages such as BeautifulSoup and Pandas, including how to navigate their documentation to find functionality.
Identify some other (semi-)structured formats commonly used for storing and transferring data, such as JSON and CSV
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
from IPython.display import HTML
# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.
%matplotlib inline
requests.packages.urllib3.disable_warnings()
import warnings
warnings.filterwarnings("ignore")
Section Data Analysis Questions¶
Is science becoming more collaborative over time? How about literature? Are there a few "geniuses" or lots of hard workers? One way we might answer those questions is by looking at Nobel Prizes. We could ask questions like:
- 1) Has anyone won a prize more than once?
- 2) How has the total number of recipients changed over time?
- 3) How has the number of recipients per award changed over time?
To answer these questions, we'll need data: who received what award and when.
Before we dive into acquiring this data the way we've been teaching in class, let's pause to ask: what are 5 different approaches we could take to acquiring Nobel Prize data?
When possible: find a structured dataset (.csv, .json, .xls)¶
After a google search we stumble upon this dataset on github. It is also in the section folder named github-nobel-prize-winners.csv
.
We use pandas to read it:
df = pd.read_csv("../data/github-nobel-prize-winners.csv")
df.head() #pandas is a very useful package
Or you may want to read an xlsx file:
(Potential missing package; you might need to run the following command in your terminal first: !conda install xlrd
)
!conda install --yes xlrd
df = pd.read_excel("../data/github-nobel-prize-winners.xlsx")
df.tail()
introducing types¶
#type(df.winner)
#type(df)
Research Question 1: Did anyone recieve the Nobel Prize more than once?¶
How would you check if anyone recieved more than one nobel prize?
# initialize the list storing all the names
name_winners = []
for name in df.winner:
# Check if we already encountered this name:
if name in name_winners:
# if so, print the name
print(name)
else:
# otherwise append the name to the list
name_winners.append(name)
We don't want to print "No Prize was Awarded" all the time.
# Your code here
# list storing all the names
name_winners = []
for name in df.winner:
# Check if we already encountered this name:
if name in name_winners and name:
# if so, print the name
print(name)
else:
# otherwise append the name to the list
name_winners.append(name)
we can use .split() on a string to separate the words into individual strings and store them in a list.¶
UN_string = "Office of the United Nations"
print(UN_string.split())
#n_words = len(UN_string.split())
#print("Number of words: " + str(n_words));
Even better:
name_winners = []
for name in df.winner:
# Check if we already encountered this name:
if name in name_winners and len(name.split()) <= 2:
# if so, print the name
print(name)
else:
# otherwise append the name to the list
name_winners.append(name)
How can we make this into a oneligner?
List comprehension form: [f(x) for x in list]
winners = []
[print(name) if (name in winners and len(name.split()) <= 2)
else winners.append(name) for name in df.winner];
HTML('\
\
Marie Curie recieved the nobel prize in physics in 1903 and chemistry in 1911.
\
She is one of only four people to recieve two Nobel Prizes.\
\
')
Part 2: WEB SCRAPING¶
The first step in web scraping is to look for structure in the html. Lets look at a real website:¶
The official Nobel website has the data we want, but in 2018 and 2019 the physics prize was awarded to multiple groups so we will use an archived version of the web-page for an easier introduction to web scraping.
The Internet Archive periodically crawls most of the Internet and saves what it finds. (That's a lot of data!) So let's grab the data from the Archive's "Wayback Machine" (great name!). We've just given you the direct URL, but at the very end you'll see how we can get it out of a JSON response from the Wayback Machine API.
Let's take a look at the 2018 version of the Nobel website and to look at the underhood HTML: right-click and click on inspect
. Try to find structure in the tree-structured HTML.
Play around! (give floor to the students)
###################################################
The first step of web scraping is to write down the structure of the web page¶
Here some quick recap of HTML tags and what they do in the context of this notebook:¶
HTML tags are opened and closed as follows: \
some text \<\h3>.
Here are a list of few tags, their definitions and what information they contain in our problem today:
\ : paragraph tag : header 3 tag
tag is a header size 3 tag (header 1 is the largest tag). This tag will contain the title and year of the nobel prize, which we will parse out.
\ : header 6 tag
tag (smaller than header 3) will contain the prize recipients
\
\ "Content Division element ( \
Paying attention to tags with class attributes is key to the homework.
# here is what we will get after selecting using the class by year tag.
einstein = HTML('\
snapshot_url = 'http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/'
snapshot = requests.get(snapshot_url)
snapshot
Response [200] is a success status code. Let's google: response 200 meaning
. All possible codes here.
type(snapshot)
Try to request "www.xoogle.be". What happens?
snapshot_url2 = 'http://web.archive.org/web/20180820111639/https://www.xoogle.be'
snapshot = requests.get(snapshot_url2)
snapshot
Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be on HW1), always put a delay between requests (e.g, time.sleep(1)
, with the time
library) so you don’t unwittingly hammer someone’s webserver and/or get blocked.
snapshot = requests.get(snapshot_url)
raw_html = snapshot.text
print(raw_html[500:])
Regular Expressions¶
You can find specific patterns or strings in text by using Regular Expressions: This is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). Some great resources that we recommend, if you are interested in them (could be very useful for a homework problem):
- https://docs.python.org/3.3/library/re.html
- https://regexone.com
- https://docs.python.org/3/howto/regex.html.
Specify a specific sequence with the help of regex special characters. Some examples:
\S
: Matches any character which is not a Unicode whitespace character\d
: Matches any Unicode decimal digit*
: Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.
Let's find all the occurances of 'Marie' in our raw_html:
import re
re.findall(r'Marie', raw_html)
Using \S
to match 'Marie' + ' ' + 'any character which is not a Unicode whitespace character':
re.findall(r'Marie \S',raw_html)
How would we find the lastnames that come after Marie?
ANSWER: the \w character represents any alpha-numeric character. \w* is greedy and gets a repeat of the characters until the next bit of whitespace.
# Your code here
last_names = re.findall(r'Marie \w*', raw_html)
display(last_names)
Now, we have all our data in the notebook. Unfortunately, it is the form of one really long string, which is hard to work with directly. This is where BeautifulSoup comes in.
This is an example of code that grabs the first title. Regex can quickly become complex, which motivates beautiful soup.¶
first_title = re.findall(r'.*<\/a><\/h3>'
, raw_html)[0]
print(first_title)
#you can do this via regex, but it gets complicated fast! This motivates Beautiful Soup.
Parse the HTML with BeautifulSoup¶
soup = BeautifulSoup(raw_html, 'html.parser')
Key BeautifulSoup functions we’ll be using in this section:
tag.prettify()
: Returns cleaned-up version of raw HTML, useful for printingtag.select(selector)
: Return a list of nodes matching a CSS selectortag.select_one(selector)
: Return the first node matching a CSS selectortag.text/soup.get_text()
: Returns visible text of a node (e.g.,"
" -> "Some text")Some text
tag.contents
: A list of the immediate children of this node
You can also use these functions to find nodes.
tag.find_all(tag_name, attrs=attributes_dict)
: Returns a list of matching nodestag.find(tag_name, attrs=attributes_dict)
: Returns first matching node
BeautifulSoup is a very powerful library -- much more info here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Let's practice some BeautifulSoup commands...¶
Print a cleaned-up version of the raw HTML Which function should we use from above?
pretty_soup = soup.prettify()
print(pretty_soup[:500]) #what about negative indices?
Find the first “title” object
# Your code here
soup.select("h3 a")
Extract the text of first “title” object
#Your code here
Extracting award data¶
Let's use the structure of the HTML document to extract the data we want.
From inspecting the page in DevTools, we found that each award is in a div
with a by_year
class. Let's get all of them.
award_nodes = soup.select('.by_year') #
len(award_nodes)
Let's pull out an example.
award_node = award_nodes[200]
HTML(award_node.prettify())
Magic commands:
# show ls, tree, mkdir
Let's practice getting data out of a BS Node¶
The prize title¶
award_node.select_one('h3').text
How do we separate the year from the selected prize title?
# %load solutions/sol2.py
award_node.select_one('h3').text[:]
How do we drop the year from the title?
award_node.select_one('h3').text[:].strip()
Let's put them into functions:
# %load solutions/sol_functions.py
def get_award_title(award_node):
return award_node.select_one('h3').text[:-4].strip()
def get_award_year(award_node):
return int(award_node.select_one('h3').text[-4:])
Make a list of titles for all awards
#original code:
list_awards = []
for award_node in award_nodes:
list_awards.append(get_award_title(award_node))
list_awards
Let's use list comprehension:
# Your code here
[get_award_title(award_node) for award_node in award_nodes ]
The recipients¶
How do we handle there being more than one?
award_node.select('h6 a')
[node.text for node in award_node.select('h6 a')]
We'll leave them as a list for now, to return to this later.
This is how you would get the links: (Relevant for the homework)
[state_node.get("href") for state_node in award_node.select('h6 a')]
The prize "motivation"¶
How would you get the 'motivation'/reason of the prize from the following award_node
?
award_node = award_nodes[200]
award_node
# Your code here
print(award_node.select('p')[0].text);
Putting everything into functions:
def get_award_motivation(award_node):
award_node = award_node.select_one('p')
if not award_node: #0, [], None, and {} all default to False in a python conditional statement.
return None
return award_node.text
Break Out Room 1: Practice with CSS selectors, Functions and list comprehension¶
print(award_nodes[200])
Exercise 1.1: complete the following function by assigning the proper CSS-selector so that it returns a list of nobel prize award recipients.¶
Hint: you can specify multiple selectors separated by a space.
To load the first exercise by deleting the "#" and typing shift-enter to run the cell¶
clicking on "cell" -> "run all above" is also very helpful to run many cells of the notebook at once.
# %load exercises/exercise1.py
Exercise 1.2: Change the above function so it uses list comprehension.¶
To load the execise simply delete the '#' in the code below and run the cell.
# %load exercises/exercise2.py
Don't look at this cell until you've given the exercise a go! It loads the correct solution.
Exercise 1.2 solution (1.1 solution is contained herein as well)¶
# %load solutions/breakoutsol1.py
%run ./solutions/breakoutsol1.py
Let's create a Pandas dataframe¶
Now let's get all of the awards.
awards = []
for award_node in soup.select('.by_year'):
recipients = get_recipients(award_node)
#initialize the dictionary
award = {} #{key: value}
award['title'] = get_award_title(award_node)
award['year'] = get_award_year(award_node)
award['recipients'] = recipients
award['num_recipients'] = len(recipients)
award['motivation'] = get_award_motivation(award_node)
awards.append(award)
awards[0:2]
df_awards_raw = pd.DataFrame(awards)
#explain open brackets
df_awards_raw
Some quick EDA.¶
df_awards_raw.info()
df_awards_raw.year.min()
What is going on with the recipients column?
df_awards_raw.head()
df_awards_raw.num_recipients.value_counts()
Now lets take a look at num_recipients
df_awards_raw.num_recipients == 0
df_awards_raw[df_awards_raw.num_recipients == 0]
Ok: 2018 awards have no recipients because this is a 2018 archived version of nobel prize webpage. Some past years lack awards because none were actually awarded that year. Let's keep only meaningful data:
df_awards_past = df_awards_raw[df_awards_raw.year != 2018]
df_awards_past.info()
Hm, motivation
has a different number of items... why?
df_awards_past[df_awards_past.motivation.isnull()]
Looks like it's fine that those motivations were missing.
Sort the awards by year.
df_awards_past.sort_values('year').head()
How many awards of each type were given?¶
df_awards_past.title.value_counts()
But wait, that includes the years the awards weren't offered.
df_awards_actually_offered = df_awards_past[df_awards_past.num_recipients > 0]
df_awards_actually_offered.title.value_counts()
When was each award first given?¶
df_awards_actually_offered.groupby('title').year
df_awards_actually_offered.groupby('title').year.describe() # we will use this information later!
How many recipients per year?¶
Let's include the years with missing awards; if we were to analyze further, we'd have to decide whether to include them.
A good plot that clearly reveals patterns in the data is very important. Is this a good plot or not?
df_awards_past.plot.scatter(x='year', y='num_recipients') #explain scatterplot
It's hard to see a trend when there are multiple observations per year (why?).
Let's try looking at total num recipients by year.
Lets explore how important a good plot can be
df_awards_past.groupby('year').num_recipients.sum()
plt.figure(figsize=[16,6])
plt.plot(df_awards_past.groupby('year').num_recipients.mean(), 'b', linewidth='1')
plt.title('Total Nobel Awards per year')
plt.xlabel('Year')
plt.ylabel('Total recipients per prize')
plt.grid('on')
plt.show()
Check out the years 1940-43? Any comment?
Any trends the last 25 years?
set(df_awards_past.title)
plt.figure(figsize=[16,6])
i = 0
for award in set(df_awards_past.title):
i += 1
year = df_awards_past[df_awards_past['title']==award].year
recips = df_awards_past[df_awards_past['title']==award].num_recipients
index = year > 2020 - 25
years_filtered = year[index].values
recips_filtered = recips[index].values
plt.subplot(2,3,i)
plt.bar(years_filtered, recips_filtered, color='b', alpha = 0.7)
plt.title(award)
plt.xlabel('Year')
plt.ylabel('Number of Recipients')
plt.ylim(0, 3)
plt.tight_layout()
A cleaner way to iterate and keep tabs: the enumerate( ) function¶
'How has the number of recipients per award changed over time?'¶
# The enumerate function allows us to delete two lines of code
# The number of years shown is increased to 75 so we can see the trend.
plt.figure(figsize=[16,6])
for i, award in enumerate(set(df_awards_past.title), 1): ################### <--- enumerate
year = df_awards_past[ df_awards_past['title'] == award].year
recips = df_awards_past[ df_awards_past['title'] == award].num_recipients
index = year > 2019 - 75 ########################### <--- extend the range
years_filtered = year[index].values
recips_filtered = recips[index].values
#plot:
plt.subplot(2, 3, i) #arguments (nrows, ncols, index)
plt.bar(years_filtered, recips_filtered, color='b', alpha = 0.7)
plt.title(award)
plt.xlabel('Year')
plt.ylabel('Number of Recipients')
plt.ylim(0, 3)
plt.tight_layout()
Break Out Room II: Dictionaries, dataframes, and Pyplot¶
Exercise 2.1 (practice creating a dataframe): Build a dataframe of famous physicists from the following lists. ¶
Your dataframe should have the following columns: "name", "year_prize_awarded" and "famous_for".
famous_award_winners = ["Marie Curie", "Albert Einstein", "James Chadwick", "Werner Karl Heisenberg"]
nobel_prize_dates = [1923, 1937, 1940, 1934]
famous_for = ["spontaneous radioactivity", "general relativity", "strong nuclear force",
"uncertainty principle"]
#initialize dictionary
famous_physicists = {}
#TODO: build Pandas Dataframe
Exercise 2.2: Make a bar plot of the total number of Nobel prizes awarded per field. Make sure to use the 'group by' function to achieve this.¶
#create the figure:
plt.figure(figsize=[16,6])
#group by command:
#TODO
# %load solutions/exercise2.1sol
Exercise 2.2 Solutions¶
# %load solutions/exercise2.2sol_vanilla
# %load solutions/exercise2.2sol_improved
Food for thought: Is the prize in Economics more collaborative, or just more modern?
Extra: Did anyone recieve the Nobel Prize more than once (based upon scraped data)?¶
Here's where it bites us that our original DataFrame isn't "tidy". Let's make a tidy one.
A great scientific article describing tidy data by Hadley Wickam: https://vita.had.co.nz/papers/tidy-data.pdf
tidy_awards = []
for idx, row in df_awards_past.iterrows():
for recipient in row['recipients']:
tidy_awards.append(dict(
recipient = recipient,
year = row['year']))
tidy_awards_df = pd.DataFrame(tidy_awards)
tidy_awards_df
Now we can look at each recipient individually.
tidy_awards_df.recipient.value_counts()
End of Normal Section¶
Optional Further Readings¶
Harvard Professor Sean Eddy in the micro and chemical Biology department at Harvard teaches a great course called MCB-112: Biological Data Science. His course is difficult but a great complement to CS109a and is also taught in python.
Here are a couple resources that he referenced early in his course that helped solidify my understanding of data science.
50 Years of Data Science by Dave Donoho (2017)
Tidy data by Hadley Wickam (2014)
Extra Material: Other structured data formats (JSON and CSV)¶
CSV¶
CSV is a lowest-common-denominator format for tabular data.
df_awards_past.to_csv('../data/awards.csv', index=False)
with open('../data/awards.csv', 'r') as f:
print(f.read()[:1000])
It loses some info, though: the recipients list became a plain string, and the reader needs to guess whether each column is numeric or not.
pd.read_csv('../data/awards.csv').recipients.iloc[20]
JSON¶
JSON preserves structured data, but fewer data-science tools speak it.
df_awards_past.to_json('../data/awards.json', orient='records')
with open('../data/awards.json', 'r') as f:
print(f.read()[:1000])
Lists and other basic data types are preserved. (Custom data types aren't preserved, but you'll get an error when saving.)
pd.read_json('../data/awards.json').recipients.iloc[20]
Extra: Pickle: handy for storing data¶
For temporary data storage in a single version of Python, pickle
s will preserve your data even more faithfully, even many custom data types. But don't count on it for exchanging data or long-term storage. (In fact, don't try to load untrusted pickle
s -- they can run arbitrary code!)
df_awards_past.to_pickle('../data/awards.pkl')
with open('../data/awards.pkl', 'r', encoding='latin1') as f:
print(f.read()[:200])
Yup, lots of internal Python and Pandas stuff...
pd.read_pickle('../data/awards.pkl').recipients.iloc[20]
Extra: Formatted data output¶
Let's make a textual table of Physics laureates by year, earliest first:
for idx, row in df_awards_past.sort_values('year').iterrows():
if 'Physics' in row['title']:
print('{}: {}'.format(
row['year'],
', '.join(row['recipients'])))
Extra: Parsing JSON to get the Wayback Machine URL¶
We could go to http://archive.org, search for our URL, and get the URL for the archived version there. But since you'll often need to talk with APIs, let's take this opportunity to use the Wayback Machine's API. This will also give us a chance to practice working with JSON.
url = "https://www.nobelprize.org/prizes/lists/all-nobel-prizes/"
# All 3 of these do the same thing. The third is my (KCA's) favorite new feature of Python 3.6.
wayback_query_url = 'http://archive.org/wayback/available?url={}'.format(url)
wayback_query_url = 'http://archive.org/wayback/available?url={url}'.format(url=url)
wayback_query_url = f'http://archive.org/wayback/available?url={url}'
r = requests.get(wayback_query_url)
We got some kind of response... what is it?
r.text
Yay, JSON! It's usually pretty easy to work with JSON, once we parse it.
json.loads(r.text)
Loading responses as JSON is so common that requests
has a convenience method for it:
response_json = r.json()
response_json
What kind of object is this?
A little Python syntax review: How can we get the snapshot URL?
snapshot_url = response_json['archived_snapshots']['closest']['url']
snapshot_url