Key Word(s): pandas
CS109A Introduction to Data Science
Lab 01: Introduction to Web Scraping¶
Harvard University
Fall 2021
Instructors: Pavlos Protopapas and Natesh Pillai
Lab Team: Marios Mattheakis, Hayden Joy, Chris Gumb, and Eleni Kaxiras
Authors: Varshini Reddy, Marios Mattheakis and Pavlos Protopapas
## RUN THIS CELL TO GET THE RIGHT FORMATTING
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Lab Learning Objectives¶
When we're done today, you will approach messy real-world data with confidence that you can get it into a format that you can manipulate.
Specifically, our learning objectives are:
- Understand the tree-like structure of an HTML document and use that structure to extract desired information.
Use Python data structures such as lists, dictionaries to store and manipulate information.
Practice using Python packages such as BeautifulSoup, including how to navigate their documentation to find functionality.
Identify other (semi-)structured formats commonly used for storing and transferring data, such as CSV.
Pre-Requisites¶
Before you start working on the lab, we expect you to be familiar with Python programming. Following is the list of topics you need to brush up on before attending the lab session. We have provided some quick start references as well.
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
from IPython.display import HTML
%matplotlib inline
# Setting up 'requests' to make HTTPS requests properly takes some
# extra steps.
requests.packages.urllib3.disable_warnings()
import warnings
warnings.filterwarnings("ignore")
Lab Data Analysis Questions¶
Is science becoming more collaborative over time? How about literature? Are there a few "geniuses" or lots of hard workers? One way we might answer those questions is by looking at Nobel Prizes winners. We could ask questions like:
- 1) Has anyone won a prize more than once?
- 2) How has the total number of recipients changed over time?
- 3) How has the number of recipients per award changed over time?
To answer these questions, we will need data: who received what award and when.
When possible: find a structured dataset (.csv, .json, .xls)¶
After a google search we stumble upon this dataset on github. It is also in the lab folder named github-nobel-prize-winners.csv
.
We use Pandas to read it. Pandas will be covered next week in more details.
df = pd.read_csv("data/github-nobel-prize-winners.csv")
df.head()
year | discipline | winner | desc | |
---|---|---|---|---|
0 | 1901 | chemistry | Jacobus H. van 't Hoff | in recognition of the extraordinary services h... |
1 | 1901 | literature | Sully Prudhomme | in special recognition of his poetic compositi... |
2 | 1901 | medicine | Emil von Behring | for his work on serum therapy, especially its ... |
3 | 1901 | peace | Henry Dunant | NaN |
4 | 1901 | peace | Frédéric Passy | NaN |
Research Question 1: Did anyone recieve the Nobel Prize more than once?¶
How would you check if anyone recieved more than one nobel prize?
We will be using Python lists for this, which is a pre-requisite for this lab as mentioned earlier. If you have any questions with regards to lists or list comprehensions, refer to the slides from us here.
# Initialize the list storing all the names
name_winners = []
for name in df.winner:
# Check if we already encountered this name:
if name in name_winners:
# (TODO) If so, print the name
print(___)
else:
# (TODO) Otherwise append the name to the list
name_winners.append(___)
We don't want to print "No Prize was Awarded" all the time.
# List storing all the names
name_winners = []
for name in df.winner:
# (TODO) Check if we already encountered this name and the name is not "No Prize was Awarded":
if name in name_winners and name != ___ :
# (TODO) If so, print the name
print(___)
else:
# (TODO) Otherwise append the name to the list
name_winners.append(___)
we can use .split() on a string to separate the words into individual strings and store them in a list.¶
Experiment with the .split() below before using it.
UN_string = "Office of the United Nations"
print(UN_string.split())
n_words = len(UN_string.split())
print("Number of words: " + str(n_words));
['Office', 'of', 'the', 'United', 'Nations'] Number of words: 5
Let us only print winners with only two words in their name:
name_winners = []
for name in df.winner:
# (TODO) Check if we already encountered this name and the name consists of no more than 2 words:
if name in name_winners and len(___) <= 2:
# (TODO) If so, print the name
print(___)
else:
# (TODO) Otherwise append the name to the list
name_winners.append(___)
Marie Curie recieved the nobel prize in physics in 1903 and chemistry in 1911. She is one of only four people to recieve two Nobel Prizes.
All questions, such as "did anyone receive the Noble Price more than once?", are easy to answer when the data is present in such a clean tabular form. However, many times (if not most) we do not find the data we need in such a format.
In such cases, we need to perform web scraping and cleaning to get the data we desire. The end result of this lab is to create a pandas dataframe after web scraping and cleaning.
WEB SCRAPING¶
HTML stands for Hyper Text Markup Language. It is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets and scripting languages such as JavaScript.
Standard HTML documents¶
HTML documents generally have the following structure:
Page Heading\
** **\The first paragraph of page\
** **.** **.** **.** **.** **\** **\**What does each of these tags indicate?¶
The \<!DOCTYPE html> declaration defines that this document is an HTML5 document
The \ element is the root element of an HTML page
The \ element contains meta information about the HTML page
The \
element specifies a title for the HTML page (which is shown in the browser's title bar or in the page's tab) The \ element defines the document's body, and is a container for all the visible contents, such as headings, paragraphs, images, hyperlinks, tables, lists, etc.
The \ element defines a large heading. There are other heading tags in html, \
, \
, \
, \
, \
The \ element defines a paragraph
What is an HTML Element?¶
An HTML element is defined by a start tag, some content, and an end tag:
\
An example of an HTML element is as follows:
\ The Page Heading \
WEB SCRAPING¶
The official Nobel website has the data we want, but in 2018 and 2019 the physics prize was awarded to multiple groups so we will use an archived version of the web-page for an easier introduction to web scraping.
The Internet Archive periodically crawls most of the Internet and saves what it finds. (That's a lot of data!) So let's grab the data from the Archive's "Wayback Machine" (great name!). We've just given you the direct URL, but at the very end you'll see how we can get it out of a JSON response from the Wayback Machine API.
Let's take a look at the 2018 version of the Nobel website and to look at the underhood HTML: right-click and click on inspect
.You should see something like this.
Mapping the HTML tags to the webpage¶
When you inspect, try to map each element on the webpage to its HTML.
# here is what we will get after selecting using the class by year tag.
# we use the HTML parser module to render the html
einstein = HTML('\
<div class ="Class: by year"> \
<h3> \
<a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/1921/"> \
The Nobel Prize in Physics 1921 \
</a> \
</h3> \
<h6> \
<a href="http://web.archive.org/web/20180820111639/https://www.nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-facts.html">\
Albert Einstein</a> \
</h6> \
<p> \
“for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect” \
</p> \
')
display(einstein)
The Nobel Prize in Physics 1921
Albert Einstein
“for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect”
snapshot_url = 'http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/'
# (TODO) make a GET request to snapshot_url
snapshot = requests.get(___)
snapshot
Response [200] is a success status code. Let's google: response 200 meaning
. All possible codes here.
type(snapshot)
Try to request "www.xoogle.be". What happens?
snapshot_url2 = 'http://web.archive.org/web/20180820111639/https://www.xoogle.be'
# (TODO) make a GET request to snapshot_url2
snapshot = requests.get(___)
snapshot
Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be doing on HW1), always put a delay between requests (e.g., time.sleep(1)
, with the time
library), so you do not unwittingly hammer someone’s webserver and/or get blocked.
Let's look at the content we just scraped!
snapshot = requests.get(snapshot_url)
raw_html = snapshot.text
print(raw_html[:5000])
What makes Python special ?¶
import this
The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
Regular Expressions¶
You can find specific patterns or strings in text by using Regular Expressions (or re, regex, regexp): This is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python).
A short summary of regular expressions from us can be found here.
Some great resources that we recommend, if you are interested in them (could be very useful for a homework problem):
- https://docs.python.org/3.3/library/re.html
- https://regexone.com
- https://docs.python.org/3/howto/regex.html.
Specify a specific sequence with the help of regex special characters. Some examples:
\S
: Matches any character which is not a Unicode whitespace character: spaces, tabs, newlines\d
: Matches any Unicode decimal digit,0
,1
, ...,9
*
: Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.
Let's find all the occurances of 'Marie' in our raw_html:
import re
re.findall(r'Marie', raw_html)
Note we use an r before the string to get the raw text.
Using \S
to match 'Marie' + ' ' + 'any character which is not a Unicode whitespace character':
re.findall(r'Marie \S',raw_html)
How would we find the lastnames that come after Marie?
# Your code here
Hint: The \w character represents any alpha-numeric character. \w* is greedy and gets a repeat of the characters until the next bit of whitespace.
Now, we have all our data in the notebook. Unfortunately, it is the form of one really long string, which is hard to work with directly. This is where BeautifulSoup comes in.
This is an example of code that grabs the first title. Regex can quickly become complex, which motivates beautiful soup.¶
first_title = re.findall(r'<h3><a.*>.*<\/a><\/h3>', raw_html)[0]
print(first_title)
#you can do this via regex, but it gets complicated fast! This motivates Beautiful Soup.
Parse the HTML with BeautifulSoup¶
BeautifulSoup works by parsing the raw html text into a tree. Every tag in the raw html becomes a node in the tree. We can then navigate the tree by selecting a node and querying its parent, children, siblings, etc.
soup = BeautifulSoup(raw_html, 'html.parser')
Key BeautifulSoup functions we’ll be using in this lab:
tag.prettify()
: Returns cleaned-up version of raw HTML, useful for printingtag.select(selector)
: Return a list of nodes matching a CSS selectortag.select_one(selector)
: Return the first node matching a CSS selectortag.text/soup.get_text()
: Returns visible text of a node (e.g.,"<p>Some text</p>
" -> "Some text")tag.contents
: A list of the immediate children of this node
You can also use these functions to find nodes.
tag.find_all(tag_name, attrs=attributes_dict)
: Returns a list of matching nodestag.find(tag_name, attrs=attributes_dict)
: Returns first matching node
BeautifulSoup is a very powerful library -- much more info here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Let's practice some BeautifulSoup commands,¶
Print a cleaned-up version of the raw HTML
Which function should we use from above?
pretty_soup = soup.prettify()
print(pretty_soup[:500]) #what about negative indices?
Find the first “title” object
soup.select("title")[:50]
Extract the text of first “heading” object given by $<h3>$
soup.select_one('a h3')
Extracting award data¶
Let's use the structure of the HTML document to extract the data we want.
From inspecting the page in DevTools, we found that each award is in a div
with a by_year
class. Let's get all of them.
award_nodes = soup.select('.by_year') #<div class ="by year"
len(award_nodes)
Let's pull out an example.
award_node = award_nodes[200]
award_node.prettify()
We use the HTML library to render the HTML below
HTML(award_node.prettify())
Let's practice getting data out of a BS node (award_node)¶
The prize title¶
Check the html from above and note that the prize title is in the h3 tag.
award_node.select_one('h3').text
How do we separate the year from the selected prize title?
award_node.select_one('h3').text[-4:]
How do we drop the year from the title?
award_node.select_one('h3').text[:-4].strip()
Let's put them into functions:
# wrap the above code inside a function
def get_award_title(award_node):
return award_node.select_one('h3').text[___].strip()
def get_award_year(award_node):
return int(award_node.select_one('h3').text[___])
Make a list of titles for all awards
#original code:
list_awards = []
for award_node in award_nodes:
list_awards.append(get_award_title(___))
list_awards[:50]
How can we make this into a oneliner?
We can use list comprehension
l = [f(x) for x in some_list]
which is equivalent to
l = []
for x in some_list:
element = f(x)
l.append(element)
List comprehensions are explained in the slides from us linked above.
# (TODO) use list comprehension to get a list of titles
[get_award_title(___) for award_node in award_nodes ]
The recipients¶
Check the html from above and note that the prize title is in the h6 a selector.
award_node.select('h6 a')
How do we handle there being more than one?
[node.text for node in award_node.select('h6 a')]
Let's encapsulate this process into a function and make it into a function.
def get_recipients(award_node):
return [node.text for node in award_node.select('h6 a')]
We'll leave them as a list for now, to return to this later.
This is how you would get the links: (Relevant for the homework)
[state_node.get("href") for state_node in award_node.select('h6 a')]
The prize "motivation"¶
How would you get the 'motivation'/reason of the prize from the following award_node
?
award_node = award_nodes[200]
award_node
print(award_node.select('p')[0].text);
Putting everything into functions:
def get_award_motivation(award_node):
award_node = award_node.select_one('p')
if not award_node: #0, [], None, and {} all default to False in a python conditional statement.
return None
return award_node.text
Let's create a Pandas dataframe¶
Next, we parse the collected data and create a pandas.DataFrame
. A DataFrame is like a table, where each row corresponds to a data entry and each column corresponds to a feature. Once we have a DataFrame, we can easily export it to our disk in CSV, JSON, or other formats.
The easiest way to create a DataFrame is to build a list of dictionaries. Dictionaries are a pre-requisite for this lab. Refer to the slides from us here for a better understanding.
Each entry (dict) in the list is a data point, where keys are column names in the table. Let's see it in action.
awards = []
for award_node in soup.select('.by_year'):
recipients = get_recipients(award_node)
# Initialize the dictionary
award = {} #{key: value}
# Call `get_award_title` to get the title of award_node
award['title'] = get_award_title(award_node)
# Call `get_award_title` to get the year of award_node
award['year'] = get_award_year(award_node)
# Call `get_recipients` to get the list of recipients of award_node
award['recipients'] = recipients
# Count number of recipients using the built-in `len()` function
award['num_recipients'] = len(recipients)
# (TODO) call `get_award_motivation` to get the motivation of award_node
award['motivation'] = get_award_motivation(award_node)
awards.append(award)
awards[0:2]
# (TODO) convert the list of dictionaries to a pandas DataFrame
df_awards_raw = pd.DataFrame(awards)
df_awards_raw
To export the data to a local CSV file, let's used the .to_csv()
method. After you run the follwing code, you can find a scraped_awards.csv
in the same directory with this notebook. You can open the notebook using Microsoft Excel or Numbers, but make sure you are using the UTF-8 codec.
df_awards_raw.to_csv('scraped_awards.csv')
Some quick EDA.¶
df_awards_raw.info()
df_awards_raw.year.min()
What is going on with the recipients column?
df_awards_raw.head()
Visualizing Number of Recipients by Year¶
Finally, we visualize the number of recipients for each Nobel Prize by year. Don't worry about the syntax for the moment, you'll get used to it in future exercise.
titles = set(df_awards_raw.title)
fig = plt.figure(figsize=(20, 44), dpi=100)
axes = fig.subplots(len(titles), 1)
for title, ax in zip(titles, axes):
# (TODO) select entries whose titles match `title`
plot_df = df_awards_raw[df_awards_raw.title == title]
# (TODO) plot the selected entries using bar-plot, where x-axis is year and y-axis is number of recipeints
ax.bar(___, ___, color="#97CFC4")
ax.set_title(___)
ax.set_xlabel(___)
ax.set_ylabel(___)
# `counter` is used to save the number of nobel prize winners every year
counter = {}
for year in range(min(df_awards_raw.year), max(df_awards_raw.year) + 1):
# (TODO) compute total number of recipients that year
count = df_awards_raw[df_awards_raw.year == year].num_recipients.sum()
counter[year] = count
fig = plt.figure(figsize=(20, 6), dpi=100)
ax = fig.add_subplot(1, 1, 1)
# (TODO) make another bar-plot, where x-axis is year and y-axis is total number of recipeints
ax.bar(___, ___, color="#97CFC4")
ax.set_title('Total Amount of Nobel Prize')
ax.set_xlabel('year')
ax.set_ylabel('#Recipients');
End of Normal Lab¶
Optional Further Readings¶
Here are a couple resources that he referenced early in his course that helped solidify my understanding of data science.
50 Years of Data Science by Dave Donoho (2017)
Tidy data by Hadley Wickam (2014)