Harvard University Fall 2020 Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner Section Leaders: Marios Mattheakis, Hayden Joy
In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING importrequestsfromIPython.core.displayimportHTMLstyles=requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").textHTML(styles)
# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.%matplotlib inline
requests.packages.urllib3.disable_warnings()importwarningswarnings.filterwarnings("ignore")
Is science becoming more collaborative over time? How about literature? Are there a few "geniuses" or lots of hard workers? One way we might answer those questions is by looking at Nobel Prizes. We could ask questions like:
1) Has anyone won a prize more than once?
2) How has the total number of recipients changed over time?
3) How has the number of recipients per award changed over time?
To answer these questions, we'll need data: who received what award and when.
Before we dive into acquiring this data the way we've been teaching in class, let's pause to ask: what are 5 different approaches we could take to acquiring Nobel Prize data?
When possible: find a structured dataset (.csv, .json, .xls)¶
After a google search we stumble upon this dataset on github. It is also in the section folder named github-nobel-prize-winners.csv.
We use pandas to read it:
In [4]:
df=pd.read_csv("../data/github-nobel-prize-winners.csv")df.head()#pandas is a very useful package
Out[4]:
year
discipline
winner
desc
0
1901
chemistry
Jacobus H. van 't Hoff
in recognition of the extraordinary services h...
1
1901
literature
Sully Prudhomme
in special recognition of his poetic compositi...
2
1901
medicine
Emil von Behring
for his work on serum therapy, especially its ...
3
1901
peace
Henry Dunant
NaN
4
1901
peace
Frédéric Passy
NaN
Or you may want to read an xlsx file:
(Potential missing package; you might need to run the following command in your terminal first: !conda install xlrd)
In [7]:
!conda install --yes xlrd
Collecting package metadata (current_repodata.json): done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.7.10
latest version: 4.8.4
Please update conda by running
$ conda update -n base -c defaults conda
## Package Plan ##
environment location: /home/chris/anaconda3/envs/cs109a
added / updated specs:
- xlrd
The following NEW packages will be INSTALLED:
xlrd pkgs/main/linux-64::xlrd-1.2.0-py37_0
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Research Question 1: Did anyone recieve the Nobel Prize more than once?¶
How would you check if anyone recieved more than one nobel prize?
In [ ]:
# initialize the list storing all the names name_winners=[]fornameindf.winner:# Check if we already encountered this name: ifnameinname_winners:# if so, print the nameprint(name)else:# otherwise append the name to the listname_winners.append(name)
We don't want to print "No Prize was Awarded" all the time.
In [ ]:
# Your code here# list storing all the names name_winners=[]fornameindf.winner:# Check if we already encountered this name: ifnameinname_winnersandname:# if so, print the nameprint(name)else:# otherwise append the name to the listname_winners.append(name)
we can use .split() on a string to separate the words into individual strings and store them in a list.¶
In [ ]:
UN_string="Office of the United Nations"print(UN_string.split())#n_words = len(UN_string.split())#print("Number of words: " + str(n_words));
Even better:
In [ ]:
name_winners=[]fornameindf.winner:# Check if we already encountered this name: ifnameinname_winnersandlen(name.split())<=2:# if so, print the nameprint(name)else:# otherwise append the name to the listname_winners.append(name)
HTML('\\Marie Curie recieved the nobel prize in physics in 1903 and chemistry in 1911. \ She is one of only four people to recieve two Nobel Prizes.\\ ')
The first step in web scraping is to look for structure in the html. Lets look at a real website:¶
The official Nobel website has the data we want, but in 2018 and 2019 the physics prize was awarded to multiple groups so we will use an archived version of the web-page for an easier introduction to web scraping.
The Internet Archive periodically crawls most of the Internet and saves what it finds. (That's a lot of data!) So let's grab the data from the Archive's "Wayback Machine" (great name!). We've just given you the direct URL, but at the very end you'll see how we can get it out of a JSON response from the Wayback Machine API.
Let's take a look at the 2018 version of the Nobel website and to look at the underhood HTML: right-click and click on inspect . Try to find structure in the tree-structured HTML.
The first step of web scraping is to write down the structure of the web page¶
Here some quick recap of HTML tags and what they do in the context of this notebook:¶
HTML tags are opened and closed as follows: \
some text \<\h3>.
Here are a list of few tags, their definitions and what information they contain in our problem today:
\
: header 3 tag
tag is a header size 3 tag (header 1 is the largest tag). This tag will contain the title and year of the nobel prize, which we will parse out. \
: header 6 tag
tag (smaller than header 3) will contain the prize recipients \
: paragraph tag
tags used for text, contains the prize motivation \ "Content Division element ( \
) is the generic container for flow content." What we care about here is the class attribute, which we will use with beautiful soup to quickly parse information which we want. The class attribute could be attatched to any tag.
Paying attention to tags with class attributes is key to the homework.
In [ ]:
# here is what we will get after selecting using the class by year tag.einstein=HTML('\
\
\\ The Nobel Prize in Physics 1921 \\\
\\ Albert Einstein \\
\ “for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect” \\ ')display(einstein)
Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be on HW1), always put a delay between requests (e.g, time.sleep(1), with the time library) so you don’t unwittingly hammer someone’s webserver and/or get blocked.
You can find specific patterns or strings in text by using Regular Expressions: This is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). Some great resources that we recommend, if you are interested in them (could be very useful for a homework problem):
Specify a specific sequence with the help of regex special characters. Some examples:
\S : Matches any character which is not a Unicode whitespace character
\d : Matches any Unicode decimal digit
* : Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.
Let's find all the occurances of 'Marie' in our raw_html:
In [ ]:
importre
In [ ]:
re.findall(r'Marie',raw_html)
Using \S to match 'Marie' + ' ' + 'any character which is not a Unicode whitespace character':
In [ ]:
re.findall(r'Marie \S',raw_html)
How would we find the lastnames that come after Marie?
ANSWER: the \w character represents any alpha-numeric character. \w* is greedy and gets a repeat of the characters until the next bit of whitespace.
In [ ]:
# Your code herelast_names=re.findall(r'Marie \w*',raw_html)display(last_names)
Now, we have all our data in the notebook. Unfortunately, it is the form of one really long string, which is hard to work with directly. This is where BeautifulSoup comes in.
This is an example of code that grabs the first title. Regex can quickly become complex, which motivates beautiful soup.¶
In [ ]:
first_title=re.findall(r'
.*<\/a><\/h3>'
,raw_html)[0]print(first_title)#you can do this via regex, but it gets complicated fast! This motivates Beautiful Soup.
How would you get the 'motivation'/reason of the prize from the following award_node?
In [ ]:
award_node=award_nodes[200]award_node
In [ ]:
# Your code hereprint(award_node.select('p')[0].text);
Putting everything into functions:
In [ ]:
defget_award_motivation(award_node):award_node=award_node.select_one('p')ifnotaward_node:#0, [], None, and {} all default to False in a python conditional statement.returnNonereturnaward_node.text
Break Out Room 1: Practice with CSS selectors, Functions and list comprehension¶
In [ ]:
print(award_nodes[200])
Exercise 1.1: complete the following function by assigning the proper CSS-selector so that it returns a list of nobel prize award recipients.¶
Hint: you can specify multiple selectors separated by a space.
To load the first exercise by deleting the "#" and typing shift-enter to run the cell¶
clicking on "cell" -> "run all above" is also very helpful to run many cells of the notebook at once.
In [ ]:
# %load exercises/exercise1.py
Exercise 1.2: Change the above function so it uses list comprehension.¶
To load the execise simply delete the '#' in the code below and run the cell.
In [ ]:
# %load exercises/exercise2.py
Don't look at this cell until you've given the exercise a go! It loads the correct solution.
Exercise 1.2 solution (1.1 solution is contained herein as well)¶
awards=[]foraward_nodeinsoup.select('.by_year'):recipients=get_recipients(award_node)#initialize the dictionaryaward={}#{key: value}award['title']=get_award_title(award_node)award['year']=get_award_year(award_node)award['recipients']=recipientsaward['num_recipients']=len(recipients)award['motivation']=get_award_motivation(award_node)awards.append(award)awards[0:2]
Ok: 2018 awards have no recipients because this is a 2018 archived version of nobel prize webpage. Some past years lack awards because none were actually awarded that year. Let's keep only meaningful data:
plt.figure(figsize=[16,6])plt.plot(df_awards_past.groupby('year').num_recipients.mean(),'b',linewidth='1')plt.title('Total Nobel Awards per year')plt.xlabel('Year')plt.ylabel('Total recipients per prize')plt.grid('on')plt.show()
Check out the years 1940-43? Any comment?
Any trends the last 25 years?
In [ ]:
set(df_awards_past.title)
In [ ]:
plt.figure(figsize=[16,6])i=0forawardinset(df_awards_past.title):i+=1year=df_awards_past[df_awards_past['title']==award].yearrecips=df_awards_past[df_awards_past['title']==award].num_recipientsindex=year>2020-25years_filtered=year[index].valuesrecips_filtered=recips[index].valuesplt.subplot(2,3,i)plt.bar(years_filtered,recips_filtered,color='b',alpha=0.7)plt.title(award)plt.xlabel('Year')plt.ylabel('Number of Recipients')plt.ylim(0,3)plt.tight_layout()
A cleaner way to iterate and keep tabs: the enumerate( ) function¶
'How has the number of recipients per award changed over time?'¶
In [ ]:
# The enumerate function allows us to delete two lines of code # The number of years shown is increased to 75 so we can see the trend.plt.figure(figsize=[16,6])fori,awardinenumerate(set(df_awards_past.title),1):################### <--- enumerateyear=df_awards_past[df_awards_past['title']==award].yearrecips=df_awards_past[df_awards_past['title']==award].num_recipientsindex=year>2019-75########################### <--- extend the rangeyears_filtered=year[index].valuesrecips_filtered=recips[index].values#plot:plt.subplot(2,3,i)#arguments (nrows, ncols, index)plt.bar(years_filtered,recips_filtered,color='b',alpha=0.7)plt.title(award)plt.xlabel('Year')plt.ylabel('Number of Recipients')plt.ylim(0,3)plt.tight_layout()
Harvard Professor Sean Eddy in the micro and chemical Biology department at Harvard teaches a great course called MCB-112: Biological Data Science. His course is difficult but a great complement to CS109a and is also taught in python.
Here are a couple resources that he referenced early in his course that helped solidify my understanding of data science.
For temporary data storage in a single version of Python, pickles will preserve your data even more faithfully, even many custom data types. But don't count on it for exchanging data or long-term storage. (In fact, don't try to load untrusted pickles -- they can run arbitrary code!)
Extra: Parsing JSON to get the Wayback Machine URL¶
We could go to http://archive.org, search for our URL, and get the URL for the archived version there. But since you'll often need to talk with APIs, let's take this opportunity to use the Wayback Machine's API. This will also give us a chance to practice working with JSON.
In [ ]:
url="https://www.nobelprize.org/prizes/lists/all-nobel-prizes/"# All 3 of these do the same thing. The third is my (KCA's) favorite new feature of Python 3.6.wayback_query_url='http://archive.org/wayback/available?url={}'.format(url)wayback_query_url='http://archive.org/wayback/available?url={url}'.format(url=url)wayback_query_url=f'http://archive.org/wayback/available?url={url}'r=requests.get(wayback_query_url)
We got some kind of response... what is it?
In [ ]:
r.text
Yay, JSON! It's usually pretty easy to work with JSON, once we parse it.
In [ ]:
json.loads(r.text)
Loading responses as JSON is so common that requests has a convenience method for it:
In [ ]:
response_json=r.json()response_json
What kind of object is this?
A little Python syntax review: How can we get the snapshot URL?