{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#
CS109A Introduction to Data Science \n",
"\n",
"## Standard Section 1: Introduction to Web Scraping\n",
"\n",
"**Harvard University**
\n",
"**Fall 2019**
\n",
"**Instructors**: Pavlos Protopapas, Kevin Rader, and Chris Tanner
\n",
"**Section Leaders**: Marios Mattheakis, Abhimanyu (Abhi) Vasishth, Robbert (Rob) Struyven
\n",
"\n",
"\n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## RUN THIS CELL TO GET THE RIGHT FORMATTING \n",
"import requests\n",
"from IPython.core.display import HTML\n",
"styles = requests.get(\"https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css\").text\n",
"HTML(styles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we're done today, you will approach messy real-world data with confidence that you can get it into a format that you can manipulate.\n",
"\n",
"Specifically, our learning objectives are:\n",
"* Understand the structure of an HTML document and use that structure to extract desired information\n",
"* Use Python data structures such as lists, dictionaries, and Pandas DataFrames to store and manipulate information\n",
"* Identify some other (semi-)structured formats commonly used for storing and transferring data, such as JSON and CSV\n",
"* Practice using [Python](https://docs.python.org/3.6/) packages such as [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [Pandas](https://pandas.pydata.org/pandas-docs/stable/), including how to navigate their documentation to find functionality."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"sns.set_style(\"whitegrid\")\n",
"sns.set_context(\"notebook\")\n",
"import json\n",
"\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import HTML"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.\n",
"requests.packages.urllib3.disable_warnings()\n",
"\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Goals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Is science becoming more collaborative over time? How about literature? Are there a few \"geniuses\" or lots of hard workers? One way we might answer those questions is by looking at Nobel Prizes. We could ask questions like:\n",
"\n",
"* Has anyone won a prize more than once?\n",
"* How has the total number of recipients changed over time?\n",
"* How has the number of recipients per award changed over time?\n",
"\n",
"\n",
"To answer these questions, we'll need data: *who* received *what* award *when*. \n",
"\n",
"Before we dive into acquiring this data the way we've been teaching in class, let's pause to ask: **what are 5 different approaches we could take to acquiring Nobel Prize data**?\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## When possible: find a structured dataset (.csv,.json,.xls)\n",
"\n",
"After a google search we stumble upon this [dataset on github](https://github.com/OpenRefine/OpenRefine/blob/master/main/tests/data/nobel-prize-winners.csv). It is also in the section folder named `github-nobel-prize-winners.csv`.\n",
"\n",
"We use pandas to read it: "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" discipline | \n",
" winner | \n",
" desc | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1901 | \n",
" chemistry | \n",
" Jacobus H. van 't Hoff | \n",
" in recognition of the extraordinary services h... | \n",
"
\n",
" \n",
" 1 | \n",
" 1901 | \n",
" literature | \n",
" Sully Prudhomme | \n",
" in special recognition of his poetic compositi... | \n",
"
\n",
" \n",
" 2 | \n",
" 1901 | \n",
" medicine | \n",
" Emil von Behring | \n",
" for his work on serum therapy, especially its ... | \n",
"
\n",
" \n",
" 3 | \n",
" 1901 | \n",
" peace | \n",
" Henry Dunant | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 1901 | \n",
" peace | \n",
" Frédéric Passy | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year discipline winner \\\n",
"0 1901 chemistry Jacobus H. van 't Hoff \n",
"1 1901 literature Sully Prudhomme \n",
"2 1901 medicine Emil von Behring \n",
"3 1901 peace Henry Dunant \n",
"4 1901 peace Frédéric Passy \n",
"\n",
" desc \n",
"0 in recognition of the extraordinary services h... \n",
"1 in special recognition of his poetic compositi... \n",
"2 for his work on serum therapy, especially its ... \n",
"3 NaN \n",
"4 NaN "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"../data/github-nobel-prize-winners.csv\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or you may want to read an xlsx file:\n",
"\n",
"(Potential missing package; you might need to run the following command in your terminal first: ```!conda install xlrd```)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" discipline | \n",
" winner | \n",
" desc | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1901 | \n",
" chemistry | \n",
" Jacobus H. van 't Hoff | \n",
" in recognition of the extraordinary services h... | \n",
"
\n",
" \n",
" 1 | \n",
" 1901 | \n",
" literature | \n",
" Sully Prudhomme | \n",
" in special recognition of his poetic compositi... | \n",
"
\n",
" \n",
" 2 | \n",
" 1901 | \n",
" medicine | \n",
" Emil von Behring | \n",
" for his work on serum therapy, especially its ... | \n",
"
\n",
" \n",
" 3 | \n",
" 1901 | \n",
" peace | \n",
" Henry Dunant | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 1901 | \n",
" peace | \n",
" Frédéric Passy | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year discipline winner \\\n",
"0 1901 chemistry Jacobus H. van 't Hoff \n",
"1 1901 literature Sully Prudhomme \n",
"2 1901 medicine Emil von Behring \n",
"3 1901 peace Henry Dunant \n",
"4 1901 peace Frédéric Passy \n",
"\n",
" desc \n",
"0 in recognition of the extraordinary services h... \n",
"1 in special recognition of his poetic compositi... \n",
"2 for his work on serum therapy, especially its ... \n",
"3 NaN \n",
"4 NaN "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_excel(\"../data/github-nobel-prize-winners.xlsx\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## QUIZ: Did anyone recieve the Nobel Prize more than once?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**How would you check if anyone recieved more than one nobel prize?**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Marie Curie\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"International Committee of the Red Cross\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"Linus Pauling\n",
"International Committee of the Red Cross\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"No Prize was Awarded\n",
"John Bardeen\n",
"Frederick Sanger\n",
"Office of the United Nations High Commissioner for Refugees\n"
]
}
],
"source": [
"# list storing all the names \n",
"name_winners = []\n",
"for name in df.winner:\n",
" # Check if we already encountered this name: \n",
" if name in name_winners:\n",
" # if so, print the name\n",
" print(name)\n",
" else:\n",
" # otherwhise the name to the list\n",
" name_winners.append(name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We don't want to print \"No Prize was Awarded\" all the time.**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Marie Curie\n",
"International Committee of the Red Cross\n",
"Linus Pauling\n",
"International Committee of the Red Cross\n",
"John Bardeen\n",
"Frederick Sanger\n",
"Office of the United Nations High Commissioner for Refugees\n"
]
}
],
"source": [
"# Your code here\n",
"winners = []\n",
"for name in df.winner:\n",
" if name in winners and name != \"No Prize was Awarded\":\n",
" print(name)\n",
" else:\n",
" winners.append(name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**How can we make this into a oneligner?**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Marie Curie\n",
"International Committee of the Red Cross\n",
"Linus Pauling\n",
"International Committee of the Red Cross\n",
"John Bardeen\n",
"Frederick Sanger\n",
"Office of the United Nations High Commissioner for Refugees\n"
]
}
],
"source": [
"# Your code here\n",
"winners = []\n",
"[print(name) if (name in winners and name != \"No Prize was Awarded\") \n",
" else winners.append(name) for name in df.winner];"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Otherwhise: WEB SCRAPING"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Turns out that https://www.nobelprize.org/prizes/lists/all-nobel-prizes/ has the data we want. \n",
"\n",
"Let's take a look at the [website](https://www.nobelprize.org/prizes/lists/all-nobel-prizes/) and to look at the underhood HTML: right-click and click on `inspect` . Try to find structure in the tree-structured HTML.\n",
"\n",
"------\n",
"\n",
"But the `nobelprize.org` server is a little slow sometimes. Fortunately, the Internet Archive periodically crawls most of the Internet and saves what it finds. (That's a lot of data!) So let's grab the data from the Archive's \"Wayback Machine\" (great name!).\n",
"\n",
"We'll just give you the direct URL, but at the very end you'll see how we can get it out of a JSON response from the Wayback Machine API."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"snapshot_url = 'http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/'"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"snapshot = requests.get(snapshot_url)\n",
"snapshot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is a this Response [200]? Let's google: [`response 200 meaning`](https://www.google.com/search?q=response+200+meaning&oq=response+%5B200%5D+m&aqs=chrome.1.69i57j0l5.6184j0j7&sourceid=chrome&ie=UTF-8). All possible codes [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"requests.models.Response"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(snapshot)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Try to request \"www.xoogle.be\"? What happens?"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"snapshot_url2 = 'http://web.archive.org/web/20180820111639/https://www.xoogle.be'\n",
"\n",
"snapshot = requests.get(snapshot_url2)\n",
"snapshot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be on HW1), always put a delay between requests (e.g, `time.sleep(1)`, with the `time` library) so you don’t unwittingly hammer someone’s webserver and/or get blocked."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\t\n",
"\n",
"\t\n",
"\n",
"\t\n",
"\n",
" \n",
"