{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS109A Introduction to Data Science \n", "\n", "## Standard Section 1: Introduction to Web Scraping\n", "\n", "**Harvard University**
\n", "**Fall 2020**
\n", "**Instructors**: Pavlos Protopapas, Kevin Rader, and Chris Tanner
\n", "**Section Leaders**: Marios Mattheakis, Hayden Joy
\n", "\n", "\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## RUN THIS CELL TO GET THE RIGHT FORMATTING \n", "import requests\n", "from IPython.core.display import HTML\n", "styles = requests.get(\"https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css\").text\n", "HTML(styles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section Learning Objectives\n", "\n", "When we're done today, you will approach messy real-world data with confidence that you can get it into a format that you can manipulate.\n", "\n", "Specifically, our learning objectives are:\n", "* Understand the tree-like structure of an HTML document and use that structure to extract desired information\n", "* Use Python data structures such as lists, dictionaries, and Pandas DataFrames to store and manipulate information\n", "\n", "* Practice using [Python](https://docs.python.org/3.6/) packages such as [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [Pandas](https://pandas.pydata.org/pandas-docs/stable/), including how to navigate their documentation to find functionality.\n", "\n", "* Identify some other (semi-)structured formats commonly used for storing and transferring data, such as JSON and CSV" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from bs4 import BeautifulSoup\n", "import requests\n", "\n", "\n", "import json\n", "from IPython.display import HTML" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.\n", "%matplotlib inline \n", "\n", "requests.packages.urllib3.disable_warnings()\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section Data Analysis Questions\n", "\n", "Is science becoming more collaborative over time? How about literature? Are there a few \"geniuses\" or lots of hard workers? One way we might answer those questions is by looking at Nobel Prizes. We could ask questions like:\n", "\n", "* 1) Has anyone won a prize more than once?\n", "* 2) How has the total number of recipients changed over time?\n", "* 3) How has the number of recipients per award changed over time?\n", "\n", "\n", "To answer these questions, we'll need data: *who* received *what* award and *when*. \n", "\n", "Before we dive into acquiring this data the way we've been teaching in class, let's pause to ask: **what are 5 different approaches we could take to acquiring Nobel Prize data**?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## When possible: find a structured dataset (.csv, .json, .xls)\n", "\n", "After a google search we stumble upon this [dataset on github](https://github.com/OpenRefine/OpenRefine/blob/master/main/tests/data/nobel-prize-winners.csv). It is also in the section folder named `github-nobel-prize-winners.csv`.\n", "\n", "We use pandas to read it: " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardisciplinewinnerdesc
01901chemistryJacobus H. van 't Hoffin recognition of the extraordinary services h...
11901literatureSully Prudhommein special recognition of his poetic compositi...
21901medicineEmil von Behringfor his work on serum therapy, especially its ...
31901peaceHenry DunantNaN
41901peaceFrédéric PassyNaN
\n", "
" ], "text/plain": [ " year discipline winner \\\n", "0 1901 chemistry Jacobus H. van 't Hoff \n", "1 1901 literature Sully Prudhomme \n", "2 1901 medicine Emil von Behring \n", "3 1901 peace Henry Dunant \n", "4 1901 peace Frédéric Passy \n", "\n", " desc \n", "0 in recognition of the extraordinary services h... \n", "1 in special recognition of his poetic compositi... \n", "2 for his work on serum therapy, especially its ... \n", "3 NaN \n", "4 NaN " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"../data/github-nobel-prize-winners.csv\")\n", "df.head() #pandas is a very useful package" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you may want to read an xlsx file:\n", "\n", "(Potential missing package; you might need to run the following command in your terminal first: ```!conda install xlrd```)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting package metadata (current_repodata.json): done\n", "Solving environment: done\n", "\n", "\n", "==> WARNING: A newer version of conda exists. <==\n", " current version: 4.7.10\n", " latest version: 4.8.4\n", "\n", "Please update conda by running\n", "\n", " $ conda update -n base -c defaults conda\n", "\n", "\n", "\n", "## Package Plan ##\n", "\n", " environment location: /home/chris/anaconda3/envs/cs109a\n", "\n", " added / updated specs:\n", " - xlrd\n", "\n", "\n", "The following NEW packages will be INSTALLED:\n", "\n", " xlrd pkgs/main/linux-64::xlrd-1.2.0-py37_0\n", "\n", "\n", "Preparing transaction: done\n", "Verifying transaction: done\n", "Executing transaction: done\n" ] } ], "source": [ "!conda install --yes xlrd " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yeardisciplinewinnerdesc
8482007medicineOliver Smithiesfor their discoveries of principles for introd...
8492007peaceIntergovernmental Panel on Climate Change (IPCC)for their efforts to build up and disseminate ...
8502007peaceAlbert Arnold (Al) Gore Jr.for their efforts to build up and disseminate ...
8512007physicsAlbert Fertfor the discovery of Giant Magnetoresistance
8522007physicsPeter Gr&Atilde;&frac14;nbergfor the discovery of Giant Magnetoresistance
\n", "
" ], "text/plain": [ " year discipline winner \\\n", "848 2007 medicine Oliver Smithies \n", "849 2007 peace Intergovernmental Panel on Climate Change (IPCC) \n", "850 2007 peace Albert Arnold (Al) Gore Jr. \n", "851 2007 physics Albert Fert \n", "852 2007 physics Peter Grünberg \n", "\n", " desc \n", "848 for their discoveries of principles for introd... \n", "849 for their efforts to build up and disseminate ... \n", "850 for their efforts to build up and disseminate ... \n", "851 for the discovery of Giant Magnetoresistance \n", "852 for the discovery of Giant Magnetoresistance " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel(\"../data/github-nobel-prize-winners.xlsx\")\n", "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### introducing types" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#type(df.winner)\n", "#type(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Research Question 1: Did anyone recieve the Nobel Prize more than once?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How would you check if anyone recieved more than one nobel prize?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# initialize the list storing all the names \n", "name_winners = []\n", "\n", "for name in df.winner:\n", " \n", " # Check if we already encountered this name: \n", " if name in name_winners:\n", " \n", " # if so, print the name\n", " print(name)\n", " else:\n", " # otherwise append the name to the list\n", " name_winners.append(name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**We don't want to print \"No Prize was Awarded\" all the time.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here\n", "# list storing all the names \n", "name_winners = []\n", "\n", "for name in df.winner:\n", " \n", " # Check if we already encountered this name: \n", " if name in name_winners and name: \n", " # if so, print the name\n", " print(name)\n", " \n", " else:\n", " # otherwise append the name to the list\n", " name_winners.append(name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### we can use .split() on a string to separate the words into individual strings and store them in a list.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "UN_string = \"Office of the United Nations\"\n", "print(UN_string.split())\n", "#n_words = len(UN_string.split())\n", "#print(\"Number of words: \" + str(n_words));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Even better:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "name_winners = []\n", "\n", "for name in df.winner:\n", " \n", " # Check if we already encountered this name: \n", " if name in name_winners and len(name.split()) <= 2: \n", " # if so, print the name\n", " print(name)\n", " \n", " else:\n", " # otherwise append the name to the list\n", " name_winners.append(name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How can we make this into a oneligner?**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "List comprehension form: [f(x) for x in list]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "winners = []\n", "[print(name) if (name in winners and len(name.split()) <= 2) \n", " else winners.append(name) for name in df.winner];" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "HTML('
\\\n", " \\\n", "
Marie Curie recieved the nobel prize in physics in 1903 and chemistry in 1911.
\\\n", " She is one of only four people to recieve two Nobel Prizes.\\\n", "
\\\n", "
')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2: WEB SCRAPING\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The first step in web scraping is to look for structure in the html. Lets look at a real website: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The official Nobel website has the data we want, but in 2018 and 2019 the physics prize was awarded to multiple groups so we will use an archived version of the web-page for an easier introduction to web scraping.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Internet Archive periodically crawls most of the Internet and saves what it finds. (That's a lot of data!) So let's grab the data from the Archive's \"Wayback Machine\" (great name!). We've just given you the direct URL, but at the very end you'll see how we can get it out of a JSON response from the Wayback Machine API.\n", "\n", "\n", "Let's take a look at the [2018 version of the Nobel website](http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/) and to look at the underhood HTML: right-click and click on `inspect` . Try to find structure in the tree-structured HTML.\n", "\n", "Play around! (give floor to the students)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "###################################################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The first step of web scraping is to write down the structure of the web page\n", "### Here some quick recap of HTML tags and what they do in the context of this notebook:\n", "\n", "HTML tags are opened and closed as follows: \\

some text \\<\\h3>. \n", "\n", "Here are a list of few tags, their definitions and what information they contain in our problem today:\n", "\n", "\n", "**\\

: header 3 tag** tag is a header size 3 tag (header 1 is the largest tag). This tag will contain the title and year of the nobel prize, which we will parse out.
\n", "**\\

: header 6 tag** tag (smaller than header 3) will contain the prize recipients
\n", "**\\

: paragraph tag** tags used for text, contains the prize motivation
\n", "**\\

** \"Content Division element ( \\
) is the generic container for flow content.\" What we care about here is the class attribute, which we will use with beautiful soup to quickly parse information which we want. The class attribute could be attatched to any tag.\n", "\n", "***Paying attention to tags with class attributes is key to the homework.***" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# here is what we will get after selecting using the class by year tag.\n", "\n", "einstein = HTML('\\\n", "
\\\n", "

\\\n", " \\\n", " The Nobel Prize in Physics 1921 \\\n", " \\\n", "

\\\n", "
\\\n", " \\\n", " Albert Einstein \\\n", "
\\\n", "

\\\n", " “for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect” \\\n", "

\\\n", " ')\n", "display(einstein)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "snapshot_url = 'http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "snapshot = requests.get(snapshot_url)\n", "snapshot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Response [200] is a success status code. Let's google: [`response 200 meaning`](https://www.google.com/search?q=response+200+meaning&oq=response+%5B200%5D+m&aqs=chrome.1.69i57j0l5.6184j0j7&sourceid=chrome&ie=UTF-8). All possible codes [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(snapshot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try to request \"www.xoogle.be\". What happens?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "snapshot_url2 = 'http://web.archive.org/web/20180820111639/https://www.xoogle.be'\n", "snapshot = requests.get(snapshot_url2)\n", "snapshot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be on HW1), always put a delay between requests (e.g, `time.sleep(1)`, with the `time` library) so you don’t unwittingly hammer someone’s webserver and/or get blocked." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "snapshot = requests.get(snapshot_url)\n", "raw_html = snapshot.text\n", "print(raw_html[500:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regular Expressions\n", "You can find specific patterns or strings in text by using Regular Expressions: This is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). Some great resources that we recommend, if you are interested in them (could be very useful for a homework problem):\n", "- https://docs.python.org/3.3/library/re.html\n", "- https://regexone.com\n", "- https://docs.python.org/3/howto/regex.html.\n", "\n", "Specify a specific sequence with the help of regex special characters. Some examples: \n", "- ```\\S``` : Matches any character which is not a Unicode whitespace character\n", "- ```\\d``` : Matches any Unicode decimal digit \n", "- ```*``` : Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.\n", "\n", "**Let's find all the occurances of 'Marie' in our raw_html:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "re.findall(r'Marie', raw_html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Using ```\\S``` to match 'Marie' + ' ' + 'any character which is not a Unicode whitespace character':**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "re.findall(r'Marie \\S',raw_html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How would we find the lastnames that come after Marie?**\n", "\n", "ANSWER: the \\w character represents any alpha-numeric character. \\w* is greedy and gets a repeat of the characters until the next bit of whitespace." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here\n", "last_names = re.findall(r'Marie \\w*', raw_html)\n", "display(last_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we have all our data in the notebook. Unfortunately, it is the form of one really long string, which is hard to work with directly. This is where BeautifulSoup comes in. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### This is an example of code that grabs the first title. Regex can quickly become complex, which motivates beautiful soup." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_title = re.findall(r'

.*<\\/a><\\/h3>', raw_html)[0]\n", "print(first_title)\n", "\n", "#you can do this via regex, but it gets complicated fast! This motivates Beautiful Soup." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parse the HTML with BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "soup = BeautifulSoup(raw_html, 'html.parser')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Key BeautifulSoup functions we’ll be using in this section:\n", "- **`tag.prettify()`**: Returns cleaned-up version of raw HTML, useful for printing\n", "- **`tag.select(selector)`**: Return a list of nodes matching a [CSS selector](https://developer.mozilla.org/en-US/docs/Learn/CSS/Introduction_to_CSS/Simple_selectors)\n", "- **`tag.select_one(selector)`**: Return the first node matching a CSS selector\n", "- **`tag.text/soup.get_text()`**: Returns visible text of a node (e.g.,\"`

Some text

`\" -> \"Some text\")\n", "- **`tag.contents`**: A list of the immediate children of this node\n", "\n", "You can also use these functions to find nodes.\n", "- **`tag.find_all(tag_name, attrs=attributes_dict)`**: Returns a list of matching nodes\n", "- **`tag.find(tag_name, attrs=attributes_dict)`**: Returns first matching node\n", "\n", "BeautifulSoup is a very powerful library -- much more info here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's practice some BeautifulSoup commands..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Print a cleaned-up version of the raw HTML** Which function should we use from above?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pretty_soup = soup.prettify()\n", "print(pretty_soup[:500]) #what about negative indices?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Find the first “title” object** " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here\n", "soup.select(\"h3 a\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Extract the text of first “title” object** " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting award data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use the structure of the HTML document to extract the data we want.\n", "\n", "From inspecting the page in DevTools, we found that each award is in a `div` with a `by_year` class. Let's get all of them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "award_nodes = soup.select('.by_year') #
\"run all above\" is also very helpful to run many cells of the notebook at once." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load exercises/exercise1.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1.2: Change the above function so it uses list comprehension.\n", "To load the execise simply delete the '#' in the code below and run the cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load exercises/exercise2.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't look at this cell until you've given the exercise a go! It loads the correct solution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1.2 solution (1.1 solution is contained herein as well)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load solutions/breakoutsol1.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%run ./solutions/breakoutsol1.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's create a Pandas dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's get all of the awards." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "awards = []\n", "for award_node in soup.select('.by_year'):\n", " recipients = get_recipients(award_node)\n", " \n", " #initialize the dictionary\n", " award = {} #{key: value}\n", " \n", " award['title'] = get_award_title(award_node)\n", " award['year'] = get_award_year(award_node)\n", " award['recipients'] = recipients\n", " award['num_recipients'] = len(recipients)\n", " award['motivation'] = get_award_motivation(award_node) \n", " awards.append(award)\n", "awards[0:2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_raw = pd.DataFrame(awards)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#explain open brackets\n", "df_awards_raw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Some quick EDA." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_raw.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_raw.year.min()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What is going on with the recipients column?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_raw.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_raw.num_recipients.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Now lets take a look at num_recipients**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_raw.num_recipients == 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_raw[df_awards_raw.num_recipients == 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok: 2018 awards have no recipients because this is a 2018 archived version of nobel prize webpage. Some past years lack awards because none were actually awarded that year. Let's keep only meaningful data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_past = df_awards_raw[df_awards_raw.year != 2018]\n", "df_awards_past.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hm, `motivation` has a different number of items... why?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_past[df_awards_past.motivation.isnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like it's fine that those motivations were missing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Sort the awards by year.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_past.sort_values('year').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How many awards of each type were given?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_past.title.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But wait, that includes the years the awards weren't offered." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_actually_offered = df_awards_past[df_awards_past.num_recipients > 0]\n", "df_awards_actually_offered.title.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When was each award first given?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_actually_offered.groupby('title').year" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_actually_offered.groupby('title').year.describe() # we will use this information later!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How many recipients per year?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's include the years with missing awards; if we were to analyze further, we'd have to decide whether to include them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A good plot that clearly reveals patterns in the data is very important. Is this a good plot or not?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_past.plot.scatter(x='year', y='num_recipients') #explain scatterplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's hard to see a trend when there are multiple observations per year (**why?**).\n", "\n", "Let's try looking at *total* num recipients by year." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Lets explore how important a good plot can be
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_past.groupby('year').num_recipients.sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=[16,6])\n", "plt.plot(df_awards_past.groupby('year').num_recipients.mean(), 'b', linewidth='1')\n", "\n", "\n", "plt.title('Total Nobel Awards per year')\n", "plt.xlabel('Year')\n", "plt.ylabel('Total recipients per prize')\n", "plt.grid('on')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check out the years 1940-43? Any comment? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any trends the last 25 years?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set(df_awards_past.title)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=[16,6])\n", "i = 0\n", "for award in set(df_awards_past.title):\n", " i += 1\n", " year = df_awards_past[df_awards_past['title']==award].year\n", " recips = df_awards_past[df_awards_past['title']==award].num_recipients\n", " index = year > 2020 - 25\n", " years_filtered = year[index].values\n", " recips_filtered = recips[index].values\n", " \n", " plt.subplot(2,3,i)\n", " plt.bar(years_filtered, recips_filtered, color='b', alpha = 0.7)\n", " plt.title(award)\n", " plt.xlabel('Year')\n", " plt.ylabel('Number of Recipients')\n", " plt.ylim(0, 3)\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A cleaner way to iterate and keep tabs: the ***enumerate( )*** function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 'How has the number of recipients per award changed over time?'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The enumerate function allows us to delete two lines of code \n", "# The number of years shown is increased to 75 so we can see the trend.\n", "plt.figure(figsize=[16,6])\n", "\n", "for i, award in enumerate(set(df_awards_past.title), 1): ################### <--- enumerate\n", " year = df_awards_past[ df_awards_past['title'] == award].year\n", " recips = df_awards_past[ df_awards_past['title'] == award].num_recipients\n", " index = year > 2019 - 75 ########################### <--- extend the range\n", " years_filtered = year[index].values\n", " recips_filtered = recips[index].values\n", " \n", " #plot:\n", " plt.subplot(2, 3, i) #arguments (nrows, ncols, index)\n", " plt.bar(years_filtered, recips_filtered, color='b', alpha = 0.7)\n", " plt.title(award)\n", " plt.xlabel('Year')\n", " plt.ylabel('Number of Recipients')\n", " plt.ylim(0, 3)\n", "\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------------\n", "### End of Standard Section\n", "---------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Break Out Room II: Dictionaries, dataframes, and Pyplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Exercise 2.1 (practice creating a dataframe): Build a dataframe of famous physicists from the following lists. **\n", "Your dataframe should have the following columns: \"name\", \"year_prize_awarded\" and \"famous_for\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "famous_award_winners = [\"Marie Curie\", \"Albert Einstein\", \"James Chadwick\", \"Werner Karl Heisenberg\"] \n", "nobel_prize_dates = [1923, 1937, 1940, 1934]\n", "famous_for = [\"spontaneous radioactivity\", \"general relativity\", \"strong nuclear force\",\n", " \"uncertainty principle\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#initialize dictionary\n", "famous_physicists = {}\n", "#TODO: build Pandas Dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Exercise 2.2:** Make a bar plot of the total number of Nobel prizes awarded per field. Make sure to use the 'group by' function to achieve this.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create the figure:\n", "plt.figure(figsize=[16,6])\n", "#group by command:\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Solutions:\n", "## Exercise 2.1 Solutions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load solutions/exercise2.1sol" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2.2 Solutions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load solutions/exercise2.2sol_vanilla" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load solutions/exercise2.2sol_improved" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Food for thought: Is the prize in Economics more collaborative, or just more modern?***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extra: Did anyone recieve the Nobel Prize more than once (based upon scraped data)?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's where it bites us that our original DataFrame isn't \"tidy\". Let's make a tidy one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A great scientific article describing tidy data by Hadley Wickam: https://vita.had.co.nz/papers/tidy-data.pdf" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tidy_awards = []\n", "for idx, row in df_awards_past.iterrows():\n", " for recipient in row['recipients']:\n", " tidy_awards.append(dict(\n", " recipient = recipient,\n", " year = row['year']))\n", "tidy_awards_df = pd.DataFrame(tidy_awards)\n", "tidy_awards_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can look at each recipient individually." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tidy_awards_df.recipient.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## End of Normal Section" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optional Further Readings\n", "\n", "Harvard Professor Sean Eddy in the micro and chemical Biology department at Harvard teaches a great course called MCB-112: Biological Data Science. His course is difficult but a great complement to CS109a and is also taught in python.\n", "\n", "Here are a couple resources that he referenced early in his course that helped solidify my understanding of data science.\n", "\n", "50 Years of Data Science by Dave Donoho (2017)\n", "\n", " Tidy data by Hadley Wickam (2014)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extra Material: Other structured data formats (JSON and CSV)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CSV\n", "CSV is a lowest-common-denominator format for tabular data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_past.to_csv('../data/awards.csv', index=False)\n", "with open('../data/awards.csv', 'r') as f:\n", " print(f.read()[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It loses some info, though: the recipients list became a plain string, and the reader needs to guess whether each column is numeric or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_csv('../data/awards.csv').recipients.iloc[20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### JSON" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "JSON preserves structured data, but fewer data-science tools speak it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_past.to_json('../data/awards.json', orient='records')\n", "\n", "with open('../data/awards.json', 'r') as f:\n", " print(f.read()[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lists and other basic data types are preserved. (Custom data types aren't preserved, but you'll get an error when saving.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_json('../data/awards.json').recipients.iloc[20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extra: Pickle: handy for storing data\n", "For temporary data storage in a single version of Python, `pickle`s will preserve your data even more faithfully, even many custom data types. But don't count on it for exchanging data or long-term storage. (In fact, don't try to load untrusted `pickle`s -- they can run arbitrary code!)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_awards_past.to_pickle('../data/awards.pkl')\n", "with open('../data/awards.pkl', 'r', encoding='latin1') as f:\n", " print(f.read()[:200])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yup, lots of internal Python and Pandas stuff..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_pickle('../data/awards.pkl').recipients.iloc[20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extra: Formatted data output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a textual table of Physics laureates by year, earliest first:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for idx, row in df_awards_past.sort_values('year').iterrows():\n", " if 'Physics' in row['title']:\n", " print('{}: {}'.format(\n", " row['year'],\n", " ', '.join(row['recipients'])))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extra: Parsing JSON to get the Wayback Machine URL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could go to http://archive.org, search for our URL, and get the URL for the archived version there. But since you'll often need to talk with APIs, let's take this opportunity to use the Wayback Machine's [API](https://archive.org/help/wayback_api.php). This will also give us a chance to practice working with JSON." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"https://www.nobelprize.org/prizes/lists/all-nobel-prizes/\"\n", "# All 3 of these do the same thing. The third is my (KCA's) favorite new feature of Python 3.6.\n", "wayback_query_url = 'http://archive.org/wayback/available?url={}'.format(url)\n", "wayback_query_url = 'http://archive.org/wayback/available?url={url}'.format(url=url)\n", "wayback_query_url = f'http://archive.org/wayback/available?url={url}'\n", "r = requests.get(wayback_query_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We got some kind of response... what is it?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yay, [JSON](https://en.wikipedia.org/wiki/JSON)! It's usually pretty easy to work with JSON, once we parse it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "json.loads(r.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading responses as JSON is so common that `requests` has a convenience method for it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response_json = r.json()\n", "response_json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What kind of object is this?**\n", "\n", "A little Python syntax review: **How can we get the snapshot URL?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "snapshot_url = response_json['archived_snapshots']['closest']['url']\n", "snapshot_url" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }