CSCI E-109A¶

Standard Section 1: Introduction to Webscraping with BeautifulSoup¶

Harvard University
Fall 2017
Section Leaders: Albert Wu, Matthew Holman, Nathaniel Burbank
Instructors: Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine

Sometimes the most interesting data sets don’t come with an API or pre-packaged plain-text CSVs. In these situations, web scraping can be a powerful tool, enabling us to extract and convert data from almost any format found on the internet into a tabular form we can conduct further analysis on.

For this section we will be working with historic population estimates of different US states listed on Wikipedia here:https://en.wikipedia.org/wiki/List_of_U.S._states_by_historical_population. In this section, you’ll learn how to import html-tables on the web into something you can manipulate with pandas.

Specifically, we will:

1. Download the relevant webpage as raw html with the requests library  
2. Parse the html into a tree-like python-object with the BeautifulSoup library
3. Use BeautifulSoup to select and extract just the tables we’re interested in
4. Combine the tables, clean the text, and convert them into a single python dictionary 
5. Make a pandas dataframe from the dictionary 
6. Doing some basic analysis and plotting (with matplotlib)

As usual, our first step is to import the necessary libraries.

In [ ]:

%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

# Two new 
import requests 
requests.packages.urllib3.disable_warnings()
#Requests enables us to download raw html as text
from bs4 import BeautifulSoup 
# BeautifulSoup enables us to navigate html in python with dom-like tree structure

Step 1: Download the relevant webpage as raw html¶

In [ ]:

url = "http://en.wikipedia.org/wiki/List_of_U.S._states_by_historical_population"
r = requests.get(url, timeout=20,verify=False)

Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be on HW1), always put a delay between requests (e.g,, time.sleep(1), with the time library) so you don’t unwittingly hammer someone’s webserver and/or get blocked.

In [ ]:

raw_html = r.text
raw_html

Now, we have all our data in the notebook. Unfortunately, it is the form of one really long string, which is hard to work with directly. This is where BeautifulSoup comes in.

Step 2: Parse the html with BeautifulSoup¶

In [ ]:

soup = BeautifulSoup(raw_html, 'html.parser')

Key BeautifulSoup functions we’ll be using in this section:

soup.prettify(): Returns cleanedup version of raw html for printing
soup.find_all(,attrs=): Returns python list of matching objects
soup.find(,attrs=): Returns first matching object
soup.text/soup.get_text(): Returns visible text of an object (e.g.,"Some text" -> "Some text")

BeautifulSoup is a very powerful library -- much more info here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Let's practice some BeautifulSoup commands..¶

Print a cleanedup version of raw html for printing

In [ ]:

# Your code here

Find the first “title” object

In [ ]:

# Your code here

Find the text of first “title” object

In [ ]:

# Your code here

Step 3: extract just the tables we’re interested in¶

In [ ]:

tables = soup.find_all("table",attrs={"class":"wikitable"})
# This says return a list of all table objects that include 
# the css class “wikitable” within the soup object.  
len(tables)

As it turns out, the data we’re interested in is in the 1st, 3rd, and 4th wikitable table on the page.

In [ ]:

tables = [tables[0], tables[2], tables[3]]
len(tables)

Step 4: Convert the html tables into a python dictionary¶

Before we can think about how to extract what we need, we need to understand how tables are constructed in HTML...

In [ ]:

from IPython.core.display import display, HTML
display(HTML("""                                                

Firstname
Lastname 
Age


Jill
Smith 
50


Eve
Jackson 
94

"""))

Now let’s look at the first table in the list of tables we’re trying to import.

In [ ]:

# Let's render the first table as HTML
from IPython.core.display import display, HTML
display(HTML(tables[0].prettify()))

Our goal is to associate a series of population values with each state, so that we can build a time series table across all three tables. Things to keep in mind while building our extract routine:

- We don’t care about the “admitted” column 
- We want to remove the commas from the numbers so python interprets them as ints rather than strings 
- We want to remove the footnotes and links

Now, let’s try to extract out our data…

In [ ]:

def rm_footnotes(s):
    """
    Removes any text after first '[' in string
    District of Columbia[1] -> District of Columbia"""
    # Your code here 

def clean_int(s):
    """Removes any commas or footnotes from string and converts to int.
       Returns zero for blank strings"""
    # Your code here

In [ ]:

d = dict()

for table in tables:
    """Extracts population data for states from all tables in tables 
      and store it in single dictionary d """
    
    headrow = table.find("tr")
    
    col_names = [(idx,th.text) for idx,th in enumerate(headrow.find_all('th')) if th.text.isnumeric()]
    # Makes list of tuples like this with idx and name for cols with years.
    # By using isnumeric, we only include columns are that are years. 
    # Result looks like this: 
    # [(2, '1790'), (3, '1800'), (4, '1810')]                     
    
    rows = table.find_all("tr")[1:]
    # List of rows in table excluding the header row 
    
    for row in rows:
        state_name = rm_footnotes(row.find('td').text)
        # String of state name, with any footnotes removed 
        
        all_cells = [c.text for c in row.find_all('td')]
        # List of cell values for row, e.g.: 
        # ['Alabama', '1819', '\xa0', '1,250', '9,046' .. ] 
        
        existing_values = d.get(state_name,{})
        # Existing dict of values for given state 
        
        new_values = {year:clean_int(all_cells[idx]) for (idx,year) in col_names}
        # For cols listed in col_names, return dict of cleaned int values 
        # {'1790': 0, '1800': 1250, '1810': 9046...}
        
        existing_values.update(new_values)
        # Merge with existing dict for state 
        d[state_name] = existing_values
d

Step 5: Make a pandas dataframe from the dictionary¶

Now that we have all three tables in a cleaned dictionary form, with each state as a key, and time series for each sate (as another dict) as it’s value, it's simple to convert to a pandas DataFrame.

In [ ]:

df = pd.DataFrame.from_dict(d,orient='index')
df

This looks pretty good! However, we’ve accidently imported the total row (United States). We could address that in our function above, or just drop it from our dataframe.

In [ ]:

df = df.drop(["United States"])  
df

An alternative aproach -- Using pd.read_html() instead¶

In this particular situation, using Panda’s built-in “read_html” function actually works pretty well, and would have saved us a lot of code above. However, not all datasets come in nicely formatted html tables, so it’s important to develop the skills to write a scraping routine from scratch from any type of html page (as we did above)

More info here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

In [ ]:

pd.read_html(tables[0].prettify(),header=0, index_col=0)[0]

(Note – if we really wanted to take this approach, we’d have to handle removing the footnotes and merging/joining the tables in Pandas.)

Step 6: let's plot our data!¶

In [ ]:

df.T.plot(legend=None)
plt.ylabel("Population (Millions)")
plt.xlabel('Year \n\n (Every line represents the population for an individual state overtime)')
plt.title("US Population by State, 1790-2010")

In [ ]:

# let's look at just the 10 largest states (2010 population)
df.sort_values('2010', ascending=False)
largest = df['2010'].nlargest(10).index
df[df.index.isin(largest)].T.plot()
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.title("Population by State for 10 largest states, 1790-2010")
plt.ylabel("Population (Millions)")
plt.xlabel('Year \n\n (Every line represents the population for an individual state overtime)')

Perhaps it will be more interested to view as percentage of total US population, so we can see relative change overtime…

In [ ]:

import matplotlib.ticker as mtick
cols = df.columns
by_percent = df[cols].div(df[cols].sum(axis=0), axis=1).multiply(100)
by_percent[by_percent.index.isin(largest)].T.plot()
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.ylabel("Percent of total US population")
plt.xlabel("Year")
plt.title("Percentage of total US Population by State, 1790-2010")

# Make y-axis percentages 
ax = plt.gca()
fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
yticks = mtick.FormatStrFormatter(fmt)
ax.yaxis.set_major_formatter(yticks)

In [ ]:

by_percent[by_percent.index.isin(largest)].T.plot(kind="bar",stacked=True)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.ylabel("Percent of total population")
plt.xlabel("Year")
plt.title("Percentage of total US Population by State, 1790-2010")

ax = plt.gca()
ax.yaxis.set_major_formatter(yticks)