Key Word(s): BeautifulSoup



CS 109A/STAT 121A/AC 209A/CSCI E-109A

Standard Section 2: Webscraping with BeautifulSoup, Continued

Harvard University
Fall 2017
Section Leaders: Nathaniel Burbank, Albert Wu, Matthew Holman
Instructors: Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine

**Download this notebook from CS109 repo or here:
http://bit.ly/109_s2 **

During the first section we worked with tables of historic population estimates of the different US states, as listed on the Wikipedia site: https://en.wikipedia.org/wiki/List_of_U.S._states_by_historical_population. We learned how to download a webpage via its URL, to parse the resulting html into a python object with BeautifulSoup, to extract the necessary tables, to convert the results into a pandas dataframe, and finally to make a few plots.

In this section, we will extend those results to include the median household income and other information per state. Specifically, we will:

1. Review the process of downloading a webpage with the requests library 
2. Revisit how to parse the html into a python object using the BeautifulSoup library
3. Again use BeautifulSoup to select and extract the relevant tables
4. Clean the text and convert the table into a python dictionary 
5. Save and restore the results in files with JSON and csv (pandas)
6. Make a pandas dataframe from the dictionary
7. Join the new dataframe with the one we made last week.
8. Inner vs outer joins 
9. Do more analysis and plotting (with matplotlib) 

As usual, our first step is to import the necessary libraries.

In [ ]:
%matplotlib inline
import sys
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

import requests 
requests.packages.urllib3.disable_warnings()
#Requests enables us to download raw html as text
from bs4 import BeautifulSoup 
# BeautifulSoup enables us to navigate html in python with dom-like tree structure 

# Two new ones
import time
# So that we can sleep between downloads
import json
# We'll write and read JSON files later

assert(sys.version_info.major==3),print(sys.version)
# Python 3 or higher is required

Download, extract, and parse the relevant webpage as raw html

Next, let's quickly reconstruct the pandas dataframe we made in the last section. Recall that we

1. used the requests module to download the page corresponding to a given URL,
2. extracted the raw HTML of the result, and
3. used BeautifulSoup to parse the raw HTML.

We could repeat each of these steps, but you might have noticed that we only used the final result and ignored the intermediate results. This calls out for a function definition.

In [ ]:
# Dictionary to store already downloaded BeautifulSoup objects 
html_cache = dict()
In [ ]:
def dl_parsed_html(url, timeout=20, verify=False, sleep_time=1):
    """Downloads the content of the page specified by url.
    Return the parsed HTML or None upon error.
    Notice that timeout, verify, and sleep_time have 
    defaults values"""
    
    # Down load the context at url
    r = requests.get(url, timeout=20,verify=False)
    
    # Check if HTTP status code is anything other than "ok"
    if r.status_code != 200:
        return None
    
    # Grab the raw HTML of the result
    raw_html = r.text
    
    # Parse the raw HTML with Beautiful soup
    soup = BeautifulSoup(raw_html, 'html.parser')
    
    # We have added a default 1 sec sleep
    time.sleep(sleep_time)
    
    # Return the parsed results
    return soup

def get_parsed_html(url, timeout=20, verify=False, sleep_time=1):
    """If we’ve already downloaded a webpage, return from cache. 
    Otherwise, download and parse using dl_parsed_html above."""
    if url not in html_cache:
        parsed_html = dl_parsed_html(url, timeout, verify, sleep_time)
        if parsed_html:
            html_cache[url] = parsed_html 
            
    return html_cache.get(url,None)

Now let's use our new function to download and parse the US historical population information again.

In [ ]:
soup = get_parsed_html("http://en.wikipedia.org/wiki/List_of_U.S._states_by_historical_population")
soup.find('title').text

We can use find_all to get the tables of interest again.

In [ ]:
tables = soup.find_all("table",attrs={"class":"wikitable"})
# This says return a list of all table objects that include 
# the css class “wikitable” within the soup object.  

# We only need the first, third, and fourth tables.
tables = [tables[0], tables[2], tables[3]]

We can render the first table, to inspect it, as we did in the previous section.

In [ ]:
from IPython.core.display import display, HTML
display(HTML(tables[0].prettify()))

Clean data and build dictionary

Here is the code we used in the previous section to clean the data and build the dictionary, row by row, from a list of tables.

In [ ]:
def rm_footnotes(s):
    """Removes any text after first '[' in string
    District of Columbia[1] -> District of Columbia"""
    return s.split('[')[0]

def clean_int(s):
    """Removes any commas, '$' symbols, or footnotes from string and converts to int.
    Returns zero for blank strings"""
    s = s.strip().replace(',','').replace('$', '')
    s = rm_footnotes(s)
    return int(s) if s else 0
In [ ]:
d = dict()

for table in tables:
    """Extracts population data for states from all tables in tables 
      and store it in single dictionary d """
    
    headrow = table.find("tr")
    
    col_names = [(idx,th.text) for idx,th in enumerate(headrow.find_all('th')) if th.text.isnumeric()]
    # Makes list of tuples like this with idx and name for cols with years.
    # By using isnumeric, we only include columns are that are years. 
    # Result looks like this: 
    # [(2, '1790'), (3, '1800'), (4, '1810')]                     
    
    rows = table.find_all("tr")[1:]
    # List of rows in table excluding the header row 
    
    for row in rows:
        state_name = rm_footnotes(row.find('td').text)
        # String of state name, with any footnotes removed 
        
        all_cells = [c.text for c in row.find_all('td')]
        # List of cell values for row, e.g.: 
        # ['Alabama', '1819', '\xa0', '1,250', '9,046' .. ] 
        
        existing_values = d.get(state_name,{})
        # Existing dict of values for given state 
        
        new_values = {year:clean_int(all_cells[idx]) for (idx,year) in col_names}
        # For cols listed in col_names, return dict of cleaned int values 
        # {'1790': 0, '1800': 1250, '1810': 9046...}
        
        existing_values.update(new_values)
        # Merge with existing dict for state 
        d[state_name] = existing_values

Use JSON to save and restore data

JSON (JavaScript Object Notation) is a simple data exchange format that can be used to serialize many data objects. To serialize an object means to convert it to a format that can be transmitted in a series, either to or from files or across a network. This is particularly easy in python. (See pickle for an alternative means of 'serializing' python objects that supports a larger range of python objects but is not human readable (and not a universal format like json)

Here we'll use JSON to save our dictionary so that we can just load it from the file, if we need in the future.

In [ ]:
filename = "historical_population.json"

# Open a file descriptor for writing, with the filename 'historical_population.json'.
# Convert our dictionary 'd' to JSON and output it to the file.
# Close the file.

# Your code here 

A convenient feature of JSON is that the resulting files are ASCII and thus simple to read (although they are larger than a corresponding binary format would be).

Let's use a shell command to look at the actual file.

In [ ]:
!cat historical_population.json
# exclamation point here is a "magic" command that means 
# run line as shell command in terminal

It looks a lot like a python dictionary!

Let's demonstrate the use of JSON files by deleting d and then reloading it.

In [ ]:
del d
In [ ]:
d
In [ ]:
# Now open the json file back up and save the results to d

# your code here
In [ ]:
d

We successfully restored the dictionary from its JSON file!

Generate the dataframe

Now that we have the dictionary, it's straightfoward to generate a pandas dataframe from it.

In [ ]:
# Generate the dataframe from the dictionary.
df = pd.DataFrame.from_dict(d,orient='index')

# Drop unneeded rows.
df = df.drop(["United States"])  

# Sort the columns in increasing (alphabetical) order.
df = df.reindex_axis(sorted(df.columns), axis=1)

df.head()

Alternative to saving JSON – save dataframe object as csv

In [ ]:
df.to_csv("historical_population.csv")
In [ ]:
!cat historical_population.csv
In [ ]:
df.from_csv("historical_population.csv").head()

Exporting as csv is another useful method for saving a data frame offline you can easily jump back to your dataset without needed to re-scrape from Wikipedia. Note however that when we export as csv, no metadata about the data frame is saved which can cause subtle differences on re-import in the datatypes of your columns and which column is used as the index.

Download, extract, and parse another webpage

Now that we have regenerated the dictionary and dataframe from last section, let's demonstrate a slightly different approach to downloading, extracting, and parsing. This time we will look at a webpage of the US states by income.

We'll re-use our previous function to download and parse the page.

In [ ]:
soup = get_parsed_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_income")
soup.find('title').text
In [ ]:
tables = soup.find_all("table",attrs={"class":"wikitable"})
# This says return a list of all table objects that include 
# the css class “wikitable” within the soup object.  
len(tables)

This page has two wikitables, and we are interested in the second one, which looks like this.

In [ ]:
# Let's render the second table as HTML
from IPython.core.display import display, HTML
display(HTML(tables[1].prettify()))

Convert the html table directly into a pandas dataframe

For well formatted HTML tables the pandas read_html method works like a charm, and it can be much easier than building a dataframe row by row.

With the read_html method, we can specify which row has the column names and which column is the index. We'll start by using the first row and first column.

In [ ]:
df1 = pd.read_html(tables[1].prettify(),header=0, index_col=0)[0]
In [ ]:
df1.head()

Clean up the dataframe

There are a few things we should clean up before proceeding. First let's set the index to be the'State' column.

In [ ]:
df1 = df1.set_index('State')

Note that by overwriting the index, we’re discarding the 'Rank' column.

In [ ]:
df1.head()

Next, let's drop the 'United States' row -- but keep your eye on the District of Columbia and Puerto Rico rows.

In [ ]:
df1 = df1.drop(["United States"])

Also, the column names have one or more spaces embedded in them. That can make the columns difficult to select. Fortunately, a pandas dataframe allows easy access to the column names as follows:

In [ ]:
df1.columns.values

Notice that this is a numpy array of type 'object' with a list of column names. We can write a quick reformatting function and apply it to all the column names.

Here we use both 'strip' which is a built_in method of string objects and the 're' (regular expression) substitution method. Regular expressions are a powerful, concise way of representing an enormous range of string transformations.

To learn more about regular expresions in python, see: https://docs.python.org/3/library/re.html

In [ ]:
import re
def clean_column_name(s):
    """This removes any leading or trailing spaces and
    replaces multiple embedded spaces with a single
    underscore character
    
    'Per capita  income' -> 'Per_capita_income'
    """
    # Your code here
    
    return s

new_column_names = list(map(clean_column_name, list(df1.columns.values)))

# Here's our set of reformatted column names
new_column_names

Now we can reset the column names.

In [ ]:
df1.columns = new_column_names
df1.head()

That's better!

Let’s say we were interested in identifying states that had a large divergence between Median household income and Median family income.

Per the Census Bureau’s definitions, a family is “a group of two people or more (one of whom is the householder) related by birth, marriage, or adoption and residing together; all such people (including related subfamily members) are considered as members of one family.” Conversely, a household “consists of all the people who occupy a housing unit.”

In [ ]:
df1.Median_family_income - df1.Median_household_income

Uh oh --- that did not work – what’s wrong?

In [ ]:
df1.info()

This does not work becuase these two columens are represented by strings rather than ints or float objects. On import, pandas interpreted them as strings rather than floats or ints because of the '\$' symbols in them. Let's use our clean_int function from last time to solve this issue. (Notice, that clean_int now replaces '\$' symbols too.) If we 'apply' clean_int to the Per_capita_income column, we get the following:

In [ ]:
df1['Per_capita_income'].apply(clean_int)

Since we have three such columns to deal with, let's clean them in bulk.

In [ ]:
for colname in ['Per_capita_income', 'Median_household_income', 'Median_family_income']:
    df1[colname] = df1[colname].apply(clean_int)
In [ ]:
df1.head()

Why does this work? Aren't we changing the columns as we go?

Now let’s get back to looking at the difference between median household and family by state.

In [ ]:
df1['Household_family_diff'] = df1.Median_family_income - df1.Median_household_income
df1.head()

Let’s identify the states with the largest differential...

In [ ]:
key_cols = ['Median_family_income','Median_household_income','Household_family_diff']
df1.sort_values(['Household_family_diff'],ascending=False)[key_cols].head(5)

And the smallest...

In [ ]:
df1.sort_values(['Household_family_diff'])[key_cols].head(5)

Plot our data!

In [ ]:
y_values = (df1['Household_family_diff']).sort_values(ascending=False)
x_values = list(y_values.index)
x_pos = np.arange(len(y_values))

plt.bar(x_pos,y_values, align='center', alpha=0.5)
plt.xticks(x_pos, x_values, rotation=90, fontsize=12)
plt.ylabel('Difference between family income and household income (USD)')
plt.title('Difference between Median family income and median household income by state')

new_style = {'grid': False}
plt.rc('axes', **new_style)
plt.rc

We can make almost the exact same chart using Panda’s built in “plot” command with much less code.

In [ ]:
df1['Household_family_diff'].sort_values(ascending=False).plot(kind="bar", alpha=.5, fontsize=12)
plt.ylabel('Difference between family income and household income (USD)')
plt.title('Difference between Median family income and median household income by state')

Combine dataframes

What if we want to combine out two dataframes? There are a number of ways to do this.

One way is to use the Pandas concat function. It stitches the two data frames together based on their index values.

In [ ]:
result = pd.concat([df, df1], axis=1, join='inner')
print(len(result))
result.head()

Let's check the results by comparing the 'Connecticut' from each of the constituent dataframes.

In [ ]:
df.loc['Connecticut']
In [ ]:
df1.loc['Connecticut']

Notice that we can also do an 'outer' join. What's the difference?

In [ ]:
result = pd.concat([df, df1], axis=1, join='outer')
print(len(result))
result

Notice that “District of Columbia” is listed twice above. What’s going on? Why didn’t it match?

In [ ]:
df.index[8]
In [ ]:
df1.index[0]
In [ ]:
df.index[8] == df1.index[0]

Note even though the strings look the same, slight differences in the way the strings are encoded keep pandas from matching them up. If we wanted to fix this, we’d probably want to go back to the import stage and do it there. Adjusting indexes once they’ve been assigned is messy in pandas.

In [ ]:
result = pd.concat([df, df1], axis=1, join='inner')

Another approach is to use the merge method on the shared values in two specified columns. Let's explore this by copying the index column to a column of a slightly different name. (If we had available some other shared key, we could use that.)

In [ ]:
df['state'] = df.index
df1['state'] = df1.index
In [ ]:
combined_df= df1.merge(df, on='state', how='inner')
len(combined_df)
In [ ]:
combined_df.head()

Note here that the while the resulting table is correct, we’ve abounded our index which makes the table harder to work with.

Analysis on joined frame – what drives household income?

Finally, let’s explore a couple of possible drivers of median household income.

In [ ]:
plt.scatter(combined_df['2010']/1e6, combined_df['Median_household_income']/1e3)

plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.title("Median income vs population by State")
plt.xlabel("Population (Millions)")
plt.ylabel("Median household income (Thousands)")

Perhaps we think population growth, rather than raw size is what matters…

In [ ]:
population_growth = result['2010']/result['1980']
population_growth
In [ ]:
plt.title("Population growth by state, 1980-2010")
population_growth.sort_values(ascending=False).plot(kind='bar')
plt.axhline(y=1, color='r', linestyle='-')
# Draw a dotted line at y=1
plt.ylabel("30 year population growth multiplier")
In [ ]:
plt.scatter(population_growth, combined_df.Median_household_income/1e3)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.title("Median income vs population growth by State")
plt.xlabel("30 year population growth multiplier")
plt.ylabel("Median household income (Thousands)")