CS109A Introduction to Data Science

Lecture 3, Exercise 1: Web Scraping and Parsing Intro

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner


Title

Exercise 1: Web Scraping and Parsing Intro

Description

OVERVIEW

As we learned in class, the three most common sources of data used for Data Science are:

  • files (e.g, .csv, .txt) that already contain the dataset
  • APIs (e.g., Twitter or Facebook)
  • web scraping (e.g., Requests)

Here, we get practice with web scraping by using Requests. Once we fetch the page contents, we will need to extract the information that we actually care about. We rely on BeautifulSoup to help with this.

In [27]:
import re
import requests
from bs4 import BeautifulSoup

NOTE: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.

For this exercise, we will be grabbing data (the Top News stories) from AP News, a not-for-profit news agency.

In [ ]:
# the URL of the webpage that has the desired info
url = "https://apnews.com/hub/ap-top-news"

Web Scraping (Graded)

Let's use requests to fetch the contents. Specifically, the requests library has a .get() function that returns a Response object. A Response object contains the server's response to the HTTP request, and thus contains all the information that we could want from the page.

Below, fill in the blank to fetch AP News' Top News website.

In [ ]:
### edTest(test_a) ###
home_page = requests.get(____)
home_page.status_code

You should have received a status code of 200, which means the page was successfully found on the server and sent to receiver (aka client/user/you). Again, you can click here for a full list of status codes. Recall that sometimes, while browsing the Internet, webpages will report a 404 error, possibly with an entertaining graphic to ease your pain. That 404 is the status code, just like we are using here!

home_page is now a Response object. It contains many attributes, including the .text. Run the cell below and note that it's identical to if we were to visit the webpage in our browser and clicked 'View Page Source'.

In [ ]:
home_page.text

Data Parsing Intro (Graded)

The above .text property is atrocious to view and make sense of. Sure, we could write Regular Expressions to extract all of the contents that we're interested in. Instead, let's first use BeautifulSoup to parse the content into more manageable chunks.

Below, fill in the blank to construct an HTML-parsed BeautifulSoup object from our website.

In [ ]:
### edTest(test_b) ###
soup = BeautifulSoup(____, ____)
soup

You'll notice that the soup object is better formatted than just looking at the entire text. It's still dense, but it helps.

Below, fill in the blank to set webpage_title equal to the text of the webpage's title (no HTML tags included).

In [ ]:
### edTest(test_c) ###
webpage_title = ____

Again, our BeautifulSoup object allows for quick, convenient searching and access to the web page contents.

Data Parsing Examples (Not Graded)

Anytime you wish to extract specific contents from a webpage, it is necessary to:

  • Step 1. While viewing the page in your browser, identify what contents of the page you're interested in.
  • Step 2. Look at the HTML returned from the BeautifulSoup object, and pinpoint the specific context that surrounds each of these items that you're interested in
  • Step 3. Devise a pattern using BeautifulSoup and/or RegularExpressions to extract said contents.

For example:

Step 1:

Let's say, for every news article found on the AP's Top News page, you want to extract the link and associated title. In this screenshot

we can see one news article (there are many more below on the page). Its title is "California fires bring more chopper rescues, power shutoffs" and its link is to /c0aa17fff978e9c4768ee32679b8555c. Since the current page is stored at apnews.com, the article link's full address is apnews.com/c0aa17fff978e9c4768ee32679b8555c.

Step 2:

After printing the soup object, we saw a huge mess of all of the HTML still. So, let's drill down into certain sections. As illustrated in the official documentation here, we can retrieve all links by running the cell below:

In [ ]:
soup.find_all("a")

Step 3:

The pattern is that we want the value of the href attribute, along with the text of the link. There are many ways to get at this information. Below, I show just a few:

In [19]:
# EXAMPLE 1

# returns all `a` links that also contain `Component-headline-0-2-110`
soup.find_all("a", "Component-headline-0-2-110")

# iterates over each link and extracts the href and title
for link in soup.find_all("a", "Component-headline-0-2-110"):
    url = "www.apnews.com" + link['href']
    title = link.text

As mentioned in the official documentation here and here, a tag (such as a) may have many attributes, and you can search them by putting your terms in a dictionary.

In [ ]:
# EXAMPLE 2
# this returns the same exact subset of links as the example above
# so, we could iterate through the list just like above
soup.find_all("a", attrs={"data-key": "card-headline"})

Alternatively, we could use Regular Expressions if we were confident that our Regex pattern only matched on the relevant links.

In [ ]:
# EXAMPLE 3
# instead of using the BeautifulSoup, we are handling all of the parsing
# ourselves, and working directly with the original Requests text
re.findall("\"Component-headline.*?href=\"(.+?)\">(.+?)", home_page.text)