Key Word(s): requests, web scraping, pandas, beautiful soup, parsing, eda
CS109A Introduction to Data Science
Lecture 3, Exercise 1: Web Scraping and Parsing Intro¶
Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Title¶
Exercise 1: Web Scraping and Parsing Intro
Description¶
OVERVIEW
As we learned in class, the three most common sources of data used for Data Science are:
- files (e.g,
.csv
,.txt
) that already contain the dataset - APIs (e.g., Twitter or Facebook)
- web scraping (e.g., Requests)
Here, we get practice with web scraping by using Requests. Once we fetch the page contents, we will need to extract the information that we actually care about. We rely on BeautifulSoup to help with this.
import re
import requests
from bs4 import BeautifulSoup
# the URL of the webpage that has the desired info
url = "https://apnews.com/hub/ap-top-news"
Web Scraping (Graded)¶
Let's use requests
to fetch the contents. Specifically, the requests
library has a .get()
function that returns a Response object. A Response object contains the server's response to the HTTP request, and thus contains all the information that we could want from the page.
Below, fill in the blank to fetch AP News' Top News website.
### edTest(test_a) ###
home_page = requests.get(____)
home_page.status_code
You should have received a status code of 200, which means the page was successfully found on the server and sent to receiver (aka client/user/you). Again, you can click here for a full list of status codes. Recall that sometimes, while browsing the Internet, webpages will report a 404 error, possibly with an entertaining graphic to ease your pain. That 404 is the status code, just like we are using here!
home_page
is now a Response object. It contains many attributes, including the .text
. Run the cell below and note that it's identical to if we were to visit the webpage in our browser and clicked 'View Page Source'.
home_page.text
Data Parsing Intro (Graded)¶
The above .text
property is atrocious to view and make sense of. Sure, we could write Regular Expressions to extract all of the contents that we're interested in. Instead, let's first use BeautifulSoup
to parse the content into more manageable chunks.
Below, fill in the blank to construct an HTML-parsed BeautifulSoup
object from our website.
### edTest(test_b) ###
soup = BeautifulSoup(____, ____)
soup
You'll notice that the soup
object is better formatted than just looking at the entire text. It's still dense, but it helps.
Below, fill in the blank to set webpage_title
equal to the text of the webpage's title (no HTML tags included).
### edTest(test_c) ###
webpage_title = ____
Again, our BeautifulSoup object allows for quick, convenient searching and access to the web page contents.
Data Parsing Examples (Not Graded)¶
Anytime you wish to extract specific contents from a webpage, it is necessary to:
- Step 1. While viewing the page in your browser, identify what contents of the page you're interested in.
- Step 2. Look at the HTML returned from the BeautifulSoup object, and pinpoint the specific context that surrounds each of these items that you're interested in
- Step 3. Devise a pattern using BeautifulSoup and/or RegularExpressions to extract said contents.
For example:
Step 1:¶
Let's say, for every news article found on the AP's Top News page, you want to extract the link and associated title. In this screenshot
we can see one news article (there are many more below on the page). Its title is "California fires bring more chopper rescues, power shutoffs"
and its link is to /c0aa17fff978e9c4768ee32679b8555c. Since the current page is stored at apnews.com, the article link's full address is apnews.com/c0aa17fff978e9c4768ee32679b8555c.
Step 2:¶
After printing the soup
object, we saw a huge mess of all of the HTML still. So, let's drill down into certain sections. As illustrated in the official documentation here, we can retrieve all links by running the cell below:
soup.find_all("a")
This is still a ton of text (links). So, let's get more specific. I now search for the title text California fires bring more chopper rescues, power shutoffs
within the output of the previous cell (the HTML of all links). I notice the following:
California fires bring more chopper rescues, power shutoffs
I also see that this is repeatable; every news article on the Top News page has such text! Great!
Step 3:¶
The pattern is that we want the value of the href
attribute, along with the text of the link. There are many ways to get at this information. Below, I show just a few:
# EXAMPLE 1
# returns all `a` links that also contain `Component-headline-0-2-110`
soup.find_all("a", "Component-headline-0-2-110")
# iterates over each link and extracts the href and title
for link in soup.find_all("a", "Component-headline-0-2-110"):
url = "www.apnews.com" + link['href']
title = link.text
# EXAMPLE 2
# this returns the same exact subset of links as the example above
# so, we could iterate through the list just like above
soup.find_all("a", attrs={"data-key": "card-headline"})
Alternatively, we could use Regular Expressions if we were confident that our Regex pattern only matched on the relevant links.
# EXAMPLE 3
# instead of using the BeautifulSoup, we are handling all of the parsing
# ourselves, and working directly with the original Requests text
re.findall("\"Component-headline.*?href=\"(.+?)\">(.+?) ", home_page.text)