{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#
CS109A Introduction to Data Science \n",
"\n",
"## Lecture 3, Exercise 1: Web Scraping and Parsing Intro\n",
"\n",
"\n",
"**Harvard University**
\n",
"**Fall 2020**
\n",
"**Instructors**: Pavlos Protopapas, Kevin Rader, and Chris Tanner\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Title\n",
"\n",
"**Exercise 1: Web Scraping and Parsing Intro**\n",
"\n",
"# Description\n",
"\n",
"**OVERVIEW**\n",
"\n",
"As we learned in class, the three most common sources of data used for Data Science are:\n",
"\n",
"- files (e.g, `.csv`, `.txt`) that already contain the dataset\n",
"- APIs (e.g., Twitter or Facebook)\n",
"- web scraping (e.g., Requests)\n",
"\n",
"Here, we get practice with web scraping by using **Requests**. Once we fetch the page contents, we will need to extract the information that we actually care about. We rely on BeautifulSoup to help with this."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"import requests\n",
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **NOTE**: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.\n",
"\n",
"For this exercise, we will be grabbing data (the Top News stories) from [AP News](apnews.com), a not-for-profit news agency."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# the URL of the webpage that has the desired info\n",
"url = \"https://apnews.com/hub/ap-top-news\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Web Scraping (Graded)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use [`requests`](https://requests.readthedocs.io/en/master/user/quickstart/) to fetch the contents. Specifically, the [`requests`](https://requests.readthedocs.io/en/master/user/quickstart/) library has a `.get()` function that returns a [Response object](https://www.w3schools.com/python/ref_requests_response.asp). A Response object contains the server's _response_ to the HTTP request, and thus contains all the information that we could want from the page.\n",
"\n",
"Below, fill in the blank to fetch AP News' Top News website."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_a) ###\n",
"home_page = requests.get(____)\n",
"home_page.status_code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should have received a status code of 200, which means the page was successfully found on the server and sent to receiver (aka client/user/you). [Again, you can click here](https://www.restapitutorial.com/httpstatuscodes.html) for a full list of status codes. Recall that sometimes, while browsing the Internet, webpages will report a 404 error, possibly with an entertaining graphic to ease your pain. That 404 is the status code, just like we are using here!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`home_page` is now a [Response object](https://www.w3schools.com/python/ref_requests_response.asp). It contains many attributes, including the `.text`. Run the cell below and note that it's identical to if we were to visit the webpage in our browser and clicked 'View Page Source'."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"home_page.text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Parsing Intro (Graded)\n",
"The above `.text` property is atrocious to view and make sense of. Sure, we could write Regular Expressions to extract all of the contents that we're interested in. Instead, let's first use [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the content into more manageable chunks.\n",
"\n",
"Below, fill in the blank to construct an HTML-parsed [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) object from our website."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_b) ###\n",
"soup = BeautifulSoup(____, ____)\n",
"soup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll notice that the `soup` object is better formatted than just looking at the entire text. It's still dense, but it helps.\n",
"\n",
"Below, fill in the blank to set `webpage_title` equal to the text of the webpage's title (no HTML tags included)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_c) ###\n",
"webpage_title = ____"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, our BeautifulSoup object allows for quick, convenient searching and access to the web page contents.\n",
"\n",
"# Data Parsing Examples (Not Graded)\n",
"\n",
"Anytime you wish to extract specific contents from a webpage, it is necessary to:\n",
"- **Step 1**. While viewing the page in your browser, identify what contents of the page you're interested in.\n",
"- **Step 2**. Look at the HTML returned from the BeautifulSoup object, and pinpoint the specific context that surrounds each of these items that you're interested in\n",
"- **Step 3.** Devise a pattern using BeautifulSoup and/or RegularExpressions to extract said contents."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For example:\n",
"### **Step 1:**\n",
"Let's say, for every news article found on the AP's Top News page, you want to extract the link and associated title. In this screenshot\n",
"
\n",
"\n",
"we can see one news article (there are many more below on the page). Its title is `\"California fires bring more chopper rescues, power shutoffs\"` and its link is to [/c0aa17fff978e9c4768ee32679b8555c](/c0aa17fff978e9c4768ee32679b8555c). Since the current page is stored at apnews.com, the article link's full address is [apnews.com/c0aa17fff978e9c4768ee32679b8555c](apnews.com/c0aa17fff978e9c4768ee32679b8555c).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### **Step 2:**\n",
"\n",
"After printing the `soup` object, we saw a huge mess of all of the HTML still. So, let's drill down into certain sections. As illustrated in the [official documentation here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names), we can retrieve all `` links by running the cell below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"soup.find_all(\"a\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is still a ton of text (links). So, let's get more specific. I now search for the title text `California fires bring more chopper rescues, power shutoffs` within the output of the previous cell (the HTML of all links). I notice the following:\n",
"\n",
"`California fires bring more chopper rescues, power shutoffs
`\n",
"\n",
"I also see that this is repeatable; every news article on the Top News page has such text! Great!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Step 3:**\n",
"\n",
"The pattern is that we want the value of the `href` attribute, along with the text of the link. There are many ways to get at this information. Below, I show just a few:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# EXAMPLE 1\n",
"\n",
"# returns all `a` links that also contain `Component-headline-0-2-110`\n",
"soup.find_all(\"a\", \"Component-headline-0-2-110\")\n",
"\n",
"# iterates over each link and extracts the href and title\n",
"for link in soup.find_all(\"a\", \"Component-headline-0-2-110\"):\n",
" url = \"www.apnews.com\" + link['href']\n",
" title = link.text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As mentioned in the official documentation [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes) and [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments), a tag (such as `a`) may have many attributes, and you can search them by putting your terms in a dictionary."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# EXAMPLE 2\n",
"# this returns the same exact subset of links as the example above\n",
"# so, we could iterate through the list just like above\n",
"soup.find_all(\"a\", attrs={\"data-key\": \"card-headline\"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, we could use Regular Expressions if we were confident that our Regex pattern only matched on the relevant links."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# EXAMPLE 3\n",
"# instead of using the BeautifulSoup, we are handling all of the parsing\n",
"# ourselves, and working directly with the original Requests text\n",
"re.findall(\"(.+?)\", home_page.text)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}