CS109A Introduction to Data Science
Detailed Examples: Data Collection, Parsing, and Quick Analyses¶
Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Title¶
Extra Practice + Solutions!
Description¶
This exercise will not be graded; in fact, it's not even submittable, and it's definitely not mandatory to work on it.
But, if you would like extra practice, we crafted this notebook which is very similar in nature to the homework. More importantly, it's very realistic to real-world scenarios whereby one would explore and analyze data -- before modelling is involved.
We have not included an auto-grader, so you cannot test your solutions. However, we provide the solutions, so you can manually check if your outputs are on par with ours. The solutions are visible from the tab up top (right-side) in this window.
## RUN THIS CELL TO GET THE RIGHT FORMATTING
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2020-CS109A/master/themes/static/css/cs109.css").text
HTML(styles)
Overview¶
In this notebook, your goal is to gain further practice with acquiring, parsing, cleaning, and analyzing data. Since real-world problems often require gathering information from a variety of sources, including the Internet, web scraping is a highly useful skill to have. To do this, we will scrape IMDb data on the highest-paid actors and actresses, extracting various key data points and using PANDAS to learn how to aggregate the data in useful ways.
Learning Objectives¶
- Get started using Jupyter Notebooks, which are incredibly popular, powerful, and will be our medium of programming for the duration of CS109A and CS109B.
- Become familiar with how to scrape and use data from online sources.
- Gain experience with data exploration and simple analysis.
- Become comfortable with PANDAS as a means of storing and working with data.
- Feel well-prepared to complete HW1.
Notes¶
Exercise responsible scraping. Web servers can become slow or unresponsive if they receive too many requests from the same source in a short amount of time. In your code, use a delay of 2 seconds between requests. This helps to not get blocked by the target website -- imagine how frustrating it would be to have this occur. Section 1 of this homework involves saving the scraped web pages to your local machine.
Web scraping requests can take several minutes. Depending on one's project, it could even take hours, days, or last indefinitely (Google crawling the entire Web).
- As you run a Jupyter Notebook, it maintains a running state of memory. Thus, the order in which you run cells matters and plays a crucial role; it can be easy to make mistakes based on when you run different cells as you develop and test your code. Before submitting every Jupyter Notebook homework assignment, be sure to restart your Jupyter Notebook and run the entire notebook from scratch, all at once (i.e., "Kernel -> Restart & Run All")
# import the necessary libraries
import re
import requests
import pandas as pd
import numpy as np
from time import sleep
from bs4 import BeautifulSoup
Table of Contents¶
- Practice with regex
- Obtaining IMDb Data
- Fetching website data via
requests
- BeautifulSoup
- Obtain actor url + salary
- Scrape rest of data
- Fetching website data via
- Loading and Exploring Data
- Saving & Loading Data with Pandas
- Cleaning data (rename columns + change types)
- Slicing & sorting data
- Calculating summary statistics (min, max, mean, etc)
pd.cut
,df.groupby
, and bar plots- Exploring age vs salary
- Exploring salary vs sex
- Exploring awards vs sex
- Exploring awards vs sex part II
- Exploring composer credits
0. Practice with regex¶
Being able to scrape, parse, and analyze simple website data is very useful in a variety of settings. Here, we look at a U.S. Senate vote on confirming a nominee to be a U.S. District Judge: https://www.senate.gov/legislative/LIS/roll_call_lists/roll_call_vote_cfm.cfm?congress=116&session;=2&vote;=00157
We provide the scraping. Your task is:
- Write the BeautifulSoup to grab the ‘vote by positon' section for both "Yea" and "Nay".
- Write a regex to extract each senator’s name for the Yeas and Nays.
url = "https://www.senate.gov/legislative/LIS/roll_call_lists/roll_call_vote_cfm.cfm?congress=116&session;=2&vote;=00157"
s = requests.Session()
page = s.get(url)
page
# YOUR CODE HERE
# END OF YOUR CODE HERE
Explanation of regex: '\)(.*?)\s\('
I noticed the names were listed between parentheses i.e. ...)Barrasso (R...
. So I decided to match for the text in between parentheses with a space before the open parentheses, i.e. )abc (
.
The regex searches returns a list of all matches to the following condition: match any string of any length that comes after ")" but before " (".
1. Obtaining IMDb Data¶
Here, we are interested in analyzing several data points for famous actors and actresses on IMDb. IMDb provides relevant data that includes the names, sexes, and various awards of actors and actresses. Visit https://www.imdb.com/list/ls026028927/ to find a list of the highest-paid actors and actresses. Each actor
In this exercise, we will focus on automating the downloading of each actor's data (via Requests
). First, as we will do for every Jupyter Notebook, let's import necessary packages that we will use throughout the notebook (i.e., run the cell below).
# we define this for convenience, as every actor's url begins with this prefix
base_url = 'https://www.imdb.com'
extension = '/list/ls026028927/'
Here, we fetch the webpage and construct a BeautifulSoup object (HTML parser) from it.
actors_page = requests.get(base_url + extension)
bs_page = BeautifulSoup(actors_page.content, "html.parser")
bs_page
# YOUR CODE HERE
# END OF YOUR CODE HERE
# YOUR CODE HERE
# END OF YOUR CODE HERE
2. Loading and Exploring Data¶
Now, let's actually use the data! Here, we ask you to perform a few operations using PANDAS on our new dataset.
# YOUR CODE HERE
# END OF YOUR CODE HERE
The newly loaded dataframe turned the index of df
into a new Unnamed column.
# YOUR CODE HERE
# END OF YOUR CODE HERE
# YOUR CODE HERE
# END OF YOUR CODE HERE
# YOUR CODE HERE
# END OF YOUR CODE HERE
# YOUR CODE HERE
# END OF YOUR CODE HERE
# YOUR CODE HERE
# END OF YOUR CODE HERE
Now we explore the age statistics.
# YOUR CODE HERE
# END OF YOUR CODE HERE
# find the actor/actress associated with oldest age
# YOUR CODE HERE
# END OF YOUR CODE HERE
Observe the results. What do you notice about the two youngest actors/actresses? What do you notice about the oldest?
The youngest are both female, and have significantly less credits than the oldest, who happens to be male. But notice that Jennifer Lawrence does have more wins than Emma Watson and Samuel Jackson. Do you think that it makes sense that more wins/awards correlates with a higher salary?
Let's look a little further in depth in the age range by splitting up age by quartiles according to the data. We can do this with pandas' built in .describe() function.
df["age"].describe()
We the bin age groups based on these quartile summary statistics.
quartile_1 = df[df.age <= 41]
quartile_2 = df[(df.age > 41) & (df.age <=49)]
quartile_3 = df[(df.age >49) & (df.age <=53)]
quartile_4 = df[df.age >53]
# look at mean salary within each age quartile
# YOUR CODE HERE
# END OF YOUR CODE HERE
We can see that quartile 2 (ages 42 - 49) has higher average salary than quartile 3 (ages 50 - 53), but quartile 4 still has highest average salary (ages 54 - 71). Why might this be?
# salary vs sex
# look at mean salary among male and females
# YOUR CODE HERE
# END OF YOUR CODE HERE
# look at min and max salary among male and females
# YOUR CODE HERE
# END OF YOUR CODE HERE
What do you notice from these statistics? It looks like females have a lower salary in all three summary statistics. Why do you think this is? Do you think there are other factors that could be affecting this? If so, what else in the data could be indicative?
It could also be helpful to additionally look the average total number of credits per gender.
# look at mean salary among male and females
# YOUR CODE HERE
# END OF YOUR CODE HERE
# look at min and max salary among male and females
# YOUR CODE HERE
# END OF YOUR CODE HERE
Notice that although females have lower average salary than males, they also tend to have fewer credits. Does this tell us something about how the number of credits correlates with salary? How could you explore this further?
# YOUR CODE HERE
# END OF YOUR CODE HERE
# YOUR CODE HERE
# END OF YOUR CODE HERE
# YOUR CODE HERE
# END OF YOUR CODE HERE