Key Word(s): eda, pandas


CS109A Introduction to Data Science

Detailed Examples: Data Collection, Parsing, and Quick Analyses

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner


Title

Extra Practice + Solutions!

Description

This exercise will not be graded; in fact, it's not even submittable, and it's definitely not mandatory to work on it.

But, if you would like extra practice, we crafted this notebook which is very similar in nature to the homework. More importantly, it's very realistic to real-world scenarios whereby one would explore and analyze data -- before modelling is involved.

We have not included an auto-grader, so you cannot test your solutions. However, we provide the solutions, so you can manually check if your outputs are on par with ours. The solutions are visible from the tab up top (right-side) in this window.

In [ ]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2020-CS109A/master/themes/static/css/cs109.css").text
HTML(styles)

Overview

In this notebook, your goal is to gain further practice with acquiring, parsing, cleaning, and analyzing data. Since real-world problems often require gathering information from a variety of sources, including the Internet, web scraping is a highly useful skill to have. To do this, we will scrape IMDb data on the highest-paid actors and actresses, extracting various key data points and using PANDAS to learn how to aggregate the data in useful ways.

Learning Objectives

  • Get started using Jupyter Notebooks, which are incredibly popular, powerful, and will be our medium of programming for the duration of CS109A and CS109B.
  • Become familiar with how to scrape and use data from online sources.
  • Gain experience with data exploration and simple analysis.
  • Become comfortable with PANDAS as a means of storing and working with data.
  • Feel well-prepared to complete HW1.

Notes

  • Exercise responsible scraping. Web servers can become slow or unresponsive if they receive too many requests from the same source in a short amount of time. In your code, use a delay of 2 seconds between requests. This helps to not get blocked by the target website -- imagine how frustrating it would be to have this occur. Section 1 of this homework involves saving the scraped web pages to your local machine.

  • Web scraping requests can take several minutes. Depending on one's project, it could even take hours, days, or last indefinitely (Google crawling the entire Web).

  • As you run a Jupyter Notebook, it maintains a running state of memory. Thus, the order in which you run cells matters and plays a crucial role; it can be easy to make mistakes based on when you run different cells as you develop and test your code. Before submitting every Jupyter Notebook homework assignment, be sure to restart your Jupyter Notebook and run the entire notebook from scratch, all at once (i.e., "Kernel -> Restart & Run All")
In [ ]:
# import the necessary libraries
import re
import requests
import pandas as pd
import numpy as np
from time import sleep
from bs4 import BeautifulSoup

Table of Contents

  1. Practice with regex
  2. Obtaining IMDb Data
    1. Fetching website data via requests
    2. BeautifulSoup
    3. Obtain actor url + salary
    4. Scrape rest of data
  3. Loading and Exploring Data
    1. Saving & Loading Data with Pandas
    2. Cleaning data (rename columns + change types)
    3. Slicing & sorting data
    4. Calculating summary statistics (min, max, mean, etc)
    5. pd.cut, df.groupby, and bar plots
    6. Exploring age vs salary
    7. Exploring salary vs sex
    8. Exploring awards vs sex
    9. Exploring awards vs sex part II
    10. Exploring composer credits

0. Practice with regex

Being able to scrape, parse, and analyze simple website data is very useful in a variety of settings. Here, we look at a U.S. Senate vote on confirming a nominee to be a U.S. District Judge: https://www.senate.gov/legislative/LIS/roll_call_lists/roll_call_vote_cfm.cfm?congress=116&session;=2&vote;=00157

We provide the scraping. Your task is:

  1. Write the BeautifulSoup to grab the ‘vote by positon' section for both "Yea" and "Nay".
  2. Write a regex to extract each senator’s name for the Yeas and Nays.
In [ ]:
url = "https://www.senate.gov/legislative/LIS/roll_call_lists/roll_call_vote_cfm.cfm?congress=116&session;=2&vote;=00157"
s = requests.Session()
page = s.get(url)
In [ ]:
page
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE

Explanation of regex: '\)(.*?)\s\('

I noticed the names were listed between parentheses i.e. ...)Barrasso (R.... So I decided to match for the text in between parentheses with a space before the open parentheses, i.e. )abc (.

The regex searches returns a list of all matches to the following condition: match any string of any length that comes after ")" but before " (".

1. Obtaining IMDb Data

Here, we are interested in analyzing several data points for famous actors and actresses on IMDb. IMDb provides relevant data that includes the names, sexes, and various awards of actors and actresses. Visit https://www.imdb.com/list/ls026028927/ to find a list of the highest-paid actors and actresses. Each actor

In this exercise, we will focus on automating the downloading of each actor's data (via Requests). First, as we will do for every Jupyter Notebook, let's import necessary packages that we will use throughout the notebook (i.e., run the cell below).

In [ ]:
# we define this for convenience, as every actor's url begins with this prefix
base_url = 'https://www.imdb.com'
extension = '/list/ls026028927/'

Here, we fetch the webpage and construct a BeautifulSoup object (HTML parser) from it.

In [ ]:
actors_page = requests.get(base_url + extension)
bs_page = BeautifulSoup(actors_page.content, "html.parser")
bs_page
Exercise 1.1: In the cell below, write code that uses the BeautifulSoup object to parse through the home page in order to extract the **link and salary** for every actor. Specifically, populate the `info` dictionary by setting each key to be the actor name and the value to be **a dictionary of data**, with keys `url` and `salary`. When complete, there will be 30 keys to the `info` dictionary. As an example, within `info`, one of your pairs should be: ``'Adam Sandler': {'salary': '$50.5 million', 'url': 'https://www.imdb.com/name/nm0001191'}`` **HINT:** There are _many_ solutions, but you may find it easiest to use Regular Expression(s)
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE
Exercise 1.2: In the cell below, write code that uses the BeautifulSoup object to parse through each actor's url to extract these additional variables and add them to `info`: - birth date - sex - height - producer credits - actor credits - soundtrack credits - composer credits - award wins - award nominations Save this as a Pandas dataframe called `df` where each row represents a specific actor. Check `df.shape`: you should have 30 rows and 11 columns. **HINT:** To get award wins and nominations, you can append `'/awards'` to your url and request that link. Remember to scrape responsibly! **HINT:** Actor credits are listed under "Actor" for males and "Actress" for females. Make sure you scrape that correctly.
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE

2. Loading and Exploring Data

Now, let's actually use the data! Here, we ask you to perform a few operations using PANDAS on our new dataset.

Exercise 2.1: Saving & Loading Data Save the dataframe as a csv, and then load it back again, saving it under a different name. Compare the number of columns of your new dataframe to `df` above. If it is different, why?
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE

The newly loaded dataframe turned the index of df into a new Unnamed column.

Exercise 2.2: Cleaning Data First, to clean up the data frame, we will rename column headers that contain a space, i.e. changing "actor credits" to "actor_credits". This is standard practice and will make references to column names easier in the future. Also, change all numerical columns to be of type `"int"` or `"float"`, whichever is more appropriate, by using the `astype` function.
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE
Exercise 2.3: Slicing & Sorting Data Use `df` for all future exercises. Create a subset of `df` of female actresses with more than 50 wins. Sort it by actor credits, then producer credits. So, if there is a tie in actor credits, the person with more producer credits should come 1st. Sort in decreasing order (most actor credits goes first).
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE
Exercise 2.4: Data Summary Statistics Calculate summary statistics for the actors' salaries and total credits (actor + producer + composer + soundtrack).
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE
Exercise 2.5: Group By Use the pandas `cut` function to group the actors by total credits into bins of 50 (i.e. 1-50, 51-100, etc.). Then use the `groupby` function to find the mean number of nominations for each bin and report it in a plot.
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE
Exercise 2.6: Exploring Age vs Salary One interesting question is looking at age vs salary : Is there a peak age that acquires the highest salary?; Are older actors paid less? In order to answer these questions, we must first extract age from the birthdate. We can do this by converting the birthdate to a datetime object. More info on datetime objects: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html .
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE

Now we explore the age statistics.

In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE
In [ ]:
# find the actor/actress associated with oldest age
# YOUR CODE HERE
# END OF YOUR CODE HERE

Observe the results. What do you notice about the two youngest actors/actresses? What do you notice about the oldest?

The youngest are both female, and have significantly less credits than the oldest, who happens to be male. But notice that Jennifer Lawrence does have more wins than Emma Watson and Samuel Jackson. Do you think that it makes sense that more wins/awards correlates with a higher salary?

Let's look a little further in depth in the age range by splitting up age by quartiles according to the data. We can do this with pandas' built in .describe() function.

In [ ]:
df["age"].describe()

We the bin age groups based on these quartile summary statistics.

In [ ]:
quartile_1 = df[df.age <= 41]
quartile_2 = df[(df.age > 41) & (df.age <=49)]
quartile_3 = df[(df.age >49) & (df.age <=53)]
quartile_4 = df[df.age >53]
In [ ]:
# look at mean salary within each age quartile
# YOUR CODE HERE
# END OF YOUR CODE HERE

We can see that quartile 2 (ages 42 - 49) has higher average salary than quartile 3 (ages 50 - 53), but quartile 4 still has highest average salary (ages 54 - 71). Why might this be?

Exercise 2.7: Exploring Salary vs sex Now, lets look at salary vs sex. This could be useful to look into any possible gender bias.
In [ ]:
# salary vs sex
# look at mean salary among male and females
# YOUR CODE HERE
# END OF YOUR CODE HERE
In [ ]:
# look at min and max salary among male and females
# YOUR CODE HERE
# END OF YOUR CODE HERE

What do you notice from these statistics? It looks like females have a lower salary in all three summary statistics. Why do you think this is? Do you think there are other factors that could be affecting this? If so, what else in the data could be indicative?

It could also be helpful to additionally look the average total number of credits per gender.

In [ ]:
# look at mean salary among male and females
# YOUR CODE HERE
# END OF YOUR CODE HERE
In [ ]:
# look at min and max salary among male and females
# YOUR CODE HERE
# END OF YOUR CODE HERE

Notice that although females have lower average salary than males, they also tend to have fewer credits. Does this tell us something about how the number of credits correlates with salary? How could you explore this further?

Exercise 2.8: Exploring awards by sex Which actress has the most wins? How many wins does she have? Which actor has the most wins? How many wins does he have?
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE
Exercise 2.9: Exploring awards by sex, Part II What is the mean number of wins for an actor/actress with less than 65 actor credits? With greater than or equal to 65 actor credits?
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE
Exercise 2.10: Exploring composer credits How many actors/actresses have at least 1 composer credit?
In [ ]:
# YOUR CODE HERE
# END OF YOUR CODE HERE