Key Word(s): Pandas, Data Cleaning
CS109A Introduction to Data Science
Lecture 3 (Pandas + Beautiful Soup)¶
Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Authors: Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner
## RUN THIS CELL TO GET THE RIGHT FORMATTING
from IPython.core.display import HTML
from IPython.display import Image
def css_styling():
styles = open("../../../styles/cs109.css", "r").read()
return HTML(styles)
css_styling()
Table of Contents¶
- Learning Goals
- Loading and Cleaning with Pandas
- Parsing and Completing the Dataframe
- Grouping
Learning Goals¶
This Jupyter notebook accompanies Lecture 3. By the end of this lecture, you should be able to:
- Understand why and how Pandas can be useful.
- Use Pandas to:
- Load data into a DataFrame
- Access subsets of data based on column and row values
- Address missing values (e.g.,
NaN
) - Use
groupby()
to select sections of data. - Plot DataFrames (e.g., barplot())
- Use Beautiful Soup to download a webpage and all of its links
Part 1: Processing Data without Pandas¶
../data/top50.csv
is a dataset found online (Kaggle.com) that contains information about the 50 most popular songs on Spotify in 2019.
Each row represents a distinct song. The columns (in order) are:
ID: a unique ID (i.e., 1-50)
TrackName: Name of the Track
ArtistName: Name of the Artist
Genre: the genre of the track
BeatsPerMinute: The tempo of the song.
Energy: The energy of a song - the higher the value, the more energetic. song
Danceability: The higher the value, the easier it is to dance to this song.
Loudness: The higher the value, the louder the song.
Liveness: The higher the value, the more likely the song is a live recording.
Valence: The higher the value, the more positive mood for the song.
Length: The duration of the song (in seconds).
Acousticness: The higher the value, the more acoustic the song is.
Speechiness: The higher the value, the more spoken words the song contains.
Popularity: The higher the value, the more popular the song is.
Image("fig/top50_screenshot.png") # sample of the data
Read and store ../data/top50.csv
¶
Q1: Read in the ../data/top50.csv
file and store all of its contents into any data structure(s) that make the most sense to you, keeping in mind that you'd want to easily access any row or column. What data structure(s) make the most sense to use?
f = open("../data/top50.csv")
column_names = f.readline().strip().split(",")[1:] # puts names in a list
cleaned_column_names = [name for name in column_names] # removes the extraneous quotes
cleaned_column_names.insert(0, "ID")
dataset = []
# iterates through each line of the .csv file
for line in f:
attributes = line.strip().split(",")
# constructs a new dictionary for each line, and
# appends this dictionary to the `dataset`;
# thus, the dataset is a list of dictionaries (1 dictionary per song)
dataset.append(dict(zip(cleaned_column_names, attributes)))
Q2: Write code to print all songs (Artist and Track name) that are longer than 4 minutes (240 seconds):
for song in dataset:
if int(song["Length"]) > 240:
print(song["ArtistName"], "-", song["TrackName"], "is", song["Length"],"seconds long")
Q3: Write code to print the most popular song (or song(s) if there is a tie):
max_score = -1
most_populars = set()
for song in dataset:
if int(song["Popularity"]) > max_score:
most_populars = set([str(song["ArtistName"] + "-" + song["TrackName"])])
max_score = int(song["Popularity"])
elif int(song["Popularity"]) == max_score:
most_populars.add(str(song["ArtistName"] + "-" + song["TrackName"]))
#print(most_populars)
Q4: Write code to print the songs (and their attributes), if we sorted by their popularity (highest scoring ones first).
# This gets tricky. How would you do it?
Q5: How could you check for null/empty entries? This is only 50 entries. Imagine if we had 500,000.
# you would need to iterate through every single value and check
# that's O(N*M), where N is the # of rows, and M is the # of columns.
# NOTE: what if you wished to impute any given non-value with the column's mean?
# you would need another N checks
Often times, one dataset doesn't contain all of the information you are interested in -- in which case, you need to combine data from multiple files.
Q6: Imagine we had another table (i.e., .csv file) below. How could we combine its data with our already-existing dataset?
Part 2: Processing Data with Pandas¶
Pandas is an open-source Python library designed for data analysis and processing. Being open-sourced means that anyone can contribute to it (don't worry, a team of people vett all official updates to the library). Pandas allows for high-performance, easy-to-use data structures. Namely, instead of using N-dimensional arrays like NumPy (which are extremely fast, though), Pandas provides a 2D-table object calleda DataFrame.
As a very gross simplification: NumPy is great for performing math operations with matrices, whereas Pandas is excellent for wrangling, processing, and understanding 2D data like spreadsheets (2D data like spreadsheets is very common and great).
Let's get started with simple examples of how to use Pandas. We will continue with our top50.csv
Spotify music data.
First, we need to import pandas so that we have access to it. For typing convenience, we choose to rename it as pd
, which is common practice.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
Reading in the data¶
Pandas allows us to read in various structured files (e.g., .csv, .json, .html, etc) with just one line:
# we don't always need to specify the encoding, but this particular
# file has special characters that we need to handle
top50 = pd.read_csv("../data/top50.csv")
High-level view of the data¶
We can view the data frame by simply printing it:
top50
Instead of printing the entire dataframe (e.g., print(top50)
), we can also inspect the file by looking at just the first N rows or last N rows.
#top50.head(3) # first 3 rows
#top50.tail(3)) # last 3 rows
That's cool, but we can't see all of the columns too well. To get a list of all columns:
#top50.columns
Remember, Pandas is in Python, so you can write regular Python code with it. For example, if you want to know how many columns there are, you can use Python's len()
function.
#len(top50.columns)
Fortunately, many of the features in our dataset are numeric. Conveniently, Pandas' describe()
function calculates basic statistics for our columns. It's pretty amazing, as it allows us a very coarse-grain approach to understanding our data and checking for errors. That is, if we notice any summary statistics that are drastically different than what we deem reasonable, we should dive deeper and figure out why the values are what they are.
#top50.describe()
Notice, it calculated statistics only for the columns that are of numeric data types. What about the textual ones (e.g., Track name and Artist)? Pandas is smart enough to infer the data types. Don't forget to inspect the columns that are text-based though, as we need to ensure they are sound, too.
To view the data type of each column:
#top50.dtypes
Exploring the data¶
I agree with Pandas' handling of the data. If any column contained floating point numbers, we would expect to see such here, too.
Now that we've viewed our dataset at a high-level, let's actually use and explore it.
We can access a column of data the same way we access dictionary by its keys:
#top50["Length"]
We could have also used this syntax (identical results):
#top50.Length
If we want just the highest or lowest value of a given column, we can use the functions max()
and min()
, respectively.
#top50['Length'].max()
#top50['Length'].min()
If we want the row index that corresponds to a column's max or min value, we can use idxmax()
and idxmin()
, respectively.
#top50['Length'].idxmax()
#top50['Length'].idxmin()
We can also add conditional statements
(e.g., >, <, ==) for columns, which yields a boolean vector:
#top50['Length'] > 240
This is useful, as it allows us to process only the rows with the True values.
The loc()
function allows us to access data via labels:
- A single scalar label
- A list of labels
- A slice object
- A Boolean array
A single scalar:
# single scalar label
#top50.loc[0] # prints the (unnamed) row that has a label of 0 (the 1st row)
# list of labels
#top50.loc[[0,2]] # prints the (unnamed) rows that have the labels of 0 and 2 (the 1st and 3rd rows)
# a slice of the dataframe, based on the passed-in booleans;
# picture it's like a filter overlaying the DataFrame, and the filter
# dictates which values will be emitted/make it through to us
#top50.loc[top50['Length'] > 240] # prints all rows that have Length > 240
Note, this returns a DataFrame. Everything we've learned so far concerns how to use DataFrames, so we can tack on additional syntax to this command if we wish to do further processing.
For example, if we want to index just select columns (e.g., ArtistName, TrackName, and Length) of this returned DataFrame:
#top50.loc[top50['Length'] > 240][['ArtistName', 'TrackName', 'Length']]
Note, the above solves our original Q2: (Write code to print all songs (Artist and Track name) that are longer than 4 minutes (240 seconds))
Q3: Write code to print the most popular song (or song(s) if there is a tie):
## [TODO] TO BE FILLED IN DURING CLASS PARTICIPATION
# FEEL FREE TO WORK WITH A NEIGHBOR
top50.iloc[top50['Popularity'].idxmax()]
We can also sort our data by a single column! This pertains to our original Q4!
Q4: Write code to print the songs (and their attributes), if we sorted by their popularity (highest scoring ones first).
#top50.sort_values("Popularity", ascending=False)
While .loc()
allows us to index based on passed-in labels, .iloc()
allows us to access data based on 0-based indices.
The syntax is .iloc[
, where
#top50.iloc[5:6] # prints all columns for the 6th row
#top50.iloc[:,2] # prints all rows for the 3rd column
#top50.iloc[[0,2,3], [2,1]] # prints the 1st, 3rd, and 4th rows of the 3rd and 2nd columns (artist and track)
Inspecting/cleaning the data¶
As mentioned, it is imperative to ensure the data is sound to use:
- Did it come from a trustworthy, authoritative source?
- Is the data a complete sample?
- Does the data seem correct?
- (optional) Is the data stored efficiently or does it have redundancies?
Let's walk through each of these points now:
- Did it come from a trustworthy, authoritative source?
The data came from Kaggle.com, which anyone can publish to. However, the author claims that he/she used Spotify.com's official API to query songs in 2019. There are no public comments for it so far. It's potentially credible.
- Is the data a complete sample?
Pandas has functions named isnull()
and notnull()
, which return DataFrames corresponding to any null or non-null entries, respectively.
For example:
#top50[top50.ArtistName.isnull()] # returns an empty DataFrame
#top50[top50.ArtistName.notnull()] # returns the complete DataFrame since there are no null Artists
If we run this for all of our features/columns, we will see there are no nulls. Since this dataset is manageable in size, you can also just scroll through it and notice no nulls.
This answers our original Q5:
Q5: How could you check for null/empty entries?
Continuing with our data sanity check list:
- Does the data seem correct?
A quick scroll through the data, and we see a song by Maluma titled 0.95833333. This is possibly a song about probability, but I think the chances are slim. The song is 176 seconds long (2m56s). Looking on Spotify, we see Maluma's most popular song is currently 11PM which is 2m56s in length! Somehow, during the creation of the dataset, 11PM became 0.95833333. Bonus points if you can figure out where this pointing number could have come from.
#Image("fig/maluma.png") # sample of the data
Since only one song seems obviously wrong, we can manually fix it. And it's worth noting such to ourselves and to whomever else would see our results or receive a copy of our data. If there were many more wrong values, we'd potentialy not fix them, as we'd explore other options.
- (optional) Is the data stored efficiently or does it have redundancies?
Everything seems fine. No repeated data.
Combining multiple DataFrames¶
As mentioned, often times one dataset doesn't contain all of the information you are interested in -- in which case, you need to combine data from multiple files. This also means you need to verify the accuracy (per above) of each dataset.
Pandas' groupby()
function splits the DataFrame into different groups, depending on the passed-in variable. For example, we can group our data by the genres:
grouped_df = top50.groupby('Genre')
#for key, item in grouped_df:
# print("Genre:", key, "(", len(grouped_df.get_group(key)), "items):", grouped_df.get_group(key), "\n\n")
../data/spotify_aux.csv
contains the same 50 songs as top50.csv
; however, it only contains 3 columns:
- Track Name
- Artist Name
- Explicit Language (boolean valued)
Note, that 3rd column is just random values, but pretend as if it's correct. The point of this section is to demonstrate how to merge columns together.
Let's load ../data/spotify_aux.csv
into a DataFrame:
explicit_lyrics = pd.read_csv("../data/spotify_aux.csv")
#explicit_lyrics
Let's merge it with our top50
DataFrame.
.merge()
is a Pandas function that stitches together DataFrames by their columns.
.concat()
is a Pandas function that stitches together DataFrames by their rows (if you pass axis=1 as a flag, it will be column-based)
# 'on='' specifies the column used as the shared key
df_combined = pd.merge(explicit_lyrics, top50, on='TrackName')
#df_combined
We see that all columns from both DataFrames have been added. That's nice, but having duplicate ArtistName and TrackName is unecessary. Since merge()
uses DataFrames as the passed-in objects, we can simply pass merge() a stripped-down copy of ExplicitLanguage, which helps merge() not add any redundant fields.
df_combined = pd.merge(explicit_lyrics[['TrackName', 'ExplicitLanguage']], top50, on='TrackName')
#df_combined
This answers our original Q6:
Q6: Imagine we had another table (i.e., .csv file) below. How could we combine its data with our already-existing dataset?
While we do not exhaustively illustrate Pandas' joining/splitting functionality, you may find the following functions useful:
merge()
concat()
aggregate()
append()
Plotting DataFrames¶
As a very simple example of how one can plot elements of a DataFrame, we turn to Pandas' built-in plotting:
scatter_plot = top50.plot.scatter(x='Danceability', y='Popularity', c='DarkBlue')
This shows the lack of a correlation between the Danceability of a song and its popularity, based on just the top 50 songs, of course.
Please feel free to experiment with plotting other items of interest, and we recommend using Seaborn.
Practice Problems with a Partner (or individually if you prefer)¶
P1. Print the shortest song (all features):
# [TODO] TO BE FILLED IN DURING CLASS PARTICIPATION
top50.iloc[top50['Length'].idxmin()]
P2. Print the 5 shortest songs (all features):
# [TODO] TO BE FILLED IN DURING CLASS PARTICIPATION
top50.sort_values("Length", ascending=True)[0:5]
P3. What is the average length of the 5 shortest songs?
# [TODO] TO BE FILLED IN DURING CLASS PARTICIPATION
top50.sort_values("Length", ascending=True)[0:5]['Length'].mean()
P4. Write a function that accepts a DataFrame as an input and returns True if there exists any null values in the DataSet. Otherwise, returns False. Pass top50
to the function in order to test it.
# [TODO] TO BE FILLED IN DURING CLASS PARTICIPATION
#print(top50['Genre'].isnull())
def contains_nulls(df):
for col in df.columns:
if np.sum(df[col].isnull()) > 0:
return True
return False
contains_nulls(top50)
P5. How many distinct genres are present in the top 50 songs?
# [TODO] TO BE FILLED IN DURING CLASS PARTICIPATION
len(top50['Genre'].unique())
P6. Print the songs that have a Danceability score above 80 and a popularity above 86. HINT: you can combine conditional statements with the & operator, and each item must be surrounded with ( ) brackets.
# [TODO] TO BE FILLED IN DURING CLASS PARTICIPATION
top50[(top50['Danceability'] > 80) & (top50['Popularity'] > 86)]
P7. Print the songs that are faster than the average Top 50 and more popular than the average Top 50?
# [TODO] TO BE FILLED IN DURING CLASS PARTICIPATION
avg_speed = top50['BeatsPerMinute'].mean()
avg_popularity = top50['Popularity'].mean()
top50[(top50['BeatsPerMinute'] > avg_speed) & (top50['Popularity'] > avg_popularity)]
P8. Plot a histogram of the Genre counts (x-axis is the Genres, y-axis is the # of songs with that Genre)
# [TODO] TO BE FILLED IN DURING CLASS PARTICIPATION
P9. (open ended) Think of a subset of the data that you're interested in. Think of an interesting plot that could be shown to illustrate that data. With a partner, discuss whose would be easier to create. Together, create that plot. Then, try to create the harder plot.
# [TODO] TO BE FILLED IN DURING CLASS PARTICIPATION