CS109A Introduction to Data Science
Lecture 4, Exercise 1: EDA with PANDAS¶
Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
import pandas as pd
NOTE: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶
EDA with PANDAS¶
top50.csv
is a dataset found online (Kaggle.com) that contains information about the 50 most popular songs on Spotify in 2019.
NOTE: This music dataset is used purely for illustrative, educational purposes. The data, including song titles, may include explicit language. Harvard, including myself and the rest of the CS109 staff, does not endorse any of the entailed contents or the songs themselves, and we apologize if it is offensive to anyone in anyway.
Each row represents a distinct song. The columns (in order) are:
ID: a unique ID (i.e., 1-50)
TrackName: Name of the Track
ArtistName: Name of the Artist
Genre: the genre of the track
BeatsPerMinute: The tempo of the song.
Energy: The energy of a song - the higher the value, the more energetic. song
Danceability: The higher the value, the easier it is to dance to this song.
Loudness: The higher the value, the louder the song.
Liveness: The higher the value, the more likely the song is a live recording.
Valence: The higher the value, the more positive mood for the song.
Length: The duration of the song (in seconds).
Acousticness: The higher the value, the more acoustic the song is.
Speechiness: The higher the value, the more spoken words the song contains.
Popularity: The higher the value, the more popular the song is.
Below, fill in the blank to create a new PANDAS DataFrame from the top50.csv
file.
### edTest(test_a) ###
df = _____
Run .describe()
to display summary statistics.
df.describe()
Q1: Do you notice any suspicious features (aka columns)? Anything that is worth investigating for correctness? Any features of the data seem unclear or foreign to you?
your answer here
Another broad sanity check is to inspect the data types of each column, to ensure it's as expected.
Below, fill in the blank to display the data types of each column. Specifically, when printing var_types
to the screen, you should see:
Unnamed: 0 int64 \ TrackName object \ ArtistName object \ Genre object \ BeatsPerMinute int64 \ Energy int64 \ Danceability int64 \ Loudness int64 \ Liveness int64 \ Valence int64 \ Length int64 \ Acousticness int64 \ Speechiness int64 \ Popularity int64
### edTest(test_b) ###
var_types = _____
var_types # displays it to the screen
Hmm, do we have any missing values, though?
Write code that determines if any value within the entire DataFrame is Null. This should be represented by the has_null
variable. To be clear, has_null
should have a value of True if there exists a Null value within the DataFrame. Otherwise, has_null
should be False.
### edTest(test_c) ###
has_null = False # initializes to False
# your code here
# (that checks all cells and potentially updates `has_null`)
# end of your code
After exploring the values within the TrackName
, ArtistName
, and Genre
columns, I noticed something looked peculiar. Specifically, one TrackName
value isn't like any of the others. This was truly the case; I did not alter the name of this track, it's just how the dataset was created. Identify this strange TrackName
value, either by manual inspection or writing code to assist you. Regardless of your approach, set the variable mystery_track
to have that value. For example, if you thought the very first Track, Senorita
was the unusual value, then mystery_track
should be equal to the String Senorita
.
### edTest(test_d) ###
mystery_track = ______ # must be a String data type
While TrackNames and ArtistNames can be anything one desires, Genres have pre-defined categories/names, and there is a finite number of them. So, let's inspect the data to ensure that there is no spurious Genre. One indication that there is an invalid Genre is if only one song is of that Genre. Moreover, if there exists some Genre that is too all-encompassing, then we would expect to see a large percentage of the songs have that Genre. To this effect, use value_counts()
to return the # of songs that fall within each Genre. Save this to the variable named genre_counts
. Printing genre_counts
should yield the following:
dance pop 8 \ pop 7 \ latin 5 \ edm 3 \ canadian hip hop 3 \ country rap 2 \ brostep 2 \ reggaeton flow 2 \ reggaeton 2 \ electropop 2 \ dfw rap 2 \ canadian pop 2 \ panamanian pop 2 \ atl hip hop 1 \ australian pop 1 \ escape room 1 \ boy band 1 \ r&b; en espanol 1 \ trap music 1 \ pop house 1 \ big room 1 \ Name: Genre, dtype: int64
### edTest(test_e) ###
genre_counts = ______
Notice there are several Genres with a count of only 1. Given that our dataset is of only 50 songs, this isn't too surprising. However, I've never heard of some of these Genres. If any of these Genres are also unfamiliar to you, use the Internet to verify that they're actually valid Genres. Below, write what you did to verify such, along with your thoughts on if they are all valid or not.
your answer here
Think of other steps you could take to explore the data, not only to ensure the data's legitimacy but perhaps elements that you're interested in and what you would do to address such. Discuss this with your group and be prepared to discuss this with the rest of the class in the main Zoom room.