CS109A Introduction to Data Science

Lecture 4, Exercise 1: EDA with PANDAS¶

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner

Title¶

Exercise 1: EDA

Description¶

In this exercise, we get hands-on practice with EDA.

In [1]:

import pandas as pd

NOTE: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶

EDA with PANDAS¶

top50.csv is a dataset found online (Kaggle.com) that contains information about the 50 most popular songs on Spotify in 2019.

NOTE: This music dataset is used purely for illustrative, educational purposes. The data, including song titles, may include explicit language. Harvard, including myself and the rest of the CS109 staff, does not endorse any of the entailed contents or the songs themselves, and we apologize if it is offensive to anyone in anyway.

Each row represents a distinct song. The columns (in order) are:

ID: a unique ID (i.e., 1-50)
TrackName: Name of the Track
ArtistName: Name of the Artist
Genre: the genre of the track
BeatsPerMinute: The tempo of the song.
Energy: The energy of a song - the higher the value, the more energetic. song
Danceability: The higher the value, the easier it is to dance to this song.
Loudness: The higher the value, the louder the song.
Liveness: The higher the value, the more likely the song is a live recording.
Valence: The higher the value, the more positive mood for the song.
Length: The duration of the song (in seconds).
Acousticness: The higher the value, the more acoustic the song is.
Speechiness: The higher the value, the more spoken words the song contains.
Popularity: The higher the value, the more popular the song is.

Below, fill in the blank to create a new PANDAS DataFrame from the top50.csv file.

In [36]:

### edTest(test_a) ###
df = _____

Run .describe() to display summary statistics.

In [31]:

df.describe()

Q1: Do you notice any suspicious features (aka columns)? Anything that is worth investigating for correctness? Any features of the data seem unclear or foreign to you?

your answer here

Another broad sanity check is to inspect the data types of each column, to ensure it's as expected.

Below, fill in the blank to display the data types of each column. Specifically, when printing var_types to the screen, you should see:

Unnamed: 0 int64 \ TrackName object \ ArtistName object \ Genre object \ BeatsPerMinute int64 \ Energy int64 \ Danceability int64 \ Loudness int64 \ Liveness int64 \ Valence int64 \ Length int64 \ Acousticness int64 \ Speechiness int64 \ Popularity int64

In [32]:

### edTest(test_b) ###
var_types = _____
var_types # displays it to the screen

Hmm, do we have any missing values, though?

Write code that determines if any value within the entire DataFrame is Null. This should be represented by the has_null variable. To be clear, has_null should have a value of True if there exists a Null value within the DataFrame. Otherwise, has_null should be False.

In [35]:

### edTest(test_c) ###
has_null = False # initializes to False

# your code here
# (that checks all cells and potentially updates `has_null`)

# end of your code

After exploring the values within the TrackName, ArtistName, and Genre columns, I noticed something looked peculiar. Specifically, one TrackName value isn't like any of the others. This was truly the case; I did not alter the name of this track, it's just how the dataset was created. Identify this strange TrackName value, either by manual inspection or writing code to assist you. Regardless of your approach, set the variable mystery_track to have that value. For example, if you thought the very first Track, Senorita was the unusual value, then mystery_track should be equal to the String Senorita.

In [54]:

### edTest(test_d) ###
mystery_track = ______ # must be a String data type

While TrackNames and ArtistNames can be anything one desires, Genres have pre-defined categories/names, and there is a finite number of them. So, let's inspect the data to ensure that there is no spurious Genre. One indication that there is an invalid Genre is if only one song is of that Genre. Moreover, if there exists some Genre that is too all-encompassing, then we would expect to see a large percentage of the songs have that Genre. To this effect, use value_counts() to return the # of songs that fall within each Genre. Save this to the variable named genre_counts. Printing genre_counts should yield the following:

dance pop 8 \ pop 7 \ latin 5 \ edm 3 \ canadian hip hop 3 \ country rap 2 \ brostep 2 \ reggaeton flow 2 \ reggaeton 2 \ electropop 2 \ dfw rap 2 \ canadian pop 2 \ panamanian pop 2 \ atl hip hop 1 \ australian pop 1 \ escape room 1 \ boy band 1 \ r&b; en espanol 1 \ trap music 1 \ pop house 1 \ big room 1 \ Name: Genre, dtype: int64

In [55]:

### edTest(test_e) ###
genre_counts = ______

Notice there are several Genres with a count of only 1. Given that our dataset is of only 50 songs, this isn't too surprising. However, I've never heard of some of these Genres. If any of these Genres are also unfamiliar to you, use the Internet to verify that they're actually valid Genres. Below, write what you did to verify such, along with your thoughts on if they are all valid or not.

your answer here

Think of other steps you could take to explore the data, not only to ensure the data's legitimacy but perhaps elements that you're interested in and what you would do to address such. Discuss this with your group and be prepared to discuss this with the rest of the class in the main Zoom room.