Key Word(s): EDA



CS 109A/STAT 121A/AC 209A

Lab 2: Cleaning and EDA of Goodreads

Harvard University
Fall 2017
Instructors: Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine


Table of Contents

  1. Learning Goals
  2. Loading and Cleaning with Pandas
  3. Asking Questions?
  4. Parsing and Completing the Dataframe
  5. EDA
  6. Determining the Best Books
  7. Trends in Popularity of Genres

Learning Goals

About 6000 odd "best books" were fetched and parsed from Goodreads. The "bestness" of these books came from a proprietary formula used by Goodreads and published as a list on their web site.

We parsed the page for each book and saved data from all these pages in a tabular format as a CSV file. In this lab we'll clean and further parse the data. We'll then do some exploratory data analysis to answer questions about these best books and popular genres.

By the end of this lab, you should be able to:

  • Load and systematically address missing values, ancoded as NaN values in our data set, for example, by removing observations associated with these values.
  • Parse columns in the dataframe to create new dataframe columns.
  • Create and interpret visualizations to explore the data set

This lab corresponds to lectures 2 and 3 and maps on to homework 1 and further.

Basic EDA workflow

(From the lecture, repeated here for convenience).

The basic workflow is as follows:

  1. Build a DataFrame from the data (ideally, put all data in this object)
  2. Clean the DataFrame. It should have the following properties:
    • Each row describes a single object
    • Each column describes a property of that object
    • Columns are numeric whenever appropriate
    • Columns contain atomic properties that cannot be further decomposed
  3. Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.
  4. Explore group properties. Use groupby and small multiples to compare subsets of the data.

This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis.

Part 1: Loading and Cleaning with Pandas

Read in the goodreads.csv file, examine the data, and do any necessary data cleaning.

Here is a description of the columns (in order) present in this csv file:

rating: the average rating on a 1-5 scale achieved by the book
review_count: the number of Goodreads users who reviewed this book
isbn: the ISBN code for the book
booktype: an internal Goodreads identifier for the book
author_url: the Goodreads (relative) URL for the author of the book
year: the year the book was published
genre_urls: a string with '|' separated relative URLS of Goodreads genre pages
dir: a directory identifier internal to the scraping code
rating_count: the number of ratings for this book (this is different from the number of reviews)
name: the name of the book

Report all the issues you found with the data and how you resolved them.

[15 minutes]


Load the appropriate libraries

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

Cleaning: Reading in the data

We read in and clean the data from goodreads.csv. Try directly loading the data from file and see what the dataframe look like. What's the problem with naively loading the data as is? You might want to open the CSV file in Excel or your favorite text editor to see how this dataset is formatted.

In [3]:
#Read the data into a dataframe
df = pd.read_csv("../data/goodreads.csv")

#Examine the first couple of rows of the dataframe
####### 
#   Insert your code
####### 

Lets read the csv file with custom column descriptions specified in the problem statement.

In [ ]:
df=pd.read_csv("../data/goodreads.csv", header=None,
               names=["rating", 'review_count', 'isbn', 'booktype',
                      'author_url', 'year', 'genre_urls', 'dir','rating_count', 'name'])

#Examine the first couple of rows of the dataframe
####### 
#   Insert your code
####### 

Cleaning: Examing the dataframe - quick checks

We should examine the dataframe to get a overall sense of the content.

In [5]:
#Start by check the column data types
####### 
#   Insert your code
####### 

There are a couple more quick sanity checks to perform on the dataframe.

In [6]:
#Come up with a few other important properties of the dataframe to check
####### 
#   Insert your code
####### 

Was the data read correctly and values represented as we expected?

Cleaning: Examining the dataframe - a deeper look

Beyond performing checking some quick general properties of the data frame and looking at the first $n$ rows, we can dig a bit deeper into the values being stored. If you haven't already, check to see if there are any missing values in the data frame.

In [18]:
#Get a sense of how many missing values there are in the dataframe.
####### 
#   Insert your code
####### 
In [19]:
#Try to locate where the missing values occur
####### 
#   Insert your code
####### 

How does pandas or numpy handle missing values when we try to compute with data sets that include them?

Cleaning: Dealing with Missing Values

How should we interpret 'missing' or 'invalid' values in the data (hint: look at where these values occur)? One approach is to simply exclude them from the dataframe. Is this appropriate for all 'missing' or 'invalid' values? How would you drop these values from the dataframe (hint: is it possible to eliminate just a single entry in your dataframe? Should you eliminate an entire row? Or column?)?

In [1]:
#Treat the missing or invalid values in your dataframe
####### 
#   Insert your code
####### 

Ok so we have done some cleaning. Is it enough?

In [8]:
#Check the column data types again
####### 
#   Insert your code
####### 

Notice the float has not yet changed. Ok so lets fix those types. If the type conversion fails, we now know we have further problems.

In [10]:
#Convert rating_count, review_count and year to int 
#######
# .Insert your code
#######

Final check

Some of the other colums that should be strings have NaN.

In [ ]:
df.loc[df.genre_urls.isnull(), 'genre_urls']=""
df.loc[df.isbn.isnull(), 'isbn']=""

Part 2: Asking Questions

Think of few questions we want to ask and then examine the data and decide if the dataframe contains what you need to address these questions.

Example: Which are the highest rated books? To determine this, you'll only need the data in two columns: name and rating. The task will be to sort these two columns by the value in rating. [5 min]


Part 3: Parsing and Completing the Data Frame

We will need author and genre to proceed! Parse the author column from the author_url and genres column from the genre_urls. Keep the genres column as a string separated by '|'.

Hint: Use panda's map to assign new columns to the dataframe.

[10 minutes]


Examine an example author_url and reason about which sequence of string operations must be performed in order to isolate the author's name.

In [ ]:
#Get the first author_url
test_string = df.author_url[0]
test_string
In [ ]:
#Test out some string operations to isolate the author name
####### 
#   Insert your code
####### 

Now parse out the genres from genre_url. Like with the authors, we'll assign a new genres column to the dataframe.

This is a little more complicated because there be more than one genre.

In [ ]:
#Write a function that accepts an author url and returns the author's name based on your experimentation above
def get_author(url):
    ####### 
    #   Insert your code
    ####### 
    return name

#Apply the get_author function to the 'author_url' column using '.map' 
#and add a new column 'author' to store the names
df['author'] = df.author_url.map(get_author)
df.author[0:5]
In [ ]:
#Examine some examples of genre_urls

#Test out some string operations to isolate the genre name
####### 
#   Insert your code
####### 
In [ ]:
#Write a function that accepts a genre url and returns the genre name based on your experimentation above
def split_and_join_genres(url):
    ####### 
    #   Insert your code
    ####### 
    return genre

df['genres']=df.genre_urls.map(split_and_join_genres)
df.head()

Part 4: EDA

Before proceeding any further, get to know the dataset using a few "global property" visualizations, illustrating histograms with both linear and log scales. Do you find anything interesting or strange?

[10 minutes]


Choose some columns to generate histograms.

In [ ]:
#Generate histograms using the format df.YOUR_CHOICE_OF_COLUMN_NAME.hist(bins=YOUR_CHOICE_OF_BIN_SIZE)

plt.xlabel('Label the x-axis appropriately')
plt.ylabel('Label the y-axis appropriately')
plt.title('Title the plot appropriately')

plt.show()

If your histograms appear strange or counter-intuitive, make appropriate adjustments in the data and re-visualize.

In [ ]:
#Generate histograms using the format df.YOUR_CHOICE_OF_COLUMN_NAME.hist(bins=YOUR_CHOICE_OF_BIN_SIZE)

plt.xlabel('Label the x-axis appropriately')
plt.ylabel('Label the y-axis appropriately')
plt.title('Title the plot appropriately')

plt.show()

Part 5: Determining the Best Books

This is an example of an analysis of the "grouped property" type.

Think of some reasonable definitions of what it could mean to be a "best book." (After all, these are all the best books according to Goodreads)

[5 minutes]


For example, we can determine the "best book" by year! Determine the best books in each year.

In [ ]:
#Using .groupby, we can divide the dataframe into subsets by the values of 'year'.
#We can then iterate over these subsets
for year, subset in df.groupby('year'):
    #Find the best book of the year
    ####### 
    #   Insert your code
    ####### 

Try this for few other definitions of the "best book" using .groupby.

This is an example of an analysis of the "grouped property" type.

There are a lot of questions you could ask about genres.

  • Which genre is currently the most popular?
  • Better, based on our data, what draw conclusions can you draw about the time evolution of the popularity of each genre?

[15 minutes]


First we need to find the distinct genres in the dataframe.

To do this, notice that each string is a pipe (|) separated list of genres. For each string, we ask if the genre is in that pipe separated list. If it is, we return True, else False

Hint: remember that python sets have unique (non-repeating) items.

In [ ]:
#Get the unique genres contained in the dataframe.
####### 
#   Insert your code
####### 

What happens if we add a column to the dataframe for each genre?

Is this way of representing genre efficient? Allows for easy computation and visualization?

Are there other ways to represent genre information in the dataframe that allow for each visualization?

In [ ]:
#Add a column for each genre
####### 
#   Insert your code
####### 
df.shape

Now explore some ways to visualize the genres represented in the dataframe.

For example, you might ask which is the most represented genre.

In [ ]:
#Explore different ways to visualize information about the genres in the dataframe
####### 
#   Insert your code
####### 

Part 6.1: What can you conclude from the above visualizations?

Pick two or three genres and describe how the popularity of these genres fluctuates with time.

There is a category called Paranormal Romance!!!!!