Key Word(s): Pandas



CS109A Introduction to Data Science

Lab 2: Pandas and Web Scraping with Beautiful Soup

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras
Authors: Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, Eleni Kaxiras


In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
from IPython.core.display import HTML
def css_styling():
    styles = open("../../styles/cs109.css", "r").read()
    return HTML(styles)
css_styling()
Out[1]:

Table of Contents

  1. Learning Goals
  2. Loading and Cleaning with Pandas
  3. Parsing and Completing the Dataframe
  4. Grouping

Learning Goals

About 6,000 odd "best books" were fetched and parsed from Goodreads. The "bestness" of these books came from a proprietary formula used by Goodreads and published as a list on their web site.

We parsed the page for each book and saved data from all these pages in a tabular format as a CSV file. In this lab we'll clean and further parse the data. We'll then do some exploratory data analysis to answer questions about these best books and popular genres.

By the end of this lab, you should be able to:

  • Load and systematically address missing values, ancoded as NaN values in our data set, for example, by removing observations associated with these values.
  • Parse columns in the dataframe to create new dataframe columns.
  • Use groupby to aggregate data on a particular feature column, such as author.

This lab corresponds to lectures #1, #2, and #3 and maps on to homework #1 and further.

Basic EDA workflow

(From the lecture, repeated here for convenience).

The basic workflow is as follows:

  1. Build a DataFrame from the data (ideally, put all data in this object)
  2. Clean the DataFrame. It should have the following properties:
    • Each row describes a single object
    • Each column describes a property of that object
    • Columns are numeric whenever appropriate
    • Columns contain atomic properties that cannot be further decomposed
  3. Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.
  4. Explore group properties. Use groupby and small multiples to compare subsets of the data.

This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis.

Part 1: Loading and Cleaning with Pandas

Read in the goodreads.csv file, examine the data, and do any necessary data cleaning.

Here is a description of the columns (in order) present in this csv file:

rating: the average rating on a 1-5 scale achieved by the book
review_count: the number of Goodreads users who reviewed this book
isbn: the ISBN code for the book
booktype: an internal Goodreads identifier for the book
author_url: the Goodreads (relative) URL for the author of the book
year: the year the book was published
genre_urls: a string with '|' separated relative URLS of Goodreads genre pages
dir: a directory identifier internal to the scraping code
rating_count: the number of ratings for this book (this is different from the number of reviews)
name: the name of the book

Let us see what issues we find with the data and resolve them.


After loading appropriate libraries

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

Cleaning: Reading in the data

We read in and clean the data from goodreads.csv.

In [3]:
#Read the data into a dataframe
df = pd.read_csv("data/goodreads.csv", encoding='utf-8')

#Examine the first few rows of the dataframe
df
Out[3]:
4.40 136455 0439023483 good_reads:book https://www.goodreads.com/author/show/153394.Suzanne_Collins 2008 /genres/young-adult|/genres/science-fiction|/genres/dystopia|/genres/fantasy|/genres/science-fiction|/genres/romance|/genres/adventure|/genres/book-club|/genres/young-adult|/genres/teen|/genres/apocalyptic|/genres/post-apocalyptic|/genres/action dir01/2767052-the-hunger-games.html 2958974 The Hunger Games (The Hunger Games, #1)
0 4.41 16648 0439358078 good_reads:book https://www.goodreads.com/author/show/1077326.... 2003.0 /genres/fantasy|/genres/young-adult|/genres/fi... dir01/2.Harry_Potter_and_the_Order_of_the_Phoe... 1284478 Harry Potter and the Order of the Phoenix (Har...
1 3.56 85746 0316015849 good_reads:book https://www.goodreads.com/author/show/941441.S... 2005.0 /genres/young-adult|/genres/fantasy|/genres/ro... dir01/41865.Twilight.html 2579564 Twilight (Twilight, #1)
2 4.23 47906 0061120081 good_reads:book https://www.goodreads.com/author/show/1825.Har... 1960.0 /genres/classics|/genres/fiction|/genres/histo... dir01/2657.To_Kill_a_Mockingbird.html 2078123 To Kill a Mockingbird
3 4.23 34772 0679783261 good_reads:book https://www.goodreads.com/author/show/1265.Jan... 1813.0 /genres/classics|/genres/fiction|/genres/roman... dir01/1885.Pride_and_Prejudice.html 1388992 Pride and Prejudice
4 4.25 12363 0446675539 good_reads:book https://www.goodreads.com/author/show/11081.Ma... 1936.0 /genres/classics|/genres/historical-fiction|/g... dir01/18405.Gone_with_the_Wind.html 645470 Gone with the Wind
5 4.22 7205 0066238501 good_reads:book https://www.goodreads.com/author/show/1069006.... 1949.0 /genres/classics|/genres/young-adult|/genres/c... dir01/11127.The_Chronicles_of_Narnia.html 286677 The Chronicles of Narnia (Chronicles of Narnia...
6 4.38 10902 0060256656 good_reads:book https://www.goodreads.com/author/show/435477.S... 1964.0 /genres/childrens|/genres/young-adult|/genres/... dir01/370493.The_Giving_Tree.html 502891 The Giving Tree
7 3.79 20670 0452284244 good_reads:book https://www.goodreads.com/author/show/3706.Geo... 1945.0 /genres/classics|/genres/fiction|/genres/scien... dir01/7613.Animal_Farm.html 1364879 Animal Farm
8 4.18 12302 0345391802 good_reads:book https://www.goodreads.com/author/show/4.Dougla... 1979.0 /genres/science-fiction|/genres/humor|/genres/... dir01/11.The_Hitchhiker_s_Guide_to_the_Galaxy.... 724713 The Hitchhiker's Guide to the Galaxy (Hitchhik...
9 4.03 20937 0739326228 good_reads:book https://www.goodreads.com/author/show/614.Arth... 1997.0 /genres/fiction|/genres/historical-fiction|/ge... dir01/930.Memoirs_of_a_Geisha.html 1042679 Memoirs of a Geisha
10 3.72 34959 0307277674 good_reads:book https://www.goodreads.com/author/show/630.Dan_... 2003.0 /genres/mystery|/genres/thriller|/genres/suspe... dir01/968.The_Da_Vinci_Code.html 1220657 The Da Vinci Code (Robert Langdon, #2)
11 4.36 69524 0375831002 good_reads:book https://www.goodreads.com/author/show/11466.Ma... 2005.0 /genres/historical-fiction|/genres/young-adult... dir01/19063.The_Book_Thief.html 675431 The Book Thief
12 4.05 5516 0451527747 good_reads:book https://www.goodreads.com/author/show/8164.Lew... 1865.0 /genres/classics|/genres/childrens|/genres/you... dir01/24213.Alice_s_Adventures_in_Wonderland_T... 301702 Alice's Adventures in Wonderland & Through the...
13 3.72 10156 0743477111 good_reads:book https://www.goodreads.com/author/show/947.Will... 1597.0 /genres/classics|/genres/plays|/genres/fiction... dir01/18135.Romeo_and_Juliet.html 1211146 Romeo and Juliet
14 4.09 10082 0451525264 good_reads:book https://www.goodreads.com/author/show/13661.Vi... 1862.0 /genres/classics|/genres/historical-fiction|/g... dir01/24280.Les_Mis_rables.html 418004 Les Misérables
15 3.92 38061 NaN good_reads:book https://www.goodreads.com/author/show/498072.A... 2003.0 /genres/fiction|/genres/romance|/genres/fantas... dir01/18619684-the-time-traveler-s-wife.html 927254 The Time Traveler's Wife
16 4.58 1314 0345538374 good_reads:book https://www.goodreads.com/author/show/656983.J... 1973.0 /genres/fantasy|/genres/classics|/genres/scien... dir01/30.J_R_R_Tolkien_4_Book_Boxed_Set.html 68495 J.R.R. Tolkien 4-Book Boxed Set
17 3.60 18039 0140283331 good_reads:book https://www.goodreads.com/author/show/306.Will... 1954.0 /genres/classics|/genres/academic|/genres/scho... dir01/7624.Lord_of_the_Flies.html 1232126 Lord of the Flies
18 4.28 30815 0812550706 good_reads:book https://www.goodreads.com/author/show/589.Orso... 1985.0 /genres/science-fiction|/genres/young-adult|/g... dir01/375802.Ender_s_Game.html 624730 Ender's Game (The Ender Quintet, #1)
19 4.02 11942 0375751513 good_reads:book https://www.goodreads.com/author/show/3565.Osc... 1890.0 /genres/classics|/genres/fiction|/genres/horro... dir01/5297.The_Picture_of_Dorian_Gray.html 409478 The Picture of Dorian Gray
20 4.14 8681 0143058142 good_reads:book https://www.goodreads.com/author/show/3137322.... 1866.0 /genres/classics|/genres/cultural|/genres/russ... dir01/7144.Crime_and_Punishment.html 294297 Crime and Punishment
21 4.11 8897 0064410935 good_reads:book https://www.goodreads.com/author/show/988142.E... 1952.0 /genres/childrens|/genres/fiction|/genres/clas... dir01/24178.Charlotte_s_Web.html 662707 Charlotte's Web
22 4.20 8678 0451528824 good_reads:book https://www.goodreads.com/author/show/5350.L_M... 1908.0 /genres/fiction|/genres/young-adult|/genres/cl... dir01/8127.Anne_of_Green_Gables.html 393594 Anne of Green Gables (Anne of Green Gables, #1)
23 3.75 36955 0061122416 good_reads:book https://www.goodreads.com/author/show/566.Paul... 1988.0 /genres/fiction|/genres/classics|/genres/fanta... dir01/865.The_Alchemist.html 876518 The Alchemist
24 3.94 18581 0007491565 good_reads:book https://www.goodreads.com/author/show/1630.Ray... 1953.0 /genres/classics|/genres/fiction|/genres/scien... dir01/17470674-fahrenheit-451.html 783133 Fahrenheit 451
25 4.43 112279 0525478817 good_reads:book https://www.goodreads.com/author/show/1406384.... 2012.0 /genres/young-adult|/genres/book-club|/genres/... dir01/11870085-the-fault-in-our-stars.html 1150626 The Fault in Our Stars
26 3.79 15833 0142000671 good_reads:book https://www.goodreads.com/author/show/585.John... 1937.0 /genres/fiction|/genres/classics|/genres/acade... dir01/890.Of_Mice_and_Men.html 1070755 Of Mice and Men
27 4.04 13214 0440498058 good_reads:book https://www.goodreads.com/author/show/106.Made... 1962.0 /genres/fantasy|/genres/young-adult|/genres/cl... dir01/18131.A_Wrinkle_in_Time.html 420001 A Wrinkle in Time (A Wrinkle in Time Quintet, #1)
28 3.94 11736 0393970124 good_reads:book https://www.goodreads.com/author/show/6988.Bra... 1897.0 /genres/classics|/genres/horror|/genres/fictio... dir01/17245.Dracula.html 429079 Dracula
29 4.24 10614 0345418263 good_reads:book https://www.goodreads.com/author/show/12521.Wi... 1973.0 /genres/fantasy|/genres/classics|/genres/roman... dir01/21787.The_Princess_Bride.html 457219 The Princess Bride
... ... ... ... ... ... ... ... ... ... ...
5969 3.97 182 0399151311 good_reads:book https://www.goodreads.com/author/show/33987.La... 1985.0 /genres/romance|/genres/romance|/genres/contem... dir60/572626.Separate_Beds.html 3544 Separate Beds
5970 4.24 72 0413748308 good_reads:book https://www.goodreads.com/author/show/29185.Sa... 2000.0 /genres/plays|/genres/drama|/genres/plays|/gen... dir60/146548.4_48_Psychosis.html 1016 4.48 Psychosis
5971 4.19 1670 NaN good_reads:book https://www.goodreads.com/author/show/4586597.... 2011.0 /genres/romance|/genres/romance|/genres/contem... dir60/12351649-perfection.html 35197 Perfection (Neighbor from Hell, #2)
5972 4.17 789 1401324290 good_reads:book https://www.goodreads.com/author/show/4627059.... 2011.0 /genres/biography|/genres/animals|/genres/auto... dir60/10393675-until-tuesday.html 4685 Until Tuesday
5973 3.99 2944 0425267040 good_reads:book https://www.goodreads.com/author/show/24978.Ma... 2013.0 /genres/romance|/genres/adult-fiction|/genres/... dir60/16033902-rush.html 41287 Rush (Breathless, #1)
5974 4.07 10585 0061950726 good_reads:book https://www.goodreads.com/author/show/157146.C... 2013.0 /genres/historical-fiction|/genres/book-club|/... dir60/15818107-orphan-train.html 76606 Orphan Train
5975 4.23 1185 NaN good_reads:book https://www.goodreads.com/author/show/5160667.... 2014.0 /genres/romance|/genres/science-fiction|/genre... dir60/20504754-transcendence.html 4942 Transcendence
5976 4.03 218 NaN good_reads:book https://www.goodreads.com/author/show/5769580.... 1987.0 /genres/fiction|/genres/novels|/genres/literat... dir60/5948927.html 1607 التيه
5977 3.99 27 1853408360 good_reads:book https://www.goodreads.com/author/show/851161.K... 2005.0 /genres/young-adult|/genres/romance|/genres/co... dir60/2274992.Tessa_in_Love.html 294 Tessa in Love
5978 2.77 800 0060988649 good_reads:book https://www.goodreads.com/author/show/7025.Gre... 2001.0 /genres/fantasy|/genres/fiction|/genres/myster... dir60/24929.Lost.html 11128 Lost
5979 3.84 165 0571207995 good_reads:book https://www.goodreads.com/author/show/16865.Ti... 1977.0 /genres/fiction|/genres/cultural|/genres/canad... dir60/29898.The_Wars.html 4160 The Wars
5980 3.36 1693 0312424442 good_reads:book https://www.goodreads.com/author/show/3083854.... 2003.0 /genres/fiction|/genres/novels|/genres/contemp... dir60/231.I_am_Charlotte_Simmons.html 17743 I am Charlotte Simmons
5981 4.09 362 1407103946 good_reads:book https://www.goodreads.com/author/show/81096.Ch... 2009.0 /genres/fantasy|/genres/horror|/genres/young-a... dir60/6364017-malice.html 2013 Malice (Malice, #1)
5982 4.23 137 1582430438 good_reads:book https://www.goodreads.com/author/show/8567.Wen... 1974.0 /genres/fiction|/genres/novels|/genres/literat... dir60/227274.The_Memory_of_Old_Jack.html 1085 The Memory of Old Jack
5983 4.02 531 0575085150 good_reads:book https://www.goodreads.com/author/show/81096.Ch... 2009.0 /genres/science-fiction|/genres/steampunk|/gen... dir60/6285903-retribution-falls.html 3878 Retribution Falls (Tales of the Ketty Jay, #1)
5984 3.61 109 1401360106 good_reads:book https://www.goodreads.com/author/show/183537.K... 2005.0 /genres/fiction|/genres/young-adult|/genres/bo... dir60/319403.Pigtopia.html 529 Pigtopia
5985 4.06 954 1606840584 good_reads:book https://www.goodreads.com/author/show/2891503.... 2010.0 /genres/young-adult|/genres/fantasy|/genres/pa... dir60/7831742-the-lost-saint.html 12690 The Lost Saint (The Dark Divine, #2)
5986 4.26 477 0517548233 good_reads:book https://www.goodreads.com/author/show/2062.Hen... 1946.0 /genres/economics|/genres/non-fiction|/genres/... dir60/3028.Economics_in_One_Lesson.html 5767 Economics in One Lesson
5987 4.34 93 0575070706 good_reads:book https://www.goodreads.com/author/show/58.Frank... 1977.0 /genres/science-fiction|/genres/fantasy|/genre... dir60/53764.The_Great_Dune_Trilogy.html 41378 The Great Dune Trilogy
5988 3.36 192 842534607X good_reads:book https://www.goodreads.com/author/show/3493970.... 2011.0 /genres/european-literature|/genres/spanish-li... dir60/10832326-si-t-me-dices-ven-lo-dejo-todo-... 1914 Si tú me dices ven lo dejo todo... pero dime ven
5989 4.12 1150 0140143459 good_reads:book https://www.goodreads.com/author/show/776.Mich... 1989.0 /genres/non-fiction|/genres/economics|/genres/... dir60/1171.Liar_s_Poker.html 32637 Liar's Poker
5990 4.20 650 NaN good_reads:book https://www.goodreads.com/author/show/1112683._ 2009.0 /genres/novels|/genres/fiction|/genres/religio... dir60/6976667.html 2899 ألواح ودسر
5991 3.89 132 1400303400 good_reads:book https://www.goodreads.com/author/show/5544.Fra... 2002.0 /genres/christian-fiction|/genres/christian|/g... dir60/65686.Nightmare_Academy.html 3531 Nightmare Academy (Veritas Project, #2)
5992 4.09 1256 0345515501 good_reads:book https://www.goodreads.com/author/show/18149.Te... 2011.0 /genres/mystery|/genres/mystery|/genres/crime|... dir60/9578677-the-silent-girl.html 16312 The Silent Girl (Rizzoli & Isles, #9)
5993 4.37 28 0393062260 good_reads:book https://www.goodreads.com/author/show/62157.Ro... 2007.0 /genres/poetry|/genres/religion|/genres/christ... dir60/1251125.The_Book_of_Psalms.html 242 The Book of Psalms
5994 4.17 2226 0767913736 good_reads:book https://www.goodreads.com/author/show/44565.Ca... 2005.0 /genres/history|/genres/non-fiction|/genres/bi... dir60/78508.The_River_of_Doubt.html 16618 The River of Doubt
5995 3.99 775 1416909427 good_reads:book https://www.goodreads.com/author/show/151371.J... 2006.0 /genres/young-adult|/genres/realistic-fiction|... dir60/259068.Shug.html 6179 Shug
5996 3.78 540 1620612321 good_reads:book https://www.goodreads.com/author/show/5761314.... 2012.0 /genres/contemporary|/genres/romance|/genres/y... dir60/13503247-flawed.html 2971 Flawed
5997 3.91 281 NaN good_reads:book https://www.goodreads.com/author/show/1201952.... 2006.0 /genres/religion|/genres/islam|/genres/religio... dir60/2750008.html 3083 أسعد امرأة في العالÙ
5998 4.35 61 0786929081 good_reads:book https://www.goodreads.com/author/show/1023510.... 2001.0 /genres/fiction|/genres/fantasy|/genres/magic|... dir60/66677.Legacy_of_the_Drow_Collector_s_Edi... 3982 Legacy of the Drow Collector's Edition (Legacy...

5999 rows × 10 columns

Oh dear. That does not quite seem to be right. We are missing the column names. We need to add these in! But what are they?

Here is a list of them in order:

["rating", 'review_count', 'isbn', 'booktype','author_url', 'year', 'genre_urls', 'dir','rating_count', 'name']

Exercise
Use these to load the dataframe properly! And then "head" the dataframe... (you will need to look at the read_csv docs)

In [4]:
# your code here
df=pd.read_csv("data/goodreads.csv", header=None,
               names=["rating", 'review_count', 'isbn', 'booktype','author_url', 'year', 'genre_urls', 'dir','rating_count', 'name'],
)

#Examine the first few rows of the dataframe
df.head()
Out[4]:
rating review_count isbn booktype author_url year genre_urls dir rating_count name
0 4.40 136455 0439023483 good_reads:book https://www.goodreads.com/author/show/153394.S... 2008.0 /genres/young-adult|/genres/science-fiction|/g... dir01/2767052-the-hunger-games.html 2958974 The Hunger Games (The Hunger Games, #1)
1 4.41 16648 0439358078 good_reads:book https://www.goodreads.com/author/show/1077326.... 2003.0 /genres/fantasy|/genres/young-adult|/genres/fi... dir01/2.Harry_Potter_and_the_Order_of_the_Phoe... 1284478 Harry Potter and the Order of the Phoenix (Har...
2 3.56 85746 0316015849 good_reads:book https://www.goodreads.com/author/show/941441.S... 2005.0 /genres/young-adult|/genres/fantasy|/genres/ro... dir01/41865.Twilight.html 2579564 Twilight (Twilight, #1)
3 4.23 47906 0061120081 good_reads:book https://www.goodreads.com/author/show/1825.Har... 1960.0 /genres/classics|/genres/fiction|/genres/histo... dir01/2657.To_Kill_a_Mockingbird.html 2078123 To Kill a Mockingbird
4 4.23 34772 0679783261 good_reads:book https://www.goodreads.com/author/show/1265.Jan... 1813.0 /genres/classics|/genres/fiction|/genres/roman... dir01/1885.Pride_and_Prejudice.html 1388992 Pride and Prejudice

Cleaning: Examing the dataframe - quick checks

We should examine the dataframe to get a overall sense of the content.

Exercise
Lets check the types of the columns. What do you find?

In [5]:
# your code here
####### 
df.dtypes
####### 
Out[5]:
rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object

your answer here

Notice that review_count and rating_counts are objects instead of ints, and the year is a float!

There are a couple more quick sanity checks to perform on the dataframe.

In [6]:
print(df.shape)
df.columns
(6000, 10)
Out[6]:
Index(['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year', 'genre_urls', 'dir', 'rating_count', 'name'], dtype='object')

Cleaning: Examining the dataframe - a deeper look

Beyond performing checking some quick general properties of the data frame and looking at the first $n$ rows, we can dig a bit deeper into the values being stored. If you haven't already, check to see if there are any missing values in the data frame.

Let's see for a column which seemed OK to us.

In [7]:
#Get a sense of how many missing values there are in the dataframe.
print(np.sum([df.rating.isnull()]))
print(np.sum([df.review_count.isnull()]))
print(np.sum([df.isbn.isnull()]))
print(np.sum([df.booktype.isnull()]))
print(np.sum([df.author_url.isnull()]))
print(np.sum([df.year.isnull()]))
print(np.sum([df.genre_urls.isnull()]))
print(np.sum([df.dir.isnull()]))
print(np.sum([df.rating_count.isnull()]))
print(np.sum([df.name.isnull()]))
2
0
475
0
0
7
62
0
0
0
In [8]:
#Try to locate where the missing values occur
df[df.rating.isnull()]
Out[8]:
rating review_count isbn booktype author_url year genre_urls dir rating_count name
3643 NaN None None None None NaN NaN dir37/9658936-harry-potter.html None None
5282 NaN None None None None NaN NaN dir53/113138.The_Winner.html None None

How does pandas or numpy handle missing values when we try to compute with data sets that include them?

We'll now check if any of the other suspicious columns have missing values. Let's look at year and review_count first.

One thing you can do is to try and convert to the type you expect the column to be. If something goes wrong, it likely means your data are bad.

Lets test for missing data:

In [9]:
df[df.year.isnull()]

df.year.isnull()
df.shape
Out[9]:
(6000, 10)

Cleaning: Dealing with Missing Values

How should we interpret 'missing' or 'invalid' values in the data (hint: look at where these values occur)? One approach is to simply exclude them from the dataframe. Is this appropriate for all 'missing' or 'invalid' values?

In [10]:
#Treat the missing or invalid values in your dataframe
####### 

df = df[df.year.notnull()]

Ok so we have done some cleaning. What do things look like now? Notice the float has not yet changed.

In [11]:
df.dtypes
Out[11]:
rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object
In [12]:
print(np.sum(df.year.isnull()))
df.shape # We removed seven rows
0
Out[12]:
(5993, 10)
Exercise

Ok so lets fix those types. Convert them to ints. If the type conversion fails, we now know we have further problems.

In [13]:
# your code here
df.rating_count=df.rating_count.astype(int)
df.review_count=df.review_count.astype(int)
df.year=df.year.astype(int)

Once you do this, we seem to be good on these columns (no errors in conversion). Lets look:

In [14]:
df.dtypes
Out[14]:
rating          float64
review_count      int64
isbn             object
booktype         object
author_url       object
year              int64
genre_urls       object
dir              object
rating_count      int64
name             object
dtype: object

Sweet!

Some of the other colums that should be strings have NaN.

In [15]:
df.loc[df.genre_urls.isnull(), 'genre_urls']=""
df.loc[df.isbn.isnull(), 'isbn']=""

Part 2: Parsing and Completing the Data Frame

We will parse the author column from the author_url and genres column from the genre_urls. Keep the genres column as a string separated by '|'.

We will use panda's map to assign new columns to the dataframe.

Examine an example author_url and reason about which sequence of string operations must be performed in order to isolate the author's name.

In [16]:
#Get the first author_url
test_string = df.author_url[0]
test_string
Out[16]:
'https://www.goodreads.com/author/show/153394.Suzanne_Collins'
In [17]:
#Test out some string operations to isolate the author name

test_string.split('/')[-1].split('.')[1:][0]
Out[17]:
'Suzanne_Collins'
Exercise

Lets wrap the above code into a function which we will then use

In [18]:
# Write a function that accepts an author url and returns the author's name based on your experimentation above
def get_author(url):
    # your code here
    name = url.split('/')[-1].split('.')[1:][0]
    ####### 
    return name
In [19]:
#Apply the get_author function to the 'author_url' column using '.map' 
#and add a new column 'author' to store the names
df['author'] = df.author_url.map(get_author)
df.author[0:5]
Out[19]:
0    Suzanne_Collins
1        J_K_Rowling
2    Stephenie_Meyer
3         Harper_Lee
4        Jane_Austen
Name: author, dtype: object
Exercise

Now parse out the genres from genre_url.

This is a little more complicated because there be more than one genre.

In [20]:
df.genre_urls.head()
Out[20]:
0    /genres/young-adult|/genres/science-fiction|/g...
1    /genres/fantasy|/genres/young-adult|/genres/fi...
2    /genres/young-adult|/genres/fantasy|/genres/ro...
3    /genres/classics|/genres/fiction|/genres/histo...
4    /genres/classics|/genres/fiction|/genres/roman...
Name: genre_urls, dtype: object
In [21]:
# your code here
#Examine some examples of genre_urls

#Test out some string operations to isolate the genre name
test_genre_string=df.genre_urls[0]
genres=test_genre_string.strip().split('|')
for e in genres:
    print(e.split('/')[-1])
    "|".join(genres)
young-adult
science-fiction
dystopia
fantasy
science-fiction
romance
adventure
book-club
young-adult
teen
apocalyptic
post-apocalyptic
action
Exercise

Write a function that accepts a genre url and returns the genre name based on your experimentation above

In [22]:
def split_and_join_genres(url):
    # your code here
    genres=url.strip().split('|')
    genres=[e.split('/')[-1] for e in genres]
    return "|".join(genres)

Test your function

In [23]:
split_and_join_genres("/genres/young-adult|/genres/science-fiction")
Out[23]:
'young-adult|science-fiction'
In [24]:
split_and_join_genres("")
Out[24]:
''
Exercise

Use map again to create a new "genres" column

In [25]:
df['genres']=df.genre_urls.map(split_and_join_genres)
df.head()
Out[25]:
rating review_count isbn booktype author_url year genre_urls dir rating_count name author genres
0 4.40 136455 0439023483 good_reads:book https://www.goodreads.com/author/show/153394.S... 2008 /genres/young-adult|/genres/science-fiction|/g... dir01/2767052-the-hunger-games.html 2958974 The Hunger Games (The Hunger Games, #1) Suzanne_Collins young-adult|science-fiction|dystopia|fantasy|s...
1 4.41 16648 0439358078 good_reads:book https://www.goodreads.com/author/show/1077326.... 2003 /genres/fantasy|/genres/young-adult|/genres/fi... dir01/2.Harry_Potter_and_the_Order_of_the_Phoe... 1284478 Harry Potter and the Order of the Phoenix (Har... J_K_Rowling fantasy|young-adult|fiction|fantasy|magic|chil...
2 3.56 85746 0316015849 good_reads:book https://www.goodreads.com/author/show/941441.S... 2005 /genres/young-adult|/genres/fantasy|/genres/ro... dir01/41865.Twilight.html 2579564 Twilight (Twilight, #1) Stephenie_Meyer young-adult|fantasy|romance|paranormal|vampire...
3 4.23 47906 0061120081 good_reads:book https://www.goodreads.com/author/show/1825.Har... 1960 /genres/classics|/genres/fiction|/genres/histo... dir01/2657.To_Kill_a_Mockingbird.html 2078123 To Kill a Mockingbird Harper_Lee classics|fiction|historical-fiction|academic|s...
4 4.23 34772 0679783261 good_reads:book https://www.goodreads.com/author/show/1265.Jan... 1813 /genres/classics|/genres/fiction|/genres/roman... dir01/1885.Pride_and_Prejudice.html 1388992 Pride and Prejudice Jane_Austen classics|fiction|romance|historical-fiction|li...

Finally, let's pick an author at random so we can see the results of the transformations. Scroll to see the author and genre columns that we added to the dataframe.

In [26]:
df[df.author == "Marguerite_Yourcenar"]
Out[26]:
rating review_count isbn booktype author_url year genre_urls dir rating_count name author genres
1014 4.23 483 0374529264 good_reads:book https://www.goodreads.com/author/show/7732.Mar... 1951 /genres/historical-fiction|/genres/fiction|/ge... dir11/12172.Memoirs_of_Hadrian.html 6258 Memoirs of Hadrian Marguerite_Yourcenar historical-fiction|fiction|cultural|france|cla...
5620 4.11 74 2070367983 good_reads:book https://www.goodreads.com/author/show/7732.Mar... 1968 /genres/fiction|/genres/historical-fiction|/ge... dir57/953435.L_uvre_au_noir.html 1601 L'Œuvre au noir Marguerite_Yourcenar fiction|historical-fiction|cultural|france|eur...

Let us delete the genre_urls column.

In [28]:
del df['genre_urls']

And then save the dataframe out!

In [29]:
df.to_csv("data/cleaned-goodreads.csv", index=False, header=True)

Part 3: Grouping

It appears that some books were written in negative years! Print out the observations that correspond to negative years. What do you notice about these books?

In [30]:
# your code here
df[df.year < 0].name
#These are books written before the Common Era (BCE, equivalent to BC).
Out[30]:
47                               The Odyssey
246                    The Iliad/The Odyssey
455                             The Republic
596                               The Aeneid
629                              Oedipus Rex
674                           The Art of War
746                        The Bhagavad Gita
777                                 Antigone
1233                       The Oedipus Cycle
1397                          Aesop's Fables
1398                   The Epic of Gilgamesh
1428                                   Medea
1815                            The Oresteia
1882         The Trial and Death of Socrates
2078    The History of the Peloponnesian War
2527                           The Histories
3133                          Complete Works
3274                  The Nicomachean Ethics
3757                              Lysistrata
4402                           The Symposium
4475                                 Apology
5367                          Five Dialogues
Name: name, dtype: object

We can determine the "best book" by year! For this we use Panda's groupby(). Groupby() allows grouping a dataframe by any (usually categorical) variable. Would it make sense to ever groupby integer variables? Floating point variables?

In [31]:
dfgb_author = df.groupby('author')
type(dfgb_author)
Out[31]:
pandas.core.groupby.generic.DataFrameGroupBy

Perhaps we want the number of books each author wrote

In [32]:
dfgb_author.count()
Out[32]:
rating review_count isbn booktype author_url year dir rating_count name genres
author
A_A_Milne 6 6 6 6 6 6 6 6 6 6
A_G_Howard 1 1 1 1 1 1 1 1 1 1
A_J_Cronin 1 1 1 1 1 1 1 1 1 1
A_J_Jacobs 1 1 1 1 1 1 1 1 1 1
A_J_Salt 1 1 1 1 1 1 1 1 1 1
A_Meredith_Walters 2 2 2 2 2 2 2 2 2 2
A_N_Roquelaure 2 2 2 2 2 2 2 2 2 2
A_S_Byatt 1 1 1 1 1 1 1 1 1 1
A_S_King 1 1 1 1 1 1 1 1 1 1
A_id_al_Qarni 2 2 2 2 2 2 2 2 2 2
Abbi_Glines 14 14 14 14 14 14 14 14 14 14
Abdul_Rahman_Munif 1 1 1 1 1 1 1 1 1 1
Abigail_Gibbs 1 1 1 1 1 1 1 1 1 1
Abigail_Roux 4 4 4 4 4 4 4 4 4 4
Abigail_Thomas 1 1 1 1 1 1 1 1 1 1
Abolqasem_Ferdowsi 1 1 1 1 1 1 1 1 1 1
Abraham_Verghese 1 1 1 1 1 1 1 1 1 1
Abul_Hasan_Ali_Nadwi 1 1 1 1 1 1 1 1 1 1
Adam_Hochschild 1 1 1 1 1 1 1 1 1 1
Adam_Johnson 1 1 1 1 1 1 1 1 1 1
Adam_Levin 1 1 1 1 1 1 1 1 1 1
Adam_Rex 1 1 1 1 1 1 1 1 1 1
Adam_Smith 1 1 1 1 1 1 1 1 1 1
Addison_Moore 1 1 1 1 1 1 1 1 1 1
Adeline_Yen_Mah 1 1 1 1 1 1 1 1 1 1
Adolf_Hitler 1 1 1 1 1 1 1 1 1 1
Adolfo_Bioy_Casares 1 1 1 1 1 1 1 1 1 1
Aeschylus 1 1 1 1 1 1 1 1 1 1
Aesop 1 1 1 1 1 1 1 1 1 1
Agatha_Christie 11 11 11 11 11 11 11 11 11 11
... ... ... ... ... ... ... ... ... ... ...
William_Strunk_Jr_ 1 1 1 1 1 1 1 1 1 1
William_Styron 3 3 3 3 3 3 3 3 3 3
William_Wharton 2 2 2 2 2 2 2 2 2 2
William_Wordsworth 1 1 1 1 1 1 1 1 1 1
Willow_Aster 1 1 1 1 1 1 1 1 1 1
Wilson_Rawls 2 2 2 2 2 2 2 2 2 2
Winston_Groom 1 1 1 1 1 1 1 1 1 1
Witold_Gombrowicz 2 2 2 2 2 2 2 2 2 2
Wm_Paul_Young 1 1 1 1 1 1 1 1 1 1
Woody_Allen 1 1 1 1 1 1 1 1 1 1
Wu_Cheng_en 1 1 1 1 1 1 1 1 1 1
Yamamoto_Tsunetomo 1 1 1 1 1 1 1 1 1 1
Yana_Toboso 1 1 1 1 1 1 1 1 1 1
Yann_Martel 2 2 2 2 2 2 2 2 2 2
Yasunari_Kawabata 1 1 1 1 1 1 1 1 1 1
Yevgeny_Zamyatin 1 1 1 1 1 1 1 1 1 1
Young_Kim 1 1 1 1 1 1 1 1 1 1
Yuehai_Xiao 1 1 1 1 1 1 1 1 1 1
Yukio_Mishima 2 2 2 2 2 2 2 2 2 2
Yukito_Kishiro 1 1 1 1 1 1 1 1 1 1
Yvonne_Woon 1 1 1 1 1 1 1 1 1 1
Zack_Love 1 1 1 1 1 1 1 1 1 1
Zadie_Smith 2 2 2 2 2 2 2 2 2 2
Zilpha_Keatley_Snyder 1 1 1 1 1 1 1 1 1 1
Zora_Neale_Hurston 1 1 1 1 1 1 1 1 1 1
_ 42 42 42 42 42 42 42 42 42 42
_gota_Krist_f 1 1 1 1 1 1 1 1 1 1
_mile_Zola 4 4 4 4 4 4 4 4 4 4
_ric_Emmanuel_Schmitt 1 1 1 1 1 1 1 1 1 1
_sne_Seierstad 1 1 1 1 1 1 1 1 1 1

2645 rows × 10 columns

Lots of useless info there. One column should suffice

Exercise:

  • Group the dataframe by author. Include the following columns: rating, name, author. For the aggregation of the name column which includes the names of the books create a list with the strings containing the name of each book. Make sure that the way you aggregate the rest of the columns make sense!

  • Create a new column with number of books for each author and find the most prolific author!

In [33]:
###### Before we start : what do we do about these titles where 'name' is unreadable? Try different encodings?
auth_name = 'A_id_al_Qarni'
df[df.author == auth_name].head()
Out[33]:
rating review_count isbn booktype author_url year dir rating_count name author genres
2213 4.19 1169 good_reads:book https://www.goodreads.com/author/show/1201952.... 2003 dir23/2750180.html 15781 لا تحزن A_id_al_Qarni religion|religion|islam|self-help|non-fiction|...
5998 3.91 281 good_reads:book https://www.goodreads.com/author/show/1201952.... 2006 dir60/2750008.html 3083 أسعد امرأة في العال٠A_id_al_Qarni religion|islam|religion|self-help|spirituality...
In [34]:
df[df.author == auth_name].iat[0,8].encode('UTF-16')
Out[34]:
b'\xff\xfe\xd9\x00\x84\x00\xd8\x00\xa7\x00 \x00\xd8\x00\xaa\x00\xd8\x00\xad\x00\xd8\x00\xb2\x00\xd9\x00\x86\x00'
In [35]:
# let's examine the columns we have
df.columns
Out[35]:
Index(['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year', 'dir', 'rating_count', 'name', 'author', 'genres'], dtype='object')

Create the GroupBy table

In [36]:
authors = df.copy()
authors = authors[['rating','name','author']].groupby('author').agg({'rating' : np.mean,
                                                                    'name' : '|'.join})
In [37]:
authors = authors.reset_index()
authors.head()
Out[37]:
author rating name
0 A_A_Milne 4.365 Winnie-the-Pooh|The House at Pooh Corner|The H...
1 A_G_Howard 4.020 Splintered (Splintered, #1)
2 A_J_Cronin 4.220 The Keys of the Kingdom
3 A_J_Jacobs 3.750 The Year of Living Biblically
4 A_J_Salt 4.940 Nik Nassa & the Mark of Destiny
In [38]:
# split the column string and make a list of string book names
authors['name'] = authors.name.str.split('|')
authors.head()
Out[38]:
author rating name
0 A_A_Milne 4.365 [Winnie-the-Pooh, The House at Pooh Corner, Th...
1 A_G_Howard 4.020 [Splintered (Splintered, #1)]
2 A_J_Cronin 4.220 [The Keys of the Kingdom]
3 A_J_Jacobs 3.750 [The Year of Living Biblically]
4 A_J_Salt 4.940 [Nik Nassa & the Mark of Destiny]
In [39]:
# count the books - create new column
len(authors.name[0])
Out[39]:
6
In [40]:
authors['num_books'] = authors['name'].str.len()
authors
Out[40]:
author rating name num_books
0 A_A_Milne 4.365000 [Winnie-the-Pooh, The House at Pooh Corner, Th... 6
1 A_G_Howard 4.020000 [Splintered (Splintered, #1)] 1
2 A_J_Cronin 4.220000 [The Keys of the Kingdom] 1
3 A_J_Jacobs 3.750000 [The Year of Living Biblically] 1
4 A_J_Salt 4.940000 [Nik Nassa & the Mark of Destiny] 1
5 A_Meredith_Walters 4.150000 [Find You in the Dark (Find You in the Dark, #... 2
6 A_N_Roquelaure 3.450000 [Beauty's Punishment (Sleeping Beauty, #2), Th... 2
7 A_S_Byatt 3.860000 [Possession] 1
8 A_S_King 3.930000 [Please Ignore Vera Dietz] 1
9 A_id_al_Qarni 4.050000 [لا تحزن, أسعد امرأة في الØ... 2
10 Abbi_Glines 4.179286 [Fallen Too Far (Too Far, #1), Existence (Exis... 14
11 Abdul_Rahman_Munif 4.030000 [التيه] 1
12 Abigail_Gibbs 3.820000 [Dinner With a Vampire (Dark Heroine, #1)] 1
13 Abigail_Roux 4.470000 [Cut & Run (Cut & Run, #1), Fish & Chips (Cut ... 4
14 Abigail_Thomas 3.680000 [A Three Dog Life] 1
15 Abolqasem_Ferdowsi 4.520000 [Shahnameh] 1
16 Abraham_Verghese 4.260000 [Cutting for Stone] 1
17 Abul_Hasan_Ali_Nadwi 4.150000 [ماذا خسر العالم بانحطاط Ø... 1
18 Adam_Hochschild 4.140000 [King Leopold's Ghost] 1
19 Adam_Johnson 4.030000 [The Orphan Master's Son] 1
20 Adam_Levin 4.040000 [The Instructions] 1
21 Adam_Rex 4.140000 [The True Meaning of Smekday] 1
22 Adam_Smith 3.820000 [The Wealth of Nations] 1
23 Addison_Moore 3.780000 [Ethereal (Celestra, #1)] 1
24 Adeline_Yen_Mah 4.010000 [Chinese Cinderella] 1
25 Adolf_Hitler 2.970000 [Mein Kampf] 1
26 Adolfo_Bioy_Casares 4.060000 [The Invention of Morel] 1
27 Aeschylus 3.960000 [The Oresteia] 1
28 Aesop 4.030000 [Aesop's Fables] 1
29 Agatha_Christie 3.977273 [And Then There Were None, Murder on the Orien... 11
... ... ... ... ...
2615 William_Strunk_Jr_ 4.170000 [The Elements of Style] 1
2616 William_Styron 4.043333 [Sophie's Choice, The Confessions of Nat Turne... 3
2617 William_Wharton 4.090000 [Birdy, A Midnight Clear] 2
2618 William_Wordsworth 3.920000 [Lyrical Ballads] 1
2619 Willow_Aster 4.140000 [True Love Story] 1
2620 Wilson_Rawls 3.990000 [Where the Red Fern Grows, Summer of the Monkeys] 2
2621 Winston_Groom 3.970000 [Forrest Gump (Forrest Gump, #1)] 1
2622 Witold_Gombrowicz 3.975000 [Ferdydurke, Cosmos] 2
2623 Wm_Paul_Young 3.660000 [The Shack] 1
2624 Woody_Allen 4.030000 [Without Feathers] 1
2625 Wu_Cheng_en 4.040000 [Monkey] 1
2626 Yamamoto_Tsunetomo 4.090000 [Hagakure] 1
2627 Yana_Toboso 4.340000 [Black Butler, Vol. 01 (Black Butler, #1)] 1
2628 Yann_Martel 3.470000 [Life of Pi, Beatrice and Virgil] 2
2629 Yasunari_Kawabata 3.740000 [Snow Country] 1
2630 Yevgeny_Zamyatin 3.970000 [We] 1
2631 Young_Kim 3.660000 [Twilight (Twilight: The Graphic Novel, #1)] 1
2632 Yuehai_Xiao 4.560000 [Crossing the Seas] 1
2633 Yukio_Mishima 4.030000 [The Sailor Who Fell from Grace with the Sea, ... 2
2634 Yukito_Kishiro 4.150000 [Battle Angel Alita, Volume 01] 1
2635 Yvonne_Woon 3.950000 [Dead Beautiful (Dead Beautiful, #1)] 1
2636 Zack_Love 3.550000 [Sex in the Title] 1
2637 Zadie_Smith 3.655000 [White Teeth, On Beauty] 2
2638 Zilpha_Keatley_Snyder 4.110000 [The Changeling] 1
2639 Zora_Neale_Hurston 3.820000 [Their Eyes Were Watching God] 1
2640 _ 3.988095 [عزازيل, ثلاثية غرناطة, تر... 42
2641 _gota_Krist_f 4.340000 [The Notebook, The Proof, The Third Lie] 1
2642 _mile_Zola 3.990000 [Germinal (Les Rougon-Macquart, #13), L'Assomm... 4
2643 _ric_Emmanuel_Schmitt 4.160000 [Oscar et la dame rose] 1
2644 _sne_Seierstad 3.740000 [The Bookseller of Kabul] 1

2645 rows × 4 columns

In [41]:
# sort for more prolific
authors.sort_values(by='num_books', ascending=False).iloc[0]
Out[41]:
author                                            Stephen_King
rating                                                 3.91875
name         [The Stand, The Shining (The Shining #1), It, ...
num_books                                                   56
Name: 2349, dtype: object

Winner is Stephen King with 56 books! OMG!!!

Perhaps you want more detailed info...

In [ ]:
dfgb_author[['rating', 'rating_count', 'review_count', 'year']].describe()

You can also access a groupby dictionary style.

In [ ]:
ratingdict = {}
for author, subset in dfgb_author:
    ratingdict[author] = (subset['rating'].mean(), subset['rating'].std())
ratingdict
Exercise

Lets get the best-rated book(s) for every year in our dataframe.

In [ ]:
#Using .groupby, we can divide the dataframe into subsets by the values of 'year'.
#We can then iterate over these subsets
# your code here
for year, subset in df.groupby('year'):
    #Find the best book of the year

    bestbook = subset[subset.rating == subset.rating.max()]
    if bestbook.shape[0] > 1:
        print(year, bestbook.name.values, bestbook.rating.values)
    else:
        print(year, bestbook.name.values[0], bestbook.rating.values[0])