Key Word(s): Pandas

CS109A Introduction to Data Science

Lab 2: Pandas and Web Scraping with Beautiful Soup¶

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras
Authors: Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, Eleni Kaxiras

In [1]:

## RUN THIS CELL TO GET THE RIGHT FORMATTING 
from IPython.core.display import HTML
def css_styling():
    styles = open("../../styles/cs109.css", "r").read()
    return HTML(styles)
css_styling()

Out[1]:

Table of Contents¶

Learning Goals
Loading and Cleaning with Pandas
Parsing and Completing the Dataframe
Grouping

Learning Goals¶

About 6,000 odd "best books" were fetched and parsed from Goodreads. The "bestness" of these books came from a proprietary formula used by Goodreads and published as a list on their web site.

We parsed the page for each book and saved data from all these pages in a tabular format as a CSV file. In this lab we'll clean and further parse the data. We'll then do some exploratory data analysis to answer questions about these best books and popular genres.

By the end of this lab, you should be able to:

Load and systematically address missing values, ancoded as NaN values in our data set, for example, by removing observations associated with these values.
Parse columns in the dataframe to create new dataframe columns.
Use groupby to aggregate data on a particular feature column, such as author.

This lab corresponds to lectures #1, #2, and #3 and maps on to homework #1 and further.

Basic EDA workflow¶

(From the lecture, repeated here for convenience).

The basic workflow is as follows:

Build a DataFrame from the data (ideally, put all data in this object)
Clean the DataFrame. It should have the following properties:
- Each row describes a single object
- Each column describes a property of that object
- Columns are numeric whenever appropriate
- Columns contain atomic properties that cannot be further decomposed
Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.
Explore group properties. Use groupby and small multiples to compare subsets of the data.

This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis.

Part 1: Loading and Cleaning with Pandas¶

Read in the goodreads.csv file, examine the data, and do any necessary data cleaning.

Here is a description of the columns (in order) present in this csv file:

rating: the average rating on a 1-5 scale achieved by the book
review_count: the number of Goodreads users who reviewed this book
isbn: the ISBN code for the book
booktype: an internal Goodreads identifier for the book
author_url: the Goodreads (relative) URL for the author of the book
year: the year the book was published
genre_urls: a string with '|' separated relative URLS of Goodreads genre pages
dir: a directory identifier internal to the scraping code
rating_count: the number of ratings for this book (this is different from the number of reviews)
name: the name of the book

Let us see what issues we find with the data and resolve them.

After loading appropriate libraries

In [2]:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

Cleaning: Reading in the data¶

We read in and clean the data from goodreads.csv.

In [3]:

#Read the data into a dataframe
df = pd.read_csv("data/goodreads.csv", encoding='utf-8')

#Examine the first few rows of the dataframe
df

Out[3]:

	4.40	136455	0439023483	good_reads:book	https://www.goodreads.com/author/show/153394.Suzanne_Collins	2008	/genres/young-adult\|/genres/science-fiction\|/genres/dystopia\|/genres/fantasy\|/genres/science-fiction\|/genres/romance\|/genres/adventure\|/genres/book-club\|/genres/young-adult\|/genres/teen\|/genres/apocalyptic\|/genres/post-apocalyptic\|/genres/action	dir01/2767052-the-hunger-games.html	2958974	The Hunger Games (The Hunger Games, #1)
0	4.41	16648	0439358078	good_reads:book	https://www.goodreads.com/author/show/1077326....	2003.0	/genres/fantasy\|/genres/young-adult\|/genres/fi...	dir01/2.Harry_Potter_and_the_Order_of_the_Phoe...	1284478	Harry Potter and the Order of the Phoenix (Har...
1	3.56	85746	0316015849	good_reads:book	https://www.goodreads.com/author/show/941441.S...	2005.0	/genres/young-adult\|/genres/fantasy\|/genres/ro...	dir01/41865.Twilight.html	2579564	Twilight (Twilight, #1)
2	4.23	47906	0061120081	good_reads:book	https://www.goodreads.com/author/show/1825.Har...	1960.0	/genres/classics\|/genres/fiction\|/genres/histo...	dir01/2657.To_Kill_a_Mockingbird.html	2078123	To Kill a Mockingbird
3	4.23	34772	0679783261	good_reads:book	https://www.goodreads.com/author/show/1265.Jan...	1813.0	/genres/classics\|/genres/fiction\|/genres/roman...	dir01/1885.Pride_and_Prejudice.html	1388992	Pride and Prejudice
4	4.25	12363	0446675539	good_reads:book	https://www.goodreads.com/author/show/11081.Ma...	1936.0	/genres/classics\|/genres/historical-fiction\|/g...	dir01/18405.Gone_with_the_Wind.html	645470	Gone with the Wind
5	4.22	7205	0066238501	good_reads:book	https://www.goodreads.com/author/show/1069006....	1949.0	/genres/classics\|/genres/young-adult\|/genres/c...	dir01/11127.The_Chronicles_of_Narnia.html	286677	The Chronicles of Narnia (Chronicles of Narnia...
6	4.38	10902	0060256656	good_reads:book	https://www.goodreads.com/author/show/435477.S...	1964.0	/genres/childrens\|/genres/young-adult\|/genres/...	dir01/370493.The_Giving_Tree.html	502891	The Giving Tree
7	3.79	20670	0452284244	good_reads:book	https://www.goodreads.com/author/show/3706.Geo...	1945.0	/genres/classics\|/genres/fiction\|/genres/scien...	dir01/7613.Animal_Farm.html	1364879	Animal Farm
8	4.18	12302	0345391802	good_reads:book	https://www.goodreads.com/author/show/4.Dougla...	1979.0	/genres/science-fiction\|/genres/humor\|/genres/...	dir01/11.The_Hitchhiker_s_Guide_to_the_Galaxy....	724713	The Hitchhiker's Guide to the Galaxy (Hitchhik...
9	4.03	20937	0739326228	good_reads:book	https://www.goodreads.com/author/show/614.Arth...	1997.0	/genres/fiction\|/genres/historical-fiction\|/ge...	dir01/930.Memoirs_of_a_Geisha.html	1042679	Memoirs of a Geisha
10	3.72	34959	0307277674	good_reads:book	https://www.goodreads.com/author/show/630.Dan_...	2003.0	/genres/mystery\|/genres/thriller\|/genres/suspe...	dir01/968.The_Da_Vinci_Code.html	1220657	The Da Vinci Code (Robert Langdon, #2)
11	4.36	69524	0375831002	good_reads:book	https://www.goodreads.com/author/show/11466.Ma...	2005.0	/genres/historical-fiction\|/genres/young-adult...	dir01/19063.The_Book_Thief.html	675431	The Book Thief
12	4.05	5516	0451527747	good_reads:book	https://www.goodreads.com/author/show/8164.Lew...	1865.0	/genres/classics\|/genres/childrens\|/genres/you...	dir01/24213.Alice_s_Adventures_in_Wonderland_T...	301702	Alice's Adventures in Wonderland & Through the...
13	3.72	10156	0743477111	good_reads:book	https://www.goodreads.com/author/show/947.Will...	1597.0	/genres/classics\|/genres/plays\|/genres/fiction...	dir01/18135.Romeo_and_Juliet.html	1211146	Romeo and Juliet
14	4.09	10082	0451525264	good_reads:book	https://www.goodreads.com/author/show/13661.Vi...	1862.0	/genres/classics\|/genres/historical-fiction\|/g...	dir01/24280.Les_Mis_rables.html	418004	Les MisÃ©rables
15	3.92	38061	NaN	good_reads:book	https://www.goodreads.com/author/show/498072.A...	2003.0	/genres/fiction\|/genres/romance\|/genres/fantas...	dir01/18619684-the-time-traveler-s-wife.html	927254	The Time Traveler's Wife
16	4.58	1314	0345538374	good_reads:book	https://www.goodreads.com/author/show/656983.J...	1973.0	/genres/fantasy\|/genres/classics\|/genres/scien...	dir01/30.J_R_R_Tolkien_4_Book_Boxed_Set.html	68495	J.R.R. Tolkien 4-Book Boxed Set
17	3.60	18039	0140283331	good_reads:book	https://www.goodreads.com/author/show/306.Will...	1954.0	/genres/classics\|/genres/academic\|/genres/scho...	dir01/7624.Lord_of_the_Flies.html	1232126	Lord of the Flies
18	4.28	30815	0812550706	good_reads:book	https://www.goodreads.com/author/show/589.Orso...	1985.0	/genres/science-fiction\|/genres/young-adult\|/g...	dir01/375802.Ender_s_Game.html	624730	Ender's Game (The Ender Quintet, #1)
19	4.02	11942	0375751513	good_reads:book	https://www.goodreads.com/author/show/3565.Osc...	1890.0	/genres/classics\|/genres/fiction\|/genres/horro...	dir01/5297.The_Picture_of_Dorian_Gray.html	409478	The Picture of Dorian Gray
20	4.14	8681	0143058142	good_reads:book	https://www.goodreads.com/author/show/3137322....	1866.0	/genres/classics\|/genres/cultural\|/genres/russ...	dir01/7144.Crime_and_Punishment.html	294297	Crime and Punishment
21	4.11	8897	0064410935	good_reads:book	https://www.goodreads.com/author/show/988142.E...	1952.0	/genres/childrens\|/genres/fiction\|/genres/clas...	dir01/24178.Charlotte_s_Web.html	662707	Charlotte's Web
22	4.20	8678	0451528824	good_reads:book	https://www.goodreads.com/author/show/5350.L_M...	1908.0	/genres/fiction\|/genres/young-adult\|/genres/cl...	dir01/8127.Anne_of_Green_Gables.html	393594	Anne of Green Gables (Anne of Green Gables, #1)
23	3.75	36955	0061122416	good_reads:book	https://www.goodreads.com/author/show/566.Paul...	1988.0	/genres/fiction\|/genres/classics\|/genres/fanta...	dir01/865.The_Alchemist.html	876518	The Alchemist
24	3.94	18581	0007491565	good_reads:book	https://www.goodreads.com/author/show/1630.Ray...	1953.0	/genres/classics\|/genres/fiction\|/genres/scien...	dir01/17470674-fahrenheit-451.html	783133	Fahrenheit 451
25	4.43	112279	0525478817	good_reads:book	https://www.goodreads.com/author/show/1406384....	2012.0	/genres/young-adult\|/genres/book-club\|/genres/...	dir01/11870085-the-fault-in-our-stars.html	1150626	The Fault in Our Stars
26	3.79	15833	0142000671	good_reads:book	https://www.goodreads.com/author/show/585.John...	1937.0	/genres/fiction\|/genres/classics\|/genres/acade...	dir01/890.Of_Mice_and_Men.html	1070755	Of Mice and Men
27	4.04	13214	0440498058	good_reads:book	https://www.goodreads.com/author/show/106.Made...	1962.0	/genres/fantasy\|/genres/young-adult\|/genres/cl...	dir01/18131.A_Wrinkle_in_Time.html	420001	A Wrinkle in Time (A Wrinkle in Time Quintet, #1)
28	3.94	11736	0393970124	good_reads:book	https://www.goodreads.com/author/show/6988.Bra...	1897.0	/genres/classics\|/genres/horror\|/genres/fictio...	dir01/17245.Dracula.html	429079	Dracula
29	4.24	10614	0345418263	good_reads:book	https://www.goodreads.com/author/show/12521.Wi...	1973.0	/genres/fantasy\|/genres/classics\|/genres/roman...	dir01/21787.The_Princess_Bride.html	457219	The Princess Bride
...	...	...	...	...	...	...	...	...	...	...
5969	3.97	182	0399151311	good_reads:book	https://www.goodreads.com/author/show/33987.La...	1985.0	/genres/romance\|/genres/romance\|/genres/contem...	dir60/572626.Separate_Beds.html	3544	Separate Beds
5970	4.24	72	0413748308	good_reads:book	https://www.goodreads.com/author/show/29185.Sa...	2000.0	/genres/plays\|/genres/drama\|/genres/plays\|/gen...	dir60/146548.4_48_Psychosis.html	1016	4.48 Psychosis
5971	4.19	1670	NaN	good_reads:book	https://www.goodreads.com/author/show/4586597....	2011.0	/genres/romance\|/genres/romance\|/genres/contem...	dir60/12351649-perfection.html	35197	Perfection (Neighbor from Hell, #2)
5972	4.17	789	1401324290	good_reads:book	https://www.goodreads.com/author/show/4627059....	2011.0	/genres/biography\|/genres/animals\|/genres/auto...	dir60/10393675-until-tuesday.html	4685	Until Tuesday
5973	3.99	2944	0425267040	good_reads:book	https://www.goodreads.com/author/show/24978.Ma...	2013.0	/genres/romance\|/genres/adult-fiction\|/genres/...	dir60/16033902-rush.html	41287	Rush (Breathless, #1)
5974	4.07	10585	0061950726	good_reads:book	https://www.goodreads.com/author/show/157146.C...	2013.0	/genres/historical-fiction\|/genres/book-club\|/...	dir60/15818107-orphan-train.html	76606	Orphan Train
5975	4.23	1185	NaN	good_reads:book	https://www.goodreads.com/author/show/5160667....	2014.0	/genres/romance\|/genres/science-fiction\|/genre...	dir60/20504754-transcendence.html	4942	Transcendence
5976	4.03	218	NaN	good_reads:book	https://www.goodreads.com/author/show/5769580....	1987.0	/genres/fiction\|/genres/novels\|/genres/literat...	dir60/5948927.html	1607	Ø§ÙØªÙÙ
5977	3.99	27	1853408360	good_reads:book	https://www.goodreads.com/author/show/851161.K...	2005.0	/genres/young-adult\|/genres/romance\|/genres/co...	dir60/2274992.Tessa_in_Love.html	294	Tessa in Love
5978	2.77	800	0060988649	good_reads:book	https://www.goodreads.com/author/show/7025.Gre...	2001.0	/genres/fantasy\|/genres/fiction\|/genres/myster...	dir60/24929.Lost.html	11128	Lost
5979	3.84	165	0571207995	good_reads:book	https://www.goodreads.com/author/show/16865.Ti...	1977.0	/genres/fiction\|/genres/cultural\|/genres/canad...	dir60/29898.The_Wars.html	4160	The Wars
5980	3.36	1693	0312424442	good_reads:book	https://www.goodreads.com/author/show/3083854....	2003.0	/genres/fiction\|/genres/novels\|/genres/contemp...	dir60/231.I_am_Charlotte_Simmons.html	17743	I am Charlotte Simmons
5981	4.09	362	1407103946	good_reads:book	https://www.goodreads.com/author/show/81096.Ch...	2009.0	/genres/fantasy\|/genres/horror\|/genres/young-a...	dir60/6364017-malice.html	2013	Malice (Malice, #1)
5982	4.23	137	1582430438	good_reads:book	https://www.goodreads.com/author/show/8567.Wen...	1974.0	/genres/fiction\|/genres/novels\|/genres/literat...	dir60/227274.The_Memory_of_Old_Jack.html	1085	The Memory of Old Jack
5983	4.02	531	0575085150	good_reads:book	https://www.goodreads.com/author/show/81096.Ch...	2009.0	/genres/science-fiction\|/genres/steampunk\|/gen...	dir60/6285903-retribution-falls.html	3878	Retribution Falls (Tales of the Ketty Jay, #1)
5984	3.61	109	1401360106	good_reads:book	https://www.goodreads.com/author/show/183537.K...	2005.0	/genres/fiction\|/genres/young-adult\|/genres/bo...	dir60/319403.Pigtopia.html	529	Pigtopia
5985	4.06	954	1606840584	good_reads:book	https://www.goodreads.com/author/show/2891503....	2010.0	/genres/young-adult\|/genres/fantasy\|/genres/pa...	dir60/7831742-the-lost-saint.html	12690	The Lost Saint (The Dark Divine, #2)
5986	4.26	477	0517548233	good_reads:book	https://www.goodreads.com/author/show/2062.Hen...	1946.0	/genres/economics\|/genres/non-fiction\|/genres/...	dir60/3028.Economics_in_One_Lesson.html	5767	Economics in One Lesson
5987	4.34	93	0575070706	good_reads:book	https://www.goodreads.com/author/show/58.Frank...	1977.0	/genres/science-fiction\|/genres/fantasy\|/genre...	dir60/53764.The_Great_Dune_Trilogy.html	41378	The Great Dune Trilogy
5988	3.36	192	842534607X	good_reads:book	https://www.goodreads.com/author/show/3493970....	2011.0	/genres/european-literature\|/genres/spanish-li...	dir60/10832326-si-t-me-dices-ven-lo-dejo-todo-...	1914	Si tÃº me dices ven lo dejo todo... pero dime ven
5989	4.12	1150	0140143459	good_reads:book	https://www.goodreads.com/author/show/776.Mich...	1989.0	/genres/non-fiction\|/genres/economics\|/genres/...	dir60/1171.Liar_s_Poker.html	32637	Liar's Poker
5990	4.20	650	NaN	good_reads:book	https://www.goodreads.com/author/show/1112683._	2009.0	/genres/novels\|/genres/fiction\|/genres/religio...	dir60/6976667.html	2899	Ø£ÙÙØ§Ø ÙØ¯Ø³Ø±
5991	3.89	132	1400303400	good_reads:book	https://www.goodreads.com/author/show/5544.Fra...	2002.0	/genres/christian-fiction\|/genres/christian\|/g...	dir60/65686.Nightmare_Academy.html	3531	Nightmare Academy (Veritas Project, #2)
5992	4.09	1256	0345515501	good_reads:book	https://www.goodreads.com/author/show/18149.Te...	2011.0	/genres/mystery\|/genres/mystery\|/genres/crime\|...	dir60/9578677-the-silent-girl.html	16312	The Silent Girl (Rizzoli & Isles, #9)
5993	4.37	28	0393062260	good_reads:book	https://www.goodreads.com/author/show/62157.Ro...	2007.0	/genres/poetry\|/genres/religion\|/genres/christ...	dir60/1251125.The_Book_of_Psalms.html	242	The Book of Psalms
5994	4.17	2226	0767913736	good_reads:book	https://www.goodreads.com/author/show/44565.Ca...	2005.0	/genres/history\|/genres/non-fiction\|/genres/bi...	dir60/78508.The_River_of_Doubt.html	16618	The River of Doubt
5995	3.99	775	1416909427	good_reads:book	https://www.goodreads.com/author/show/151371.J...	2006.0	/genres/young-adult\|/genres/realistic-fiction\|...	dir60/259068.Shug.html	6179	Shug
5996	3.78	540	1620612321	good_reads:book	https://www.goodreads.com/author/show/5761314....	2012.0	/genres/contemporary\|/genres/romance\|/genres/y...	dir60/13503247-flawed.html	2971	Flawed
5997	3.91	281	NaN	good_reads:book	https://www.goodreads.com/author/show/1201952....	2006.0	/genres/religion\|/genres/islam\|/genres/religio...	dir60/2750008.html	3083	Ø£Ø³Ø¹Ø¯ Ø§ÙØ±Ø£Ø© ÙÙ Ø§ÙØ¹Ø§ÙÙ
5998	4.35	61	0786929081	good_reads:book	https://www.goodreads.com/author/show/1023510....	2001.0	/genres/fiction\|/genres/fantasy\|/genres/magic\|...	dir60/66677.Legacy_of_the_Drow_Collector_s_Edi...	3982	Legacy of the Drow Collector's Edition (Legacy...

5999 rows × 10 columns

Oh dear. That does not quite seem to be right. We are missing the column names. We need to add these in! But what are they?

Here is a list of them in order:

["rating", 'review_count', 'isbn', 'booktype','author_url', 'year', 'genre_urls', 'dir','rating_count', 'name']

Exercise

Use these to load the dataframe properly! And then "head" the dataframe... (you will need to look at the read_csv docs)

In [4]:

# your code here
df=pd.read_csv("data/goodreads.csv", header=None,
               names=["rating", 'review_count', 'isbn', 'booktype','author_url', 'year', 'genre_urls', 'dir','rating_count', 'name'],
)

#Examine the first few rows of the dataframe
df.head()

Out[4]:

	rating	review_count	isbn	booktype	author_url	year	genre_urls	dir	rating_count	name
0	4.40	136455	0439023483	good_reads:book	https://www.goodreads.com/author/show/153394.S...	2008.0	/genres/young-adult\|/genres/science-fiction\|/g...	dir01/2767052-the-hunger-games.html	2958974	The Hunger Games (The Hunger Games, #1)
1	4.41	16648	0439358078	good_reads:book	https://www.goodreads.com/author/show/1077326....	2003.0	/genres/fantasy\|/genres/young-adult\|/genres/fi...	dir01/2.Harry_Potter_and_the_Order_of_the_Phoe...	1284478	Harry Potter and the Order of the Phoenix (Har...
2	3.56	85746	0316015849	good_reads:book	https://www.goodreads.com/author/show/941441.S...	2005.0	/genres/young-adult\|/genres/fantasy\|/genres/ro...	dir01/41865.Twilight.html	2579564	Twilight (Twilight, #1)
3	4.23	47906	0061120081	good_reads:book	https://www.goodreads.com/author/show/1825.Har...	1960.0	/genres/classics\|/genres/fiction\|/genres/histo...	dir01/2657.To_Kill_a_Mockingbird.html	2078123	To Kill a Mockingbird
4	4.23	34772	0679783261	good_reads:book	https://www.goodreads.com/author/show/1265.Jan...	1813.0	/genres/classics\|/genres/fiction\|/genres/roman...	dir01/1885.Pride_and_Prejudice.html	1388992	Pride and Prejudice

Cleaning: Examing the dataframe - quick checks¶

We should examine the dataframe to get a overall sense of the content.

Exercise

Lets check the types of the columns. What do you find?

In [5]:

# your code here
####### 
df.dtypes
#######

Out[5]:

rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object

your answer here

Notice that review_count and rating_counts are objects instead of ints, and the year is a float!

There are a couple more quick sanity checks to perform on the dataframe.

In [6]:

print(df.shape)
df.columns

(6000, 10)

Out[6]:

Index(['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year', 'genre_urls', 'dir', 'rating_count', 'name'], dtype='object')

Cleaning: Examining the dataframe - a deeper look¶

Beyond performing checking some quick general properties of the data frame and looking at the first $n$ rows, we can dig a bit deeper into the values being stored. If you haven't already, check to see if there are any missing values in the data frame.

Let's see for a column which seemed OK to us.

In [7]:

#Get a sense of how many missing values there are in the dataframe.
print(np.sum([df.rating.isnull()]))
print(np.sum([df.review_count.isnull()]))
print(np.sum([df.isbn.isnull()]))
print(np.sum([df.booktype.isnull()]))
print(np.sum([df.author_url.isnull()]))
print(np.sum([df.year.isnull()]))
print(np.sum([df.genre_urls.isnull()]))
print(np.sum([df.dir.isnull()]))
print(np.sum([df.rating_count.isnull()]))
print(np.sum([df.name.isnull()]))

In [8]:

#Try to locate where the missing values occur
df[df.rating.isnull()]

Out[8]:

	rating	review_count	isbn	booktype	author_url	year	genre_urls	dir	rating_count	name
3643	NaN	None	None	None	None	NaN	NaN	dir37/9658936-harry-potter.html	None	None
5282	NaN	None	None	None	None	NaN	NaN	dir53/113138.The_Winner.html	None	None

How does pandas or numpy handle missing values when we try to compute with data sets that include them?

We'll now check if any of the other suspicious columns have missing values. Let's look at year and review_count first.

One thing you can do is to try and convert to the type you expect the column to be. If something goes wrong, it likely means your data are bad.

Lets test for missing data:

In [9]:

df[df.year.isnull()]

df.year.isnull()
df.shape

Out[9]:

(6000, 10)

Cleaning: Dealing with Missing Values¶

How should we interpret 'missing' or 'invalid' values in the data (hint: look at where these values occur)? One approach is to simply exclude them from the dataframe. Is this appropriate for all 'missing' or 'invalid' values?

In [10]:

#Treat the missing or invalid values in your dataframe
####### 

df = df[df.year.notnull()]

Ok so we have done some cleaning. What do things look like now? Notice the float has not yet changed.

In [11]:

df.dtypes

Out[11]:

rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object

In [12]:

print(np.sum(df.year.isnull()))
df.shape # We removed seven rows

Out[12]:

(5993, 10)

Exercise

Ok so lets fix those types. Convert them to ints. If the type conversion fails, we now know we have further problems.

In [13]:

# your code here
df.rating_count=df.rating_count.astype(int)
df.review_count=df.review_count.astype(int)
df.year=df.year.astype(int)

Once you do this, we seem to be good on these columns (no errors in conversion). Lets look:

In [14]:

df.dtypes

Out[14]:

rating          float64
review_count      int64
isbn             object
booktype         object
author_url       object
year              int64
genre_urls       object
dir              object
rating_count      int64
name             object
dtype: object

Sweet!

Some of the other colums that should be strings have NaN.

In [15]:

df.loc[df.genre_urls.isnull(), 'genre_urls']=""
df.loc[df.isbn.isnull(), 'isbn']=""

Part 2: Parsing and Completing the Data Frame¶

We will parse the author column from the author_url and genres column from the genre_urls. Keep the genres column as a string separated by '|'.

We will use panda's map to assign new columns to the dataframe.

Examine an example author_url and reason about which sequence of string operations must be performed in order to isolate the author's name.

In [16]:

#Get the first author_url
test_string = df.author_url[0]
test_string

Out[16]:

'https://www.goodreads.com/author/show/153394.Suzanne_Collins'

In [17]:

#Test out some string operations to isolate the author name

test_string.split('/')[-1].split('.')[1:][0]

Out[17]:

'Suzanne_Collins'

Exercise

Lets wrap the above code into a function which we will then use

In [18]:

# Write a function that accepts an author url and returns the author's name based on your experimentation above
def get_author(url):
    # your code here
    name = url.split('/')[-1].split('.')[1:][0]
    ####### 
    return name

In [19]:

#Apply the get_author function to the 'author_url' column using '.map' 
#and add a new column 'author' to store the names
df['author'] = df.author_url.map(get_author)
df.author[0:5]

Out[19]:

0    Suzanne_Collins
1        J_K_Rowling
2    Stephenie_Meyer
3         Harper_Lee
4        Jane_Austen
Name: author, dtype: object

Exercise

Now parse out the genres from genre_url.

This is a little more complicated because there be more than one genre.

In [20]:

df.genre_urls.head()

Out[20]:

0    /genres/young-adult|/genres/science-fiction|/g...
1    /genres/fantasy|/genres/young-adult|/genres/fi...
2    /genres/young-adult|/genres/fantasy|/genres/ro...
3    /genres/classics|/genres/fiction|/genres/histo...
4    /genres/classics|/genres/fiction|/genres/roman...
Name: genre_urls, dtype: object

In [21]:

# your code here
#Examine some examples of genre_urls

#Test out some string operations to isolate the genre name
test_genre_string=df.genre_urls[0]
genres=test_genre_string.strip().split('|')
for e in genres:
    print(e.split('/')[-1])
    "|".join(genres)

young-adult
science-fiction
dystopia
fantasy
science-fiction
romance
adventure
book-club
young-adult
teen
apocalyptic
post-apocalyptic
action

Exercise

Write a function that accepts a genre url and returns the genre name based on your experimentation above

In [22]:

def split_and_join_genres(url):
    # your code here
    genres=url.strip().split('|')
    genres=[e.split('/')[-1] for e in genres]
    return "|".join(genres)

Test your function

In [23]:

split_and_join_genres("/genres/young-adult|/genres/science-fiction")

Out[23]:

'young-adult|science-fiction'

In [24]:

split_and_join_genres("")

Out[24]:

''

Exercise

Use map again to create a new "genres" column

In [25]:

df['genres']=df.genre_urls.map(split_and_join_genres)
df.head()

Out[25]:

	rating	review_count	isbn	booktype	author_url	year	genre_urls	dir	rating_count	name	author	genres
0	4.40	136455	0439023483	good_reads:book	https://www.goodreads.com/author/show/153394.S...	2008	/genres/young-adult\|/genres/science-fiction\|/g...	dir01/2767052-the-hunger-games.html	2958974	The Hunger Games (The Hunger Games, #1)	Suzanne_Collins	young-adult\|science-fiction\|dystopia\|fantasy\|s...
1	4.41	16648	0439358078	good_reads:book	https://www.goodreads.com/author/show/1077326....	2003	/genres/fantasy\|/genres/young-adult\|/genres/fi...	dir01/2.Harry_Potter_and_the_Order_of_the_Phoe...	1284478	Harry Potter and the Order of the Phoenix (Har...	J_K_Rowling	fantasy\|young-adult\|fiction\|fantasy\|magic\|chil...
2	3.56	85746	0316015849	good_reads:book	https://www.goodreads.com/author/show/941441.S...	2005	/genres/young-adult\|/genres/fantasy\|/genres/ro...	dir01/41865.Twilight.html	2579564	Twilight (Twilight, #1)	Stephenie_Meyer	young-adult\|fantasy\|romance\|paranormal\|vampire...
3	4.23	47906	0061120081	good_reads:book	https://www.goodreads.com/author/show/1825.Har...	1960	/genres/classics\|/genres/fiction\|/genres/histo...	dir01/2657.To_Kill_a_Mockingbird.html	2078123	To Kill a Mockingbird	Harper_Lee	classics\|fiction\|historical-fiction\|academic\|s...
4	4.23	34772	0679783261	good_reads:book	https://www.goodreads.com/author/show/1265.Jan...	1813	/genres/classics\|/genres/fiction\|/genres/roman...	dir01/1885.Pride_and_Prejudice.html	1388992	Pride and Prejudice	Jane_Austen	classics\|fiction\|romance\|historical-fiction\|li...

Finally, let's pick an author at random so we can see the results of the transformations. Scroll to see the author and genre columns that we added to the dataframe.

In [26]:

df[df.author == "Marguerite_Yourcenar"]

Out[26]:

	rating	review_count	isbn	booktype	author_url	year	genre_urls	dir	rating_count	name	author	genres
1014	4.23	483	0374529264	good_reads:book	https://www.goodreads.com/author/show/7732.Mar...	1951	/genres/historical-fiction\|/genres/fiction\|/ge...	dir11/12172.Memoirs_of_Hadrian.html	6258	Memoirs of Hadrian	Marguerite_Yourcenar	historical-fiction\|fiction\|cultural\|france\|cla...
5620	4.11	74	2070367983	good_reads:book	https://www.goodreads.com/author/show/7732.Mar...	1968	/genres/fiction\|/genres/historical-fiction\|/ge...	dir57/953435.L_uvre_au_noir.html	1601	L'Åuvre au noir	Marguerite_Yourcenar	fiction\|historical-fiction\|cultural\|france\|eur...

Let us delete the genre_urls column.

In [28]:

del df['genre_urls']

And then save the dataframe out!

In [29]:

df.to_csv("data/cleaned-goodreads.csv", index=False, header=True)

Part 3: Grouping¶

It appears that some books were written in negative years! Print out the observations that correspond to negative years. What do you notice about these books?

In [30]:

# your code here
df[df.year < 0].name
#These are books written before the Common Era (BCE, equivalent to BC).

Out[30]:

47                               The Odyssey
246                    The Iliad/The Odyssey
455                             The Republic
596                               The Aeneid
629                              Oedipus Rex
674                           The Art of War
746                        The Bhagavad Gita
777                                 Antigone
1233                       The Oedipus Cycle
1397                          Aesop's Fables
1398                   The Epic of Gilgamesh
1428                                   Medea
1815                            The Oresteia
1882         The Trial and Death of Socrates
2078    The History of the Peloponnesian War
2527                           The Histories
3133                          Complete Works
3274                  The Nicomachean Ethics
3757                              Lysistrata
4402                           The Symposium
4475                                 Apology
5367                          Five Dialogues
Name: name, dtype: object

We can determine the "best book" by year! For this we use Panda's groupby(). Groupby() allows grouping a dataframe by any (usually categorical) variable. Would it make sense to ever groupby integer variables? Floating point variables?

In [31]:

dfgb_author = df.groupby('author')
type(dfgb_author)

Out[31]:

pandas.core.groupby.generic.DataFrameGroupBy

Perhaps we want the number of books each author wrote

In [32]:

dfgb_author.count()

Out[32]:

	rating	review_count	isbn	booktype	author_url	year	dir	rating_count	name	genres
author
A_A_Milne	6	6	6	6	6	6	6	6	6	6
A_G_Howard	1	1	1	1	1	1	1	1	1	1
A_J_Cronin	1	1	1	1	1	1	1	1	1	1
A_J_Jacobs	1	1	1	1	1	1	1	1	1	1
A_J_Salt	1	1	1	1	1	1	1	1	1	1
A_Meredith_Walters	2	2	2	2	2	2	2	2	2	2
A_N_Roquelaure	2	2	2	2	2	2	2	2	2	2
A_S_Byatt	1	1	1	1	1	1	1	1	1	1
A_S_King	1	1	1	1	1	1	1	1	1	1
A_id_al_Qarni	2	2	2	2	2	2	2	2	2	2
Abbi_Glines	14	14	14	14	14	14	14	14	14	14
Abdul_Rahman_Munif	1	1	1	1	1	1	1	1	1	1
Abigail_Gibbs	1	1	1	1	1	1	1	1	1	1
Abigail_Roux	4	4	4	4	4	4	4	4	4	4
Abigail_Thomas	1	1	1	1	1	1	1	1	1	1
Abolqasem_Ferdowsi	1	1	1	1	1	1	1	1	1	1
Abraham_Verghese	1	1	1	1	1	1	1	1	1	1
Abul_Hasan_Ali_Nadwi	1	1	1	1	1	1	1	1	1	1
Adam_Hochschild	1	1	1	1	1	1	1	1	1	1
Adam_Johnson	1	1	1	1	1	1	1	1	1	1
Adam_Levin	1	1	1	1	1	1	1	1	1	1
Adam_Rex	1	1	1	1	1	1	1	1	1	1
Adam_Smith	1	1	1	1	1	1	1	1	1	1
Addison_Moore	1	1	1	1	1	1	1	1	1	1
Adeline_Yen_Mah	1	1	1	1	1	1	1	1	1	1
Adolf_Hitler	1	1	1	1	1	1	1	1	1	1
Adolfo_Bioy_Casares	1	1	1	1	1	1	1	1	1	1
Aeschylus	1	1	1	1	1	1	1	1	1	1
Aesop	1	1	1	1	1	1	1	1	1	1
Agatha_Christie	11	11	11	11	11	11	11	11	11	11
...	...	...	...	...	...	...	...	...	...	...
William_Strunk_Jr_	1	1	1	1	1	1	1	1	1	1
William_Styron	3	3	3	3	3	3	3	3	3	3
William_Wharton	2	2	2	2	2	2	2	2	2	2
William_Wordsworth	1	1	1	1	1	1	1	1	1	1
Willow_Aster	1	1	1	1	1	1	1	1	1	1
Wilson_Rawls	2	2	2	2	2	2	2	2	2	2
Winston_Groom	1	1	1	1	1	1	1	1	1	1
Witold_Gombrowicz	2	2	2	2	2	2	2	2	2	2
Wm_Paul_Young	1	1	1	1	1	1	1	1	1	1
Woody_Allen	1	1	1	1	1	1	1	1	1	1
Wu_Cheng_en	1	1	1	1	1	1	1	1	1	1
Yamamoto_Tsunetomo	1	1	1	1	1	1	1	1	1	1
Yana_Toboso	1	1	1	1	1	1	1	1	1	1
Yann_Martel	2	2	2	2	2	2	2	2	2	2
Yasunari_Kawabata	1	1	1	1	1	1	1	1	1	1
Yevgeny_Zamyatin	1	1	1	1	1	1	1	1	1	1
Young_Kim	1	1	1	1	1	1	1	1	1	1
Yuehai_Xiao	1	1	1	1	1	1	1	1	1	1
Yukio_Mishima	2	2	2	2	2	2	2	2	2	2
Yukito_Kishiro	1	1	1	1	1	1	1	1	1	1
Yvonne_Woon	1	1	1	1	1	1	1	1	1	1
Zack_Love	1	1	1	1	1	1	1	1	1	1
Zadie_Smith	2	2	2	2	2	2	2	2	2	2
Zilpha_Keatley_Snyder	1	1	1	1	1	1	1	1	1	1
Zora_Neale_Hurston	1	1	1	1	1	1	1	1	1	1
_	42	42	42	42	42	42	42	42	42	42
_gota_Krist_f	1	1	1	1	1	1	1	1	1	1
_mile_Zola	4	4	4	4	4	4	4	4	4	4
_ric_Emmanuel_Schmitt	1	1	1	1	1	1	1	1	1	1
_sne_Seierstad	1	1	1	1	1	1	1	1	1	1

2645 rows × 10 columns

Lots of useless info there. One column should suffice

Exercise:¶

Group the dataframe by author. Include the following columns: rating, name, author. For the aggregation of the name column which includes the names of the books create a list with the strings containing the name of each book. Make sure that the way you aggregate the rest of the columns make sense!
Create a new column with number of books for each author and find the most prolific author!

In [33]:

###### Before we start : what do we do about these titles where 'name' is unreadable? Try different encodings?
auth_name = 'A_id_al_Qarni'
df[df.author == auth_name].head()

Out[33]:

	rating	review_count	isbn	booktype	author_url	year	dir	rating_count	name	author	genres
2213	4.19	1169		good_reads:book	https://www.goodreads.com/author/show/1201952....	2003	dir23/2750180.html	15781	ÙØ§ ØªØØ²Ù	A_id_al_Qarni	religion\|religion\|islam\|self-help\|non-fiction\|...
5998	3.91	281		good_reads:book	https://www.goodreads.com/author/show/1201952....	2006	dir60/2750008.html	3083	Ø£Ø³Ø¹Ø¯ Ø§ÙØ±Ø£Ø© ÙÙ Ø§ÙØ¹Ø§ÙÙ	A_id_al_Qarni	religion\|islam\|religion\|self-help\|spirituality...

In [34]:

df[df.author == auth_name].iat[0,8].encode('UTF-16')

Out[34]:

b'\xff\xfe\xd9\x00\x84\x00\xd8\x00\xa7\x00 \x00\xd8\x00\xaa\x00\xd8\x00\xad\x00\xd8\x00\xb2\x00\xd9\x00\x86\x00'

In [35]:

# let's examine the columns we have
df.columns

Out[35]:

Index(['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year', 'dir', 'rating_count', 'name', 'author', 'genres'], dtype='object')

Create the GroupBy table

In [36]:

authors = df.copy()
authors = authors[['rating','name','author']].groupby('author').agg({'rating' : np.mean,
                                                                    'name' : '|'.join})

In [37]:

authors = authors.reset_index()
authors.head()

Out[37]:

	author	rating	name
0	A_A_Milne	4.365	Winnie-the-Pooh\|The House at Pooh Corner\|The H...
1	A_G_Howard	4.020	Splintered (Splintered, #1)
2	A_J_Cronin	4.220	The Keys of the Kingdom
3	A_J_Jacobs	3.750	The Year of Living Biblically
4	A_J_Salt	4.940	Nik Nassa & the Mark of Destiny

In [38]:

# split the column string and make a list of string book names
authors['name'] = authors.name.str.split('|')
authors.head()

Out[38]:

	author	rating	name
0	A_A_Milne	4.365	[Winnie-the-Pooh, The House at Pooh Corner, Th...
1	A_G_Howard	4.020	[Splintered (Splintered, #1)]
2	A_J_Cronin	4.220	[The Keys of the Kingdom]
3	A_J_Jacobs	3.750	[The Year of Living Biblically]
4	A_J_Salt	4.940	[Nik Nassa & the Mark of Destiny]

In [39]:

# count the books - create new column
len(authors.name[0])

Out[39]:

In [40]:

authors['num_books'] = authors['name'].str.len()
authors

Out[40]:

	author	rating	name	num_books
0	A_A_Milne	4.365000	[Winnie-the-Pooh, The House at Pooh Corner, Th...	6
1	A_G_Howard	4.020000	[Splintered (Splintered, #1)]	1
2	A_J_Cronin	4.220000	[The Keys of the Kingdom]	1
3	A_J_Jacobs	3.750000	[The Year of Living Biblically]	1
4	A_J_Salt	4.940000	[Nik Nassa & the Mark of Destiny]	1
5	A_Meredith_Walters	4.150000	[Find You in the Dark (Find You in the Dark, #...	2
6	A_N_Roquelaure	3.450000	[Beauty's Punishment (Sleeping Beauty, #2), Th...	2
7	A_S_Byatt	3.860000	[Possession]	1
8	A_S_King	3.930000	[Please Ignore Vera Dietz]	1
9	A_id_al_Qarni	4.050000	[ÙØ§ ØªØØ²Ù, Ø£Ø³Ø¹Ø¯ Ø§ÙØ±Ø£Ø© ÙÙ Ø§ÙØ...	2
10	Abbi_Glines	4.179286	[Fallen Too Far (Too Far, #1), Existence (Exis...	14
11	Abdul_Rahman_Munif	4.030000	[Ø§ÙØªÙÙ]	1
12	Abigail_Gibbs	3.820000	[Dinner With a Vampire (Dark Heroine, #1)]	1
13	Abigail_Roux	4.470000	[Cut & Run (Cut & Run, #1), Fish & Chips (Cut ...	4
14	Abigail_Thomas	3.680000	[A Three Dog Life]	1
15	Abolqasem_Ferdowsi	4.520000	[Shahnameh]	1
16	Abraham_Verghese	4.260000	[Cutting for Stone]	1
17	Abul_Hasan_Ali_Nadwi	4.150000	[ÙØ§Ø°Ø§ Ø®Ø³Ø± Ø§ÙØ¹Ø§ÙÙ Ø¨Ø§ÙØØ·Ø§Ø· Ø...	1
18	Adam_Hochschild	4.140000	[King Leopold's Ghost]	1
19	Adam_Johnson	4.030000	[The Orphan Master's Son]	1
20	Adam_Levin	4.040000	[The Instructions]	1
21	Adam_Rex	4.140000	[The True Meaning of Smekday]	1
22	Adam_Smith	3.820000	[The Wealth of Nations]	1
23	Addison_Moore	3.780000	[Ethereal (Celestra, #1)]	1
24	Adeline_Yen_Mah	4.010000	[Chinese Cinderella]	1
25	Adolf_Hitler	2.970000	[Mein Kampf]	1
26	Adolfo_Bioy_Casares	4.060000	[The Invention of Morel]	1
27	Aeschylus	3.960000	[The Oresteia]	1
28	Aesop	4.030000	[Aesop's Fables]	1
29	Agatha_Christie	3.977273	[And Then There Were None, Murder on the Orien...	11
...	...	...	...	...
2615	William_Strunk_Jr_	4.170000	[The Elements of Style]	1
2616	William_Styron	4.043333	[Sophie's Choice, The Confessions of Nat Turne...	3
2617	William_Wharton	4.090000	[Birdy, A Midnight Clear]	2
2618	William_Wordsworth	3.920000	[Lyrical Ballads]	1
2619	Willow_Aster	4.140000	[True Love Story]	1
2620	Wilson_Rawls	3.990000	[Where the Red Fern Grows, Summer of the Monkeys]	2
2621	Winston_Groom	3.970000	[Forrest Gump (Forrest Gump, #1)]	1
2622	Witold_Gombrowicz	3.975000	[Ferdydurke, Cosmos]	2
2623	Wm_Paul_Young	3.660000	[The Shack]	1
2624	Woody_Allen	4.030000	[Without Feathers]	1
2625	Wu_Cheng_en	4.040000	[Monkey]	1
2626	Yamamoto_Tsunetomo	4.090000	[Hagakure]	1
2627	Yana_Toboso	4.340000	[Black Butler, Vol. 01 (Black Butler, #1)]	1
2628	Yann_Martel	3.470000	[Life of Pi, Beatrice and Virgil]	2
2629	Yasunari_Kawabata	3.740000	[Snow Country]	1
2630	Yevgeny_Zamyatin	3.970000	[We]	1
2631	Young_Kim	3.660000	[Twilight (Twilight: The Graphic Novel, #1)]	1
2632	Yuehai_Xiao	4.560000	[Crossing the Seas]	1
2633	Yukio_Mishima	4.030000	[The Sailor Who Fell from Grace with the Sea, ...	2
2634	Yukito_Kishiro	4.150000	[Battle Angel Alita, Volume 01]	1
2635	Yvonne_Woon	3.950000	[Dead Beautiful (Dead Beautiful, #1)]	1
2636	Zack_Love	3.550000	[Sex in the Title]	1
2637	Zadie_Smith	3.655000	[White Teeth, On Beauty]	2
2638	Zilpha_Keatley_Snyder	4.110000	[The Changeling]	1
2639	Zora_Neale_Hurston	3.820000	[Their Eyes Were Watching God]	1
2640	_	3.988095	[Ø¹Ø²Ø§Ø²ÙÙ, Ø«ÙØ§Ø«ÙØ© ØºØ±ÙØ§Ø·Ø©, ØªØ±...	42
2641	_gota_Krist_f	4.340000	[The Notebook, The Proof, The Third Lie]	1
2642	_mile_Zola	3.990000	[Germinal (Les Rougon-Macquart, #13), L'Assomm...	4
2643	_ric_Emmanuel_Schmitt	4.160000	[Oscar et la dame rose]	1
2644	_sne_Seierstad	3.740000	[The Bookseller of Kabul]	1

2645 rows × 4 columns

In [41]:

# sort for more prolific
authors.sort_values(by='num_books', ascending=False).iloc[0]

Out[41]:

author                                            Stephen_King
rating                                                 3.91875
name         [The Stand, The Shining (The Shining #1), It, ...
num_books                                                   56
Name: 2349, dtype: object

Winner is Stephen King with 56 books! OMG!!!¶

Perhaps you want more detailed info...

In [ ]:

dfgb_author[['rating', 'rating_count', 'review_count', 'year']].describe()

You can also access a groupby dictionary style.

In [ ]:

ratingdict = {}
for author, subset in dfgb_author:
    ratingdict[author] = (subset['rating'].mean(), subset['rating'].std())
ratingdict

Exercise

Lets get the best-rated book(s) for every year in our dataframe.

In [ ]:

#Using .groupby, we can divide the dataframe into subsets by the values of 'year'.
#We can then iterate over these subsets
# your code here
for year, subset in df.groupby('year'):
    #Find the best book of the year

    bestbook = subset[subset.rating == subset.rating.max()]
    if bestbook.shape[0] > 1:
        print(year, bestbook.name.values, bestbook.rating.values)
    else:
        print(year, bestbook.name.values[0], bestbook.rating.values[0])