CS109A Introduction to Data Science

Lab 2 Scraping

Harvard University
Summer 2018
Instructors: Pavlos Protopapas and Kevin Rader
Lab Instructors: Rahul Dave
Authors: Rahul Dave, David Sondak, Will Claybaugh and Pavlos Protopaps

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../../styles/cs109.css", "r").read()
    return HTML(styles)
In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn.apionly as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
/Users/eleni/anaconda2/envs/bunny/lib/python3.6/site-packages/seaborn/apionly.py:6: UserWarning: As seaborn no longer sets a default style on import, the seaborn.apionly module is deprecated. It will be removed in a future version.
  warnings.warn(msg, UserWarning)
In [3]:
import time, requests

In this lab, we'll scrape Goodread's Best Books list:

https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1 .

We'll walk through scraping the list pages for the book names/urls

Table of Contents

  1. Learning Goals
  2. Exploring the Web pages and downloading them
  3. Parse the page, extract book urls
  4. Parse a book page, extract book properties
  5. Set up a pipeline for fetching and parsing

Learning Goals

Understand the structure of a web page. Use Beautiful soup to scrape content from these web pages.

This lab corresponds to lectures 2, 3 and 4 and maps on to homework 1 and further.

1. Exploring the web pages and downloading them

We're going to see the structure of Goodread's best books list. We'll use the Developer tools in chrome, safari and firefox have similar tools available

To getch this page we use the requests module. But are we allowed to do this? Lets check:


Yes we are.

In [4]:
page = requests.get(url)

We can see properties of the page. Most relevant are status_code and text. The former tells us if the web-page was found, and if found , ok. (See lecture notes.)

In [5]:
page.status_code # 200 is good
In [6]:
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Let us write a loop to fetch 2 pages of "best-books" from goodreads. Notice the use of a format string. This is an example of old-style python format strings</p>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [7]:</div>
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">URLSTART</span><span class="o">=</span><span class="s2">"https://www.goodreads.com"</span>
<span class="n">BESTBOOKS</span><span class="o">=</span><span class="s2">"/list/show/1.Best_Books_Ever?page="</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">):</span>
    <span class="n">bookpage</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
    <span class="n">stuff</span><span class="o">=</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">URLSTART</span><span class="o">+</span><span class="n">BESTBOOKS</span><span class="o">+</span><span class="n">bookpage</span><span class="p">)</span>
    <span class="n">filetowrite</span><span class="o">=</span><span class="s2">"files/page"</span><span class="o">+</span> <span class="s1">'</span><span class="si">%02d</span><span class="s1">'</span> <span class="o">%</span> <span class="n">i</span> <span class="o">+</span> <span class="s2">".html"</span>
    <span class="nb">print</span><span class="p">(</span><span class="s2">"FTW"</span><span class="p">,</span> <span class="n">filetowrite</span><span class="p">)</span>
    <span class="n">fd</span><span class="o">=</span><span class="nb">open</span><span class="p">(</span><span class="n">filetowrite</span><span class="p">,</span><span class="s2">"w"</span><span class="p">)</span>
    <span class="n">fd</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">stuff</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
    <span class="n">fd</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
    <span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt"></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>FTW files/page01.html
FTW files/page02.html
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="2.-Parse-the-page,-extract-book-urls">2. Parse the page, extract book urls<a class="anchor-link" href="#2.-Parse-the-page,-extract-book-urls">¶</a></h2><p>Notice how we do file input-output, and use beautiful soup in the code below. The <code>with</code> construct ensures that the file being read is closed, something we do explicitly for the file being written. We look for the elements with class <code>bookTitle</code>, extract the urls, and write them into a file</p>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [8]:</div>
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">bs4</span> <span class="k">import</span> <span class="n">BeautifulSoup</span>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [9]:</div>
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">bookdict</span><span class="o">=</span><span class="p">{}</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">):</span>
    <span class="n">books</span><span class="o">=</span><span class="p">[]</span>
    <span class="n">stri</span> <span class="o">=</span> <span class="s1">'</span><span class="si">%02d</span><span class="s1">'</span> <span class="o">%</span> <span class="n">i</span>
    <span class="n">filetoread</span><span class="o">=</span><span class="s2">"files/page"</span><span class="o">+</span> <span class="n">stri</span> <span class="o">+</span> <span class="s1">'.html'</span>
    <span class="nb">print</span><span class="p">(</span><span class="s2">"FTW"</span><span class="p">,</span> <span class="n">filetoread</span><span class="p">)</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filetoread</span><span class="p">)</span> <span class="k">as</span> <span class="n">fdr</span><span class="p">:</span>
        <span class="n">data</span> <span class="o">=</span> <span class="n">fdr</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="s1">'html.parser'</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'.bookTitle'</span><span class="p">):</span>
        <span class="n">books</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">e</span><span class="p">[</span><span class="s1">'href'</span><span class="p">])</span>
    <span class="nb">print</span><span class="p">(</span><span class="n">books</span><span class="p">[:</span><span class="mi">10</span><span class="p">])</span>
    <span class="n">bookdict</span><span class="p">[</span><span class="n">stri</span><span class="p">]</span><span class="o">=</span><span class="n">books</span>
    <span class="n">fd</span><span class="o">=</span><span class="nb">open</span><span class="p">(</span><span class="s2">"files/list"</span><span class="o">+</span><span class="n">stri</span><span class="o">+</span><span class="s2">".txt"</span><span class="p">,</span><span class="s2">"w"</span><span class="p">)</span>
    <span class="n">fd</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">"</span><span class="se">\n</span><span class="s2">"</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">books</span><span class="p">))</span>
    <span class="n">fd</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt"></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>FTW files/page01.html
['/book/show/2767052-the-hunger-games', '/book/show/2.Harry_Potter_and_the_Order_of_the_Phoenix', '/book/show/2657.To_Kill_a_Mockingbird', '/book/show/1885.Pride_and_Prejudice', '/book/show/41865.Twilight', '/book/show/19063.The_Book_Thief', '/book/show/11127.The_Chronicles_of_Narnia', '/book/show/7613.Animal_Farm', '/book/show/18405.Gone_with_the_Wind', '/book/show/30.J_R_R_Tolkien_4_Book_Boxed_Set']
FTW files/page02.html
['/book/show/5470.1984', '/book/show/4989.The_Red_Tent', '/book/show/37435.The_Secret_Life_of_Bees', '/book/show/5.Harry_Potter_and_the_Prisoner_of_Azkaban', '/book/show/7171637-clockwork-angel', '/book/show/2187.Middlesex', '/book/show/2623.Great_Expectations', '/book/show/24583.The_Adventures_of_Tom_Sawyer', '/book/show/49552.The_Stranger', '/book/show/16299.And_Then_There_Were_None']
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Here is George Orwell's 1984</p>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [10]:</div>
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">bookdict</span><span class="p">[</span><span class="s1">'02'</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt output_prompt">Out[10]:</div>
<div class="output_text output_subarea output_execute_result">
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Lets go look at the first URLs on both pages</p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img alt="" src="images/goodreads2.png"></p>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="3.-Parse-a-book-page,-extract-book-properties">3. Parse a book page, extract book properties<a class="anchor-link" href="#3.-Parse-a-book-page,-extract-book-properties">¶</a></h2><p>Ok so now lets dive in and get one of these these files and parse them.</p>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [11]:</div>
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">furl</span><span class="o">=</span><span class="n">URLSTART</span><span class="o">+</span><span class="n">bookdict</span><span class="p">[</span><span class="s1">'02'</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">furl</span>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt output_prompt">Out[11]:</div>
<div class="output_text output_subarea output_execute_result">
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img alt="" src="images/goodreads3.png"></p>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [12]:</div>
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">fstuff</span><span class="o">=</span><span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">furl</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">fstuff</span><span class="o">.</span><span class="n">status_code</span><span class="p">)</span>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt"></div>
<div class="output_subarea output_stream output_stdout output_text">
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [13]:</div>
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">d</span><span class="o">=</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">fstuff</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s1">'html.parser'</span><span class="p">)</span>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [15]:</div>
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">d</span><span class="o">.</span><span class="n">prettify</span><span class="p">()</span>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt output_prompt">Out[15]:</div>
<div class="output_text output_subarea output_execute_result">
So that we dont overwhelm their servers, we will only fetch 5 from each page, but you get the idea... We'll segue of a bit to explore new style format strings. See https://pyformat.info for more info. In [16]: "list{:0>2}.txt".format(3) Out[16]: 'list03.txt' In [17]: a = "4" b = 4 class Four: def __str__(self): return "Fourteen" c=Four() In [18]: "The hazy cat jumped over the {} and {} and {}".format(a, b, c) Out[18]: 'The hazy cat jumped over the 4 and 4 and Fourteen' 4. Set up a pipeline for fetching and parsing¶Ok lets get back to the fetching... In [19]: fetched=[] for i in range(1,3): with open("files/list{:0>2}.txt".format(i)) as fd: counter=0 for bookurl_line in fd: if counter > 4: break bookurl=bookurl_line.strip() stuff=requests.get(URLSTART+bookurl) filetowrite=bookurl.split('/')[-1] filetowrite="files/"+str(i)+"_"+filetowrite+".html" print("FTW", filetowrite) fd=open(filetowrite,"w", encoding='utf-8') fd.write(stuff.text) fd.close() fetched.append(filetowrite) time.sleep(2) counter=counter+1 print(fetched) FTW files/1_2767052-the-hunger-games.html FTW files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html FTW files/1_2657.To_Kill_a_Mockingbird.html FTW files/1_1885.Pride_and_Prejudice.html FTW files/1_41865.Twilight.html FTW files/2_5470.1984.html FTW files/2_4989.The_Red_Tent.html FTW files/2_37435.The_Secret_Life_of_Bees.html FTW files/2_5.Harry_Potter_and_the_Prisoner_of_Azkaban.html FTW files/2_7171637-clockwork-angel.html ['files/1_2767052-the-hunger-games.html', 'files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html', 'files/1_2657.To_Kill_a_Mockingbird.html', 'files/1_1885.Pride_and_Prejudice.html', 'files/1_41865.Twilight.html', 'files/2_5470.1984.html', 'files/2_4989.The_Red_Tent.html', 'files/2_37435.The_Secret_Life_of_Bees.html', 'files/2_5.Harry_Potter_and_the_Prisoner_of_Azkaban.html', 'files/2_7171637-clockwork-angel.html'] Ok we are off to parse each one of the html pages we fetched. We have provided the skeleton of the code and the code to parse the year, since it is a bit more complex...see the difference in the screenshots above. In [20]: import re yearre = r'\d{4}' def get_year(d): if d.select_one("nobr.greyText"): return d.select_one("nobr.greyText").text.strip().split()[-1][:-1] else: thetext=d.select("div#details div.row")[1].text.strip() rowmatch=re.findall(yearre, thetext) if len(rowmatch) > 0: rowtext=rowmatch[0].strip() else: rowtext="NA" return rowtext ExerciseYour job is to fill in the code to get the genres. In [21]: def get_genres(d): # your code here genres=d.select("div.elementList div.left a") glist=[] for g in genres: glist.append(g['href']) return glist In [22]: listofdicts=[] for filetoread in fetched: print(filetoread) td={} with open(filetoread) as fd: datext = fd.read() d=BeautifulSoup(datext, 'html.parser') td['title']=d.select_one("meta[property='og:title']")['content'] td['isbn']=d.select_one("meta[property='books:isbn']")['content'] td['booktype']=d.select_one("meta[property='og:type']")['content'] td['author']=d.select_one("meta[property='books:author']")['content'] td['rating']=d.select_one("span.average").text td['ratingCount']=d.select_one("meta[itemprop='ratingCount']")["content"] td['reviewCount']=d.select_one("span.count")["title"] td['year'] = get_year(d) td['file']=filetoread glist = get_genres(d) td['genres']="|".join(glist) listofdicts.append(td) files/1_2767052-the-hunger-games.html files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html files/1_2657.To_Kill_a_Mockingbird.html files/1_1885.Pride_and_Prejudice.html files/1_41865.Twilight.html files/2_5470.1984.html files/2_4989.The_Red_Tent.html files/2_37435.The_Secret_Life_of_Bees.html files/2_5.Harry_Potter_and_the_Prisoner_of_Azkaban.html files/2_7171637-clockwork-angel.html In [23]: listofdicts[0] Out[23]: {'title': 'The Hunger Games (The Hunger Games, #1)', 'isbn': '9780439023481', 'booktype': 'books.book', 'author': 'https://www.goodreads.com/author/show/153394.Suzanne_Collins', 'rating': '4.33', 'ratingCount': '5491176', 'reviewCount': '160373', 'year': '2008', 'file': 'files/1_2767052-the-hunger-games.html', 'genres': '/genres/young-adult|/genres/fiction|/genres/science-fiction|/genres/dystopia|/genres/fantasy|/genres/science-fiction'} Finally lets write all this stuff into a csv file which we will use to do analysis. In [24]: df = pd.DataFrame.from_records(listofdicts) df.head() Out[24]: .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } author booktype file genres isbn rating ratingCount reviewCount title year 0 https://www.goodreads.com/author/show/153394.S... books.book files/1_2767052-the-hunger-games.html /genres/young-adult|/genres/fiction|/genres/sc... 9780439023481 4.33 5491176 160373 The Hunger Games (The Hunger Games, #1) 2008 1 https://www.goodreads.com/author/show/1077326.... books.book files/1_2.Harry_Potter_and_the_Order_of_the_Ph... /genres/fantasy|/genres/young-adult|/genres/fi... 9780439358071 4.48 2030257 33033 Harry Potter and the Order of the Phoenix (Har... 2003 2 https://www.goodreads.com/author/show/1825.Har... books.book files/1_2657.To_Kill_a_Mockingbird.html /genres/classics|/genres/fiction|/genres/histo... 9780061120084 4.27 3722962 79058 To Kill a Mockingbird (To Kill a Mockingbird, #1) 1960 3 https://www.goodreads.com/author/show/1265.Jan... books.book files/1_1885.Pride_and_Prejudice.html /genres/classics|/genres/fiction|/genres/romance 9780679783268 4.25 2438138 54013 Pride and Prejudice 1813 4 https://www.goodreads.com/author/show/941441.S... books.book files/1_41865.Twilight.html /genres/young-adult|/genres/fantasy|/genres/ro... 9780316015844 3.58 4262416 97797 Twilight (Twilight, #1) 2005 In [25]: df.to_csv("files/meta.csv", index=False, header=True)