{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CS109A Introduction to Data Science \n",
"\n",
"## Lab 2: Web Scraping with Beautiful Soup\n",
"\n",
"**Harvard University** \n",
"**Fall 2019** \n",
"**Instructors:** Pavlos Protopapas, Kevin Rader, and Chris Tanner \n",
"**Lab Instructors:** Chris Tanner and Eleni Kaxiras \n",
"\n",
"**Authors:** Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, and Eleni Kaxiras\n",
"\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## RUN THIS CELL TO GET THE RIGHT FORMATTING \n",
"from IPython.core.display import HTML\n",
"def css_styling():\n",
" styles = open(\"../../styles/cs109.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import seaborn.apionly as sns\n",
"import time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents \n",
"\n",
"
Learning Goals
\n",
"
Introduction to Web Servers and HTTP
\n",
"
Download webpages and get basic properties
\n",
"
Parse the page with Beautiful Soup
\n",
"
String formatting
\n",
"
Additonal Python/Homework Comment
\n",
"
Walkthrough Example
\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Learning Goals\n",
"\n",
"- Understand the structure of a web page\n",
"- Understand how to use Beautiful soup to scrape content from web pages.\n",
"- Feel comfortable storing and manipulating the content in various formats.\n",
"- Understand how to convert structured format into a Pandas DataFrame\n",
"\n",
"In this lab, we'll scrape Goodread's Best Books list:\n",
"\n",
"https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1 .\n",
"\n",
"We'll walk through scraping the list pages for the book names/urls. First, we start with an even simpler example.\n",
"\n",
"*This lab corresponds to lectures #2 and #3 and maps on to Homework #1 and further.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Introduction to Web Servers and HTTP\n",
"\n",
"A web server is just a computer -- usually a powerful one, but ultimately it's just another computer -- that runs a long/continuous process that listens for requests on a pre-specified (Internet) _port_ on your computer. It responds to those requests via a protocol called HTTP (HyperText Transfer Protocol). HTTPS is the secure version. When we use a web browser and navigate to a web page, our browser is actually sending a request on our behalf to a specific web server. The browser request is essentially saying \"hey, please give me the web page contents\", and it's up to the browser to correctly render that raw content into a coherent manner, dependent on the format of the file. For example, HTML is one format, XML is another format, and so on.\n",
"\n",
"Ideally (and usually), the web server complies with the request and all is fine. As part of this communication exchange with web servers, the server also sends a status code.\n",
"- If the code starts with a **2**, it means the request was successful.\n",
"- If the code starts with a **4**, it means there was a client error (you, as the user, are the client). For example, ever receive a 404 File Not Found error because a web page doesn't exist? This is an example of a client error, because you are requesting a bogus item.\n",
"- If the code starts with a **5**, it means there was a server error (often that your request was incorrectly formed).\n",
"\n",
"[Click here](https://www.restapitutorial.com/httpstatuscodes.html) for a full list of status codes.\n",
"\n",
"As an analogy, you can think of a web server as being like a server at a restaurant; its goal is _serve_ you your requests. When you try to order something not on the menu (i.e., ask for a web page at a wrong location), the server says 'sorry, we don't have that' (i.e., 404, client error; your mistake).\n",
"\n",
"**IMPORTANT:**\n",
"As humans, we visit pages in a sane, reasonable rate. However, as we start to scrape web pages with our computers, we will be sending requests with our code, and thus, we can make requests at an incredible rate. This is potentially dangerous because it's akin to going to a restaurant and bombarding the server(s) with thousands of food orders. Very often, the restaurant will ban you (i.e., Harvard's network gets banned from the website, and you are potentially held responsible in some capacity?). It is imperative to be responsible and careful. In fact, this act of flooding web pages with requests is the single-most popular, yet archiac, method for maliciously attacking websites / computers with Internet connections. In short, be respectful and careful with your decisions and code. It is better to err on the side of caution, which includes using the **``time.sleep()`` function** to pause your code's execution between subsequent requests. ``time.sleep(2)`` should be fine when making just a few dozen requests. Each site has its own rules, which are often visible via their site's ``robots.txt`` file.\n",
"\n",
"### Additional Resources\n",
"\n",
"**HTML:** if you are not familiar with HTML see https://www.w3schools.com/html/ or one of the many tutorials on the internet.\n",
"\n",
"**Document Object Model (DOM):** for more on this programming interface for HTML and XML documents see https://www.w3schools.com/js/js_htmldom.asp."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Download webpages and get basic properties\n",
"\n",
"``Requests`` is a highly useful Python library that allows us to fetch web pages.\n",
"``BeautifulSoup`` is a phenomenal Python library that allows us to easily parse web content and perform basic extraction.\n",
"\n",
"If one wishes to scrape webpages, one usually uses ``requests`` to fetch the page and ``BeautifulSoup`` to parse the page's meaningful components. Webpages can be messy, despite having a structured format, which is why BeautifulSoup is so handy.\n",
"\n",
"Let's get started:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"import requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To fetch a webpage's content, we can simply use the ``get()`` function within the requests library:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"url = \"https://www.npr.org/2018/11/05/664395755/what-if-the-polls-are-wrong-again-4-scenarios-for-what-might-happen-in-the-elect\"\n",
"response = requests.get(url) # you can use any URL that you wish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The response variable has many highly useful attributes, such as:\n",
"- status_code\n",
"- text\n",
"- content\n",
"\n",
"Let's try each of them!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### response.status_code"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"200"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"response.status_code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should have received a status code of 200, which means the page was successfully found on the server and sent to receiver (aka client/user/you). [Again, you can click here](https://www.restapitutorial.com/httpstatuscodes.html) for a full list of status codes.\n",
"\n",
"### response.text\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\nWhat If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR\\n\\n\\n\\n\\n\\n
\\n What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The ElectionsThe polls show a Democratic advantage in the House and a Republican one in the Senate. But be ready for anything because surprises in politics always happen.\\n
\\n Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.\\n \\n \\n \\n Scott Olson/Getty Images\\n \\n \\n hide caption\\n
\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Scott Olson/Getty Images\\n \\n \\n
\\n
\\n
\\n \\n
\\n
\\n
\\n
\\n
Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.
\\n \\n \\n Scott Olson/Getty Images\\n \\n \\n
\\n
\\n
\\n
There\\'s a lot that can happen Tuesday, the culmination of a long midterm election campaign that will provide the first nationwide measure of the U.S. electorate since Donald Trump was elected president.
One narrative has become dominant: that Democrats are likely to gain control of the House and Republicans hold the Senate, if not expand their majority there. That narrative is based largely on national polls, and caution should be urged. Pollsters have made a lot of adjustments to hopefully correct what they got wrong in 2016, but they can\\'t tell you precisely who is going to show up to vote.
What\\'s more, there have been far fewer statewide and district-specific surveys than in past midterm elections. And, as it is, there are data both parties can take solace in that buoy their respective cases. So everyone should be prepared for surprises — because there always are some. That\\'s the beauty of campaigns and voting.
Here are four scenarios for how election night might play out and what each could mean.
1. Democrats win the House, and Republicans hold the Senate
This is the most likely outcome, based not just on the polls but also on conversations with strategists in both parties. But they urge caution, because the races in the many districts across the country that are up for grabs are still very close.
How it would happen: Forget the polls; Democrats are favored to take back the House for more reasons than that. There have been a record number of retirements, reducing the built-in advantage incumbents tend to have; record numbers of candidates, especially Democratic women, have run for public office; Democrats won the off-year elections in Virginia and New Jersey; they won or fared better than expected in special elections across the country; there was high primary turnout for Democrats in many states; and there is very high early voting turnout.
And just look at how wide the playing field is — Democrats need to pick up 23 seats to take back the House,and they are targeting some 80 Republican-held seats. Republicans are competing in just eight held by Democrats. That right there is and has been a huge flashing red light for the GOP. So many of those races are running through the suburbs, where independents and wealthy, college-educated women live, both of which have consistently in polling said they disapproved of the job the president is doing and prefer to vote for a Democrat in their district.
One other overlooked number from the last NPR/PBS NewsHour/Marist poll: Just 54 percent of Republican women who are registered voters said they were very enthusiastic about voting in this election. Compare that with 78 percent of Republican men who are registered voters. And where do a lot of those women live? The suburbs. If GOP women, an important group that Republicans need to bolster them, stay home, that\\'s one way Democrats clean up in the House.
In the Senate, on the other hand, Republicans have a very favorable landscape and are competing in conservative states held by Democrats. The fundamentals favor the GOP in these states, and if Republicans win where they should win, they will hold the Senate.
Democrats feel they need to limit the losses in the Senate. If they can hold Republicans to net even, keeping the Senate at 51-49, or maybe lose a net of one seat, then they will be very happy. They have a much more favorable Senate landscape in 2020 and believe they will be able to take back the Senate then.
How it would happen: A record turnout is expected Tuesday — perhaps higher than any time in the past 50 years for a midterm — but, as in 2016, Trump voters would have to dominate. Rural voters would have to turn out at higher-than-expected rates, causing the polls to be wrong (again). Meanwhile, young voters and Latinos would have to stay home. (It is supposed to rain on the East Coast Tuesday, which could depress low-propensity-voter turnout.)
All of those close House races would have to tip Republicans\\' way, something that\\'s very possible given the conservative lean of those districts and the distrust of the media, purposefully stoked by the president. And who pays for polls for the most part? Big media organizations.
\\n President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.\\n \\n \\n \\n Alex Wong/Getty Images\\n \\n \\n hide caption\\n
\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Alex Wong/Getty Images\\n \\n \\n
\\n
\\n
\\n \\n
\\n
\\n
\\n
\\n
President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.
\\n \\n \\n Alex Wong/Getty Images\\n \\n \\n
\\n
\\n
\\n
What it would mean: President Trump and Republicans would step on the gas, validated by an election cycle dominated by negative news coverage and polling that said the GOP had its back against the wall. The Affordable Care Act (aka Obamacare) would very likely be repealed once and for all. And Trump could set his sights on ousting Attorney General Jeff Sessions and other key figures at the Justice Department, possibly ending the department\\'s investigation of Russia\\'s attack on the 2016 election.
What\\'s more, Trump\\'s strategy of demonizing immigrants would have worked — again. That was rewarded, and what message would that send? He is only going to do more of it between Wednesday and November 2020 when he stands for re-election.
It would also be yet another reckoning for pollsters and media organizations that pay for the surveys. The polls currently show Democrats with a razor-thin, but consistent advantage heading into Election Day. But if the polls are wrong, it should induce more than a shoulder shrug from outlets that conduct them and the news media organizations that report on them.
3. Democrats win both the House and Senate
This is not seen as the likeliest of scenarios, but it\\'s not out of the realm of possibility either. It would very likely mean a massive wave and a massive shift against Trump and Republicans tied to him nationally.
A lot would have to happen, especially in the Senate, for this to happen.
How it would happen: The path for Democrats in the House is through the suburbs, as in Scenario 1. That doesn\\'t change. But for Democrats to pull this off in the Senate, not only would voters have to side with Democratic incumbents in conservative states, but Democratic challengers would have to win in places like Nevada and Arizona, and possibly Tennessee and Texas.
What it would mean: It would be a repudiation of Trump and the Republicans tied to him nationwide. It would have to trigger a degree of soul-searching — in at least some Republican corners.
Trump would be faced with the choice of moderating and working with Democrats or being a lame-duck president starting in January 2019 when a new Democratic Congress is sworn in — as talk ramps up about Democratic 2020 challengers.
4. Overtime
It\\'s very possible control of both the House and Senate will not be clear on election night.
\\n A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.\\n \\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n hide caption\\n
\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n
\\n
\\n
\\n \\n
\\n
\\n
\\n
\\n
A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.
\\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n
\\n
\\n
\\n
How it would happen: There are a half-dozen congressional races in California, for example, that are very close heading into Election Day. It\\'s possible those races are so close they will not be called on election night. They might not be called for days and possibly weeks later, especially because the vote there is counted slowly.
Additionally, early and absentee ballots can get counted slowly and there is growing concern that many voters\\' absentee and mailed ballots could be rejected. In 2016, to the surprise of many, 319,000 absentee ballots were rejected for one reason or another.
In the Senate, depending on how results from other races shake out, there is the possibility that control is not known on election night or for weeks after. Specifically, it could all come down to Mississippi. There, no candidate is polling above 50 percent heading into Election Day, and if no one gets at least 50 percent, the race heads to a runoff three weeks later.
What it would mean: Imagine a scenario in which Democrats lead 50-49 on Election Day in the Senate, and the eyes of the country — and the deep pockets of out-of-state money — descend on Mississippi. The consequences would be enormous, the rancor pitched and the tension thick.
\\n\\n\\n\\n'"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"response.text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Holy moly! That looks awful. If we use our browser to visit the URL, then right-click the page and click 'View Page Source', we see that it is identical to this chunk of glorious text.\n",
"\n",
"### response.content"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b'\\nWhat If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR\\n\\n\\n\\n\\n\\n
\\n What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The ElectionsThe polls show a Democratic advantage in the House and a Republican one in the Senate. But be ready for anything because surprises in politics always happen.\\n
\\n Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.\\n \\n \\n \\n Scott Olson/Getty Images\\n \\n \\n hide caption\\n
\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Scott Olson/Getty Images\\n \\n \\n
\\n
\\n
\\n \\n
\\n
\\n
\\n
\\n
Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.
\\n \\n \\n Scott Olson/Getty Images\\n \\n \\n
\\n
\\n
\\n
There\\'s a lot that can happen Tuesday, the culmination of a long midterm election campaign that will provide the first nationwide measure of the U.S. electorate since Donald Trump was elected president.
One narrative has become dominant: that Democrats are likely to gain control of the House and Republicans hold the Senate, if not expand their majority there. That narrative is based largely on national polls, and caution should be urged. Pollsters have made a lot of adjustments to hopefully correct what they got wrong in 2016, but they can\\'t tell you precisely who is going to show up to vote.
What\\'s more, there have been far fewer statewide and district-specific surveys than in past midterm elections. And, as it is, there are data both parties can take solace in that buoy their respective cases. So everyone should be prepared for surprises \\xe2\\x80\\x94 because there always are some. That\\'s the beauty of campaigns and voting.
Here are four scenarios for how election night might play out and what each could mean.
1. Democrats win the House, and Republicans hold the Senate
This is the most likely outcome, based not just on the polls but also on conversations with strategists in both parties. But they urge caution, because the races in the many districts across the country that are up for grabs are still very close.
How it would happen: Forget the polls; Democrats are favored to take back the House for more reasons than that. There have been a record number of retirements, reducing the built-in advantage incumbents tend to have; record numbers of candidates, especially Democratic women, have run for public office; Democrats won the off-year elections in Virginia and New Jersey; they won or fared better than expected in special elections across the country; there was high primary turnout for Democrats in many states; and there is very high early voting turnout.
And just look at how wide the playing field is \\xe2\\x80\\x94 Democrats need to pick up 23 seats to take back the House,and they are targeting some 80 Republican-held seats. Republicans are competing in just eight held by Democrats. That right there is and has been a huge flashing red light for the GOP. So many of those races are running through the suburbs, where independents and wealthy, college-educated women live, both of which have consistently in polling said they disapproved of the job the president is doing and prefer to vote for a Democrat in their district.
One other overlooked number from the last NPR/PBS NewsHour/Marist poll: Just 54 percent of Republican women who are registered voters said they were very enthusiastic about voting in this election. Compare that with 78 percent of Republican men who are registered voters. And where do a lot of those women live? The suburbs. If GOP women, an important group that Republicans need to bolster them, stay home, that\\'s one way Democrats clean up in the House.
In the Senate, on the other hand, Republicans have a very favorable landscape and are competing in conservative states held by Democrats. The fundamentals favor the GOP in these states, and if Republicans win where they should win, they will hold the Senate.
Democrats feel they need to limit the losses in the Senate. If they can hold Republicans to net even, keeping the Senate at 51-49, or maybe lose a net of one seat, then they will be very happy. They have a much more favorable Senate landscape in 2020 and believe they will be able to take back the Senate then.
How it would happen: A record turnout is expected Tuesday \\xe2\\x80\\x94 perhaps higher than any time in the past 50 years for a midterm \\xe2\\x80\\x94 but, as in 2016, Trump voters would have to dominate. Rural voters would have to turn out at higher-than-expected rates, causing the polls to be wrong (again). Meanwhile, young voters and Latinos would have to stay home. (It is supposed to rain on the East Coast Tuesday, which could depress low-propensity-voter turnout.)
All of those close House races would have to tip Republicans\\' way, something that\\'s very possible given the conservative lean of those districts and the distrust of the media, purposefully stoked by the president. And who pays for polls for the most part? Big media organizations.
\\n President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.\\n \\n \\n \\n Alex Wong/Getty Images\\n \\n \\n hide caption\\n
\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Alex Wong/Getty Images\\n \\n \\n
\\n
\\n
\\n \\n
\\n
\\n
\\n
\\n
President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.
\\n \\n \\n Alex Wong/Getty Images\\n \\n \\n
\\n
\\n
\\n
What it would mean: President Trump and Republicans would step on the gas, validated by an election cycle dominated by negative news coverage and polling that said the GOP had its back against the wall. The Affordable Care Act (aka Obamacare) would very likely be repealed once and for all. And Trump could set his sights on ousting Attorney General Jeff Sessions and other key figures at the Justice Department, possibly ending the department\\'s investigation of Russia\\'s attack on the 2016 election.
What\\'s more, Trump\\'s strategy of demonizing immigrants would have worked \\xe2\\x80\\x94 again. That was rewarded, and what message would that send? He is only going to do more of it between Wednesday and November 2020 when he stands for re-election.
It would also be yet another reckoning for pollsters and media organizations that pay for the surveys. The polls currently show Democrats with a razor-thin, but consistent advantage heading into Election Day. But if the polls are wrong, it should induce more than a shoulder shrug from outlets that conduct them and the news media organizations that report on them.
3. Democrats win both the House and Senate
This is not seen as the likeliest of scenarios, but it\\'s not out of the realm of possibility either. It would very likely mean a massive wave and a massive shift against Trump and Republicans tied to him nationally.
A lot would have to happen, especially in the Senate, for this to happen.
How it would happen: The path for Democrats in the House is through the suburbs, as in Scenario 1. That doesn\\'t change. But for Democrats to pull this off in the Senate, not only would voters have to side with Democratic incumbents in conservative states, but Democratic challengers would have to win in places like Nevada and Arizona, and possibly Tennessee and Texas.
What it would mean: It would be a repudiation of Trump and the Republicans tied to him nationwide. It would have to trigger a degree of soul-searching \\xe2\\x80\\x94 in at least some Republican corners.
Trump would be faced with the choice of moderating and working with Democrats or being a lame-duck president starting in January 2019 when a new Democratic Congress is sworn in \\xe2\\x80\\x94 as talk ramps up about Democratic 2020 challengers.
4. Overtime
It\\'s very possible control of both the House and Senate will not be clear on election night.
\\n A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.\\n \\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n hide caption\\n
\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n
\\n
\\n
\\n \\n
\\n
\\n
\\n
\\n
A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.
\\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n
\\n
\\n
\\n
How it would happen: There are a half-dozen congressional races in California, for example, that are very close heading into Election Day. It\\'s possible those races are so close they will not be called on election night. They might not be called for days and possibly weeks later, especially because the vote there is counted slowly.
Additionally, early and absentee ballots can get counted slowly and there is growing concern that many voters\\' absentee and mailed ballots could be rejected. In 2016, to the surprise of many, 319,000 absentee ballots were rejected for one reason or another.
In the Senate, depending on how results from other races shake out, there is the possibility that control is not known on election night or for weeks after. Specifically, it could all come down to Mississippi. There, no candidate is polling above 50 percent heading into Election Day, and if no one gets at least 50 percent, the race heads to a runoff three weeks later.
What it would mean: Imagine a scenario in which Democrats lead 50-49 on Election Day in the Senate, and the eyes of the country \\xe2\\x80\\x94 and the deep pockets of out-of-state money \\xe2\\x80\\x94 descend on Mississippi. The consequences would be enormous, the rancor pitched and the tension thick.
\\n\\n\\n\\n'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"response.content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What?! This seems identical to the ``.text`` field. However, the careful eye would notice that the very 1st characters differ; that is, ``.content`` has a *b'* character at the beginning, which in Python syntax denotes that the data type is bytes, whereas the ``.text`` field did not have it and is a regular String.\n",
"\n",
"Ok, so that's great, but how do we make sense of this text? We could manually parse it, but that's tedious and difficult. As mentioned, BeautifulSoup is specifically designed to parse this exact content (any webpage content).\n",
"\n",
"## BEAUTIFUL SOUP\n",
" (property of NBC)\n",
"\n",
"\n",
"The [documentation for BeautifulSoup is found here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).\n",
"\n",
"A BeautifulSoup object can be initialized with the ``.content`` from request and a flag denoting the type of parser that we should use. For example, we could specify ``html.parser``, ``lxml``, etc [documentation here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers). Since we are interested in standard webpages that use HTML, let's specify the html.parser:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"\n",
"What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
\n",
"What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The ElectionsThe polls show a Democratic advantage in the House and a Republican one in the Senate. But be ready for anything because surprises in politics always happen.\n",
"
\n",
" Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.\n",
" \n",
" \n",
" \n",
" Scott Olson/Getty Images\n",
" \n",
" \n",
"hide caption\n",
"
There's a lot that can happen Tuesday, the culmination of a long midterm election campaign that will provide the first nationwide measure of the U.S. electorate since Donald Trump was elected president.
One narrative has become dominant: that Democrats are likely to gain control of the House and Republicans hold the Senate, if not expand their majority there. That narrative is based largely on national polls, and caution should be urged. Pollsters have made a lot of adjustments to hopefully correct what they got wrong in 2016, but they can't tell you precisely who is going to show up to vote.
What's more, there have been far fewer statewide and district-specific surveys than in past midterm elections. And, as it is, there are data both parties can take solace in that buoy their respective cases. So everyone should be prepared for surprises — because there always are some. That's the beauty of campaigns and voting.
Here are four scenarios for how election night might play out and what each could mean.
1. Democrats win the House, and Republicans hold the Senate
This is the most likely outcome, based not just on the polls but also on conversations with strategists in both parties. But they urge caution, because the races in the many districts across the country that are up for grabs are still very close.
How it would happen: Forget the polls; Democrats are favored to take back the House for more reasons than that. There have been a record number of retirements, reducing the built-in advantage incumbents tend to have; record numbers of candidates, especially Democratic women, have run for public office; Democrats won the off-year elections in Virginia and New Jersey; they won or fared better than expected in special elections across the country; there was high primary turnout for Democrats in many states; and there is very high early voting turnout.
And just look at how wide the playing field is — Democrats need to pick up 23 seats to take back the House,and they are targeting some 80 Republican-held seats. Republicans are competing in just eight held by Democrats. That right there is and has been a huge flashing red light for the GOP. So many of those races are running through the suburbs, where independents and wealthy, college-educated women live, both of which have consistently in polling said they disapproved of the job the president is doing and prefer to vote for a Democrat in their district.
One other overlooked number from the last NPR/PBS NewsHour/Marist poll: Just 54 percent of Republican women who are registered voters said they were very enthusiastic about voting in this election. Compare that with 78 percent of Republican men who are registered voters. And where do a lot of those women live? The suburbs. If GOP women, an important group that Republicans need to bolster them, stay home, that's one way Democrats clean up in the House.
In the Senate, on the other hand, Republicans have a very favorable landscape and are competing in conservative states held by Democrats. The fundamentals favor the GOP in these states, and if Republicans win where they should win, they will hold the Senate.
Democrats feel they need to limit the losses in the Senate. If they can hold Republicans to net even, keeping the Senate at 51-49, or maybe lose a net of one seat, then they will be very happy. They have a much more favorable Senate landscape in 2020 and believe they will be able to take back the Senate then.
How it would happen: A record turnout is expected Tuesday — perhaps higher than any time in the past 50 years for a midterm — but, as in 2016, Trump voters would have to dominate. Rural voters would have to turn out at higher-than-expected rates, causing the polls to be wrong (again). Meanwhile, young voters and Latinos would have to stay home. (It is supposed to rain on the East Coast Tuesday, which could depress low-propensity-voter turnout.)
All of those close House races would have to tip Republicans' way, something that's very possible given the conservative lean of those districts and the distrust of the media, purposefully stoked by the president. And who pays for polls for the most part? Big media organizations.
\n",
" President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.\n",
" \n",
" \n",
" \n",
" Alex Wong/Getty Images\n",
" \n",
" \n",
"hide caption\n",
"
President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.
What it would mean: President Trump and Republicans would step on the gas, validated by an election cycle dominated by negative news coverage and polling that said the GOP had its back against the wall. The Affordable Care Act (aka Obamacare) would very likely be repealed once and for all. And Trump could set his sights on ousting Attorney General Jeff Sessions and other key figures at the Justice Department, possibly ending the department's investigation of Russia's attack on the 2016 election.
What's more, Trump's strategy of demonizing immigrants would have worked — again. That was rewarded, and what message would that send? He is only going to do more of it between Wednesday and November 2020 when he stands for re-election.
It would also be yet another reckoning for pollsters and media organizations that pay for the surveys. The polls currently show Democrats with a razor-thin, but consistent advantage heading into Election Day. But if the polls are wrong, it should induce more than a shoulder shrug from outlets that conduct them and the news media organizations that report on them.
3. Democrats win both the House and Senate
This is not seen as the likeliest of scenarios, but it's not out of the realm of possibility either. It would very likely mean a massive wave and a massive shift against Trump and Republicans tied to him nationally.
A lot would have to happen, especially in the Senate, for this to happen.
How it would happen: The path for Democrats in the House is through the suburbs, as in Scenario 1. That doesn't change. But for Democrats to pull this off in the Senate, not only would voters have to side with Democratic incumbents in conservative states, but Democratic challengers would have to win in places like Nevada and Arizona, and possibly Tennessee and Texas.
What it would mean: It would be a repudiation of Trump and the Republicans tied to him nationwide. It would have to trigger a degree of soul-searching — in at least some Republican corners.
Trump would be faced with the choice of moderating and working with Democrats or being a lame-duck president starting in January 2019 when a new Democratic Congress is sworn in — as talk ramps up about Democratic 2020 challengers.
4. Overtime
It's very possible control of both the House and Senate will not be clear on election night.
\n",
" A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.\n",
" \n",
" \n",
" \n",
" Joe Sohm/Visions of America/UIG via Getty Images\n",
" \n",
" \n",
"hide caption\n",
"
\n",
"
\n",
"toggle caption\n",
"
\n",
"\n",
" \n",
" Joe Sohm/Visions of America/UIG via Getty Images\n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.
\n",
"\n",
" \n",
" Joe Sohm/Visions of America/UIG via Getty Images\n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
"
How it would happen: There are a half-dozen congressional races in California, for example, that are very close heading into Election Day. It's possible those races are so close they will not be called on election night. They might not be called for days and possibly weeks later, especially because the vote there is counted slowly.
Additionally, early and absentee ballots can get counted slowly and there is growing concern that many voters' absentee and mailed ballots could be rejected. In 2016, to the surprise of many, 319,000 absentee ballots were rejected for one reason or another.
In the Senate, depending on how results from other races shake out, there is the possibility that control is not known on election night or for weeks after. Specifically, it could all come down to Mississippi. There, no candidate is polling above 50 percent heading into Election Day, and if no one gets at least 50 percent, the race heads to a runoff three weeks later.
What it would mean: Imagine a scenario in which Democrats lead 50-49 on Election Day in the Senate, and the eyes of the country — and the deep pockets of out-of-state money — descend on Mississippi. The consequences would be enormous, the rancor pitched and the tension thick.
\n",
"\n",
"\n",
""
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"soup = BeautifulSoup(response.content, \"html.parser\")\n",
"soup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alright! That looks a little better; there's some whitespace formatting, adding some structure to our content! HTML code is structured by ``. Every tag has an opening and closing portion, denoted by ``< >`` and `` >``, respectively. If we want just the text (not the tags), we can use:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\n try {var _sf_startpt=(new Date()).getTime();} catch(e){}\\n\\nWhat If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR\\n window.NPR = window.NPR || {};\\nNPR.ServerConstants = {\"cbHost\":\"npr.org\",\"webHost\":\"https:\\\\/\\\\/www.npr.org\",\"embedHost\":\"https:\\\\/\\\\/www.npr.org\",\"webHostSecure\":\"https:\\\\/\\\\/secure.npr.org\",\"apiHost\":\"https:\\\\/\\\\/api.npr.org\",\"serverMediaCache\":\"https:\\\\/\\\\/media.npr.org\",\"googleAnalyticsAccount\":\"UA-5828686-4\",\"nielsenSFCode\":\"dcr\",\"nielsenAPN\":\"NPR-dcr\",\"shouldShowHPLocalContent\":true,\"readingServiceHostname\":\"https:\\\\/\\\\/reading.api.npr.org\"};\\nNPR.serverVars = {\"storyId\":\"664395755\",\"facebookAppId\":\"138837436154588\",\"webpackPublicPath\":\"https:\\\\/\\\\/s.npr.org\\\\/templates\\\\/javascript\\\\/dist\\\\/bundles\\\\/\",\"persistenceVersion\":\"e2193dbd58d7e71fdaffbd399767e8dc\",\"isBuildOut\":true,\"topicIds\":[\"P139482413\",\"1001\",\"1002\",\"1003\",\"1014\",\"1059\"],\"primaryTopic\":\"Elections\",\"topics\":[\"Elections\",\"News\",\"Home Page Top Stories\",\"National\",\"Politics\",\"Analysis\"],\"theme\":\"139482413\",\"aggIds\":[\"1001\",\"1002\",\"1003\",\"1014\",\"1059\",\"125950998\",\"125951073\",\"126931907\",\"126944326\",\"126953005\",\"127115490\",\"139482413\",\"162174434\",\"191676894\",\"219323468\",\"312150170\",\"360452518\",\"428799323\",\"432805936\",\"434975886\",\"497806639\",\"520216945\"],\"tagIds\":[\"2016\",\"2018\",\"2020\",\"Democrats\",\"GOP\",\"House\",\"Republican\",\"Senate\",\"election\",\"election day\",\"election night\",\"polling\",\"trump\"],\"byline\":[\"Domenico Montanaro\"],\"pubDate\":\"2018110516\",\"pageTypeId\":\"1\",\"title\":\"What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections\",\"publisherOrgId\":\"1\",\"rocketfuelCode\":20501671};\\n\\n\\n\\n\\n\\n\\n !function(a){function e(d){if(c[d])return c[d].exports;var f=c[d]={exports:{},id:d,loaded:!1};return a[d].call(f.exports,f,f.exports,e),f.loaded=!0,f.exports}var d=window.webpackJsonp;window.webpackJsonp=function(b,t){for(var n,r,o=0,i=[];o\n",
"What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR\n",
"\n",
"\n",
"\n",
"\n",
"\n",
""
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"soup.head # fetches the head tag, which ecompasses the title tag"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Usually head tags are small and only contain the most important contents; however, here, there's some Javascript code. The ``title`` tag resides within the head tag."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"soup.title # we can specifically call for the title tag"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This result includes the tag itself. To get just the text within the tags, we can use the ``.name`` property."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"soup.title.string"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can navigate to the parent tag (the tag that encompasses the current tag) via the ``.parent`` attribute:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'head'"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"soup.title.parent.name"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Parse the page with Beautiful Soup\n",
"In HTML code, paragraphs are often denoated with a ``
\n",
" Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.\n",
" \n",
" \n",
" \n",
" Scott Olson/Getty Images\n",
" \n",
" \n",
" hide caption\n",
"
,
Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.
,
There's a lot that can happen Tuesday, the culmination of a long midterm election campaign that will provide the first nationwide measure of the U.S. electorate since Donald Trump was elected president.
,
One narrative has become dominant: that Democrats are likely to gain control of the House and Republicans hold the Senate, if not expand their majority there. That narrative is based largely on national polls, and caution should be urged. Pollsters have made a lot of adjustments to hopefully correct what they got wrong in 2016, but they can't tell you precisely who is going to show up to vote.
,
What's more, there have been far fewer statewide and district-specific surveys than in past midterm elections. And, as it is, there are data both parties can take solace in that buoy their respective cases. So everyone should be prepared for surprises — because there always are some. That's the beauty of campaigns and voting.
,
Here are four scenarios for how election night might play out and what each could mean.
,
This is the most likely outcome, based not just on the polls but also on conversations with strategists in both parties. But they urge caution, because the races in the many districts across the country that are up for grabs are still very close.
,
How it would happen: Forget the polls; Democrats are favored to take back the House for more reasons than that. There have been a record number of retirements, reducing the built-in advantage incumbents tend to have; record numbers of candidates, especially Democratic women, have run for public office; Democrats won the off-year elections in Virginia and New Jersey; they won or fared better than expected in special elections across the country; there was high primary turnout for Democrats in many states; and there is very high early voting turnout.
,
And just look at how wide the playing field is — Democrats need to pick up 23 seats to take back the House,and they are targeting some 80 Republican-held seats. Republicans are competing in just eight held by Democrats. That right there is and has been a huge flashing red light for the GOP. So many of those races are running through the suburbs, where independents and wealthy, college-educated women live, both of which have consistently in polling said they disapproved of the job the president is doing and prefer to vote for a Democrat in their district.
,
One other overlooked number from the last NPR/PBS NewsHour/Marist poll: Just 54 percent of Republican women who are registered voters said they were very enthusiastic about voting in this election. Compare that with 78 percent of Republican men who are registered voters. And where do a lot of those women live? The suburbs. If GOP women, an important group that Republicans need to bolster them, stay home, that's one way Democrats clean up in the House.
,
In the Senate, on the other hand, Republicans have a very favorable landscape and are competing in conservative states held by Democrats. The fundamentals favor the GOP in these states, and if Republicans win where they should win, they will hold the Senate.
Democrats feel they need to limit the losses in the Senate. If they can hold Republicans to net even, keeping the Senate at 51-49, or maybe lose a net of one seat, then they will be very happy. They have a much more favorable Senate landscape in 2020 and believe they will be able to take back the Senate then.
,
This would be a huge win for the GOP.
,
How it would happen: A record turnout is expected Tuesday — perhaps higher than any time in the past 50 years for a midterm — but, as in 2016, Trump voters would have to dominate. Rural voters would have to turn out at higher-than-expected rates, causing the polls to be wrong (again). Meanwhile, young voters and Latinos would have to stay home. (It is supposed to rain on the East Coast Tuesday, which could depress low-propensity-voter turnout.)
,
All of those close House races would have to tip Republicans' way, something that's very possible given the conservative lean of those districts and the distrust of the media, purposefully stoked by the president. And who pays for polls for the most part? Big media organizations.
,
\n",
" President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.\n",
" \n",
" \n",
" \n",
" Alex Wong/Getty Images\n",
" \n",
" \n",
" hide caption\n",
"
,
President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.
,
What it would mean: President Trump and Republicans would step on the gas, validated by an election cycle dominated by negative news coverage and polling that said the GOP had its back against the wall. The Affordable Care Act (aka Obamacare) would very likely be repealed once and for all. And Trump could set his sights on ousting Attorney General Jeff Sessions and other key figures at the Justice Department, possibly ending the department's investigation of Russia's attack on the 2016 election.
,
What's more, Trump's strategy of demonizing immigrants would have worked — again. That was rewarded, and what message would that send? He is only going to do more of it between Wednesday and November 2020 when he stands for re-election.
,
It would also be yet another reckoning for pollsters and media organizations that pay for the surveys. The polls currently show Democrats with a razor-thin, but consistent advantage heading into Election Day. But if the polls are wrong, it should induce more than a shoulder shrug from outlets that conduct them and the news media organizations that report on them.
,
This is not seen as the likeliest of scenarios, but it's not out of the realm of possibility either. It would very likely mean a massive wave and a massive shift against Trump and Republicans tied to him nationally.
,
A lot would have to happen, especially in the Senate, for this to happen.
,
How it would happen: The path for Democrats in the House is through the suburbs, as in Scenario 1. That doesn't change. But for Democrats to pull this off in the Senate, not only would voters have to side with Democratic incumbents in conservative states, but Democratic challengers would have to win in places like Nevada and Arizona, and possibly Tennessee and Texas.
,
What it would mean: It would be a repudiation of Trump and the Republicans tied to him nationwide. It would have to trigger a degree of soul-searching — in at least some Republican corners.
,
Trump would be faced with the choice of moderating and working with Democrats or being a lame-duck president starting in January 2019 when a new Democratic Congress is sworn in — as talk ramps up about Democratic 2020 challengers.
,
It's very possible control of both the House and Senate will not be clear on election night.
,
\n",
" A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.\n",
" \n",
" \n",
" \n",
" Joe Sohm/Visions of America/UIG via Getty Images\n",
" \n",
" \n",
" hide caption\n",
"
,
A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.
,
How it would happen: There are a half-dozen congressional races in California, for example, that are very close heading into Election Day. It's possible those races are so close they will not be called on election night. They might not be called for days and possibly weeks later, especially because the vote there is counted slowly.
,
Additionally, early and absentee ballots can get counted slowly and there is growing concern that many voters' absentee and mailed ballots could be rejected. In 2016, to the surprise of many, 319,000 absentee ballots were rejected for one reason or another.
,
In the Senate, depending on how results from other races shake out, there is the possibility that control is not known on election night or for weeks after. Specifically, it could all come down to Mississippi. There, no candidate is polling above 50 percent heading into Election Day, and if no one gets at least 50 percent, the race heads to a runoff three weeks later.
,
What it would mean: Imagine a scenario in which Democrats lead 50-49 on Election Day in the Senate, and the eyes of the country — and the deep pockets of out-of-state money — descend on Mississippi. The consequences would be enormous, the rancor pitched and the tension thick.
]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"paragraphs = soup.find_all('p')\n",
"paragraphs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we want just the paragraph text:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" Domenico Montanaro\n",
" \n",
"\n",
"\n",
" Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.\n",
" \n",
" \n",
" \n",
" Scott Olson/Getty Images\n",
" \n",
" \n",
"hide caption\n",
"\n",
"Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.\n",
"There's a lot that can happen Tuesday, the culmination of a long midterm election campaign that will provide the first nationwide measure of the U.S. electorate since Donald Trump was elected president.\n",
"One narrative has become dominant: that Democrats are likely to gain control of the House and Republicans hold the Senate, if not expand their majority there. That narrative is based largely on national polls, and caution should be urged. Pollsters have made a lot of adjustments to hopefully correct what they got wrong in 2016, but they can't tell you precisely who is going to show up to vote.\n",
"What's more, there have been far fewer statewide and district-specific surveys than in past midterm elections. And, as it is, there are data both parties can take solace in that buoy their respective cases. So everyone should be prepared for surprises — because there always are some. That's the beauty of campaigns and voting.\n",
"Here are four scenarios for how election night might play out and what each could mean.\n",
"This is the most likely outcome, based not just on the polls but also on conversations with strategists in both parties. But they urge caution, because the races in the many districts across the country that are up for grabs are still very close.\n",
"How it would happen: Forget the polls; Democrats are favored to take back the House for more reasons than that. There have been a record number of retirements, reducing the built-in advantage incumbents tend to have; record numbers of candidates, especially Democratic women, have run for public office; Democrats won the off-year elections in Virginia and New Jersey; they won or fared better than expected in special elections across the country; there was high primary turnout for Democrats in many states; and there is very high early voting turnout.\n",
"And just look at how wide the playing field is — Democrats need to pick up 23 seats to take back the House, and they are targeting some 80 Republican-held seats. Republicans are competing in just eight held by Democrats. That right there is and has been a huge flashing red light for the GOP. So many of those races are running through the suburbs, where independents and wealthy, college-educated women live, both of which have consistently in polling said they disapproved of the job the president is doing and prefer to vote for a Democrat in their district.\n",
"One other overlooked number from the last NPR/PBS NewsHour/Marist poll: Just 54 percent of Republican women who are registered voters said they were very enthusiastic about voting in this election. Compare that with 78 percent of Republican men who are registered voters. And where do a lot of those women live? The suburbs. If GOP women, an important group that Republicans need to bolster them, stay home, that's one way Democrats clean up in the House.\n",
"In the Senate, on the other hand, Republicans have a very favorable landscape and are competing in conservative states held by Democrats. The fundamentals favor the GOP in these states, and if Republicans win where they should win, they will hold the Senate.\n",
"What it would mean: This would be a huge win for Democrats, as they'd be able to gum up Trump's agenda and begin to investigate his administration, something the GOP has not done very much of. In the Senate, Republicans could still approve federal judges and Trump Supreme Court nominees, but if they want to get any big legislation done they're going to have to negotiate with Democrats in the House, and possibly a Speaker Nancy Pelosi.\n",
"Democrats feel they need to limit the losses in the Senate. If they can hold Republicans to net even, keeping the Senate at 51-49, or maybe lose a net of one seat, then they will be very happy. They have a much more favorable Senate landscape in 2020 and believe they will be able to take back the Senate then.\n",
"This would be a huge win for the GOP.\n",
"How it would happen: A record turnout is expected Tuesday — perhaps higher than any time in the past 50 years for a midterm — but, as in 2016, Trump voters would have to dominate. Rural voters would have to turn out at higher-than-expected rates, causing the polls to be wrong (again). Meanwhile, young voters and Latinos would have to stay home. (It is supposed to rain on the East Coast Tuesday, which could depress low-propensity-voter turnout.)\n",
"All of those close House races would have to tip Republicans' way, something that's very possible given the conservative lean of those districts and the distrust of the media, purposefully stoked by the president. And who pays for polls for the most part? Big media organizations.\n",
"\n",
" President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.\n",
" \n",
" \n",
" \n",
" Alex Wong/Getty Images\n",
" \n",
" \n",
"hide caption\n",
"\n",
"President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.\n",
"What it would mean: President Trump and Republicans would step on the gas, validated by an election cycle dominated by negative news coverage and polling that said the GOP had its back against the wall. The Affordable Care Act (aka Obamacare) would very likely be repealed once and for all. And Trump could set his sights on ousting Attorney General Jeff Sessions and other key figures at the Justice Department, possibly ending the department's investigation of Russia's attack on the 2016 election.\n",
"What's more, Trump's strategy of demonizing immigrants would have worked — again. That was rewarded, and what message would that send? He is only going to do more of it between Wednesday and November 2020 when he stands for re-election.\n",
"It would also be yet another reckoning for pollsters and media organizations that pay for the surveys. The polls currently show Democrats with a razor-thin, but consistent advantage heading into Election Day. But if the polls are wrong, it should induce more than a shoulder shrug from outlets that conduct them and the news media organizations that report on them.\n",
"This is not seen as the likeliest of scenarios, but it's not out of the realm of possibility either. It would very likely mean a massive wave and a massive shift against Trump and Republicans tied to him nationally.\n",
"A lot would have to happen, especially in the Senate, for this to happen.\n",
"How it would happen: The path for Democrats in the House is through the suburbs, as in Scenario 1. That doesn't change. But for Democrats to pull this off in the Senate, not only would voters have to side with Democratic incumbents in conservative states, but Democratic challengers would have to win in places like Nevada and Arizona, and possibly Tennessee and Texas.\n",
"What it would mean: It would be a repudiation of Trump and the Republicans tied to him nationwide. It would have to trigger a degree of soul-searching — in at least some Republican corners.\n",
"Trump would be faced with the choice of moderating and working with Democrats or being a lame-duck president starting in January 2019 when a new Democratic Congress is sworn in — as talk ramps up about Democratic 2020 challengers.\n",
"It's very possible control of both the House and Senate will not be clear on election night.\n",
"\n",
" A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.\n",
" \n",
" \n",
" \n",
" Joe Sohm/Visions of America/UIG via Getty Images\n",
" \n",
" \n",
"hide caption\n",
"\n",
"A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.\n",
"How it would happen: There are a half-dozen congressional races in California, for example, that are very close heading into Election Day. It's possible those races are so close they will not be called on election night. They might not be called for days and possibly weeks later, especially because the vote there is counted slowly.\n",
"Additionally, early and absentee ballots can get counted slowly and there is growing concern that many voters' absentee and mailed ballots could be rejected. In 2016, to the surprise of many, 319,000 absentee ballots were rejected for one reason or another.\n",
"In the Senate, depending on how results from other races shake out, there is the possibility that control is not known on election night or for weeks after. Specifically, it could all come down to Mississippi. There, no candidate is polling above 50 percent heading into Election Day, and if no one gets at least 50 percent, the race heads to a runoff three weeks later.\n",
"What it would mean: Imagine a scenario in which Democrats lead 50-49 on Election Day in the Senate, and the eyes of the country — and the deep pockets of out-of-state money — descend on Mississippi. The consequences would be enormous, the rancor pitched and the tension thick.\n",
"NPR thanks our sponsors\n",
"Become an NPR sponsor\n"
]
}
],
"source": [
"for pa in paragraphs:\n",
" print(pa.get_text())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since there are multiple tags and various attributes, it is useful to check the data type of BeautifulSoup objects:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bs4.element.Tag"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(soup.find('p'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the ``.find()`` function returns a BeautifulSoup element, we can tack on multiple calls that continue to return elements:"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"
]"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"soup.find('header', attrs={'class':'npr-header'}).find_all(\"li\") # li stands for list items"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This returns all of our list items, and since it's within a particular header section of the page, it appears they are links to menu items for navigating the webpage. If we wanted to grab just the links within these:"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{About NPR,\n",
" More Shows & Podcasts,\n",
" \n",
" All Songs Considered\n",
" ,\n",
" \n",
" \n",
" All Things Considered\n",
" ,\n",
" Art & Design,\n",
" Arts & Life,\n",
" \n",
" Best Music Of 2019\n",
" ,\n",
" Books,\n",
" Business,\n",
" Careers,\n",
" Connect,\n",
" Ethics,\n",
" Food,\n",
" \n",
" \n",
" Fresh Air\n",
" ,\n",
" Health,\n",
" \n",
" \n",
" Hidden Brain\n",
" ,\n",
" \n",
" \n",
" How I Built This with Guy Raz\n",
" ,\n",
" \n",
" \n",
" Morning Edition\n",
" ,\n",
" Movies,\n",
" \n",
" Music News\n",
" ,\n",
" Music,\n",
" National,\n",
" \n",
" New Music\n",
" ,\n",
" News,\n",
" \n",
" \n",
" ,\n",
" Home,\n",
" \n",
" \n",
" \n",
" ,\n",
" NPR Shop,\n",
" Overview,\n",
" Performing Arts,\n",
" \n",
" \n",
" Planet Money\n",
" ,\n",
" Politics,\n",
" Pop Culture,\n",
" Press,\n",
" Race & Culture,\n",
" Science,\n",
" Search,\n",
" Shows & Podcasts,\n",
" Support,\n",
" Technology,\n",
" Television,\n",
" \n",
" Tiny Desk\n",
" ,\n",
" \n",
" Turning The Tables\n",
" ,\n",
" \n",
" \n",
" Up First\n",
" ,\n",
" \n",
" \n",
" Wait Wait...Don't Tell Me!\n",
" ,\n",
" \n",
" \n",
" Weekend Edition Saturday\n",
" ,\n",
" \n",
" \n",
" Weekend Edition Sunday\n",
" ,\n",
" World}"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"menu_links = set()\n",
"for list_item in soup.find('header', attrs={'class':'npr-header'}).find_all(\"li\"):\n",
" for link in list_item.find_all('a', href=True):\n",
" menu_links.add(link)\n",
"menu_links # a unique set of all the seemingly important links in the header"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TAKEAWAY LESSON\n",
"The above tutorial isn't meant to be a study guide to memorize; its point is to show you the most important functionaity that exist within BeautifulSoup, and to illustrate how one can access different pieces of content. No two web scraping tasks are identical, so it's useful to play around with code and try different things, while using the above as examples of how you may navigate between different tags and properties of a page. Don't worry; we are always here to help when you get stuck!\n",
"\n",
"# String formatting\n",
"As we parse webpages, we may often want to further adjust and format the text to a certain way.\n",
"\n",
"For example, say we wanted to scrape a polical website that lists all US Senators' name and office phone number. We may want to store information for each senator in a dictionary. All senators' information may be stored in a list. Thus, we'd have a list of dictionaries. Below, we will initialize such a list of dictionary (it has only 3 senators, for illustrative purposes, but imagine it contains many more)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'name': 'Lamar Alexander', 'number': '555-229-2812'}, {'name': 'Tammy Baldwin', 'number': '555-922-8393'}, {'name': 'John Barrasso', 'number': '555-827-2281'}]\n"
]
}
],
"source": [
"# this is a bit clumsy of an initialization, but we spell it out this way for clarity purposes\n",
"# NOTE: imagine the dictionary were constructed in a more organic manner\n",
"senator1 = {\"name\":\"Lamar Alexander\", \"number\":\"555-229-2812\"}\n",
"senator2 = {\"name\":\"Tammy Baldwin\", \"number\":\"555-922-8393\"}\n",
"senator3 = {\"name\":\"John Barrasso\", \"number\":\"555-827-2281\"}\n",
"senators = [senator1, senator2, senator3]\n",
"print(senators)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the real-world, we may not want the final form of our information to be in a Python dictionary; rather, we may need to send an email to people in our mailing list, urging them to call their senators. If we have a templated format in mind, we can do the following:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Please call Lamar Alexander at 555-229-2812\n",
"Please call Tammy Baldwin at 555-922-8393\n",
"Please call John Barrasso at 555-827-2281\n"
]
}
],
"source": [
"email_template = \"\"\"Please call {name} at {number}\"\"\"\n",
"for senator in senators:\n",
" print(email_template.format(**senator))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Please [visit here](https://docs.python.org/3/library/stdtypes.html#str.format)** for further documentation\n",
" \n",
"Alternatively, one can also format their text via the ``f'-strings`` property. [See documentation here](https://docs.python.org/3/reference/lexical_analysis.html#f-strings). For example, using the above data structure and goal, one could yield identical results via:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Please call Lamar Alexander at 555-229-2812\n",
"Please call Tammy Baldwin at 555-922-8393\n",
"Please call John Barrasso at 555-827-2281\n"
]
}
],
"source": [
"for senator in senators:\n",
" print(f\"Please call {senator['name']} at {senator['number']}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Additionally, sometimes we wish to search large strings of text. If we wish to find all occurrences within a given string, a very mechanical, procedural way of doing it would be to use the ``.find()`` function in Python and to repeatedly update the starting index from which we are looking.\n",
"\n",
"## Regular Expressions\n",
"A way more suitable and powerful way is to use Regular Expressions, which is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). A tutorial on Regular Expressions (aka regex) is beond this lab, but below are many great resources that we recommend, if you are interested in them (could be very useful for a homework problem):\n",
"- https://docs.python.org/3.3/library/re.html\n",
"- https://regexone.com\n",
"- https://docs.python.org/3/howto/regex.html.\n",
"\n",
"# Additonal Python/Homework Comment\n",
"In Homework #1, we ask you to complete functions that have signatures with a syntax you may not have seen before:\n",
"\n",
"``def create_star_table(starlist: list) -> list:``\n",
"\n",
"To be clear, this syntax merely means that the input parameter must be a list, and the output must be a list. It's no different than any other function, it just puts a requirement on the behavior of the function.\n",
"\n",
"It is **typing** our function. Please [see this documention if you have more questions.](https://docs.python.org/3/library/typing.html)\n",
"\n",
"# Walkthrough Example (of Web Scraping)\n",
"We're going to see the structure of Goodread's best books list (**NOTE: Goodreads is described a little more within the other Lab2_More_Pandas.ipynb notebook)**. We'll use the Developer tools in chrome, safari and firefox have similar tools available. To get this page we use the `requests` module. But first we should check if the company's policy allows scraping. Check the [robots.txt](https://www.goodreads.com/robots.txt) to find what sites/elements are not accessible. Please read and verify.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"ename": "IndentationError",
"evalue": "unexpected indent (, line 2)",
"output_type": "error",
"traceback": [
"\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m2\u001b[0m\n\u001b[0;31m response = requests.get(url)\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mIndentationError\u001b[0m\u001b[0;31m:\u001b[0m unexpected indent\n"
]
}
],
"source": [
"url=\"https://www.npr.org/2018/11/05/664395755/what-if-the-polls-are-wrong-again-4-scenarios-for-what-might-happen-in-the-elect\"\n",
"response = requests.get(url)\n",
"# response.status_code\n",
"# response.content\n",
"\n",
"# Beautiful Soup (library) time!\n",
"soup = BeautifulSoup(response.content, \"html.parser\")\n",
" #print(soup)\n",
" # soup.prettify()\n",
"soup.find(\"title\")\n",
"\n",
" # Q1: how do we get the title's text?\n",
"\n",
" # Q2: how do we get the webpage's entire content?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1\n"
]
}
],
"source": [
"URLSTART=\"https://www.goodreads.com\"\n",
"BESTBOOKS=\"/list/show/1.Best_Books_Ever?page=\"\n",
"url = URLSTART+BESTBOOKS+'1'\n",
"print(url)\n",
"page = requests.get(url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see properties of the page. Most relevant are `status_code` and `text`. The former tells us if the web-page was found, and if found , ok. (See lecture notes.)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"200"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"page.status_code # 200 is good"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\n\\n\\n Best Books Ever (56897 books)\\n\\n\\n\\n\\n\\n\\n\\n \\n \\n\\n\\n \\n\\n\\n \\n\\n\\n\\n \\n\\n \\n\\n\\n \\n\\n \\n \\n\\n \\n2}.txt\".format(3)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"a = \"4\"\n",
"b = 4\n",
"class Four:\n",
" def __str__(self):\n",
" return \"Fourteen\"\n",
"c=Four()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The hazy cat jumped over the 4 and 4 and Fourteen'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"The hazy cat jumped over the {} and {} and {}\".format(a, b, c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Set up a pipeline for fetching and parsing\n",
"\n",
"Ok lets get back to the fetching..."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"FTW files/1_2767052-the-hunger-games.html\n",
"FTW files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html\n",
"FTW files/1_2657.To_Kill_a_Mockingbird.html\n",
"FTW files/1_1885.Pride_and_Prejudice.html\n",
"FTW files/1_41865.Twilight.html\n",
"FTW files/2_43763.Interview_with_the_Vampire.html\n",
"FTW files/2_153747.Moby_Dick_or_the_Whale.html\n",
"FTW files/2_5.Harry_Potter_and_the_Prisoner_of_Azkaban.html\n",
"FTW files/2_4989.The_Red_Tent.html\n",
"FTW files/2_37435.The_Secret_Life_of_Bees.html\n",
"['files/1_2767052-the-hunger-games.html', 'files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html', 'files/1_2657.To_Kill_a_Mockingbird.html', 'files/1_1885.Pride_and_Prejudice.html', 'files/1_41865.Twilight.html', 'files/2_43763.Interview_with_the_Vampire.html', 'files/2_153747.Moby_Dick_or_the_Whale.html', 'files/2_5.Harry_Potter_and_the_Prisoner_of_Azkaban.html', 'files/2_4989.The_Red_Tent.html', 'files/2_37435.The_Secret_Life_of_Bees.html']\n"
]
}
],
"source": [
"fetched=[]\n",
"for i in range(1,3):\n",
" with open(\"files/list{:0>2}.txt\".format(i)) as fd:\n",
" counter=0\n",
" for bookurl_line in fd:\n",
" if counter > 4:\n",
" break\n",
" bookurl=bookurl_line.strip()\n",
" stuff=requests.get(URLSTART+bookurl)\n",
" filetowrite=bookurl.split('/')[-1]\n",
" filetowrite=\"files/\"+str(i)+\"_\"+filetowrite+\".html\"\n",
" print(\"FTW\", filetowrite)\n",
" fd=open(filetowrite,\"w\")\n",
" fd.write(stuff.text)\n",
" fd.close()\n",
" fetched.append(filetowrite)\n",
" time.sleep(2)\n",
" counter=counter+1\n",
" \n",
"print(fetched)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok we are off to parse each one of the html pages we fetched. We have provided the skeleton of the code and the code to parse the year, since it is a bit more complex...see the difference in the screenshots above. "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import re\n",
"yearre = r'\\d{4}'\n",
"def get_year(d):\n",
" if d.select_one(\"nobr.greyText\"):\n",
" return d.select_one(\"nobr.greyText\").text.strip().split()[-1][:-1]\n",
" else:\n",
" thetext=d.select(\"div#details div.row\")[1].text.strip()\n",
" rowmatch=re.findall(yearre, thetext)\n",
" if len(rowmatch) > 0:\n",
" rowtext=rowmatch[0].strip()\n",
" else:\n",
" rowtext=\"NA\"\n",
" return rowtext"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Exercise
\n",
"\n",
"Your job is to fill in the code to get the genres."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_genres(d):\n",
" # your code here\n",
" genres=d.select(\"div.elementList div.left a\")\n",
" glist=[]\n",
" for g in genres:\n",
" glist.append(g['href'])\n",
" return glist"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"files/1_2767052-the-hunger-games.html\n",
"files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html\n",
"files/1_2657.To_Kill_a_Mockingbird.html\n",
"files/1_1885.Pride_and_Prejudice.html\n",
"files/1_41865.Twilight.html\n",
"files/2_43763.Interview_with_the_Vampire.html\n",
"files/2_153747.Moby_Dick_or_the_Whale.html\n",
"files/2_5.Harry_Potter_and_the_Prisoner_of_Azkaban.html\n",
"files/2_4989.The_Red_Tent.html\n",
"files/2_37435.The_Secret_Life_of_Bees.html\n"
]
}
],
"source": [
"\n",
"listofdicts=[]\n",
"for filetoread in fetched:\n",
" print(filetoread)\n",
" td={}\n",
" with open(filetoread) as fd:\n",
" datext = fd.read()\n",
" d=BeautifulSoup(datext, 'html.parser')\n",
" td['title']=d.select_one(\"meta[property='og:title']\")['content']\n",
" td['isbn']=d.select_one(\"meta[property='books:isbn']\")['content']\n",
" td['booktype']=d.select_one(\"meta[property='og:type']\")['content']\n",
" td['author']=d.select_one(\"meta[property='books:author']\")['content']\n",
" #td['rating']=d.select_one(\"span.average\").text\n",
" td['year'] = get_year(d)\n",
" td['file']=filetoread\n",
" glist = get_genres(d)\n",
" td['genres']=\"|\".join(glist)\n",
" listofdicts.append(td)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'title': 'The Hunger Games (The Hunger Games, #1)',\n",
" 'isbn': '9780439023481',\n",
" 'booktype': 'books.book',\n",
" 'author': 'https://www.goodreads.com/author/show/153394.Suzanne_Collins',\n",
" 'year': '2008',\n",
" 'file': 'files/1_2767052-the-hunger-games.html',\n",
" 'genres': '/genres/young-adult|/genres/fiction|/genres/science-fiction|/genres/dystopia|/genres/fantasy|/genres/science-fiction|/genres/romance|/genres/adventure|/genres/young-adult|/genres/teen|/genres/apocalyptic|/genres/post-apocalyptic|/genres/action'}"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"listofdicts[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally lets write all this stuff into a csv file which we will use to do analysis."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
author
\n",
"
booktype
\n",
"
file
\n",
"
genres
\n",
"
isbn
\n",
"
title
\n",
"
year
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
https://www.goodreads.com/author/show/153394.S...
\n",
"
books.book
\n",
"
files/1_2767052-the-hunger-games.html
\n",
"
/genres/young-adult|/genres/fiction|/genres/sc...
\n",
"
9780439023481
\n",
"
The Hunger Games (The Hunger Games, #1)
\n",
"
2008
\n",
"
\n",
"
\n",
"
1
\n",
"
https://www.goodreads.com/author/show/1077326....
\n",
"
books.book
\n",
"
files/1_2.Harry_Potter_and_the_Order_of_the_Ph...
\n",
"
/genres/fantasy|/genres/young-adult|/genres/fi...
\n",
"
9780439358071
\n",
"
Harry Potter and the Order of the Phoenix (Har...
\n",
"
2003
\n",
"
\n",
"
\n",
"
2
\n",
"
https://www.goodreads.com/author/show/1825.Har...
\n",
"
books.book
\n",
"
files/1_2657.To_Kill_a_Mockingbird.html
\n",
"
/genres/classics|/genres/fiction|/genres/histo...
\n",
"
null
\n",
"
To Kill a Mockingbird (To Kill a Mockingbird, #1)
\n",
"
1960
\n",
"
\n",
"
\n",
"
3
\n",
"
https://www.goodreads.com/author/show/1265.Jan...
\n",
"
books.book
\n",
"
files/1_1885.Pride_and_Prejudice.html
\n",
"
/genres/classics|/genres/fiction|/genres/roman...
\n",
"
null
\n",
"
Pride and Prejudice
\n",
"
1813
\n",
"
\n",
"
\n",
"
4
\n",
"
https://www.goodreads.com/author/show/941441.S...
\n",
"
books.book
\n",
"
files/1_41865.Twilight.html
\n",
"
/genres/young-adult|/genres/fantasy|/genres/ro...
\n",
"
9780316015844
\n",
"
Twilight (Twilight, #1)
\n",
"
2005
\n",
"
\n",
"
\n",
"
5
\n",
"
https://www.goodreads.com/author/show/7577.Ann...
\n",
"
books.book
\n",
"
files/2_43763.Interview_with_the_Vampire.html
\n",
"
/genres/horror|/genres/fantasy|/genres/fiction...
\n",
"
9780345476876
\n",
"
Interview with the Vampire (The Vampire Chroni...
\n",
"
1976
\n",
"
\n",
"
\n",
"
6
\n",
"
https://www.goodreads.com/author/show/1624.Her...
\n",
"
books.book
\n",
"
files/2_153747.Moby_Dick_or_the_Whale.html
\n",
"
/genres/classics|/genres/fiction|/genres/liter...
\n",
"
9780142437247
\n",
"
Moby-Dick, or, the Whale
\n",
"
1851
\n",
"
\n",
"
\n",
"
7
\n",
"
https://www.goodreads.com/author/show/1077326....
\n",
"
books.book
\n",
"
files/2_5.Harry_Potter_and_the_Prisoner_of_Azk...
\n",
"
/genres/fantasy|/genres/young-adult|/genres/fi...
\n",
"
9780439655484
\n",
"
Harry Potter and the Prisoner of Azkaban (Harr...
\n",
"
1999
\n",
"
\n",
"
\n",
"
8
\n",
"
https://www.goodreads.com/author/show/626222.A...
\n",
"
books.book
\n",
"
files/2_4989.The_Red_Tent.html
\n",
"
/genres/historical|/genres/historical-fiction|...
\n",
"
9780312353766
\n",
"
The Red Tent
\n",
"
1997
\n",
"
\n",
"
\n",
"
9
\n",
"
https://www.goodreads.com/author/show/4711.Sue...
\n",
"
books.book
\n",
"
files/2_37435.The_Secret_Life_of_Bees.html
\n",
"
/genres/fiction|/genres/historical|/genres/his...
\n",
"
9780142001745
\n",
"
The Secret Life of Bees
\n",
"
2001
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" author booktype file genres isbn title year\n",
"0 https://www.goodreads.com/author/show/153394.S... books.book files/1_2767052-the-hunger-games.html /genres/young-adult|/genres/fiction|/genres/sc... 9780439023481 The Hunger Games (The Hunger Games, #1) 2008\n",
"1 https://www.goodreads.com/author/show/1077326.... books.book files/1_2.Harry_Potter_and_the_Order_of_the_Ph... /genres/fantasy|/genres/young-adult|/genres/fi... 9780439358071 Harry Potter and the Order of the Phoenix (Har... 2003\n",
"2 https://www.goodreads.com/author/show/1825.Har... books.book files/1_2657.To_Kill_a_Mockingbird.html /genres/classics|/genres/fiction|/genres/histo... null To Kill a Mockingbird (To Kill a Mockingbird, #1) 1960\n",
"3 https://www.goodreads.com/author/show/1265.Jan... books.book files/1_1885.Pride_and_Prejudice.html /genres/classics|/genres/fiction|/genres/roman... null Pride and Prejudice 1813\n",
"4 https://www.goodreads.com/author/show/941441.S... books.book files/1_41865.Twilight.html /genres/young-adult|/genres/fantasy|/genres/ro... 9780316015844 Twilight (Twilight, #1) 2005\n",
"5 https://www.goodreads.com/author/show/7577.Ann... books.book files/2_43763.Interview_with_the_Vampire.html /genres/horror|/genres/fantasy|/genres/fiction... 9780345476876 Interview with the Vampire (The Vampire Chroni... 1976\n",
"6 https://www.goodreads.com/author/show/1624.Her... books.book files/2_153747.Moby_Dick_or_the_Whale.html /genres/classics|/genres/fiction|/genres/liter... 9780142437247 Moby-Dick, or, the Whale 1851\n",
"7 https://www.goodreads.com/author/show/1077326.... books.book files/2_5.Harry_Potter_and_the_Prisoner_of_Azk... /genres/fantasy|/genres/young-adult|/genres/fi... 9780439655484 Harry Potter and the Prisoner of Azkaban (Harr... 1999\n",
"8 https://www.goodreads.com/author/show/626222.A... books.book files/2_4989.The_Red_Tent.html /genres/historical|/genres/historical-fiction|... 9780312353766 The Red Tent 1997\n",
"9 https://www.goodreads.com/author/show/4711.Sue... books.book files/2_37435.The_Secret_Life_of_Bees.html /genres/fiction|/genres/historical|/genres/his... 9780142001745 The Secret Life of Bees 2001"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame.from_records(listofdicts)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.to_csv(\"files/meta_utf8_EK.csv\", index=False, header=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 1
}