{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS109A Introduction to Data Science \n", "\n", "## Lab 2: Web Scraping with Beautiful Soup\n", "\n", "**Harvard University**
\n", "**Fall 2019**
\n", "**Instructors:** Pavlos Protopapas, Kevin Rader, and Chris Tanner
\n", "**Lab Instructors:** Chris Tanner and Eleni Kaxiras
\n", "\n", "**Authors:** Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, and Eleni Kaxiras\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## RUN THIS CELL TO GET THE RIGHT FORMATTING \n", "from IPython.core.display import HTML\n", "def css_styling():\n", " styles = open(\"../../styles/cs109.css\", \"r\").read()\n", " return HTML(styles)\n", "css_styling()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn.apionly as sns\n", "import time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents \n", "
    \n", "
  1. Learning Goals
  2. \n", "
  3. Introduction to Web Servers and HTTP
  4. \n", "
  5. Download webpages and get basic properties
  6. \n", "
  7. Parse the page with Beautiful Soup
  8. \n", "
  9. String formatting
  10. \n", "
  11. Additonal Python/Homework Comment
  12. \n", "
  13. Walkthrough Example
  14. \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Learning Goals\n", "\n", "- Understand the structure of a web page\n", "- Understand how to use Beautiful soup to scrape content from web pages.\n", "- Feel comfortable storing and manipulating the content in various formats.\n", "- Understand how to convert structured format into a Pandas DataFrame\n", "\n", "In this lab, we'll scrape Goodread's Best Books list:\n", "\n", "https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1 .\n", "\n", "We'll walk through scraping the list pages for the book names/urls. First, we start with an even simpler example.\n", "\n", "*This lab corresponds to lectures #2 and #3 and maps on to Homework #1 and further.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Introduction to Web Servers and HTTP\n", "\n", "A web server is just a computer -- usually a powerful one, but ultimately it's just another computer -- that runs a long/continuous process that listens for requests on a pre-specified (Internet) _port_ on your computer. It responds to those requests via a protocol called HTTP (HyperText Transfer Protocol). HTTPS is the secure version. When we use a web browser and navigate to a web page, our browser is actually sending a request on our behalf to a specific web server. The browser request is essentially saying \"hey, please give me the web page contents\", and it's up to the browser to correctly render that raw content into a coherent manner, dependent on the format of the file. For example, HTML is one format, XML is another format, and so on.\n", "\n", "Ideally (and usually), the web server complies with the request and all is fine. As part of this communication exchange with web servers, the server also sends a status code.\n", "- If the code starts with a **2**, it means the request was successful.\n", "- If the code starts with a **4**, it means there was a client error (you, as the user, are the client). For example, ever receive a 404 File Not Found error because a web page doesn't exist? This is an example of a client error, because you are requesting a bogus item.\n", "- If the code starts with a **5**, it means there was a server error (often that your request was incorrectly formed).\n", "\n", "[Click here](https://www.restapitutorial.com/httpstatuscodes.html) for a full list of status codes.\n", "\n", "As an analogy, you can think of a web server as being like a server at a restaurant; its goal is _serve_ you your requests. When you try to order something not on the menu (i.e., ask for a web page at a wrong location), the server says 'sorry, we don't have that' (i.e., 404, client error; your mistake).\n", "\n", "**IMPORTANT:**\n", "As humans, we visit pages in a sane, reasonable rate. However, as we start to scrape web pages with our computers, we will be sending requests with our code, and thus, we can make requests at an incredible rate. This is potentially dangerous because it's akin to going to a restaurant and bombarding the server(s) with thousands of food orders. Very often, the restaurant will ban you (i.e., Harvard's network gets banned from the website, and you are potentially held responsible in some capacity?). It is imperative to be responsible and careful. In fact, this act of flooding web pages with requests is the single-most popular, yet archiac, method for maliciously attacking websites / computers with Internet connections. In short, be respectful and careful with your decisions and code. It is better to err on the side of caution, which includes using the **``time.sleep()`` function** to pause your code's execution between subsequent requests. ``time.sleep(2)`` should be fine when making just a few dozen requests. Each site has its own rules, which are often visible via their site's ``robots.txt`` file.\n", "\n", "### Additional Resources\n", "\n", "**HTML:** if you are not familiar with HTML see https://www.w3schools.com/html/ or one of the many tutorials on the internet.\n", "\n", "**Document Object Model (DOM):** for more on this programming interface for HTML and XML documents see https://www.w3schools.com/js/js_htmldom.asp." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Download webpages and get basic properties\n", "\n", "``Requests`` is a highly useful Python library that allows us to fetch web pages.\n", "``BeautifulSoup`` is a phenomenal Python library that allows us to easily parse web content and perform basic extraction.\n", "\n", "If one wishes to scrape webpages, one usually uses ``requests`` to fetch the page and ``BeautifulSoup`` to parse the page's meaningful components. Webpages can be messy, despite having a structured format, which is why BeautifulSoup is so handy.\n", "\n", "Let's get started:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fetch a webpage's content, we can simply use the ``get()`` function within the requests library:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "url = \"https://www.npr.org/2018/11/05/664395755/what-if-the-polls-are-wrong-again-4-scenarios-for-what-might-happen-in-the-elect\"\n", "response = requests.get(url) # you can use any URL that you wish" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The response variable has many highly useful attributes, such as:\n", "- status_code\n", "- text\n", "- content\n", "\n", "Let's try each of them!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### response.status_code" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "200" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response.status_code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should have received a status code of 200, which means the page was successfully found on the server and sent to receiver (aka client/user/you). [Again, you can click here](https://www.restapitutorial.com/httpstatuscodes.html) for a full list of status codes.\n", "\n", "### response.text\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\nWhat If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR\\n\\n\\n\\n\\n\\n
\\n Accessibility links \\n
\\n
\\n What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections The polls show a Democratic advantage in the House and a Republican one in the Senate. But be ready for anything because surprises in politics always happen.\\n
\\n
\\n
\\n\\n

\\n Elections\\n

\\n
\\n
\\n

What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections

\\n \\n \\n \\n \\n
\\n\\n\\n
\\n
\\n
    \\n
  • \\n
  • \\n
  • \\n
  • \\n
\\n
\\n \\n\\n
\\n\\n\\n
\\n
\\n
\\n \\n
\\n
\\n \\n\\n
\\n
\\n
\\n
\\n
\\n
\\n \\n \"Domenico\\n \\n
\\n\\n

\\n \\n Domenico Montanaro\\n \\n

\\n\\n \\n
\\n
\\n
\\n \\n\\n
\\n
\\n \\n\\n
\\n\\n\\n
\\n
\\n
\\n \"\"\\n \\n
\\n
\\n
\\n
\\n

\\n Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.\\n \\n \\n \\n Scott Olson/Getty Images\\n \\n \\n hide caption\\n

\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Scott Olson/Getty Images\\n \\n \\n
\\n
\\n
\\n \"\"\\n
\\n
\\n
\\n
\\n

Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.

\\n \\n \\n Scott Olson/Getty Images\\n \\n \\n
\\n
\\n
\\n

There\\'s a lot that can happen Tuesday, the culmination of a long midterm election campaign that will provide the first nationwide measure of the U.S. electorate since Donald Trump was elected president.

One narrative has become dominant: that Democrats are likely to gain control of the House and Republicans hold the Senate, if not expand their majority there. That narrative is based largely on national polls, and caution should be urged. Pollsters have made a lot of adjustments to hopefully correct what they got wrong in 2016, but they can\\'t tell you precisely who is going to show up to vote.

\\n \\n \\n\\n
\\n \\n\\n

What\\'s more, there have been far fewer statewide and district-specific surveys than in past midterm elections. And, as it is, there are data both parties can take solace in that buoy their respective cases. So everyone should be prepared for surprises — because there always are some. That\\'s the beauty of campaigns and voting.

Here are four scenarios for how election night might play out and what each could mean.

1. Democrats win the House, and Republicans hold the Senate

This is the most likely outcome, based not just on the polls but also on conversations with strategists in both parties. But they urge caution, because the races in the many districts across the country that are up for grabs are still very close.

\\n \\n\\n

How it would happen: Forget the polls; Democrats are favored to take back the House for more reasons than that. There have been a record number of retirements, reducing the built-in advantage incumbents tend to have; record numbers of candidates, especially Democratic women, have run for public office; Democrats won the off-year elections in Virginia and New Jersey; they won or fared better than expected in special elections across the country; there was high primary turnout for Democrats in many states; and there is very high early voting turnout.

\\n \\n\\n

And just look at how wide the playing field is — Democrats need to pick up 23 seats to take back the House, and they are targeting some 80 Republican-held seats. Republicans are competing in just eight held by Democrats. That right there is and has been a huge flashing red light for the GOP. So many of those races are running through the suburbs, where independents and wealthy, college-educated women live, both of which have consistently in polling said they disapproved of the job the president is doing and prefer to vote for a Democrat in their district.

\\n \\n\\n

One other overlooked number from the last NPR/PBS NewsHour/Marist poll: Just 54 percent of Republican women who are registered voters said they were very enthusiastic about voting in this election. Compare that with 78 percent of Republican men who are registered voters. And where do a lot of those women live? The suburbs. If GOP women, an important group that Republicans need to bolster them, stay home, that\\'s one way Democrats clean up in the House.

\\n \\n\\n

In the Senate, on the other hand, Republicans have a very favorable landscape and are competing in conservative states held by Democrats. The fundamentals favor the GOP in these states, and if Republicans win where they should win, they will hold the Senate.

\\n \\n\\n

What it would mean: This would be a huge win for Democrats, as they\\'d be able to gum up Trump\\'s agenda and begin to investigate his administration, something the GOP has not done very much of. In the Senate, Republicans could still approve federal judges and Trump Supreme Court nominees, but if they want to get any big legislation done they\\'re going to have to negotiate with Democrats in the House, and possibly a Speaker Nancy Pelosi.

Democrats feel they need to limit the losses in the Senate. If they can hold Republicans to net even, keeping the Senate at 51-49, or maybe lose a net of one seat, then they will be very happy. They have a much more favorable Senate landscape in 2020 and believe they will be able to take back the Senate then.

2. Republicans hold the House and Senate

This would be a huge win for the GOP.

\\n \\n\\n

How it would happen: A record turnout is expected Tuesday — perhaps higher than any time in the past 50 years for a midterm — but, as in 2016, Trump voters would have to dominate. Rural voters would have to turn out at higher-than-expected rates, causing the polls to be wrong (again). Meanwhile, young voters and Latinos would have to stay home. (It is supposed to rain on the East Coast Tuesday, which could depress low-propensity-voter turnout.)

\\n \\n\\n

All of those close House races would have to tip Republicans\\' way, something that\\'s very possible given the conservative lean of those districts and the distrust of the media, purposefully stoked by the president. And who pays for polls for the most part? Big media organizations.

\\n
\\n \"\"\\n \\n
\\n
\\n
\\n
\\n

\\n President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.\\n \\n \\n \\n Alex Wong/Getty Images\\n \\n \\n hide caption\\n

\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Alex Wong/Getty Images\\n \\n \\n
\\n
\\n
\\n \"\"\\n
\\n
\\n
\\n
\\n

President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.

\\n \\n \\n Alex Wong/Getty Images\\n \\n \\n
\\n
\\n
\\n

What it would mean: President Trump and Republicans would step on the gas, validated by an election cycle dominated by negative news coverage and polling that said the GOP had its back against the wall. The Affordable Care Act (aka Obamacare) would very likely be repealed once and for all. And Trump could set his sights on ousting Attorney General Jeff Sessions and other key figures at the Justice Department, possibly ending the department\\'s investigation of Russia\\'s attack on the 2016 election.

\\n \\n\\n

What\\'s more, Trump\\'s strategy of demonizing immigrants would have worked — again. That was rewarded, and what message would that send? He is only going to do more of it between Wednesday and November 2020 when he stands for re-election.

\\n \\n \\n\\n
\\n \\n\\n

It would also be yet another reckoning for pollsters and media organizations that pay for the surveys. The polls currently show Democrats with a razor-thin, but consistent advantage heading into Election Day. But if the polls are wrong, it should induce more than a shoulder shrug from outlets that conduct them and the news media organizations that report on them.

3. Democrats win both the House and Senate

This is not seen as the likeliest of scenarios, but it\\'s not out of the realm of possibility either. It would very likely mean a massive wave and a massive shift against Trump and Republicans tied to him nationally.

A lot would have to happen, especially in the Senate, for this to happen.

\\n \\n\\n

How it would happen: The path for Democrats in the House is through the suburbs, as in Scenario 1. That doesn\\'t change. But for Democrats to pull this off in the Senate, not only would voters have to side with Democratic incumbents in conservative states, but Democratic challengers would have to win in places like Nevada and Arizona, and possibly Tennessee and Texas.

What it would mean: It would be a repudiation of Trump and the Republicans tied to him nationwide. It would have to trigger a degree of soul-searching — in at least some Republican corners.

Trump would be faced with the choice of moderating and working with Democrats or being a lame-duck president starting in January 2019 when a new Democratic Congress is sworn in — as talk ramps up about Democratic 2020 challengers.

4. Overtime

It\\'s very possible control of both the House and Senate will not be clear on election night.

\\n
\\n \"\"\\n \\n
\\n
\\n
\\n
\\n

\\n A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.\\n \\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n hide caption\\n

\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n
\\n
\\n
\\n \"\"\\n
\\n
\\n
\\n
\\n

A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.

\\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n
\\n
\\n
\\n

How it would happen: There are a half-dozen congressional races in California, for example, that are very close heading into Election Day. It\\'s possible those races are so close they will not be called on election night. They might not be called for days and possibly weeks later, especially because the vote there is counted slowly.

\\n \\n\\n

Additionally, early and absentee ballots can get counted slowly and there is growing concern that many voters\\' absentee and mailed ballots could be rejected. In 2016, to the surprise of many, 319,000 absentee ballots were rejected for one reason or another.

In the Senate, depending on how results from other races shake out, there is the possibility that control is not known on election night or for weeks after. Specifically, it could all come down to Mississippi. There, no candidate is polling above 50 percent heading into Election Day, and if no one gets at least 50 percent, the race heads to a runoff three weeks later.

What it would mean: Imagine a scenario in which Democrats lead 50-49 on Election Day in the Senate, and the eyes of the country — and the deep pockets of out-of-state money — descend on Mississippi. The consequences would be enormous, the rancor pitched and the tension thick.

\\n
\\n
\\n \\n
\\n\\n\\n
\\n
    \\n
  • \\n
  • \\n
  • \\n
  • \\n
\\n
\\n\\n\\n
\\n
\\n\\n\\n
\\n
\\n\\n\\n
\\n
\\n\\n\\n\\n\\n
\\n \\n
\\n \\n\\n
\\n\\n\\n\\n
'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Holy moly! That looks awful. If we use our browser to visit the URL, then right-click the page and click 'View Page Source', we see that it is identical to this chunk of glorious text.\n", "\n", "### response.content" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'\\nWhat If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR\\n\\n\\n\\n\\n\\n
\\n Accessibility links \\n
\\n
\\n What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections The polls show a Democratic advantage in the House and a Republican one in the Senate. But be ready for anything because surprises in politics always happen.\\n
\\n
\\n
\\n\\n

\\n Elections\\n

\\n
\\n
\\n

What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections

\\n \\n \\n \\n \\n
\\n\\n\\n
\\n
\\n
    \\n
  • \\n
  • \\n
  • \\n
  • \\n
\\n
\\n \\n\\n
\\n\\n\\n
\\n
\\n
\\n \\n
\\n
\\n \\n\\n
\\n
\\n
\\n
\\n
\\n
\\n \\n \"Domenico\\n \\n
\\n\\n

\\n \\n Domenico Montanaro\\n \\n

\\n\\n \\n
\\n
\\n
\\n \\n\\n
\\n
\\n \\n\\n
\\n\\n\\n
\\n
\\n
\\n \"\"\\n \\n
\\n
\\n
\\n
\\n

\\n Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.\\n \\n \\n \\n Scott Olson/Getty Images\\n \\n \\n hide caption\\n

\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Scott Olson/Getty Images\\n \\n \\n
\\n
\\n
\\n \"\"\\n
\\n
\\n
\\n
\\n

Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.

\\n \\n \\n Scott Olson/Getty Images\\n \\n \\n
\\n
\\n
\\n

There\\'s a lot that can happen Tuesday, the culmination of a long midterm election campaign that will provide the first nationwide measure of the U.S. electorate since Donald Trump was elected president.

One narrative has become dominant: that Democrats are likely to gain control of the House and Republicans hold the Senate, if not expand their majority there. That narrative is based largely on national polls, and caution should be urged. Pollsters have made a lot of adjustments to hopefully correct what they got wrong in 2016, but they can\\'t tell you precisely who is going to show up to vote.

\\n \\n \\n\\n
\\n \\n\\n

What\\'s more, there have been far fewer statewide and district-specific surveys than in past midterm elections. And, as it is, there are data both parties can take solace in that buoy their respective cases. So everyone should be prepared for surprises \\xe2\\x80\\x94 because there always are some. That\\'s the beauty of campaigns and voting.

Here are four scenarios for how election night might play out and what each could mean.

1. Democrats win the House, and Republicans hold the Senate

This is the most likely outcome, based not just on the polls but also on conversations with strategists in both parties. But they urge caution, because the races in the many districts across the country that are up for grabs are still very close.

\\n \\n\\n

How it would happen: Forget the polls; Democrats are favored to take back the House for more reasons than that. There have been a record number of retirements, reducing the built-in advantage incumbents tend to have; record numbers of candidates, especially Democratic women, have run for public office; Democrats won the off-year elections in Virginia and New Jersey; they won or fared better than expected in special elections across the country; there was high primary turnout for Democrats in many states; and there is very high early voting turnout.

\\n \\n\\n

And just look at how wide the playing field is \\xe2\\x80\\x94 Democrats need to pick up 23 seats to take back the House, and they are targeting some 80 Republican-held seats. Republicans are competing in just eight held by Democrats. That right there is and has been a huge flashing red light for the GOP. So many of those races are running through the suburbs, where independents and wealthy, college-educated women live, both of which have consistently in polling said they disapproved of the job the president is doing and prefer to vote for a Democrat in their district.

\\n \\n\\n

One other overlooked number from the last NPR/PBS NewsHour/Marist poll: Just 54 percent of Republican women who are registered voters said they were very enthusiastic about voting in this election. Compare that with 78 percent of Republican men who are registered voters. And where do a lot of those women live? The suburbs. If GOP women, an important group that Republicans need to bolster them, stay home, that\\'s one way Democrats clean up in the House.

\\n \\n\\n

In the Senate, on the other hand, Republicans have a very favorable landscape and are competing in conservative states held by Democrats. The fundamentals favor the GOP in these states, and if Republicans win where they should win, they will hold the Senate.

\\n \\n\\n

What it would mean: This would be a huge win for Democrats, as they\\'d be able to gum up Trump\\'s agenda and begin to investigate his administration, something the GOP has not done very much of. In the Senate, Republicans could still approve federal judges and Trump Supreme Court nominees, but if they want to get any big legislation done they\\'re going to have to negotiate with Democrats in the House, and possibly a Speaker Nancy Pelosi.

Democrats feel they need to limit the losses in the Senate. If they can hold Republicans to net even, keeping the Senate at 51-49, or maybe lose a net of one seat, then they will be very happy. They have a much more favorable Senate landscape in 2020 and believe they will be able to take back the Senate then.

2. Republicans hold the House and Senate

This would be a huge win for the GOP.

\\n \\n\\n

How it would happen: A record turnout is expected Tuesday \\xe2\\x80\\x94 perhaps higher than any time in the past 50 years for a midterm \\xe2\\x80\\x94 but, as in 2016, Trump voters would have to dominate. Rural voters would have to turn out at higher-than-expected rates, causing the polls to be wrong (again). Meanwhile, young voters and Latinos would have to stay home. (It is supposed to rain on the East Coast Tuesday, which could depress low-propensity-voter turnout.)

\\n \\n\\n

All of those close House races would have to tip Republicans\\' way, something that\\'s very possible given the conservative lean of those districts and the distrust of the media, purposefully stoked by the president. And who pays for polls for the most part? Big media organizations.

\\n
\\n \"\"\\n \\n
\\n
\\n
\\n
\\n

\\n President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.\\n \\n \\n \\n Alex Wong/Getty Images\\n \\n \\n hide caption\\n

\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Alex Wong/Getty Images\\n \\n \\n
\\n
\\n
\\n \"\"\\n
\\n
\\n
\\n
\\n

President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.

\\n \\n \\n Alex Wong/Getty Images\\n \\n \\n
\\n
\\n
\\n

What it would mean: President Trump and Republicans would step on the gas, validated by an election cycle dominated by negative news coverage and polling that said the GOP had its back against the wall. The Affordable Care Act (aka Obamacare) would very likely be repealed once and for all. And Trump could set his sights on ousting Attorney General Jeff Sessions and other key figures at the Justice Department, possibly ending the department\\'s investigation of Russia\\'s attack on the 2016 election.

\\n \\n\\n

What\\'s more, Trump\\'s strategy of demonizing immigrants would have worked \\xe2\\x80\\x94 again. That was rewarded, and what message would that send? He is only going to do more of it between Wednesday and November 2020 when he stands for re-election.

\\n \\n \\n\\n
\\n \\n\\n

It would also be yet another reckoning for pollsters and media organizations that pay for the surveys. The polls currently show Democrats with a razor-thin, but consistent advantage heading into Election Day. But if the polls are wrong, it should induce more than a shoulder shrug from outlets that conduct them and the news media organizations that report on them.

3. Democrats win both the House and Senate

This is not seen as the likeliest of scenarios, but it\\'s not out of the realm of possibility either. It would very likely mean a massive wave and a massive shift against Trump and Republicans tied to him nationally.

A lot would have to happen, especially in the Senate, for this to happen.

\\n \\n\\n

How it would happen: The path for Democrats in the House is through the suburbs, as in Scenario 1. That doesn\\'t change. But for Democrats to pull this off in the Senate, not only would voters have to side with Democratic incumbents in conservative states, but Democratic challengers would have to win in places like Nevada and Arizona, and possibly Tennessee and Texas.

What it would mean: It would be a repudiation of Trump and the Republicans tied to him nationwide. It would have to trigger a degree of soul-searching \\xe2\\x80\\x94 in at least some Republican corners.

Trump would be faced with the choice of moderating and working with Democrats or being a lame-duck president starting in January 2019 when a new Democratic Congress is sworn in \\xe2\\x80\\x94 as talk ramps up about Democratic 2020 challengers.

4. Overtime

It\\'s very possible control of both the House and Senate will not be clear on election night.

\\n
\\n \"\"\\n \\n
\\n
\\n
\\n
\\n

\\n A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.\\n \\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n hide caption\\n

\\n
\\n\\n\\n toggle caption\\n
\\n\\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n
\\n
\\n
\\n \"\"\\n
\\n
\\n
\\n
\\n

A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.

\\n \\n \\n Joe Sohm/Visions of America/UIG via Getty Images\\n \\n \\n
\\n
\\n
\\n

How it would happen: There are a half-dozen congressional races in California, for example, that are very close heading into Election Day. It\\'s possible those races are so close they will not be called on election night. They might not be called for days and possibly weeks later, especially because the vote there is counted slowly.

\\n \\n\\n

Additionally, early and absentee ballots can get counted slowly and there is growing concern that many voters\\' absentee and mailed ballots could be rejected. In 2016, to the surprise of many, 319,000 absentee ballots were rejected for one reason or another.

In the Senate, depending on how results from other races shake out, there is the possibility that control is not known on election night or for weeks after. Specifically, it could all come down to Mississippi. There, no candidate is polling above 50 percent heading into Election Day, and if no one gets at least 50 percent, the race heads to a runoff three weeks later.

What it would mean: Imagine a scenario in which Democrats lead 50-49 on Election Day in the Senate, and the eyes of the country \\xe2\\x80\\x94 and the deep pockets of out-of-state money \\xe2\\x80\\x94 descend on Mississippi. The consequences would be enormous, the rancor pitched and the tension thick.

\\n
\\n
\\n \\n
\\n\\n\\n
\\n
    \\n
  • \\n
  • \\n
  • \\n
  • \\n
\\n
\\n\\n\\n
\\n
\\n\\n\\n
\\n
\\n\\n\\n
\\n
\\n\\n\\n\\n\\n
\\n \\n
\\n \\n\\n
\\n\\n\\n\\n
'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response.content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What?! This seems identical to the ``.text`` field. However, the careful eye would notice that the very 1st characters differ; that is, ``.content`` has a *b'* character at the beginning, which in Python syntax denotes that the data type is bytes, whereas the ``.text`` field did not have it and is a regular String.\n", "\n", "Ok, so that's great, but how do we make sense of this text? We could manually parse it, but that's tedious and difficult. As mentioned, BeautifulSoup is specifically designed to parse this exact content (any webpage content).\n", "\n", "## BEAUTIFUL SOUP\n", "![title](images/soup_for_you.jpg) (property of NBC)\n", "\n", "\n", "The [documentation for BeautifulSoup is found here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).\n", "\n", "A BeautifulSoup object can be initialized with the ``.content`` from request and a flag denoting the type of parser that we should use. For example, we could specify ``html.parser``, ``lxml``, etc [documentation here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers). Since we are interested in standard webpages that use HTML, let's specify the html.parser:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR\n", "\n", "\n", "\n", "\n", "\n", "
\n", "Accessibility links \n", "
\n", "
\n", "What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections The polls show a Democratic advantage in the House and a Republican one in the Senate. But be ready for anything because surprises in politics always happen.\n", "
\n", "
\n", "
\n", "

\n", "Elections\n", "

\n", "
\n", "
\n", "

What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections

\n", "\n", "\n", "\n", "\n", "
\n", "\n", "
\n", "
\n", "
    \n", "
  • \n", "
  • \n", "
  • \n", "
  • \n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\"Domenico\n", "\n", "
\n", "

\n", "\n", " Domenico Montanaro\n", " \n", "

\n", "\n", "
\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "Enlarge this image\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.\n", " \n", " \n", " \n", " Scott Olson/Getty Images\n", " \n", " \n", "hide caption\n", "

\n", "
\n", "toggle caption\n", "
\n", "\n", " \n", " Scott Olson/Getty Images\n", " \n", " \n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "
\n", "
\n", "
\n", "

Supporters of Missouri Democratic Sen. Claire McCaskill wait for her to arrive at a campaign stop in St. Louis on Monday.

\n", "\n", " \n", " Scott Olson/Getty Images\n", " \n", " \n", "
\n", "
\n", "
\n", "

There's a lot that can happen Tuesday, the culmination of a long midterm election campaign that will provide the first nationwide measure of the U.S. electorate since Donald Trump was elected president.

One narrative has become dominant: that Democrats are likely to gain control of the House and Republicans hold the Senate, if not expand their majority there. That narrative is based largely on national polls, and caution should be urged. Pollsters have made a lot of adjustments to hopefully correct what they got wrong in 2016, but they can't tell you precisely who is going to show up to vote.

\n", "
\n", "\"Key \n", "\n", "
\n", "\n", "
\n", "\n", "

What's more, there have been far fewer statewide and district-specific surveys than in past midterm elections. And, as it is, there are data both parties can take solace in that buoy their respective cases. So everyone should be prepared for surprises — because there always are some. That's the beauty of campaigns and voting.

Here are four scenarios for how election night might play out and what each could mean.

1. Democrats win the House, and Republicans hold the Senate

This is the most likely outcome, based not just on the polls but also on conversations with strategists in both parties. But they urge caution, because the races in the many districts across the country that are up for grabs are still very close.

\n", "\n", "\n", "
\n", "\n", "

How it would happen: Forget the polls; Democrats are favored to take back the House for more reasons than that. There have been a record number of retirements, reducing the built-in advantage incumbents tend to have; record numbers of candidates, especially Democratic women, have run for public office; Democrats won the off-year elections in Virginia and New Jersey; they won or fared better than expected in special elections across the country; there was high primary turnout for Democrats in many states; and there is very high early voting turnout.

\n", "\n", "

And just look at how wide the playing field is — Democrats need to pick up 23 seats to take back the House, and they are targeting some 80 Republican-held seats. Republicans are competing in just eight held by Democrats. That right there is and has been a huge flashing red light for the GOP. So many of those races are running through the suburbs, where independents and wealthy, college-educated women live, both of which have consistently in polling said they disapproved of the job the president is doing and prefer to vote for a Democrat in their district.

\n", "\n", "\n", "
\n", "\n", "

One other overlooked number from the last NPR/PBS NewsHour/Marist poll: Just 54 percent of Republican women who are registered voters said they were very enthusiastic about voting in this election. Compare that with 78 percent of Republican men who are registered voters. And where do a lot of those women live? The suburbs. If GOP women, an important group that Republicans need to bolster them, stay home, that's one way Democrats clean up in the House.

\n", "\n", "\n", "
\n", "\n", "

In the Senate, on the other hand, Republicans have a very favorable landscape and are competing in conservative states held by Democrats. The fundamentals favor the GOP in these states, and if Republicans win where they should win, they will hold the Senate.

\n", "\n", "

What it would mean: This would be a huge win for Democrats, as they'd be able to gum up Trump's agenda and begin to investigate his administration, something the GOP has not done very much of. In the Senate, Republicans could still approve federal judges and Trump Supreme Court nominees, but if they want to get any big legislation done they're going to have to negotiate with Democrats in the House, and possibly a Speaker Nancy Pelosi.

Democrats feel they need to limit the losses in the Senate. If they can hold Republicans to net even, keeping the Senate at 51-49, or maybe lose a net of one seat, then they will be very happy. They have a much more favorable Senate landscape in 2020 and believe they will be able to take back the Senate then.

2. Republicans hold the House and Senate

This would be a huge win for the GOP.

\n", "\n", "

How it would happen: A record turnout is expected Tuesday — perhaps higher than any time in the past 50 years for a midterm — but, as in 2016, Trump voters would have to dominate. Rural voters would have to turn out at higher-than-expected rates, causing the polls to be wrong (again). Meanwhile, young voters and Latinos would have to stay home. (It is supposed to rain on the East Coast Tuesday, which could depress low-propensity-voter turnout.)

\n", "\n", "\n", "
\n", "\n", "

All of those close House races would have to tip Republicans' way, something that's very possible given the conservative lean of those districts and the distrust of the media, purposefully stoked by the president. And who pays for polls for the most part? Big media organizations.

\n", "
\n", "\"\"\n", "
\n", "Enlarge this image\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.\n", " \n", " \n", " \n", " Alex Wong/Getty Images\n", " \n", " \n", "hide caption\n", "

\n", "
\n", "toggle caption\n", "
\n", "\n", " \n", " Alex Wong/Getty Images\n", " \n", " \n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "
\n", "
\n", "
\n", "

President Trump acknowledges supporters during a campaign rally for Rep. Marsha Blackburn, R-Tenn., and other Tennessee Republican candidates on Monday in Chattanooga, Tenn.

\n", "\n", " \n", " Alex Wong/Getty Images\n", " \n", " \n", "
\n", "
\n", "
\n", "

What it would mean: President Trump and Republicans would step on the gas, validated by an election cycle dominated by negative news coverage and polling that said the GOP had its back against the wall. The Affordable Care Act (aka Obamacare) would very likely be repealed once and for all. And Trump could set his sights on ousting Attorney General Jeff Sessions and other key figures at the Justice Department, possibly ending the department's investigation of Russia's attack on the 2016 election.

\n", "\n", "\n", "
\n", "\n", "

What's more, Trump's strategy of demonizing immigrants would have worked — again. That was rewarded, and what message would that send? He is only going to do more of it between Wednesday and November 2020 when he stands for re-election.

\n", "\n", "\n", "
\n", "\n", "

It would also be yet another reckoning for pollsters and media organizations that pay for the surveys. The polls currently show Democrats with a razor-thin, but consistent advantage heading into Election Day. But if the polls are wrong, it should induce more than a shoulder shrug from outlets that conduct them and the news media organizations that report on them.

3. Democrats win both the House and Senate

This is not seen as the likeliest of scenarios, but it's not out of the realm of possibility either. It would very likely mean a massive wave and a massive shift against Trump and Republicans tied to him nationally.

A lot would have to happen, especially in the Senate, for this to happen.

\n", "\n", "

How it would happen: The path for Democrats in the House is through the suburbs, as in Scenario 1. That doesn't change. But for Democrats to pull this off in the Senate, not only would voters have to side with Democratic incumbents in conservative states, but Democratic challengers would have to win in places like Nevada and Arizona, and possibly Tennessee and Texas.

What it would mean: It would be a repudiation of Trump and the Republicans tied to him nationwide. It would have to trigger a degree of soul-searching — in at least some Republican corners.

Trump would be faced with the choice of moderating and working with Democrats or being a lame-duck president starting in January 2019 when a new Democratic Congress is sworn in — as talk ramps up about Democratic 2020 challengers.

4. Overtime

It's very possible control of both the House and Senate will not be clear on election night.

\n", "
\n", "\"\"\n", "
\n", "Enlarge this image\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.\n", " \n", " \n", " \n", " Joe Sohm/Visions of America/UIG via Getty Images\n", " \n", " \n", "hide caption\n", "

\n", "
\n", "toggle caption\n", "
\n", "\n", " \n", " Joe Sohm/Visions of America/UIG via Getty Images\n", " \n", " \n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "
\n", "
\n", "
\n", "

A number of races are so close that it may not be possible to declare a winner on election night, leaving control of the House and Senate up in the air.

\n", "\n", " \n", " Joe Sohm/Visions of America/UIG via Getty Images\n", " \n", " \n", "
\n", "
\n", "
\n", "

How it would happen: There are a half-dozen congressional races in California, for example, that are very close heading into Election Day. It's possible those races are so close they will not be called on election night. They might not be called for days and possibly weeks later, especially because the vote there is counted slowly.

\n", "\n", "\n", "
\n", "\n", "

Additionally, early and absentee ballots can get counted slowly and there is growing concern that many voters' absentee and mailed ballots could be rejected. In 2016, to the surprise of many, 319,000 absentee ballots were rejected for one reason or another.

In the Senate, depending on how results from other races shake out, there is the possibility that control is not known on election night or for weeks after. Specifically, it could all come down to Mississippi. There, no candidate is polling above 50 percent heading into Election Day, and if no one gets at least 50 percent, the race heads to a runoff three weeks later.

What it would mean: Imagine a scenario in which Democrats lead 50-49 on Election Day in the Senate, and the eyes of the country — and the deep pockets of out-of-state money — descend on Mississippi. The consequences would be enormous, the rancor pitched and the tension thick.

\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", "
    \n", "
  • \n", "
  • \n", "
  • \n", "
  • \n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "
" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup = BeautifulSoup(response.content, \"html.parser\")\n", "soup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alright! That looks a little better; there's some whitespace formatting, adding some structure to our content! HTML code is structured by ``. Every tag has an opening and closing portion, denoted by ``< >`` and ````, respectively. If we want just the text (not the tags), we can use:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\n try {var _sf_startpt=(new Date()).getTime();} catch(e){}\\n\\nWhat If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections : NPR\\n window.NPR = window.NPR || {};\\nNPR.ServerConstants = {\"cbHost\":\"npr.org\",\"webHost\":\"https:\\\\/\\\\/www.npr.org\",\"embedHost\":\"https:\\\\/\\\\/www.npr.org\",\"webHostSecure\":\"https:\\\\/\\\\/secure.npr.org\",\"apiHost\":\"https:\\\\/\\\\/api.npr.org\",\"serverMediaCache\":\"https:\\\\/\\\\/media.npr.org\",\"googleAnalyticsAccount\":\"UA-5828686-4\",\"nielsenSFCode\":\"dcr\",\"nielsenAPN\":\"NPR-dcr\",\"shouldShowHPLocalContent\":true,\"readingServiceHostname\":\"https:\\\\/\\\\/reading.api.npr.org\"};\\nNPR.serverVars = {\"storyId\":\"664395755\",\"facebookAppId\":\"138837436154588\",\"webpackPublicPath\":\"https:\\\\/\\\\/s.npr.org\\\\/templates\\\\/javascript\\\\/dist\\\\/bundles\\\\/\",\"persistenceVersion\":\"e2193dbd58d7e71fdaffbd399767e8dc\",\"isBuildOut\":true,\"topicIds\":[\"P139482413\",\"1001\",\"1002\",\"1003\",\"1014\",\"1059\"],\"primaryTopic\":\"Elections\",\"topics\":[\"Elections\",\"News\",\"Home Page Top Stories\",\"National\",\"Politics\",\"Analysis\"],\"theme\":\"139482413\",\"aggIds\":[\"1001\",\"1002\",\"1003\",\"1014\",\"1059\",\"125950998\",\"125951073\",\"126931907\",\"126944326\",\"126953005\",\"127115490\",\"139482413\",\"162174434\",\"191676894\",\"219323468\",\"312150170\",\"360452518\",\"428799323\",\"432805936\",\"434975886\",\"497806639\",\"520216945\"],\"tagIds\":[\"2016\",\"2018\",\"2020\",\"Democrats\",\"GOP\",\"House\",\"Republican\",\"Senate\",\"election\",\"election day\",\"election night\",\"polling\",\"trump\"],\"byline\":[\"Domenico Montanaro\"],\"pubDate\":\"2018110516\",\"pageTypeId\":\"1\",\"title\":\"What If The Polls Are Wrong Again? 4 Scenarios For What Might Happen In The Elections\",\"publisherOrgId\":\"1\",\"rocketfuelCode\":20501671};\\n\\n\\n\\n\\n\\n\\n !function(a){function e(d){if(c[d])return c[d].exports;var f=c[d]={exports:{},id:d,loaded:!1};return a[d].call(f.exports,f,f.exports,e),f.loaded=!0,f.exports}var d=window.webpackJsonp;window.webpackJsonp=function(b,t){for(var n,r,o=0,i=[];o