Starting Up¶
You can connect to the saved database from last time if you want. Alternatively, for extra practice, you can just recreate it from the datasets provided in the .txt
files. That's what I'll do.
Comments¶
You should make a copy of this notebook and call it Lecture22_Exercises.ipynb
. Do all excercises in a single cell immediately after the Exercise statement similarly to what you did in Lecture 21.
import sqlite3
import numpy as np
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
db = sqlite3.connect('L22DB_ex.sqlite')
cursor = db.cursor()
cursor.execute("DROP TABLE IF EXISTS candidates")
cursor.execute("DROP TABLE IF EXISTS contributors")
cursor.execute("PRAGMA foreign_keys=1")
cursor.execute('''CREATE TABLE candidates (
id INTEGER PRIMARY KEY NOT NULL,
first_name TEXT,
last_name TEXT,
middle_init TEXT,
party TEXT NOT NULL)''')
db.commit() # Commit changes to the database
cursor.execute('''CREATE TABLE contributors (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
last_name TEXT,
first_name TEXT,
middle_name TEXT,
street_1 TEXT,
street_2 TEXT,
city TEXT,
state TEXT,
zip TEXT,
amount REAL,
date DATETIME,
candidate_id INTEGER NOT NULL,
FOREIGN KEY(candidate_id) REFERENCES candidates(id))''')
db.commit()
with open ("candidates.txt") as candidates:
next(candidates) # jump over the header
for line in candidates.readlines():
cid, first_name, last_name, middle_name, party = line.strip().split('|')
vals_to_insert = (int(cid), first_name, last_name, middle_name, party)
cursor.execute('''INSERT INTO candidates
(id, first_name, last_name, middle_init, party)
VALUES (?, ?, ?, ?, ?)''', vals_to_insert)
with open ("contributors.txt") as contributors:
next(contributors)
for line in contributors.readlines():
cid, last_name, first_name, middle_name, street_1, street_2, \
city, state, zip_code, amount, date, candidate_id = line.strip().split('|')
vals_to_insert = (last_name, first_name, middle_name, street_1, street_2,
city, state, int(zip_code), amount, date, candidate_id)
cursor.execute('''INSERT INTO contributors (last_name, first_name, middle_name,
street_1, street_2, city, state, zip, amount, date, candidate_id)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', vals_to_insert)
candidate_cols = [col[1] for col in cursor.execute("PRAGMA table_info(candidates)")]
contributor_cols = [col[1] for col in cursor.execute("PRAGMA table_info(contributors)")]
def viz_tables(cols, query):
q = cursor.execute(query).fetchall()
framelist = dict()
for i, col_name in enumerate(cols):
framelist[col_name] = [col[i] for col in q]
return pd.DataFrame.from_dict(framelist)
from IPython.display import display
Recap¶
Last time, you played with a bunch of SQLite
commands to query and update the tables in the database.
One thing we didn't get to was how to query the contributors table based off of a query in the candidates table. For example, suppose you want to query which contributors donated to Obama. You could use a nested SELECT
statement to accomplish that.
query = '''SELECT * FROM contributors WHERE candidate_id = (SELECT id from candidates WHERE last_name = "Obama")'''
viz_tables(contributor_cols, query)
id | last_name | first_name | middle_name | street_1 | street_2 | city | state | zip | amount | date | candidate_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26 | Buckler | Steve | 24351 Armada Dr | Dana Point | CA | 926291306 | 50.00 | 2007-07-30 | 20 | ||
1 | 27 | Buckler | Steve | 24351 Armada Dr | Dana Point | CA | 926291306 | 25.00 | 2007-08-16 | 20 | ||
2 | 28 | Buckheit | Bruce | 8904 KAREN DR | FAIRFAX | VA | 220312731 | 100.00 | 2007-09-19 | 20 | ||
3 | 29 | Buckel | Linda | PO Box 683130 | Park City | UT | 840683130 | 2300.00 | 2007-08-14 | 20 | ||
4 | 30 | Buckel | Linda | PO Box 683130 | Park City | UT | 840683130 | -2300.00 | 2007-08-14 | 20 | ||
5 | 31 | Buckel | Linda | PO Box 683130 | Park City | UT | 840683130 | 4600.00 | 2007-08-14 | 20 | ||
6 | 32 | Buck | Thomas | 4206 Terrace Street | Kansas City | MO | 64111 | 100.00 | 2007-09-25 | 20 | ||
7 | 33 | Buck | Jay | K. | 1855 Old Willow Rd Unit 322 | Northfield | IL | 600932918 | 200.00 | 2007-09-12 | 20 | |
8 | 34 | Buck | Blaine | M | 45 Eaton Ave | Camden | ME | 48431752 | 2300.00 | 2007-09-30 | 20 | |
9 | 35 | Buck | Barbara | 1780 NE 138th St | North Miami | FL | 331811316 | 50.00 | 2007-09-13 | 20 | ||
10 | 36 | Buck | Barbara | 1780 NE 138th St | North Miami | FL | 331811316 | 50.00 | 2007-07-19 | 20 | ||
11 | 37 | Buchman | Mark M | 2530 Lawton Ave | San Luis Obispo | CA | 934015622 | 460.80 | 2007-07-18 | 20 | ||
12 | 38 | Bucher | Ida | M | 1400 Warnall Ave | Los Angeles | CA | 900245333 | 100.00 | 2007-07-10 | 20 | |
13 | 39 | Buchanek | Elizabeth | 7917 Kentbury Dr | Bethesda | MD | 208144615 | 50.00 | 2007-09-30 | 20 | ||
14 | 40 | Buchanan | John | 2025 NW 29th Rd | Boca Raton | FL | 334316303 | 500.00 | 2007-09-24 | 20 | ||
15 | 41 | Buchanan | John | 2025 NW 29th Rd | Boca Raton | FL | 334316303 | -500.00 | 2007-09-24 | 20 | ||
16 | 42 | Buchanan | John | 2025 NW 29th Rd | Boca Raton | FL | 334316303 | 500.00 | 2007-09-24 | 20 | ||
17 | 43 | Buchanan | John | 2025 NW 29th Rd | Boca Raton | FL | 334316303 | 700.00 | 2007-08-28 | 20 | ||
18 | 44 | Buchanan | John | 2025 NW 29th Rd | Boca Raton | FL | 334316303 | -700.00 | 2007-08-28 | 20 | ||
19 | 45 | Buchanan | John | 2025 NW 29th Rd | Boca Raton | FL | 334316303 | 1000.00 | 2007-08-28 | 20 | ||
20 | 46 | Buchanan | John | 2025 NW 29th Rd | Boca Raton | FL | 334316303 | 1300.00 | 2007-08-09 | 20 | ||
21 | 47 | Buchanan | John | 2025 NW 29th Rd | Boca Raton | FL | 334316303 | 200.00 | 2007-08-14 | 20 | ||
22 | 48 | Buchanan | John | 2025 NW 29th Rd | Boca Raton | FL | 334316303 | 500.00 | 2007-07-25 | 20 | ||
23 | 49 | Buchanan | John | 4635 49th St NW | Washington | DC | 200164320 | 200.09 | 2007-09-23 | 20 | ||
24 | 50 | Harrison | Ryan | 2247 3rd St | La Verne | CA | 917504918 | 25.00 | 2007-07-26 | 20 |
Joins¶
The last example involved querying data from multiple tables.
In particular, we combined columns from the two related tables (related through the FOREIGN KEY
).
This leads to the idea of joining multiple tables together. SQL
has a set of commands to handle different types of joins. SQLite
does not support the full suite of join commands offered by SQL
but you should still be able to get the main ideas from the limited command set.
We'll begin with the INNER JOIN
.
INNER JOIN¶
The idea here is that you will combine the tables if the values of certain columns are the same between the two tables. In our example, we will join the two tables based on the candidate id. The result of the INNER JOIN
will be a new table consisting of the columns we requested and containing the common data. Since we are joining based off of the candidate id, we will not be excluding any rows.
Example¶
Here are two tables. Table A has the form:
nA | attr | idA |
---|---|---|
s1 | 23 | 0 |
s2 | 7 | 2 |
and table B has the form:
nB | attr | idB |
---|---|---|
t1 | 60 | 0 |
t2 | 14 | 7 |
t3 | 22 | 2 |
Table A is associated with Table B through a foreign key on the id column.
If we join the two tables by comparing the id columns and selecting the nA, nB, and attr columns then we'll get
nA | A.attr | nB | B.attr |
---|---|---|---|
s1 | 23 | t1 | 60 |
s2 | 7 | t3 | 22 |
The SQLite
code to do this join would be
SELECT nA, A.attr, nB, B.attr FROM A INNER JOIN B ON B.idB = A.idA
Notice that the second row in table B is gone because the id values are not the same.
Thoughts¶
What is SQL
doing with this operation? It may help to visualize this with a Venn diagram. Table A has rows with values corresponding to the idA
attribute. Column B has rows with values corresponding to the idB
attribute. The INNER JOIN
will combine the two tables such that rows with common entries in the id
attributes are included. We essentially have the following Venn diagram.
Exercises¶
- Using an
INNER JOIN
, join the candidates and contributors tables by comparing thecandidate_id
andcandidates_id
columns. Display your joined table with the columnscontributors.last_name
,contributors.first_name
, andcandidates.last_name
. - Do the same inner join as in the last part, but this time append a
WHERE
clause to select a specific candidate's last name.
# Inner Join
cols = ["contributors.last_name", "contributors.first_name", "contributors.amount", "candidates.last_name"]
query = '''SELECT contributors.last_name, contributors.first_name, amount, candidates.last_name
FROM contributors
INNER JOIN candidates
ON candidates.id = contributors.candidate_id'''
display(viz_tables(cols, query))
# Inner Join and select last name
cols = ["contributors.last_name", "contributors.first_name", "contributors.amount", "candidates.last_name"]
query = '''SELECT contributors.last_name, contributors.first_name, amount, candidates.last_name
FROM contributors
INNER JOIN candidates
ON candidates.id = contributors.candidate_id
WHERE candidates.last_name = "Obama"'''
display(viz_tables(cols, query))
contributors.last_name | contributors.first_name | contributors.amount | candidates.last_name | |
---|---|---|---|---|
0 | Agee | Steven | 500.0 | Huckabee |
1 | Ahrens | Don | 250.0 | Huckabee |
2 | Ahrens | Don | 50.0 | Huckabee |
3 | Ahrens | Don | 100.0 | Huckabee |
4 | Akin | Charles | 100.0 | Huckabee |
... | ... | ... | ... | ... |
170 | ABESHAUS | MERRILL | 120.0 | McCain |
171 | ABRAHAM | GEORGE | 800.0 | McCain |
172 | ABRAHAMSON | PETER | 50.0 | McCain |
173 | ABRAHAM | SALEM | 1000.0 | McCain |
174 | ABRAHAM | SALEM | 1300.0 | McCain |
175 rows × 4 columns
contributors.last_name | contributors.first_name | contributors.amount | candidates.last_name | |
---|---|---|---|---|
0 | Buckler | Steve | 50.00 | Obama |
1 | Buckler | Steve | 25.00 | Obama |
2 | Buckheit | Bruce | 100.00 | Obama |
3 | Buckel | Linda | 2300.00 | Obama |
4 | Buckel | Linda | -2300.00 | Obama |
5 | Buckel | Linda | 4600.00 | Obama |
6 | Buck | Thomas | 100.00 | Obama |
7 | Buck | Jay | 200.00 | Obama |
8 | Buck | Blaine | 2300.00 | Obama |
9 | Buck | Barbara | 50.00 | Obama |
10 | Buck | Barbara | 50.00 | Obama |
11 | Buchman | Mark M | 460.80 | Obama |
12 | Bucher | Ida | 100.00 | Obama |
13 | Buchanek | Elizabeth | 50.00 | Obama |
14 | Buchanan | John | 500.00 | Obama |
15 | Buchanan | John | -500.00 | Obama |
16 | Buchanan | John | 500.00 | Obama |
17 | Buchanan | John | 700.00 | Obama |
18 | Buchanan | John | -700.00 | Obama |
19 | Buchanan | John | 1000.00 | Obama |
20 | Buchanan | John | 1300.00 | Obama |
21 | Buchanan | John | 200.00 | Obama |
22 | Buchanan | John | 500.00 | Obama |
23 | Buchanan | John | 200.09 | Obama |
24 | Harrison | Ryan | 25.00 | Obama |
LEFT JOIN
or LEFT OUTER JOIN
¶
There are many ways to combine two tables. We just explored one possibility in which we combined the tables based upon the intersection of the two tables (the INNER JOIN
).
Now we'll talk about the LEFT JOIN
or LEFT OUTER JOIN
.
In words, the LEFT JOIN
is combining the tables based upon what is in the intersection of the two tables and what is in the "reference" table.
We can consider our toy example in two guises:
Example A¶
Let's do a LEFT JOIN
of table B from table A. That is, we'd like to make a new table by putting table B into table A. In this case, we'll consider table A our "reference" table. We're comparing by the id
column again. We know that these two tables share ids 0 and 2 and table A doesn't have anything else in it. The resulting table is:
nA | A.attr | nB | B.attr |
---|---|---|---|
s1 | 23 | t1 | 60 |
s2 | 7 | t3 | 22 |
That's not very exciting. It's the same result as from the INNER JOIN
. We can do another example that may be more enlightening.
Example B¶
Let's do a LEFT JOIN
of table A from table B. That is, we'd like to make a new table by putting table A into table B. In this case, we'll consider table B our "reference" table. Again, we use the id
column for comparison. We know that these two tables share ids 0 and 2. This time, table B also contains the id 7, which is not shared by table A. The resulting table is:
nA | A.attr | nB | B.attr |
---|---|---|---|
s1 | 23 | t1 | 60 |
None | NaN | t2 | 14 |
s2 | 7 | t3 | 22 |
Notice that SQLite
filled in the missing entries for us. This is necessary for completion of the requested join.
The SQLite
commands to accomplish all of this are:
SELECT nA, A.attr, nB, B.attr FROM A LEFT JOIN B ON B.idB = A.idA
and
SELECT nA, A.attr, nB, B.attr FROM B LEFT JOIN A ON A.idA = B.idB
Here is a visualization using Venn diagrams of the LEFT JOIN
.
Exercises¶
Use the following two tables to do the first two exercises in this section. Table A has the form:
nA | attr | idA |
---|---|---|
s1 | 23 | 0 |
s2 | 7 | 2 |
s3 | 15 | 2 |
s4 | 31 | 0 |
and table B has the form:
nB | attr | idB |
---|---|---|
t1 | 60 | 0 |
t2 | 14 | 7 |
t3 | 22 | 2 |
- Draw the table that would result from a
LEFT JOIN
using table A as the reference and theid
columns for comparison.
A.nA | A.attr | B.nB | B.attr |
---|---|---|---|
s1 | 23 | t1 | 60 |
s2 | 7 | t3 | 22 |
s3 | 15 | t3 | 22 |
s4 | 31 | t1 | 60 |
- Draw the table that would result from a
LEFT JOIN
using table B as the reference and theid
columns for comparison.
B.nB | B.attr | A.nA | A.attr |
---|---|---|---|
t1 | 60 | s1 | 23 |
t1 | 60 | s4 | 31 |
t2 | 14 | --- | --- |
t3 | 22 | s2 | 7 |
t3 | 22 | s3 | 15 |
- Now back to the candidates and their contributors. Create a new table with the following form:
average contribution | candidate last name |
---|---|
... | ... |
The table should be created using the LEFT JOIN
clause on the contributors table by joining the candidates table by the id
column. The average contribution
column should be obtained using the AVG
SQL
function. Use the GROUP BY
clause on the candidates last name.
q = '''SELECT AVG(contributors.amount), candidates.last_name
FROM contributors LEFT JOIN candidates
ON contributors.candidate_id = candidates.id
WHERE contributors.amount > 0
GROUP BY candidates.last_name'''
display(viz_tables(["Average Contribution", "candidates.last_name"], q))
Average Contribution | candidates.last_name | |
---|---|---|
0 | 170.217391 | Clinton |
1 | 772.000000 | Huckabee |
2 | 452.600000 | McCain |
3 | 695.949545 | Obama |
4 | 290.909091 | Romney |
pandas
¶
We've been working with databases for the last few lectures and learning SQLite
commands to work with and manipulate the databases. There is a Python
package called pandas
that provides broad support for databases. It can be used to interact with relational databases through its own methods and even through SQL
commands.
In the last part of this lecture, you will get to redo a bunch of the database exercises using pandas
.
We won't be able to cover pandas
from the ground up, but it's a well-documented library and is fairly easy to get up and running. Here's the website: pandas
.
Reading a datafile into pandas
¶
# Using pandas naming convention
dfcand = pd.read_csv("candidates.txt", sep="|")
dfcand
id | first_name | last_name | middle_name | party | |
---|---|---|---|---|---|
0 | 33 | Joseph | Biden | NaN | D |
1 | 36 | Samuel | Brownback | NaN | R |
2 | 34 | Hillary | Clinton | R. | D |
3 | 39 | Christopher | Dodd | J. | D |
4 | 26 | John | Edwards | NaN | D |
5 | 22 | Rudolph | Giuliani | NaN | R |
6 | 24 | Mike | Gravel | NaN | D |
7 | 16 | Mike | Huckabee | NaN | R |
8 | 30 | Duncan | Hunter | NaN | R |
9 | 31 | Dennis | Kucinich | NaN | D |
10 | 37 | John | McCain | NaN | R |
11 | 20 | Barack | Obama | NaN | D |
12 | 32 | Ron | Paul | NaN | R |
13 | 29 | Bill | Richardson | NaN | D |
14 | 35 | Mitt | Romney | NaN | R |
15 | 38 | Tom | Tancredo | NaN | R |
16 | 41 | Fred | Thompson | D. | R |
dfcontr = pd.read_csv("contributors.txt", sep="|")
dfcontr
id | last_name | first_name | middle_name | street_1 | street_2 | city | state | zip | amount | date | candidate_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Agee | Steven | NaN | 549 Laurel Branch Road | NaN | Floyd | VA | 24091 | 500.0 | 2007-06-30 | 16 |
1 | NaN | Ahrens | Don | NaN | 4034 Rennellwood Way | NaN | Pleasanton | CA | 94566 | 250.0 | 2007-05-16 | 16 |
2 | NaN | Ahrens | Don | NaN | 4034 Rennellwood Way | NaN | Pleasanton | CA | 94566 | 50.0 | 2007-06-18 | 16 |
3 | NaN | Ahrens | Don | NaN | 4034 Rennellwood Way | NaN | Pleasanton | CA | 94566 | 100.0 | 2007-06-21 | 16 |
4 | NaN | Akin | Charles | NaN | 10187 Sugar Creek Road | NaN | Bentonville | AR | 72712 | 100.0 | 2007-06-16 | 16 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
170 | NaN | ABESHAUS | MERRILL | M. | 1801 N. HEREFORD DRIVE | NaN | FLAGSTAFF | AZ | 860011121 | 120.0 | 2008-01-16 | 37 |
171 | NaN | ABRAHAM | GEORGE | NaN | P.O. BOX 1504 | NaN | LAKE CHARLES | LA | 706021504 | 800.0 | 2008-01-17 | 37 |
172 | NaN | ABRAHAMSON | PETER | J. | 1030 W. ROSCOE STREET | NaN | CHICAGO | IL | 606572207 | 50.0 | 2008-01-25 | 37 |
173 | NaN | ABRAHAM | SALEM | A. | P.O. BOX 7 | NaN | CANADIAN | TX | 790140007 | 1000.0 | 2008-01-17 | 37 |
174 | NaN | ABRAHAM | SALEM | A. | P.O. BOX 7 | NaN | CANADIAN | TX | 790140007 | 1300.0 | 2008-01-30 | 37 |
175 rows × 12 columns
Reading things in is quite easy with pandas
.
Notice that pandas
populates empty fields with NaN
values.
The id
column in the contributors dataset is superfluous. Let's delete it.
del dfcontr['id']
dfcontr.head()
last_name | first_name | middle_name | street_1 | street_2 | city | state | zip | amount | date | candidate_id | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Agee | Steven | NaN | 549 Laurel Branch Road | NaN | Floyd | VA | 24091 | 500.0 | 2007-06-30 | 16 |
1 | Ahrens | Don | NaN | 4034 Rennellwood Way | NaN | Pleasanton | CA | 94566 | 250.0 | 2007-05-16 | 16 |
2 | Ahrens | Don | NaN | 4034 Rennellwood Way | NaN | Pleasanton | CA | 94566 | 50.0 | 2007-06-18 | 16 |
3 | Ahrens | Don | NaN | 4034 Rennellwood Way | NaN | Pleasanton | CA | 94566 | 100.0 | 2007-06-21 | 16 |
4 | Akin | Charles | NaN | 10187 Sugar Creek Road | NaN | Bentonville | AR | 72712 | 100.0 | 2007-06-16 | 16 |
Very nice! And we used the head
method to print out the first five rows.
Creating a Table with pandas
¶
We can use pandas
to create tables in a database.
First, let's create a new database since we've already done a lot on our test database.
dbp = sqlite3.connect('L22_pandas_DB.sqlite')
csr = dbp.cursor()
csr.execute("DROP TABLE IF EXISTS candidates")
csr.execute("DROP TABLE IF EXISTS contributors")
csr.execute("PRAGMA foreign_keys=1")
csr.execute('''CREATE TABLE candidates (
id INTEGER PRIMARY KEY NOT NULL,
first_name TEXT,
last_name TEXT,
middle_name TEXT,
party TEXT NOT NULL)''')
dbp.commit() # Commit changes to the database
csr.execute('''CREATE TABLE contributors (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
last_name TEXT,
first_name TEXT,
middle_name TEXT,
street_1 TEXT,
street_2 TEXT,
city TEXT,
state TEXT,
zip TEXT,
amount REAL,
date DATETIME,
candidate_id INTEGER NOT NULL,
FOREIGN KEY(candidate_id) REFERENCES candidates(id))''')
dbp.commit()
Last time, we opened the data files with Python
and then manually used SQLite
commands to populate the individual tables. We can use pandas
instead like so.
dfcand.to_sql("candidates", dbp, if_exists="append", index=False)
How big is our table?
dfcand.shape
(17, 5)
We can visualize the data in our pandas
-populated table. No surprises here except that pandas
did everything for us.
query = '''SELECT * FROM candidates'''
csr.execute(query).fetchall()
[(16, 'Mike', 'Huckabee', None, 'R'), (20, 'Barack', 'Obama', None, 'D'), (22, 'Rudolph', 'Giuliani', None, 'R'), (24, 'Mike', 'Gravel', None, 'D'), (26, 'John', 'Edwards', None, 'D'), (29, 'Bill', 'Richardson', None, 'D'), (30, 'Duncan', 'Hunter', None, 'R'), (31, 'Dennis', 'Kucinich', None, 'D'), (32, 'Ron', 'Paul', None, 'R'), (33, 'Joseph', 'Biden', None, 'D'), (34, 'Hillary', 'Clinton', 'R.', 'D'), (35, 'Mitt', 'Romney', None, 'R'), (36, 'Samuel', 'Brownback', None, 'R'), (37, 'John', 'McCain', None, 'R'), (38, 'Tom', 'Tancredo', None, 'R'), (39, 'Christopher', 'Dodd', 'J.', 'D'), (41, 'Fred', 'Thompson', 'D.', 'R')]
Querying a table with pandas
¶
One Way¶
dfcand.query("first_name=='Mike' & party=='D'")
id | first_name | last_name | middle_name | party | |
---|---|---|---|---|---|
6 | 24 | Mike | Gravel | NaN | D |
Another Way¶
dfcand[(dfcand.first_name=="Mike") & (dfcand.party=="D")]
id | first_name | last_name | middle_name | party | |
---|---|---|---|---|---|
6 | 24 | Mike | Gravel | NaN | D |
More Queries¶
dfcand[dfcand.middle_name.notnull()]
id | first_name | last_name | middle_name | party | |
---|---|---|---|---|---|
2 | 34 | Hillary | Clinton | R. | D |
3 | 39 | Christopher | Dodd | J. | D |
16 | 41 | Fred | Thompson | D. | R |
dfcand[dfcand.first_name.isin(['Mike', 'Hillary'])]
id | first_name | last_name | middle_name | party | |
---|---|---|---|---|---|
2 | 34 | Hillary | Clinton | R. | D |
6 | 24 | Mike | Gravel | NaN | D |
7 | 16 | Mike | Huckabee | NaN | R |
Exercises¶
- Use
pandas
to populate the contributors table. - Query the contributors tables with the following:
- List entries where the state is "VA" and the amount is less than $\$400.00$.
- List entries where the state is "NULL".
- List entries for the states of Texas and Pennsylvania.
- List entries where the amount contributed is between $\$10.00$ and $\$50.00$.
# Use pandas to populate contributors table
dfcontr.to_sql("contributors", dbp, if_exists="append", index=False)
# List entries where state is "VA" and amount is less than $400.00
display(dfcontr.query("state=='VA' & amount<400.00"))
# List entries with NULL
display(dfcontr[dfcontr.state.isnull()])
# TX and PA
display(dfcontr[dfcontr.state.isin(['TX', 'PA'])])
# Contributions between $10 and $50
display(dfcontr.query("10 < amount < 50"))
last_name | first_name | middle_name | street_1 | street_2 | city | state | zip | amount | date | candidate_id | |
---|---|---|---|---|---|---|---|---|---|---|---|
27 | Buckheit | Bruce | NaN | 8904 KAREN DR | NaN | FAIRFAX | VA | 220312731 | 100.00 | 2007-09-19 | 20 |
77 | Ranganath | Anoop | NaN | 2507 Willard Drive | NaN | Charlottesville | VA | 22903 | -100.00 | 2008-04-21 | 32 |
88 | Perreault | Louise | NaN | 503 Brockridge Hunt Drive | NaN | Hampton | VA | 23666 | -34.08 | 2008-04-21 | 32 |
145 | ABDELLA | THOMAS | M. | 4231 MONUMENT WALL WAY #340 | NaN | FAIRFAX | VA | 220308440 | 50.00 | 2007-09-30 | 35 |
last_name | first_name | middle_name | street_1 | street_2 | city | state | zip | amount | date | candidate_id | |
---|---|---|---|---|---|---|---|---|---|---|---|
125 | BOURNE | TRAVIS | NaN | LAGE KAART 77 | NaN | BRASSCHATT | NaN | 2930 | -500.0 | 2008-11-20 | 35 |
last_name | first_name | middle_name | street_1 | street_2 | city | state | zip | amount | date | candidate_id | |
---|---|---|---|---|---|---|---|---|---|---|---|
65 | BURGERT | RONALD | L. | 5723 PLUMTREE DRIVE | NaN | DALLAS | TX | 752524926 | -1000.00 | 2008-03-05 | 22 |
70 | BUCKLEY | WALTER | W. | 1635 COUNTRY ROAD | NaN | BETHLEHEM | PA | 180155718 | -100.00 | 2008-03-05 | 22 |
71 | BUCKLEY | MARJORIE | B. | 1635 COUNTRY ROAD | NaN | BETHLEHEM | PA | 180155718 | -100.00 | 2008-03-05 | 22 |
79 | Muse | Nina | Jo | 2915 Toro Canyon Rd | NaN | Austin | TX | 78746 | -50.00 | 2008-04-21 | 32 |
93 | Raught | Philip | M | 4714 Plum Way | NaN | Pittsburgh | PA | 15201 | -1046.00 | 2008-04-21 | 32 |
94 | Ferrara | Judith | D | 1508 Waterford Road | NaN | Yardley | PA | 19067 | -1100.00 | 2008-04-21 | 32 |
95 | Johnson | Cathleen | E. | 1003 Justin Ln Apt 2016 | NaN | Austin | TX | 787572648 | -14.76 | 2008-04-21 | 32 |
132 | ABREU | KEVIN | M. | 1305 GARDEN GLEN LANE | NaN | PEARLAND | TX | 775816547 | 50.00 | 2007-09-30 | 35 |
133 | ABREU | KEVIN | M. | 1305 GARDEN GLEN LANE | NaN | PEARLAND | TX | 775816547 | 150.00 | 2007-08-09 | 35 |
134 | ABREU | KEVIN | M. | 1305 GARDEN GLEN LANE | NaN | PEARLAND | TX | 775816547 | 50.00 | 2007-07-19 | 35 |
165 | ABEL | JOHN | H. | 422 THOMAS STREET | NaN | BETHLEHEM | PA | 180153316 | 200.00 | 2008-01-22 | 37 |
169 | ABERCROMBIE | DENIS | NaN | 11811 WATER OAK CT | NaN | MAGNOLIA | TX | 773546270 | 500.00 | 2008-01-30 | 37 |
173 | ABRAHAM | SALEM | A. | P.O. BOX 7 | NaN | CANADIAN | TX | 790140007 | 1000.00 | 2008-01-17 | 37 |
174 | ABRAHAM | SALEM | A. | P.O. BOX 7 | NaN | CANADIAN | TX | 790140007 | 1300.00 | 2008-01-30 | 37 |
last_name | first_name | middle_name | street_1 | street_2 | city | state | zip | amount | date | candidate_id | |
---|---|---|---|---|---|---|---|---|---|---|---|
26 | Buckler | Steve | NaN | 24351 Armada Dr | NaN | Dana Point | CA | 926291306 | 25.0 | 2007-08-16 | 20 |
49 | Harrison | Ryan | NaN | 2247 3rd St | NaN | La Verne | CA | 917504918 | 25.0 | 2007-07-26 | 20 |
101 | Aarons | Elaine | NaN | 481 Buck Island Rd Apt 17A | APT 17A | West Yarmouth | MA | 26733300 | 25.0 | 2008-02-26 | 34 |
140 | ABEGG | PATRICIA | T. | 1862 E. 5150 S. | NaN | SALT LAKE CITY | UT | 841176911 | 25.0 | 2007-09-17 | 35 |
143 | ABEGG | PATRICIA | T. | 1862 E. 5150 S. | NaN | SALT LAKE CITY | UT | 841176911 | 25.0 | 2007-08-06 | 35 |
144 | ABEGG | PATRICIA | T. | 1862 E. 5150 S. | NaN | SALT LAKE CITY | UT | 841176911 | 25.0 | 2007-07-10 | 35 |
158 | ABBO | PAULINE | MORENCY | 10720 JACOB LANE | NaN | WHITE LAKE | MI | 483862274 | 35.0 | 2008-01-07 | 37 |
160 | ABAIR | PETER | NaN | 40 EVANS STREET | NaN | WATERTOWN | MA | 24722150 | 25.0 | 2008-01-09 | 37 |
Sorting¶
dfcand.sort_values(by='party')
id | first_name | last_name | middle_name | party | |
---|---|---|---|---|---|
0 | 33 | Joseph | Biden | NaN | D |
2 | 34 | Hillary | Clinton | R. | D |
3 | 39 | Christopher | Dodd | J. | D |
4 | 26 | John | Edwards | NaN | D |
13 | 29 | Bill | Richardson | NaN | D |
6 | 24 | Mike | Gravel | NaN | D |
11 | 20 | Barack | Obama | NaN | D |
9 | 31 | Dennis | Kucinich | NaN | D |
14 | 35 | Mitt | Romney | NaN | R |
12 | 32 | Ron | Paul | NaN | R |
8 | 30 | Duncan | Hunter | NaN | R |
15 | 38 | Tom | Tancredo | NaN | R |
7 | 16 | Mike | Huckabee | NaN | R |
5 | 22 | Rudolph | Giuliani | NaN | R |
1 | 36 | Samuel | Brownback | NaN | R |
10 | 37 | John | McCain | NaN | R |
16 | 41 | Fred | Thompson | D. | R |
dfcand.sort_values(by='party', ascending=False)
id | first_name | last_name | middle_name | party | |
---|---|---|---|---|---|
8 | 30 | Duncan | Hunter | NaN | R |
7 | 16 | Mike | Huckabee | NaN | R |
15 | 38 | Tom | Tancredo | NaN | R |
14 | 35 | Mitt | Romney | NaN | R |
12 | 32 | Ron | Paul | NaN | R |
10 | 37 | John | McCain | NaN | R |
1 | 36 | Samuel | Brownback | NaN | R |
16 | 41 | Fred | Thompson | D. | R |
5 | 22 | Rudolph | Giuliani | NaN | R |
6 | 24 | Mike | Gravel | NaN | D |
9 | 31 | Dennis | Kucinich | NaN | D |
11 | 20 | Barack | Obama | NaN | D |
4 | 26 | John | Edwards | NaN | D |
13 | 29 | Bill | Richardson | NaN | D |
3 | 39 | Christopher | Dodd | J. | D |
2 | 34 | Hillary | Clinton | R. | D |
0 | 33 | Joseph | Biden | NaN | D |
Selecting Columns¶
dfcand[['last_name', 'party']]
last_name | party | |
---|---|---|
0 | Biden | D |
1 | Brownback | R |
2 | Clinton | D |
3 | Dodd | D |
4 | Edwards | D |
5 | Giuliani | R |
6 | Gravel | D |
7 | Huckabee | R |
8 | Hunter | R |
9 | Kucinich | D |
10 | McCain | R |
11 | Obama | D |
12 | Paul | R |
13 | Richardson | D |
14 | Romney | R |
15 | Tancredo | R |
16 | Thompson | R |
dfcand[['last_name', 'party']].count()
last_name 17 party 17 dtype: int64
dfcand[['first_name']].drop_duplicates()
first_name | |
---|---|
0 | Joseph |
1 | Samuel |
2 | Hillary |
3 | Christopher |
4 | John |
5 | Rudolph |
6 | Mike |
8 | Duncan |
9 | Dennis |
11 | Barack |
12 | Ron |
13 | Bill |
14 | Mitt |
15 | Tom |
16 | Fred |
dfcand[['first_name']].drop_duplicates().count()
first_name 15 dtype: int64
Exercises¶
- Sort the contributors table by
amount
and order in descending order. - Select the
first_name
andamount
columns. - Select the
last_name
andfirst_name
columns and drop duplicates. - Count how many there are after the duplicates have been dropped.
Altering Tables¶
Creating a new column is quite easy with pandas
.
dfcand['name'] = dfcand['last_name'] + ", " + dfcand['first_name']
dfcand
id | first_name | last_name | middle_name | party | name | |
---|---|---|---|---|---|---|
0 | 33 | Joseph | Biden | NaN | D | Biden, Joseph |
1 | 36 | Samuel | Brownback | NaN | R | Brownback, Samuel |
2 | 34 | Hillary | Clinton | R. | D | Clinton, Hillary |
3 | 39 | Christopher | Dodd | J. | D | Dodd, Christopher |
4 | 26 | John | Edwards | NaN | D | Edwards, John |
5 | 22 | Rudolph | Giuliani | NaN | R | Giuliani, Rudolph |
6 | 24 | Mike | Gravel | NaN | D | Gravel, Mike |
7 | 16 | Mike | Huckabee | NaN | R | Huckabee, Mike |
8 | 30 | Duncan | Hunter | NaN | R | Hunter, Duncan |
9 | 31 | Dennis | Kucinich | NaN | D | Kucinich, Dennis |
10 | 37 | John | McCain | NaN | R | McCain, John |
11 | 20 | Barack | Obama | NaN | D | Obama, Barack |
12 | 32 | Ron | Paul | NaN | R | Paul, Ron |
13 | 29 | Bill | Richardson | NaN | D | Richardson, Bill |
14 | 35 | Mitt | Romney | NaN | R | Romney, Mitt |
15 | 38 | Tom | Tancredo | NaN | R | Tancredo, Tom |
16 | 41 | Fred | Thompson | D. | R | Thompson, Fred |
We can change an existing field as well.
dfcand.loc[dfcand.first_name == "Mike", "name"]
6 Gravel, Mike 7 Huckabee, Mike Name: name, dtype: object
dfcand.loc[dfcand.first_name == "Mike", "name"] = "Mikey"
dfcand.query("first_name == 'Mike'")
id | first_name | last_name | middle_name | party | name | |
---|---|---|---|---|---|---|
6 | 24 | Mike | Gravel | NaN | D | Mikey |
7 | 16 | Mike | Huckabee | NaN | R | Mikey |
You may recall that SQLite
doesn't have the functionality to drop a column. It's a one-liner with pandas
.
del dfcand['name']
dfcand
id | first_name | last_name | middle_name | party | |
---|---|---|---|---|---|
0 | 33 | Joseph | Biden | NaN | D |
1 | 36 | Samuel | Brownback | NaN | R |
2 | 34 | Hillary | Clinton | R. | D |
3 | 39 | Christopher | Dodd | J. | D |
4 | 26 | John | Edwards | NaN | D |
5 | 22 | Rudolph | Giuliani | NaN | R |
6 | 24 | Mike | Gravel | NaN | D |
7 | 16 | Mike | Huckabee | NaN | R |
8 | 30 | Duncan | Hunter | NaN | R |
9 | 31 | Dennis | Kucinich | NaN | D |
10 | 37 | John | McCain | NaN | R |
11 | 20 | Barack | Obama | NaN | D |
12 | 32 | Ron | Paul | NaN | R |
13 | 29 | Bill | Richardson | NaN | D |
14 | 35 | Mitt | Romney | NaN | R |
15 | 38 | Tom | Tancredo | NaN | R |
16 | 41 | Fred | Thompson | D. | R |
Exercises¶
- Create a name column for the contributors table with field entries of the form "last name, first name"
- For contributors from the state of "PA", change the name to "X".
- Delete the newly created name column.
Aggregation¶
We'd like to get information about the tables such as the maximum amount contributed to the candidates. Here are a bunch of ways to describe the tables.
dfcand.describe()
id | |
---|---|
count | 17.000000 |
mean | 30.764706 |
std | 7.110990 |
min | 16.000000 |
25% | 26.000000 |
50% | 32.000000 |
75% | 36.000000 |
max | 41.000000 |
It's not very interesting with the candidates table because the candidates table only has one numeric column.
Exercise¶
Use the describe()
method on the contributors
table.
I'll use the contributors table to do some demos now.
dfcontr.amount.max()
4600.0
dfcontr[dfcontr.amount==dfcontr.amount.max()]
last_name | first_name | middle_name | street_1 | street_2 | city | state | zip | amount | date | candidate_id | |
---|---|---|---|---|---|---|---|---|---|---|---|
30 | Buckel | Linda | NaN | PO Box 683130 | NaN | Park City | UT | 840683130 | 4600.0 | 2007-08-14 | 20 |
dfcontr.groupby("state").sum()
zip | amount | candidate_id | |
---|---|---|---|
state | |||
AK | 2985459621 | 1210.00 | 111 |
AR | 864790 | 14200.00 | 192 |
AZ | 860011121 | 120.00 | 37 |
CA | 14736360720 | -5013.73 | 600 |
CO | 2405477834 | -5823.00 | 111 |
CT | 68901376 | 2300.00 | 35 |
DC | 800341853 | -1549.91 | 102 |
FL | 8970626520 | -4050.00 | 803 |
IA | 50266 | 250.00 | 16 |
ID | 83648 | -261.00 | 32 |
IL | 3042068689 | -5586.80 | 175 |
KS | 66215 | -330.00 | 32 |
KY | 402597029 | -200.00 | 22 |
LA | 1406043327 | 1300.00 | 74 |
MA | 123026638 | -83.00 | 208 |
MD | 416287617 | 300.00 | 55 |
ME | 165647170 | 2520.00 | 122 |
MI | 2426973485 | -1265.00 | 164 |
MN | 1102338918 | 322.00 | 100 |
MO | 64111 | 100.00 | 20 |
NC | 27502 | 500.00 | 16 |
NH | 32564424 | -24.60 | 32 |
NJ | 70254993 | -817.45 | 64 |
NV | 3575889763 | 725.00 | 144 |
NY | 606129991 | -6474.50 | 233 |
OH | 176071 | 450.00 | 80 |
OK | 2202499044 | 800.00 | 102 |
PA | 540499020 | -2146.00 | 145 |
RI | 58065892 | 200.00 | 70 |
SC | 296214789 | 2400.00 | 69 |
TN | 37188 | -25.00 | 32 |
TX | 6221452245 | 1985.24 | 302 |
UT | 9251153394 | 5050.00 | 340 |
VA | 440691831 | 515.92 | 135 |
WA | 2941290251 | -500.00 | 90 |
dfcontr.groupby("state")["amount"].sum()
state AK 1210.00 AR 14200.00 AZ 120.00 CA -5013.73 CO -5823.00 CT 2300.00 DC -1549.91 FL -4050.00 IA 250.00 ID -261.00 IL -5586.80 KS -330.00 KY -200.00 LA 1300.00 MA -83.00 MD 300.00 ME 2520.00 MI -1265.00 MN 322.00 MO 100.00 NC 500.00 NH -24.60 NJ -817.45 NV 725.00 NY -6474.50 OH 450.00 OK 800.00 PA -2146.00 RI 200.00 SC 2400.00 TN -25.00 TX 1985.24 UT 5050.00 VA 515.92 WA -500.00 Name: amount, dtype: float64
dfcontr.state.unique()
array(['VA', 'CA', 'AR', 'DC', 'SC', 'IA', 'OH', 'NC', 'UT', 'MO', 'IL', 'ME', 'FL', 'MD', 'MI', 'CO', 'WA', 'NY', 'TX', 'KY', 'PA', 'TN', 'MA', 'MN', 'KS', 'NJ', 'NH', 'ID', 'OK', nan, 'NV', 'CT', 'RI', 'AK', 'LA', 'AZ'], dtype=object)
There is also a version of the LIMIT
clause. It's very intuitive with pandas
.
dfcand[0:3]
id | first_name | last_name | middle_name | party | |
---|---|---|---|---|---|
0 | 33 | Joseph | Biden | NaN | D |
1 | 36 | Samuel | Brownback | NaN | R |
2 | 34 | Hillary | Clinton | R. | D |
The usual Python
slicing works just fine!
Joins with pandas
¶
pandas
has some some documentation on joins
: Merge, join, and concatenate. If you want some more reinforcement on the concepts from earlier regarding JOIN
, then the pandas
documentation may be a good place to get it.
You may also be interested in a comparison with SQL
.
To do joins with pandas
, we use the merge
command.
Here's an example of an explicit inner join:
cols_wanted = ['last_name_x', 'first_name_x', 'candidate_id', 'id', 'last_name_y']
dfcontr.merge(dfcand, left_on="candidate_id", right_on="id")[cols_wanted]
last_name_x | first_name_x | candidate_id | id | last_name_y | |
---|---|---|---|---|---|
0 | Agee | Steven | 16 | 16 | Huckabee |
1 | Ahrens | Don | 16 | 16 | Huckabee |
2 | Ahrens | Don | 16 | 16 | Huckabee |
3 | Ahrens | Don | 16 | 16 | Huckabee |
4 | Akin | Charles | 16 | 16 | Huckabee |
... | ... | ... | ... | ... | ... |
170 | ABESHAUS | MERRILL | 37 | 37 | McCain |
171 | ABRAHAM | GEORGE | 37 | 37 | McCain |
172 | ABRAHAMSON | PETER | 37 | 37 | McCain |
173 | ABRAHAM | SALEM | 37 | 37 | McCain |
174 | ABRAHAM | SALEM | 37 | 37 | McCain |
175 rows × 5 columns
Somewhat organized example¶
dfcontr.merge(dfcand, left_on="candidate_id", right_on="id")[cols_wanted].groupby('last_name_y').describe()
candidate_id | id | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
last_name_y | ||||||||||||||||
Clinton | 25.0 | 34.0 | 0.0 | 34.0 | 34.0 | 34.0 | 34.0 | 34.0 | 25.0 | 34.0 | 0.0 | 34.0 | 34.0 | 34.0 | 34.0 | 34.0 |
Giuliani | 25.0 | 22.0 | 0.0 | 22.0 | 22.0 | 22.0 | 22.0 | 22.0 | 25.0 | 22.0 | 0.0 | 22.0 | 22.0 | 22.0 | 22.0 | 22.0 |
Huckabee | 25.0 | 16.0 | 0.0 | 16.0 | 16.0 | 16.0 | 16.0 | 16.0 | 25.0 | 16.0 | 0.0 | 16.0 | 16.0 | 16.0 | 16.0 | 16.0 |
McCain | 25.0 | 37.0 | 0.0 | 37.0 | 37.0 | 37.0 | 37.0 | 37.0 | 25.0 | 37.0 | 0.0 | 37.0 | 37.0 | 37.0 | 37.0 | 37.0 |
Obama | 25.0 | 20.0 | 0.0 | 20.0 | 20.0 | 20.0 | 20.0 | 20.0 | 25.0 | 20.0 | 0.0 | 20.0 | 20.0 | 20.0 | 20.0 | 20.0 |
Paul | 25.0 | 32.0 | 0.0 | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 | 25.0 | 32.0 | 0.0 | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 |
Romney | 25.0 | 35.0 | 0.0 | 35.0 | 35.0 | 35.0 | 35.0 | 35.0 | 25.0 | 35.0 | 0.0 | 35.0 | 35.0 | 35.0 | 35.0 | 35.0 |
Other Joins with pandas
¶
We didn't cover all possible joins because SQLite
can only handle the few that we did discuss. As mentioned, there are workarounds for some things in SQLite
, but not evertyhing. Fortunately, pandas
can handle pretty much everything. Here are a few joins that pandas
can handle:
LEFT OUTER
(already discussed)RIGHT OUTER
- Think of the "opposite" of aLEFT OUTER
join (shade the intersection and right set in the Venn diagram).FULL OUTER
- Combine everything from both tables (shade the entire Venn diagram)
Left Outer Join with pandas
¶
dfcontr.merge(dfcand, left_on="candidate_id", right_on="id", how="left")[cols_wanted]
last_name_x | first_name_x | candidate_id | id | last_name_y | |
---|---|---|---|---|---|
0 | Agee | Steven | 16 | 16 | Huckabee |
1 | Ahrens | Don | 16 | 16 | Huckabee |
2 | Ahrens | Don | 16 | 16 | Huckabee |
3 | Ahrens | Don | 16 | 16 | Huckabee |
4 | Akin | Charles | 16 | 16 | Huckabee |
... | ... | ... | ... | ... | ... |
170 | ABESHAUS | MERRILL | 37 | 37 | McCain |
171 | ABRAHAM | GEORGE | 37 | 37 | McCain |
172 | ABRAHAMSON | PETER | 37 | 37 | McCain |
173 | ABRAHAM | SALEM | 37 | 37 | McCain |
174 | ABRAHAM | SALEM | 37 | 37 | McCain |
175 rows × 5 columns
Right Outer Join with pandas
¶
dfcontr.merge(dfcand, left_on="candidate_id", right_on="id", how="right")[cols_wanted]
last_name_x | first_name_x | candidate_id | id | last_name_y | |
---|---|---|---|---|---|
0 | Agee | Steven | 16.0 | 16 | Huckabee |
1 | Ahrens | Don | 16.0 | 16 | Huckabee |
2 | Ahrens | Don | 16.0 | 16 | Huckabee |
3 | Ahrens | Don | 16.0 | 16 | Huckabee |
4 | Akin | Charles | 16.0 | 16 | Huckabee |
... | ... | ... | ... | ... | ... |
180 | NaN | NaN | NaN | 30 | Hunter |
181 | NaN | NaN | NaN | 31 | Kucinich |
182 | NaN | NaN | NaN | 29 | Richardson |
183 | NaN | NaN | NaN | 38 | Tancredo |
184 | NaN | NaN | NaN | 41 | Thompson |
185 rows × 5 columns
Full Outer Join with pandas
¶
dfcontr.merge(dfcand, left_on="candidate_id", right_on="id", how="outer")[cols_wanted]
last_name_x | first_name_x | candidate_id | id | last_name_y | |
---|---|---|---|---|---|
0 | Agee | Steven | 16.0 | 16 | Huckabee |
1 | Ahrens | Don | 16.0 | 16 | Huckabee |
2 | Ahrens | Don | 16.0 | 16 | Huckabee |
3 | Ahrens | Don | 16.0 | 16 | Huckabee |
4 | Akin | Charles | 16.0 | 16 | Huckabee |
... | ... | ... | ... | ... | ... |
180 | NaN | NaN | NaN | 30 | Hunter |
181 | NaN | NaN | NaN | 31 | Kucinich |
182 | NaN | NaN | NaN | 29 | Richardson |
183 | NaN | NaN | NaN | 38 | Tancredo |
184 | NaN | NaN | NaN | 41 | Thompson |
185 rows × 5 columns