Key Word(s): SQL
SQLite
Exercises¶
Today you will work with the candidates
and contributors
datasets to create a database in Python
using SQLite
. This is a hands-on lecture. For once, you may do these tasks in a Jupyter notebook.
The exercises will consist of a sequence of steps to help illustrate basic commands.
Exercise Deliverables¶
- Make a copy of this notebook and call it
L21_Exercises.ipynb.
- For each step in this lecture notebook, there are instructions labeled "Do the following:" (except for Step 1 and Interlude). Put all the code from those instructions in a single cell immediately following the instructions. It should look like a
Python
script. You must comment where appropriate to demonstrate that you understand what you are doing.- Note: To get the pandas tables to display in a cell, use
display()
.
- Note: To get the pandas tables to display in a cell, use
- Save and close your database. Be sure to upload your database with the lecture exercises. You must name your database
L21DB.sqlite
.
import sqlite3
We will also use a basic a pandas
feature to display tables in the database. Although this lecture isn't on pandas
, I will still have you use it a little bit.
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
from IPython.display import display
Now we create the tables in the database (just like last time).
db = sqlite3.connect('L21DB_demo.sqlite')
cursor = db.cursor()
cursor.execute("DROP TABLE IF EXISTS candidates")
cursor.execute("DROP TABLE IF EXISTS contributors")
cursor.execute("PRAGMA foreign_keys=1")
cursor.execute('''CREATE TABLE candidates (
id INTEGER PRIMARY KEY NOT NULL,
first_name TEXT,
last_name TEXT,
middle_init TEXT,
party TEXT NOT NULL)''')
db.commit() # Commit changes to the database
cursor.execute('''CREATE TABLE contributors (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
last_name TEXT,
first_name TEXT,
middle_name TEXT,
street_1 TEXT,
street_2 TEXT,
city TEXT,
state TEXT,
zip TEXT,
amount REAL,
date DATETIME,
candidate_id INTEGER NOT NULL,
FOREIGN KEY(candidate_id) REFERENCES candidates(id))''')
db.commit()
with open ("candidates.txt") as candidates:
next(candidates) # jump over the header
for line in candidates.readlines():
cid, first_name, last_name, middle_name, party = line.strip().split('|')
vals_to_insert = (int(cid), first_name, last_name, middle_name, party)
cursor.execute('''INSERT INTO candidates
(id, first_name, last_name, middle_init, party)
VALUES (?, ?, ?, ?, ?)''', vals_to_insert)
with open ("contributors.txt") as contributors:
next(contributors)
for line in contributors.readlines():
cid, last_name, first_name, middle_name, street_1, street_2, \
city, state, zip_code, amount, date, candidate_id = line.strip().split('|')
vals_to_insert = (last_name, first_name, middle_name, street_1, street_2,
city, state, int(zip_code), amount, date, candidate_id)
cursor.execute('''INSERT INTO contributors (last_name, first_name, middle_name,
street_1, street_2, city, state, zip, amount, date, candidate_id)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', vals_to_insert)
Interlude¶
Now that you have values in the tables of the database, it would be convenient to be able to visualize those tables in some way. We'll write a little helper function to accomplish this.
def viz_tables(cols, query):
q = cursor.execute(query).fetchall()
framelist = dict()
for i, col_name in enumerate(cols):
framelist[col_name] = [row[i] for row in q]
return pd.DataFrame.from_dict(framelist)
Here's how we can use our helper function. It gives a pretty nice visualization of our table. You should do the same thing with the contributors
table.
candidate_cols = [col[1] for col in cursor.execute("PRAGMA table_info(candidates)")]
query = '''SELECT * FROM candidates'''
viz_tables(candidate_cols, query)
id | first_name | last_name | middle_init | party | |
---|---|---|---|---|---|
0 | 16 | Mike | Huckabee | R | |
1 | 20 | Barack | Obama | D | |
2 | 22 | Rudolph | Giuliani | R | |
3 | 24 | Mike | Gravel | D | |
4 | 26 | John | Edwards | D | |
5 | 29 | Bill | Richardson | D | |
6 | 30 | Duncan | Hunter | R | |
7 | 31 | Dennis | Kucinich | D | |
8 | 32 | Ron | Paul | R | |
9 | 33 | Joseph | Biden | D | |
10 | 34 | Hillary | Clinton | R. | D |
11 | 35 | Mitt | Romney | R | |
12 | 36 | Samuel | Brownback | R | |
13 | 37 | John | McCain | R | |
14 | 38 | Tom | Tancredo | R | |
15 | 39 | Christopher | Dodd | J. | D |
16 | 41 | Fred | Thompson | D. | R |
Do the following:¶
Display the contributors table.
Step 2: Various Queries¶
We can query our database for entries with certain characteristics. For example, we can query the candidates
table for entries whose middle name fields are not empty.
query = '''SELECT * FROM candidates WHERE middle_init <> ""'''
viz_tables(candidate_cols, query)
We can also see how many entries satisfy the query:
print("{} candidates have a middle initial.".format(viz_tables(candidate_cols, query).shape[0]))
This used a trick on the Pandas table. You'll get to practice counting in SQL later.
Do the following queries:¶
- Display the contributors where the state is "PA"
- Display the contributors where the amount contributed is greater than $\$1000.00$.
- Display the contributors from the state "UT" where the amount contributed is greater than $\$1000.00$.
- Display the contributors who didn't list their state
- Hint: Match
state
to the empty string
- Hint: Match
- Display the contributors from "WA" or "PA"
- Hint: You will need to use
IN ("WA", "PA")
in yourSELECT
statement.
- Hint: You will need to use
- Display the contributors who contributed between $\$100.00$ and $\$200.00$.
- Hint: You can use the
BETWEEN 100.00 and 200.00
clause.
- Hint: You can use the
Step 3: Sorting¶
It could be beneficial to sort by one of the attributes in the database. The following cell contains a basic sorting demo. Run it and try to understand what happened.
query = '''SELECT * FROM candidates ORDER BY id DESC'''
viz_tables(candidate_cols, query)
Do the following sorts on the contributors
table:¶
- Sort the
candidates
table bylast_name
. - Sort the contributors table by the
amount
in decending order whereamount
is restricted to be between $\$1000.00$ and $\$5000.00$.- Hint: In your SQL command, start with getting the amount between the specified range followed by the sort. This will all be done in one line.
- Sort the contributors who donted between $\$1000.00$ and $\$5000.00$ by
candidate_id
and then byamount
in descending order.- Hint: Multiple orderings can be accomplished by separating requests after
ORDER BY
with commas. - e.g.
ORDER BY amount ASC, last_name DESC
- Hint: Multiple orderings can be accomplished by separating requests after
Step 4: Selecting Columns¶
So far, we've been selecting all columns from a table (i.e. SELECT * FROM
). Often, we just want to select specific columns (e.g. SELECT amount FROM
).
query = '''SELECT last_name, party FROM candidates'''
viz_tables(['last_name', 'party'], query)
Using the DISTINCT
clause, you remove duplicate rows.
query = '''SELECT DISTINCT party FROM candidates'''
viz_tables(['party'], query)
Do the following:¶
- Get the first and last name of contributors. Make sure each row has distinct values.
Step 5: Altering Tables¶
The ALTER
clause allows us to modify tables in our database. Here, we add a new column to our candidates table called full_name
.
cursor.execute('''ALTER TABLE candidates ADD COLUMN full_name TEXT''')
candidate_cols = [col[1] for col in cursor.execute("PRAGMA table_info(candidates)")]
viz_tables(candidate_cols, '''SELECT * FROM candidates''')
What if we want to rename or delete a columm? It can't be done with SQLite
with a single command. We need to follow some roundabout steps (see SQLite
ALTER TABLE). We won't consider this case at the moment.
For now, let's put a few commands together to populate the full_name
column.
candidate_cols = [col[1] for col in cursor.execute("PRAGMA table_info(candidates)")] # regenerate columns with full_name
query = '''SELECT id, last_name, first_name FROM candidates''' # Select a few columns
full_name_and_id = [(attr[1] + ", " + attr[2], attr[0]) for attr in cursor.execute(query).fetchall()] # List of tuples: (full_name, id)
update = '''UPDATE candidates SET full_name = ? WHERE id = ?''' # Update the table
for rows in full_name_and_id:
cursor.execute(update, rows)
query = '''SELECT * FROM candidates'''
viz_tables(candidate_cols, query)
Here's another update, this time on an existing column.
update = '''UPDATE candidates SET full_name = "WINNER" WHERE last_name = "Obama"'''
cursor.execute(update)
update = '''UPDATE candidates SET full_name = "RUNNER-UP" WHERE last_name = "McCain"'''
cursor.execute(update)
viz_tables(candidate_cols, query)
Do the following:¶
- Add a new column to the contributors table called
full_name
. The value in that column should be in the formlast_name, first_name
. - Change the value in the
full_name
column to the string"Too Much"
if someone donated more than $\$1000.00$.
Step 6: Aggregation¶
You can perform some nice operations on the values in the database. For example, you can compute the maximum, minimum, and sum of a set. You can also count the number of items in a given set. Here's a little example. You can do the rest.
contributor_cols = [col[1] for col in cursor.execute("PRAGMA table_info(contributors)")] # You've already done this part. I just need to do it here b/c I haven't yet.
function = '''SELECT *, MAX(amount) AS max_amount FROM contributors'''
viz_tables(contributor_cols, function)
Do the following:¶
- Modify the demo to only output the max amount.
- Hints:
- Instead of using
SELECT *, MAX(amount) ...
you can trySELECT MAX(amount) ...
- You will want to use
cursor.execute()
here and thenfetchall()
. - Do not display your results in a table. It is sufficient to write the answer out to the screen as a single number.
- Instead of using
- Hints:
- Count how many donations there were above $\$1000.00$.
- Hint: There is a
COUNT
function.
- Hint: There is a
- Calculate the average positive donation.
- Hint: There is an
AVG
function.
- Hint: There is an
Calculate the average contribution from each state and display in a table. Restrict to positive values again.
- Hint: Use code that looks like:
"SELECT state,SUM(amount) FROM contributors GROUP BY state"
Step 8: LIMIT¶
The LIMIT
clause offers convenient functionality. It allows you to constrain the number of rows returned by your query. It shows up in many guises.
query = '''SELECT * FROM candidates LIMIT 3'''
viz_tables(candidate_cols, query)
query = '''SELECT * FROM candidates LIMIT 4 OFFSET 5'''
viz_tables(candidate_cols, query)
query = '''SELECT * FROM candidates ORDER BY last_name LIMIT 4 OFFSET 5'''
viz_tables(candidate_cols, query)
Do the following:¶
- Query and display the ten most generous donors.
- Query and display the ten least generous donors who donated a positive amount of money (since the data we have has some negative numbers in it...).
Save¶
Don't forget to save all of these changes to your database using db.commit()
. Before closing shop, be sure to close the database connection with db.close()
.