Introduction¶
Why Learn Databases¶
- You will see many databases in your career.
SQL
(Structured Query Language) is still very popular and will remain so for a long time. Hence, you will need to codeSQL
.SQL
is the language used to query a relational database.
- You will have to deal with systematic storage of structured and unstructured data at some point.
More Database Motivation¶
- It is very hard to implement a database well, but you must understand how they work.
- Data storage/wrangling are not just database concerns; packages such as
dplyr
andpandas
require a similar knowledge-base.
- It is very important to make an informed decision on a storage engine that is sufficient for your program.
- Important to understand query performance.
- Transaction processing is not optimal for analytics.
What kind of data access do you need?¶
The answer depends on your problem and the resources you have available.
Database Genre | Examples |
---|---|
relational | SQL and its derivatives |
document oriented | MongoDB, CouchDB |
key-value | Riak, Memcached, leveldb |
graph oriented | Neo4J |
columnar | HBase |
A Sampling of Database Genres¶
Relational Model¶
- A relation (table) is a collection of tuples. Each tuple is called a row.
- A collection of tables related to each other through common data values.
- Items in a column are values of one attribute.
- A cell is expected to be atomic
- Tables are related to each other if they have columns (called keys) which represent the same values.
- SQL (Structured Query Language) is a declarative model: a query optimizer decides how to execute the query.
Example from CS109:¶
Working with (Relational) Databases¶
Components to a database¶
- Client connection manager: what to do with incoming data
- Transactional storage
- storage data structures and the log
- transactions and ACID: atomicity, consistency, isolation, durability
- Process model: coroutines, threads, processes
- Query model and language: query optimization
Relational Grammar of Data¶
- We want a language to help us easily query items in the database.
- Provide simple verbs for simple things.
Pandas
is a library forPython
that allows users to work with data structures and relational databases.- The
dplyr
package offers a bunch of data manipulation tools including those for working with relational databases with theR
programming language.
The dplyr_pandas
notebook by Tom Augspurger contains a table comparing dplyr
and pandas
. The following table is a modification to that table:
VERB | dplyr | pandas | SQL |
---|---|---|---|
QUERY/SELECTION | filter() (and slice()) | query() (and loc[], iloc[]) | SELECT WHERE |
SORT | arrange() | sort() | ORDER BY |
SELECT-COLUMNS/PROJECTION | select() (and rename()) | [](__getitem__) (and rename()) | SELECT COLUMN |
SELECT-DISTINCT | distinct() | unique(),drop_duplicates() | SELECT DISTINCT COLUMN |
ASSIGN | mutate() (and transmute()) | assign | ALTER/UPDATE |
AGGREGATE | summarise() | describe(), mean(), max() | None, AVG(),MAX() |
SAMPLE | sample_n() and sample_frac() | sample() | implementation dep, use RAND() |
GROUP-AGG | group_by/summarize | groupby/agg, count, mean | GROUP BY |
DELETE | ? | drop/masking | DELETE/WHERE |
SQLite
¶
NoSQL
databases are gaining in popularity. However, we will stick with traditional relational databases in this course.
- We need a way of querying a given relational database. There are several languages for such a purpose. We will focus on
SQL
(Structured Query Language).
SQL
has a long history. Because of this (or in spite of it), there are many version ofSQL
available today.
- We'll use
SQLite
. Here are some great references:
- Note: There is an even higher level API available, called SQLAlchemy.
- You can install
SQLite
if you need to: https://www.sqlite.org/download.html. - You may find the
SQLite
browser useful: http://sqlitebrowser.org. - You can access the command line interface by downloading the
SQLite
CLI:SQLite
CLI
SQLite
Basics¶
The Plan¶
- We're going to work with the
sqlite3
package inPython
. - This package will allow us to execute basic
SQLite
commands inPython
to build and manipulate our database. - We'll start by creating a
SQL
database and work up from there. - Ultimately, we'd like to work with
pandas
to make our lives easier. - At least in the beginning, we'll just work directly with the
SQLite
commands to get the basics down.
Structural Commands¶
These commands are used to modify the structure of the table in a database.
CREATE
--- Create a table in a databaseDROP
--- Delete a table in the databaseALTER
--- Add, delete, or modify columns in an existing table
Starting a Database¶
We'll start by using the simplest of the commands. Ultimately, we will create a database with tables of presidential candidates (from 2012) and their contributors.
We begin by importing sqlite3
into Python
.
import sqlite3
- Next, we create a connection to the database. A new one will be created if it doesn't already exist.
- Then we'll place the "cursor" on the database.
db = sqlite3.connect('test_db.sqlite')
cursor = db.cursor() # https://www.python.org/dev/peps/pep-0249/#cursor-objects
cursor.execute("DROP TABLE IF EXISTS candidates") # Convenient in case you want to start over
cursor.execute("DROP TABLE IF EXISTS contributors") # Convenient in case you want to start over
# This strange command turns out FOREIGN KEY support in SQLite.
# By default, this support is off.
# We'll talk about foreign keys later on. Just setting things up here.
cursor.execute("PRAGMA foreign_keys=1")
Creating a First Table in the Database¶
We have a connection to a database, but that database doesn't have a table in it.
We'll create a table for the candidates with the following columns: id
, first_name
, last_name
, middle_initial
, and party
.
We require id
to be of type integer and the rest to be a string.
cursor.execute('''CREATE TABLE candidates (
id INTEGER PRIMARY KEY NOT NULL,
first_name TEXT,
last_name TEXT,
middle_init TEXT,
party TEXT NOT NULL)''')
db.commit() # Commit changes to the database
Discussion¶
What did we just do? Some of the commands are obvious (like CREATE TABLE
) but others are not (like PRIMARY KEY
).
cursor.execute()
runs theSQLite
command, which we pass in as a string.
- The
id
column:- Has integer values.
- Is the
PRIMARY KEY
. This means that those field values are unique and cannot haveNULL
values. A table can only have onePRIMARY KEY
. - Prohibit
NULL
values.
first_name
,last_name
,middle_init
are allTEXT
fields of unlimited length.- Note that
SQL
has other types such asVARCHAR(N)
andCHAR(N)
.VARCHAR(N)
allows variable text lengths up toN
characters in length andCHAR(N)
expects text of exactlyN
characters.
- Note that
party
is also entered asTEXT
and cannot haveNULL
values.
We have followed a convention wherein SQL
commands are issued in capital letters and table fields are written in lowercase text.
NOTE: Always commit changes to your database! If you don't, you will lose them when you close the database.
Adding Values¶
cursor.execute('''INSERT INTO candidates
(id, first_name, last_name, middle_init, party)
VALUES (?, ?, ?, ?, ?)''',
(16, "Mike", "Huckabee", "", "R"))
db.commit()
Discussion¶
- We executed the
INSERT
command in order to insert some values into thecandidates
table. - This could be read as: "Insert into the
candidates
table into the columns (...) the values(?,?,?,?,?)
"
- Why all the question marks?
- The question marks are not necessary, but they are important. You can use
Python's
string formatting to insert parameters into theexecute()
method. - However, there are two pretty good reasons for using the ? placeholders instead:
- The question marks are not necessary, but they are important. You can use
- They leave the burden of correctly encoding and escaping data items to the database module
- More importantly, they improve security because they prevent arbitrary
SQL
from being injected into a query.
See Bobby Tables: A guide to preventing SQL injection for a nice overview.
Some Queries¶
First, let's add another entry or two.
cursor.execute('''INSERT INTO candidates
(id, first_name, last_name, middle_init, party)
VALUES (?, ?, ?, ?, ?)''',
(34, "Hillary", "Clinton", "R.", "D"))
db.commit()
Getting All Rows from a Table¶
cursor.execute("SELECT * FROM candidates")
all_rows = cursor.fetchall()
print(all_rows)
Selecting Specifics¶
cursor.execute("SELECT * FROM candidates WHERE first_name = 'Joseph'")
all_rows = cursor.fetchall()
print(all_rows)
cursor.execute("SELECT first_name FROM candidates")
all_rows = cursor.fetchall()
print(all_rows)
Another Table¶
We'll add another table to our database and introduce a few new SQL
commands and ideas.
The new table will contain data on contributors and their contributions to each candidate.
cursor.execute('''CREATE TABLE contributors (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
last_name TEXT,
first_name TEXT,
middle_name TEXT,
street_1 TEXT,
street_2 TEXT,
city TEXT,
state TEXT,
zip TEXT,
amount FLOAT(7,3),
date DATETIME,
candidate_id INTEGER NOT NULL,
FOREIGN KEY(candidate_id) REFERENCES candidates(id))''')
db.commit()
Discussion¶
The creation of our new table includes a few new commands:
AUTOINCREMENT
: Theid
for a new entry to the table will be automatically generated. No need to enter this explicitly. The increment starts at 1.DATETIME
: Gives the date in the format YYYY-MM-DD HH:MM:SSFOREIGN KEY
: This allows us to link the two tables (candidates
andcontributors
). More below.
Discussion: FOREIGN KEY¶
- We'd like to be able to keep track of the contributions to each candidate.
- In order to accomplish this, we need to link the candidates and contributors tables somehow.
- A reference between the two tables is created by a common field in each table.
- We define the
FOREIGN KEY
in the second table and it references a unique key in the first table. Often, this unique key is thePRIMARY KEY
.
So, in our example, we create a new field in the contributors table called candidate_id
. We declare it to be a FOREIGN KEY
that references the id
field in the candidates table.
Let's add a contributor.
Adding Some Data to Contributors Table¶
contributors = [("Agee", "Steven", "", "549 Laurel Branch Road", "", "Floyd", "VA", int(24091), 500.0, 2007-6-30, 16),
("Aaron", "Carole", "", "PO Box 1806", "", "Ogunqui", "ME", int(3907), 70, 2008-2-7, 34)]
cursor.executemany('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', contributors)
db.commit()
cursor.execute("SELECT last_name FROM contributors where amount < 100")
for c in cursor.fetchall():
print(c)
Now let's try to add a contributor who contributed to a candidate whose id is not in the candidates table.
cursor.execute('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', ("Buckler", "Steve", "", "24351 Armada Dr.", "", "Dana Point", "CA", int(926291), 50, 2007-7-30, 20))
Observations¶
We got an error!
This is because the FOREIGN KEY
constraint was violated.
There is no candidate with candidate_id
equal to 20.
Another way: Invalid data was prevented from being entered into the foreign key column.
It's a good idea to close the database when we're done.
db.close()