Key Word(s): Databases, SQL



Lecture 21

Databases I

Thursday, November 14th 2019

Introduction

Why Learn Databases

  • You will see many databases in your career.
  • SQL (Structured Query Language) is still very popular and will remain so for a long time. Hence, you will need to code SQL.
    • SQL is the language used to query a relational database.
  • You will have to deal with systematic storage of structured and unstructured data at some point.

More Database Motivation

  • It is very hard to implement a database well, but you must understand how they work.
  • Data storage/wrangling are not just database concerns; packages such as dplyr and pandas require a similar knowledge-base.
  • It is very important to make an informed decision on a storage engine that is sufficient for your program.
  • Important to understand query performance.

What kind of data access do you need?

The answer depends on your problem and the resources you have available.

Database Genre Examples
relational SQL and its derivatives
document oriented MongoDB, CouchDB
key-value Riak, Memcached, leveldb
graph oriented Neo4J
columnar HBase

A Sampling of Database Genres

Relational Model

  • A relation (table) is a collection of tuples. Each tuple is called a row.
  • A collection of tables related to each other through common data values.
  • Items in a column are values of one attribute.
  • A cell is expected to be atomic
  • Tables are related to each other if they have columns (called keys) which represent the same values.
  • SQL (Structured Query Language) is a declarative model: a query optimizer decides how to execute the query.

Example from CS109:

Key-Value Model

  • Like a dictionary
  • Each key is mapped to a single value
  • The database is the index
  • Can be more flexible than relational databases
  • May use much less memory
  • Redis is a popular key-value database

Document Model

  • Stores nested records
  • Typically have one document/record per row in a relational DB table
  • Bad for many-to-many
  • Storage locality good for access, bad for writing
  • XML databases are optimized for XML documents
  • MongoDB is a popular example

Working with (Relational) Databases

Components to a database

  1. Client connection manager: what to do with incoming data
  2. Transactional storage
    • storage data structures and the log
    • transactions and ACID: atomicity, consistency, isolation, durability
  3. Process model: coroutines, threads, processes
  4. Query model and language: query optimization

Relational Grammar of Data

  • We want a language to help us easily query items in the database.
  • Provide simple verbs for simple things.
  • Pandas is a library for Python that allows users to work with data structures and relational databases.
  • The dplyr package offers a bunch of data manipulation tools including those for working with relational databases with the R programming language.

The dplyr_pandas notebook by Tom Augspurger contains a table comparing dplyr and pandas. The following table is a modification to that table:

VERB dplyr pandas SQL
QUERY/SELECTION filter() (and slice()) query() (and loc[], iloc[]) SELECT WHERE
SORT arrange() sort() ORDER BY
SELECT-COLUMNS/PROJECTION select() (and rename()) [](__getitem__) (and rename()) SELECT COLUMN
SELECT-DISTINCT distinct() unique(),drop_duplicates() SELECT DISTINCT COLUMN
ASSIGN mutate() (and transmute()) assign ALTER/UPDATE
AGGREGATE summarise() describe(), mean(), max() None, AVG(),MAX()
SAMPLE sample_n() and sample_frac() sample() implementation dep, use RAND()
GROUP-AGG group_by/summarize groupby/agg, count, mean GROUP BY
DELETE ? drop/masking DELETE/WHERE

SQLite

NoSQL databases are gaining in popularity. However, we will stick with traditional relational databases in this course.

  • We need a way of querying a given relational database. There are several languages for such a purpose. We will focus on SQL (Structured Query Language).
  • SQL has a long history. Because of this (or in spite of it), there are many version of SQL available today.
  • SQLite is built into Python.
    • Python implements a standard database API for all databases called DBAPI2. Some references: DBAPI2 and PEP 249.
  • Note: There is an even higher level API available, called SQLAlchemy.

SQLite Basics

The Plan

  • We're going to work with the sqlite3 package in Python.
  • This package will allow us to execute basic SQLite commands in Python to build and manipulate our database.
  • We'll start by creating a SQL database and work up from there.
  • Ultimately, we'd like to work with pandas to make our lives easier.
  • At least in the beginning, we'll just work directly with the SQLite commands to get the basics down.

The Essentials

Core Commands

  • SELECT --- Select a table name
  • INSERT --- Insert data into the table
  • UPDATE --- Change data values in the table
  • DELETE --- Delete data in the table

We can string these commands together to perform our basic operations on the database.

Structural Commands

These commands are used to modify the structure of the table in a database.

  • CREATE --- Create a table in a database
  • DROP --- Delete a table in the database
  • ALTER --- Add, delete, or modify columns in an existing table

Starting a Database

We'll start by using the simplest of the commands. Ultimately, we will create a database with tables of presidential candidates (from 2012) and their contributors.

We begin by importing sqlite3 into Python.

In [1]:
import sqlite3
  • Next, we create a connection to the database. A new one will be created if it doesn't already exist.
  • Then we'll place the "cursor" on the database.
In [2]:
db = sqlite3.connect('test_db.sqlite')
cursor = db.cursor() # https://www.python.org/dev/peps/pep-0249/#cursor-objects
cursor.execute("DROP TABLE IF EXISTS candidates") # Convenient in case you want to start over
cursor.execute("DROP TABLE IF EXISTS contributors") # Convenient in case you want to start over
Out[2]:
In [3]:
# This strange command turns out FOREIGN KEY support in SQLite.
# By default, this support is off.
# We'll talk about foreign keys later on.  Just setting things up here.
cursor.execute("PRAGMA foreign_keys=1")
Out[3]:

Creating a First Table in the Database

We have a connection to a database, but that database doesn't have a table in it.

We'll create a table for the candidates with the following columns: id, first_name, last_name, middle_initial, and party.

We require id to be of type integer and the rest to be a string.

In [4]:
cursor.execute('''CREATE TABLE candidates (
               id INTEGER PRIMARY KEY NOT NULL, 
               first_name TEXT, 
               last_name TEXT, 
               middle_init TEXT, 
               party TEXT NOT NULL)''')

db.commit() # Commit changes to the database

Discussion

What did we just do? Some of the commands are obvious (like CREATE TABLE) but others are not (like PRIMARY KEY).

  1. cursor.execute() runs the SQLite command, which we pass in as a string.
  1. The id column:
    • Has integer values.
    • Is the PRIMARY KEY. This means that those field values are unique and cannot have NULL values. A table can only have one PRIMARY KEY.
    • Prohibit NULL values.
  1. first_name, last_name, middle_init are all TEXT fields of unlimited length.
    • Note that SQL has other types such as VARCHAR(N) and CHAR(N). VARCHAR(N) allows variable text lengths up to N characters in length and CHAR(N) expects text of exactly N characters.
  1. party is also entered as TEXT and cannot have NULL values.

We have followed a convention wherein SQL commands are issued in capital letters and table fields are written in lowercase text.

NOTE: Always commit changes to your database! If you don't, you will lose them when you close the database.

Adding Values

In [5]:
cursor.execute('''INSERT INTO candidates
               (id, first_name, last_name, middle_init, party)
               VALUES (?, ?, ?, ?, ?)''', 
                (16, "Mike", "Huckabee", "", "R"))
db.commit()

Discussion

  • We executed the INSERT command in order to insert some values into the candidates table.
  • This could be read as: "Insert into the candidates table into the columns (...) the values (?,?,?,?,?)"
  • Why all the question marks?
    • The question marks are not necessary, but they are important. You can use Python's string formatting to insert parameters into the execute() method.
    • However, there are two pretty good reasons for using the ? placeholders instead:
  1. They leave the burden of correctly encoding and escaping data items to the database module
  2. More importantly, they improve security because they prevent arbitrary SQL from being injected into a query.

See Bobby Tables: A guide to preventing SQL injection for a nice overview.

Some Queries

First, let's add another entry or two.

In [6]:
cursor.execute('''INSERT INTO candidates
               (id, first_name, last_name, middle_init, party)
               VALUES (?, ?, ?, ?, ?)''', 
                (34, "Hillary", "Clinton", "R.", "D"))
db.commit()

Getting All Rows from a Table

In [7]:
cursor.execute("SELECT * FROM candidates")
all_rows = cursor.fetchall()
print(all_rows)
[(16, 'Mike', 'Huckabee', '', 'R'), (34, 'Hillary', 'Clinton', 'R.', 'D')]

Selecting Specifics

In [8]:
cursor.execute("SELECT * FROM candidates WHERE first_name = 'Joseph'")
all_rows = cursor.fetchall()
print(all_rows)
[]
In [9]:
cursor.execute("SELECT first_name FROM candidates")
all_rows = cursor.fetchall()
print(all_rows)
[('Mike',), ('Hillary',)]

Another Table

We'll add another table to our database and introduce a few new SQL commands and ideas.

The new table will contain data on contributors and their contributions to each candidate.

In [10]:
cursor.execute('''CREATE TABLE contributors (
          id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 
          last_name TEXT, 
          first_name TEXT, 
          middle_name TEXT, 
          street_1 TEXT, 
          street_2 TEXT, 
          city TEXT, 
          state TEXT, 
          zip TEXT, 
          amount FLOAT(7,3), 
          date DATETIME, 
          candidate_id INTEGER NOT NULL, 
          FOREIGN KEY(candidate_id) REFERENCES candidates(id))''')

db.commit()

Discussion

The creation of our new table includes a few new commands:

  • AUTOINCREMENT: The id for a new entry to the table will be automatically generated. No need to enter this explicitly. The increment starts at 1.
  • DATETIME: Gives the date in the format YYYY-MM-DD HH:MM:SS
  • FOREIGN KEY: This allows us to link the two tables (candidates and contributors). More below.

Discussion: FOREIGN KEY

  • We'd like to be able to keep track of the contributions to each candidate.
  • In order to accomplish this, we need to link the candidates and contributors tables somehow.
  • A reference between the two tables is created by a common field in each table.
  • We define the FOREIGN KEY in the second table and it references a unique key in the first table. Often, this unique key is the PRIMARY KEY.

So, in our example, we create a new field in the contributors table called candidate_id. We declare it to be a FOREIGN KEY that references the id field in the candidates table.

Let's add a contributor.

Adding Some Data to Contributors Table

In [11]:
contributors = [("Agee", "Steven", "", "549 Laurel Branch Road", "", "Floyd", "VA", int(24091), 500.0, 2007-6-30, 16),
                ("Aaron", "Carole", "", "PO Box 1806", "", "Ogunqui", "ME", int(3907), 70, 2008-2-7, 34)]

cursor.executemany('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', contributors)
db.commit()
In [12]:
cursor.execute("SELECT last_name FROM contributors where amount < 100")
for c in cursor.fetchall():
    print(c)
('Aaron',)

Now let's try to add a contributor who contributed to a candidate whose id is not in the candidates table.

In [13]:
cursor.execute('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', ("Buckler", "Steve", "", "24351 Armada Dr.", "", "Dana Point", "CA", int(926291), 50, 2007-7-30, 20))
---------------------------------------------------------------------------
IntegrityError                            Traceback (most recent call last)
 in 
----> 1 cursor.execute('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', ("Buckler", "Steve", "", "24351 Armada Dr.", "", "Dana Point", "CA", int(926291), int(50), 2007-7-30, 20))

IntegrityError: FOREIGN KEY constraint failed

Observations

We got an error!

This is because the FOREIGN KEY constraint was violated.

There is no candidate with candidate_id equal to 20.

Another way: Invalid data was prevented from being entered into the foreign key column.

It's a good idea to close the database when we're done.

In [14]:
db.close()