EDA, Pandas, and the Grammar of Data


We'd like a data structure that can represent the columns in the data above by their name. In particular, we want a structure that can easily store variables of different types, that stores column names, and that we can reference by column name as well as by indexed position. And it would be nice this data structure came with built-in functions that we can use to manipulate it.

Pandas is a package/library that does all of this! The library is built on top of numpy. There are two basic pandas objects, series and dataframes, which can be thought of as enhanced versions of 1D and 2D numpy arrays, respectively. Indeed Pandas attempts to keep all the efficiencies that numpy gives us.

For reference, here is a useful pandas cheat sheet and the pandas documentation.

The basic EDA workflow

Below is a basic checklist for the early stages of exploratory data analysis in Python. While not universally applicable, the rubric covers patterns which recur in several data analysis contexts, so useful to keep it in mind when encountering a new dataset.

The basic workflow (enunciated in this form by Chris Beaumont, the first Head TF of cs109 ever) is as follows:

  1. Build a Dataframe from the data (ideally, put all data in this object)
  2. Clean the Dataframe. It should have the following properties:

    • Each row describes a single object
    • Each column describes a property of that object
    • Columns are numeric whenever appropriate
    • Columns contain atomic properties that cannot be further decomposed
  3. Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.

  4. Explore group properties. Use groupby, queries, and small multiples to compare subsets of the data.

This process transforms the data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to follow-up on in subsequent analysis.

In [1]:
# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline
# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Building a dataframe

The easiest way to build a dataframe is simply to read in a CSV file. We saw an example of this in lab, and we shall see 2 examples here. We'll also see how we may combine multiple data sources into a larger dataframe.

This example is adapted from: https://github.com/tthibo/SQL-Tutorial

The first 3 lines of the file (!head -3 data/candidates.txt on mac/unix) look like this.

id|first_name|last_name|middle_name|party
33|Joseph|Biden||D
36|Samuel|Brownback||R
In [2]:
dfcand=pd.read_csv("./data/candidates.txt", sep='|')
dfcand.head()
Out[2]:
id first_name last_name middle_name party
0 33 Joseph Biden NaN D
1 36 Samuel Brownback NaN R
2 34 Hillary Clinton R. D
3 39 Christopher Dodd J. D
4 26 John Edwards NaN D

A pandas dataframe is a set of columns pasted together into a spreadsheet, as shown in the schematic below, which is taken from the cheatsheet above. The columns in pandas are called series objects.

All the columns in this dataframe:

In [3]:
dfcand.columns
Out[3]:
Index(['id', 'first_name', 'last_name', 'middle_name', 'party'], dtype='object')

And the types of these columns:

In [4]:
dfcand.dtypes
Out[4]:
id              int64
first_name     object
last_name      object
middle_name    object
party          object
dtype: object

Access to a particular column can be obtained by treating the column name as an "attribute" of the dataframe:

In [5]:
dfcand.first_name
Out[5]:
0          Joseph
1          Samuel
2         Hillary
3     Christopher
4            John
5         Rudolph
6            Mike
7            Mike
8          Duncan
9          Dennis
10           John
11         Barack
12            Ron
13           Bill
14           Mitt
15            Tom
16           Fred
Name: first_name, dtype: object

But Pandas supports a dictionary like access to columns. This is very useful when column names have spaces: Python variables cannot have spaces in them.

In [6]:
dfcand['first_name']
Out[6]:
0          Joseph
1          Samuel
2         Hillary
3     Christopher
4            John
5         Rudolph
6            Mike
7            Mike
8          Duncan
9          Dennis
10           John
11         Barack
12            Ron
13           Bill
14           Mitt
15            Tom
16           Fred
Name: first_name, dtype: object

We can also get sub-dataframes by choosing a set of series. We pass a list of the columns we want as "dictionary keys" to the dataframe.

In [7]:
columns_i_want=['first_name', 'last_name']
dfcand[columns_i_want]
Out[7]:
first_name last_name
0 Joseph Biden
1 Samuel Brownback
2 Hillary Clinton
3 Christopher Dodd
4 John Edwards
5 Rudolph Giuliani
6 Mike Gravel
7 Mike Huckabee
8 Duncan Hunter
9 Dennis Kucinich
10 John McCain
11 Barack Obama
12 Ron Paul
13 Bill Richardson
14 Mitt Romney
15 Tom Tancredo
16 Fred Thompson

Categoricals

Even though party is a string, it takes on only a finite set of values, 'D', and 'R'. We can model this:

In [8]:
dfcand.party.unique()
Out[8]:
array(['D', 'R'], dtype=object)
In [9]:
dfcand['party'] = dfcand['party'].astype("category")
In [10]:
dfcand.dtypes
Out[10]:
id                int64
first_name       object
last_name        object
middle_name      object
party          category
dtype: object
In [11]:
dfcand.head()
Out[11]:
id first_name last_name middle_name party
0 33 Joseph Biden NaN D
1 36 Samuel Brownback NaN R
2 34 Hillary Clinton R. D
3 39 Christopher Dodd J. D
4 26 John Edwards NaN D
In [12]:
dfcand.party.describe()
Out[12]:
count     17
unique     2
top        R
freq       9
Name: party, dtype: object
In [13]:
dfcand.party.cat.categories
Out[13]:
Index(['D', 'R'], dtype='object')
In [14]:
dfcand.party.cat.ordered
Out[14]:
False

Keep in mind that this is a relatively new feature of Pandas. You dont need to do this, but might find it useful to keep your types straight. (Using categoricals in machine learning algorithms is more complex and usually involves a process called One Hot Encoding)

Another piece of data

This is a file of people who have contributed money to candidates: (!head -3 data/contributors_with_candidate_id.txt)

id|last_name|first_name|middle_name|street_1|street_2|city|state|zip|amount|date|candidate_id
|Agee|Steven||549 Laurel Branch Road||Floyd|VA|24091|500.00|2007-06-30|16
|Ahrens|Don||4034 Rennellwood Way||Pleasanton|CA|94566|250.00|2007-05-16|16
In [15]:
dfcwci=pd.read_csv("./data/contributors_with_candidate_id.txt", sep="|")
dfcwci.head()
Out[15]:
id last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
0 NaN Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16
1 NaN Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 250.0 2007-05-16 16
2 NaN Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 50.0 2007-06-18 16
3 NaN Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 100.0 2007-06-21 16
4 NaN Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16

Cleaning Data

most of the techniques you will learn about in Pandas are all about getting data in a form that can be used for further analysis. Cleaning usually means dealing with missing values, transforming types appropriately, and taking care of data integrity. But we'll lump everything required to transform data to a form appropriate for analysis cleaning, even if what we are doing is for example, combining multiple data sets, or producing processed data from raw data.

Lets start with some regular cleaning:

In [16]:
del dfcwci['id']
dfcwci.head()
Out[16]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16
1 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 250.0 2007-05-16 16
2 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 50.0 2007-06-18 16
3 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 100.0 2007-06-21 16
4 Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16

We can see the size of our data

In [17]:
dfcwci.shape, dfcand.shape
Out[17]:
((175, 11), (17, 5))

We will do some more cleaning soon but let us see the EDA process as sliced from another angle: data transformation, used both for cleaning and for seeing single column and multiple column properties in the data.

Data transformation: Single Table Verbs

Let us now focus on core data manipulation commands. These are universal across systems, and by identifying them, we can quickly ask how to do these when we encounter a new system.

See https://gist.github.com/TomAugspurger/6e052140eaa5fdb6e8c0/ which has a comparison of r/dplyr and pandas. I stole and modified this table from there:

dplyr has a small set of nicely defined verbs, which Hadley Wickham has used to identify core data manipulation commands. Here are listed the closest SQL and Pandas verbs, so we can see the universality of these manipulations.

VERB dplyr pandas SQL
QUERY/SELECTION filter() (and slice()) query() (and loc[], iloc[]) SELECT WHERE
SORT arrange() sort() ORDER BY
SELECT-COLUMNS/PROJECTION select() (and rename()) [](__getitem__) (and rename()) SELECT COLUMN
SELECT-DISTINCT distinct() unique(),drop_duplicates() SELECT DISTINCT COLUMN
ASSIGN mutate() (and transmute()) assign ALTER/UPDATE
AGGREGATE summarise() describe(), mean(), max() None, AVG(),MAX()
SAMPLE sample_n() and sample_frac() sample() implementation dep, use RAND()
GROUP-AGG group_by/summarize groupby/agg, count, mean GROUP BY
DELETE ? drop/masking DELETE/WHERE

We'll tackle these one by one in Pandas, since these are data manipulations you will do all the time.

QUERY

In [18]:
dfcwci.amount < 400
Out[18]:
0      False
1       True
2       True
3       True
4       True
5      False
6      False
7       True
8      False
9      False
10     False
11     False
12      True
13     False
14      True
15     False
16     False
17      True
18      True
19     False
20     False
21     False
22      True
23     False
24     False
25      True
26      True
27      True
28     False
29      True
       ...  
145     True
146     True
147     True
148     True
149     True
150    False
151     True
152     True
153     True
154     True
155    False
156    False
157     True
158     True
159    False
160     True
161     True
162     True
163    False
164    False
165     True
166     True
167     True
168    False
169    False
170     True
171    False
172     True
173    False
174    False
Name: amount, Length: 175, dtype: bool

This gives us Trues and Falses. Such a series is called a mask. A mask is the basis of filtering. We can do:

In [19]:
dfcwci[dfcwci.amount < 400].head()
Out[19]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
1 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 250.0 2007-05-16 16
2 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 50.0 2007-06-18 16
3 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 100.0 2007-06-21 16
4 Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16
7 Aldridge Brittni NaN 808 Capitol Square Place, SW NaN Washington DC 20024 250.0 2007-06-06 16

Notice that the dataframe has been filtered down to only include those contributions with amount < 400. The rows with False in the mask have been eliminated, and those with True in the mask have been kept.

In [20]:
np.sum(dfcwci.amount < 400), np.mean(dfcwci.amount < 400)
Out[20]:
(132, 0.75428571428571434)

Why did that work? The booleans are coerced to integers as below:

In [21]:
1*True, 1*False
Out[21]:
(1, 0)

Or directly, in Pandas, which works since the comparison is a pandas Series.

In [22]:
(dfcwci.amount < 400).mean()
Out[22]:
0.75428571428571434

You can combine queries. Note that we use Python's & operator instead of and. This is because we are "Boolean AND"ing masks to get a series of True's And Falses.

In [23]:
dfcwci[(dfcwci.state=='VA') & (dfcwci.amount < 400)]
Out[23]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
27 Buckheit Bruce NaN 8904 KAREN DR NaN FAIRFAX VA 220312731 100.00 2007-09-19 20
77 Ranganath Anoop NaN 2507 Willard Drive NaN Charlottesville VA 22903 -100.00 2008-04-21 32
88 Perreault Louise NaN 503 Brockridge Hunt Drive NaN Hampton VA 23666 -34.08 2008-04-21 32
145 ABDELLA THOMAS M. 4231 MONUMENT WALL WAY #340 NaN FAIRFAX VA 220308440 50.00 2007-09-30 35

Here is another way to write the query:

In [24]:
dfcwci.query("state=='VA' & amount < 400")
Out[24]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
27 Buckheit Bruce NaN 8904 KAREN DR NaN FAIRFAX VA 220312731 100.00 2007-09-19 20
77 Ranganath Anoop NaN 2507 Willard Drive NaN Charlottesville VA 22903 -100.00 2008-04-21 32
88 Perreault Louise NaN 503 Brockridge Hunt Drive NaN Hampton VA 23666 -34.08 2008-04-21 32
145 ABDELLA THOMAS M. 4231 MONUMENT WALL WAY #340 NaN FAIRFAX VA 220308440 50.00 2007-09-30 35

For cleaning, we might want to use this querying ability

In [25]:
dfcwci[dfcwci.state.isnull()]
Out[25]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
125 BOURNE TRAVIS NaN LAGE KAART 77 NaN BRASSCHATT NaN 2930 -500.0 2008-11-20 35

Or the opposite, which is probably more useful in making the selection:

In [26]:
dfcwci[dfcwci.state.notnull()].head()
Out[26]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16
1 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 250.0 2007-05-16 16
2 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 50.0 2007-06-18 16
3 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 100.0 2007-06-21 16
4 Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16

For categoricals you can use isin. You can use Boolean not on the mask to implement not in.

In [27]:
dfcwci[dfcwci.state.isin(['VA','WA'])].head(10)
Out[27]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.00 2007-06-30 16
27 Buckheit Bruce NaN 8904 KAREN DR NaN FAIRFAX VA 220312731 100.00 2007-09-19 20
62 BURKE SUZANNE M. 3401 EVANSTON NaN SEATTLE WA 981038677 -700.00 2008-03-05 22
77 Ranganath Anoop NaN 2507 Willard Drive NaN Charlottesville VA 22903 -100.00 2008-04-21 32
88 Perreault Louise NaN 503 Brockridge Hunt Drive NaN Hampton VA 23666 -34.08 2008-04-21 32
100 Aaronson Rebecca NaN 2000 Village Green Dr Apt 12 NaN Mill Creek WA 980125787 100.00 2008-02-08 34
106 Aaronson Rebecca NaN 2000 Village Green Dr Apt 12 NaN Mill Creek WA 980125787 100.00 2008-02-14 34
145 ABDELLA THOMAS M. 4231 MONUMENT WALL WAY #340 NaN FAIRFAX VA 220308440 50.00 2007-09-30 35

And you can chain queries thus.

In [28]:
dfcwci.query("10 <= amount <= 50").head(10)
Out[28]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
2 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 50.0 2007-06-18 16
18 Ardle William NaN 412 Dakota Avenue NaN Springfield OH 45504 50.0 2007-06-28 16
25 Buckler Steve NaN 24351 Armada Dr NaN Dana Point CA 926291306 50.0 2007-07-30 20
26 Buckler Steve NaN 24351 Armada Dr NaN Dana Point CA 926291306 25.0 2007-08-16 20
34 Buck Barbara NaN 1780 NE 138th St NaN North Miami FL 331811316 50.0 2007-09-13 20
35 Buck Barbara NaN 1780 NE 138th St NaN North Miami FL 331811316 50.0 2007-07-19 20
38 Buchanek Elizabeth NaN 7917 Kentbury Dr NaN Bethesda MD 208144615 50.0 2007-09-30 20
49 Harrison Ryan NaN 2247 3rd St NaN La Verne CA 917504918 25.0 2007-07-26 20
101 Aarons Elaine NaN 481 Buck Island Rd Apt 17A APT 17A West Yarmouth MA 26733300 25.0 2008-02-26 34
104 Aaron Shirley NaN 101 Cherry Ave NaN Havana FL 323331311 50.0 2008-02-29 34

SORT

In [29]:
dfcwci.sort_values(by="amount").head(10)
Out[29]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
90 Kazor Christopher M 707 Spindletree ave NaN Naperville IL 60565 -2592.0 2008-04-21 32
72 BRUNO JOHN NaN 10136 WINDERMERE CHASE BLVD. NaN GOTHA FL 347344707 -2300.0 2008-03-06 22
64 BURKE DONALD J. 12 LOMPOC NaN RANCHO SANTA MARGA CA 926881817 -2300.0 2008-03-11 22
73 BRUNO IRENE NaN 10136 WINDERMERE CHASE BLVD. NaN GOTHA FL 347344707 -2300.0 2008-03-06 22
74 BROWN TIMOTHY J. 26826 MARLOWE COURT NaN STEVENSON RANCH CA 913811020 -2300.0 2008-03-06 22
58 BURTON GLENN M. 4404 CHARLESTON COURT NaN TAMPA FL 336092620 -2300.0 2008-03-05 22
57 BURTON STEVEN G. 9938 DEER CREEK DRIVE NaN TAMPA FL 33647 -2300.0 2008-03-05 22
84 Uihlein Richard NaN 1396 N Waukegan Rd NaN Lake Forest IL 600451147 -2300.0 2008-04-21 32
56 BURTON SUSAN NaN 9338 DEER CREEK DRIVE NaN TAMPA FL 336472286 -2300.0 2008-03-05 22
55 BUSH ERIC NaN P.O. BOX 61046 NaN DENVER CO 802061046 -2300.0 2008-03-06 22
In [30]:
dfcwci[dfcwci.amount < 0]
Out[30]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
29 Buckel Linda NaN PO Box 683130 NaN Park City UT 840683130 -2300.00 2007-08-14 20
40 Buchanan John NaN 2025 NW 29th Rd NaN Boca Raton FL 334316303 -500.00 2007-09-24 20
43 Buchanan John NaN 2025 NW 29th Rd NaN Boca Raton FL 334316303 -700.00 2007-08-28 20
50 BYNUM HERBERT NaN 332 SUNNYSIDE ROAD NaN TAMPA FL 336177249 -500.00 2008-03-10 22
51 BYINGTON MARGARET E. 2633 MIDDLEBORO LANE N.E. NaN GRAND RAPIDS MI 495061254 -2300.00 2008-03-03 22
52 BYERS BOB A. 13170 TELFAIR AVENUE NaN SYLMAR CA 913423573 -2300.00 2008-03-07 22
53 BYERS AUDREY NaN 2658 LADBROOK WAY NaN THOUSAND OAKS CA 913615073 -200.00 2008-03-07 22
54 BUSH KRYSTIE NaN P.O. BOX 61046 NaN DENVER CO 802061046 -2300.00 2008-03-06 22
55 BUSH ERIC NaN P.O. BOX 61046 NaN DENVER CO 802061046 -2300.00 2008-03-06 22
56 BURTON SUSAN NaN 9338 DEER CREEK DRIVE NaN TAMPA FL 336472286 -2300.00 2008-03-05 22
57 BURTON STEVEN G. 9938 DEER CREEK DRIVE NaN TAMPA FL 33647 -2300.00 2008-03-05 22
58 BURTON GLENN M. 4404 CHARLESTON COURT NaN TAMPA FL 336092620 -2300.00 2008-03-05 22
59 BURKHARDT CRAIG S. 910 15TH STREET N.W. NaN WASHINGTON DC 200052503 -500.00 2008-03-07 22
60 BURKHARDT CRAIG S. 910 15TH STREET N.W. NaN WASHINGTON DC 200052503 -1000.00 2008-03-07 22
61 BURKHARDT BARBARA NaN 910 15TH STREET N.W. NaN WASHINGTON DC 200052503 -500.00 2008-03-07 22
62 BURKE SUZANNE M. 3401 EVANSTON NaN SEATTLE WA 981038677 -700.00 2008-03-05 22
63 BURKE GAIL NaN 165 E. 32ND STREET APARTMENT 9E NEW YORK NY 100166014 -2000.00 2008-03-05 22
64 BURKE DONALD J. 12 LOMPOC NaN RANCHO SANTA MARGA CA 926881817 -2300.00 2008-03-11 22
65 BURGERT RONALD L. 5723 PLUMTREE DRIVE NaN DALLAS TX 752524926 -1000.00 2008-03-05 22
66 BULL BARTLE B. 439 E. 51ST STREET NaN NEW YORK NY 100226473 -800.00 2008-03-10 22
67 BULL BARTLE B. 439 E. 51ST STREET NaN NEW YORK NY 100226473 -1000.00 2008-03-10 22
68 BUKOWSKI DANIEL J. 702 S. WRIGHT STREET NaN NAPERVILLE IL 605406736 -100.00 2008-03-10 22
69 BUISSON MARGARET A. P.O. BOX 197029 NaN LOUISVILLE KY 402597029 -200.00 2008-03-11 22
70 BUCKLEY WALTER W. 1635 COUNTRY ROAD NaN BETHLEHEM PA 180155718 -100.00 2008-03-05 22
71 BUCKLEY MARJORIE B. 1635 COUNTRY ROAD NaN BETHLEHEM PA 180155718 -100.00 2008-03-05 22
72 BRUNO JOHN NaN 10136 WINDERMERE CHASE BLVD. NaN GOTHA FL 347344707 -2300.00 2008-03-06 22
73 BRUNO IRENE NaN 10136 WINDERMERE CHASE BLVD. NaN GOTHA FL 347344707 -2300.00 2008-03-06 22
74 BROWN TIMOTHY J. 26826 MARLOWE COURT NaN STEVENSON RANCH CA 913811020 -2300.00 2008-03-06 22
75 Schuff Bryan NaN 1700 W Sweden Rd NaN Brockport NY 14420 -25.00 2008-08-22 32
76 Hobbs James NaN 229 Cherry Lane NaN White House TN 37188 -25.00 2008-08-19 32
77 Ranganath Anoop NaN 2507 Willard Drive NaN Charlottesville VA 22903 -100.00 2008-04-21 32
78 Nystrom Michael A 93A Fairmont Street NaN Arlington MA 2474 -503.00 2008-04-21 32
79 Muse Nina Jo 2915 Toro Canyon Rd NaN Austin TX 78746 -50.00 2008-04-21 32
80 Waddell James L. 1823 Spel Lane SW NaN Rochester MN 55902 -28.00 2008-04-21 32
81 Brucks William C. PO Box 391 NaN Corona del Mar CA 92625 -150.00 2008-04-21 32
82 Kuehn David NaN 14502 West 93rd Street NaN Lenexa KS 66215 -330.00 2008-04-21 32
83 Verster Jeanette M. 7220 SW 61st St NaN Miami FL 331431807 -1000.00 2008-04-21 32
84 Uihlein Richard NaN 1396 N Waukegan Rd NaN Lake Forest IL 600451147 -2300.00 2008-04-21 32
85 Eskenberry Robert P 10960 Gray Cir NaN Westminster CO 80020 -223.00 2008-04-21 32
86 Froehling Alan L. 302 Broadway St NaN Mount Vernon IL 628645116 -844.80 2008-04-21 32
87 Duryea Marcia A. 123 Bayview Ave NaN Amityville NY 11701 -299.50 2008-04-21 32
88 Perreault Louise NaN 503 Brockridge Hunt Drive NaN Hampton VA 23666 -34.08 2008-04-21 32
89 Rozenfeld Timur NaN 57 Herbert Road NaN Robbinsville NJ 8691 -777.95 2008-04-21 32
90 Kazor Christopher M 707 Spindletree ave NaN Naperville IL 60565 -2592.00 2008-04-21 32
91 Lehner Thomas S. 2701 Star Lane NaN Wadsworth OH 44281 -200.00 2008-04-21 32
92 Plummer Joseph NaN 587 Blake Hill Rd NaN New Hampton NH 32564424 -24.60 2008-04-21 32
93 Raught Philip M 4714 Plum Way NaN Pittsburgh PA 15201 -1046.00 2008-04-21 32
94 Ferrara Judith D 1508 Waterford Road NaN Yardley PA 19067 -1100.00 2008-04-21 32
95 Johnson Cathleen E. 1003 Justin Ln Apt 2016 NaN Austin TX 787572648 -14.76 2008-04-21 32
96 Sanford Bradley NaN 940 Post St #43 NaN San Francisco CA 94109 -24.53 2008-04-21 32
97 Gaarder Bruce NaN PO Box 4085 NaN Mountain Home AFB ID 83648 -261.00 2008-04-21 32
98 Choe Hyeokchan NaN 207 Bridle Way NaN Fort Lee NJ 70246302 -39.50 2008-04-21 32
99 Jacobs Richard G. 14337 Tawya Rd NaN Apple Valley CA 923075545 -1000.00 2008-04-21 32
110 Reid Elizabeth NaN 73 W Patent Rd OPHIR FARM NORTH Bedford Hills NY 105072222 -350.00 2008-08-28 34
111 Reich Thomas NaN 499 Park Ave NaN New York NY 100221240 -2300.00 2008-08-28 34
125 BOURNE TRAVIS NaN LAGE KAART 77 NaN BRASSCHATT NaN 2930 -500.00 2008-11-20 35
126 SECRIST BRIAN L. 3 MULE DEER TRAIL NaN LITTLETON CO 801275722 -1000.00 2008-04-07 35
127 TOLLESTRUP TRAVIS W. 16331 WINECREEK RD. NaN SAN DIEGO CA 92127 -1000.00 2008-05-15 35
In [31]:
dfcwci.sort_values(by="amount", ascending=False).head(10)
Out[31]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
30 Buckel Linda NaN PO Box 683130 NaN Park City UT 840683130 4600.0 2007-08-14 20
159 ABATE MARIA ELENA 1291 NIGHTINGALE AVENUE NaN MIAMI SPRINGS FL 331663832 2600.0 2008-01-25 37
15 Anthony John NaN 211 Long Island Drive NaN Hot Springs AR 71913 2300.0 2007-06-12 16
33 Buck Blaine M 45 Eaton Ave NaN Camden ME 48431752 2300.0 2007-09-30 20
28 Buckel Linda NaN PO Box 683130 NaN Park City UT 840683130 2300.0 2007-08-14 20
21 Baker David NaN 2550 Adamsbrooke Drive NaN Conway AR 72034 2300.0 2007-04-11 16
13 Altes R.D. NaN 8600 Moody Road NaN Fort Smith AR 72903 2300.0 2007-06-21 16
135 ABRAMOWITZ NIRA NaN 411 HARBOR ROAD NaN SOUTHPORT CT 68901376 2300.0 2007-09-14 35
5 Akin Mike NaN 181 Baywood Lane NaN Monticello AR 71655 1500.0 2007-05-18 16
174 ABRAHAM SALEM A. P.O. BOX 7 NaN CANADIAN TX 790140007 1300.0 2008-01-30 37

SELECT-COLUMNS

In [32]:
dfcwci[['first_name', 'amount']].head(10)
Out[32]:
first_name amount
0 Steven 500.0
1 Don 250.0
2 Don 50.0
3 Don 100.0
4 Charles 100.0
5 Mike 1500.0
6 Rebecca 500.0
7 Brittni 250.0
8 John D. 1000.0
9 John D. 1300.0

SELECT-DISTINCT

Selecting a distinct set is useful for cleaning. Here, we might wish to focus on contributors rather than contributions and see how many distinct contributors we have. Of-course we might be wrong, some people have identical names.

In [33]:
dfcwci[['last_name','first_name']].count()
Out[33]:
last_name     175
first_name    175
dtype: int64
In [34]:
dfcwci[['last_name','first_name']].drop_duplicates().count()
Out[34]:
last_name     126
first_name    126
dtype: int64
In [35]:
dfcwci[['last_name','first_name']].drop_duplicates().head(10)
Out[35]:
last_name first_name
0 Agee Steven
1 Ahrens Don
4 Akin Charles
5 Akin Mike
6 Akin Rebecca
7 Aldridge Brittni
8 Allen John D.
10 Allison John W.
11 Allison Rebecca
13 Altes R.D.

ASSIGN

Assignment to a new column is easy.

In [36]:
dfcwci['name']=dfcwci['last_name']+", "+dfcwci['first_name']
dfcwci.head(10)
Out[36]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id name
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16 Agee, Steven
1 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 250.0 2007-05-16 16 Ahrens, Don
2 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 50.0 2007-06-18 16 Ahrens, Don
3 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 100.0 2007-06-21 16 Ahrens, Don
4 Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16 Akin, Charles
5 Akin Mike NaN 181 Baywood Lane NaN Monticello AR 71655 1500.0 2007-05-18 16 Akin, Mike
6 Akin Rebecca NaN 181 Baywood Lane NaN Monticello AR 71655 500.0 2007-05-18 16 Akin, Rebecca
7 Aldridge Brittni NaN 808 Capitol Square Place, SW NaN Washington DC 20024 250.0 2007-06-06 16 Aldridge, Brittni
8 Allen John D. NaN 1052 Cannon Mill Drive NaN North Augusta SC 29860 1000.0 2007-06-11 16 Allen, John D.
9 Allen John D. NaN 1052 Cannon Mill Drive NaN North Augusta SC 29860 1300.0 2007-06-29 16 Allen, John D.
In [37]:
dfcwci.assign(ucname=dfcwci.last_name+":"+dfcwci.first_name).head(10)
Out[37]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id name ucname
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16 Agee, Steven Agee:Steven
1 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 250.0 2007-05-16 16 Ahrens, Don Ahrens:Don
2 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 50.0 2007-06-18 16 Ahrens, Don Ahrens:Don
3 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 100.0 2007-06-21 16 Ahrens, Don Ahrens:Don
4 Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16 Akin, Charles Akin:Charles
5 Akin Mike NaN 181 Baywood Lane NaN Monticello AR 71655 1500.0 2007-05-18 16 Akin, Mike Akin:Mike
6 Akin Rebecca NaN 181 Baywood Lane NaN Monticello AR 71655 500.0 2007-05-18 16 Akin, Rebecca Akin:Rebecca
7 Aldridge Brittni NaN 808 Capitol Square Place, SW NaN Washington DC 20024 250.0 2007-06-06 16 Aldridge, Brittni Aldridge:Brittni
8 Allen John D. NaN 1052 Cannon Mill Drive NaN North Augusta SC 29860 1000.0 2007-06-11 16 Allen, John D. Allen:John D.
9 Allen John D. NaN 1052 Cannon Mill Drive NaN North Augusta SC 29860 1300.0 2007-06-29 16 Allen, John D. Allen:John D.

Will the above command actually change dfcwci?. No, it produces a fresh dataframe.

What if we wanted to change an existing assignment?

In [38]:
dfcwci[dfcwci.state=='VA']
Out[38]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id name
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.00 2007-06-30 16 Agee, Steven
27 Buckheit Bruce NaN 8904 KAREN DR NaN FAIRFAX VA 220312731 100.00 2007-09-19 20 Buckheit, Bruce
77 Ranganath Anoop NaN 2507 Willard Drive NaN Charlottesville VA 22903 -100.00 2008-04-21 32 Ranganath, Anoop
88 Perreault Louise NaN 503 Brockridge Hunt Drive NaN Hampton VA 23666 -34.08 2008-04-21 32 Perreault, Louise
145 ABDELLA THOMAS M. 4231 MONUMENT WALL WAY #340 NaN FAIRFAX VA 220308440 50.00 2007-09-30 35 ABDELLA, THOMAS
In [39]:
dfcwci.loc[dfcwci.state=='VA', 'name']
Out[39]:
0           Agee, Steven
27       Buckheit, Bruce
77      Ranganath, Anoop
88     Perreault, Louise
145      ABDELLA, THOMAS
Name: name, dtype: object
In [40]:
dfcwci.loc[dfcwci.state=='VA', 'name']="junk"
In [41]:
dfcwci.query("state=='VA'")
Out[41]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id name
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.00 2007-06-30 16 junk
27 Buckheit Bruce NaN 8904 KAREN DR NaN FAIRFAX VA 220312731 100.00 2007-09-19 20 junk
77 Ranganath Anoop NaN 2507 Willard Drive NaN Charlottesville VA 22903 -100.00 2008-04-21 32 junk
88 Perreault Louise NaN 503 Brockridge Hunt Drive NaN Hampton VA 23666 -34.08 2008-04-21 32 junk
145 ABDELLA THOMAS M. 4231 MONUMENT WALL WAY #340 NaN FAIRFAX VA 220308440 50.00 2007-09-30 35 junk

Drop-Column

Real simple:

In [42]:
del dfcwci['name']

AGGREGATE

In [43]:
dfcwci.describe()
Out[43]:
zip amount candidate_id
count 1.750000e+02 175.000000 175.000000
mean 3.780014e+08 3.418114 28.000000
std 3.628278e+08 1028.418999 7.823484
min 2.474000e+03 -2592.000000 16.000000
25% 9.336700e+04 -175.000000 20.000000
50% 3.233313e+08 100.000000 32.000000
75% 7.816946e+08 300.000000 35.000000
max 9.951532e+08 4600.000000 37.000000
In [44]:
dfcwci.amount.max()
Out[44]:
4600.0
In [45]:
dfcwci[dfcwci.amount==dfcwci.amount.max()]
Out[45]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
30 Buckel Linda NaN PO Box 683130 NaN Park City UT 840683130 4600.0 2007-08-14 20
In [46]:
dfcwci[dfcwci.amount > dfcwci.amount.max() - 2300]
Out[46]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
30 Buckel Linda NaN PO Box 683130 NaN Park City UT 840683130 4600.0 2007-08-14 20
159 ABATE MARIA ELENA 1291 NIGHTINGALE AVENUE NaN MIAMI SPRINGS FL 331663832 2600.0 2008-01-25 37

Aso MIN, SUM, AVG.

Grouping using Pandas and split-apply-combine

In [47]:
grouped_by_state = dfcwci.groupby("state")
grouped_by_state
Out[47]:
In [48]:
grouped_by_state.amount
Out[48]:

How do we get access to these? Standard pandas functions distribute over the groupby, going one by one over the sub-dataframes or sub-series. This is an example of a paradigm called split-apply-combine.

GROUP-AGG

The fourth part of the EDA rubric is to look at properties of the sub-dataframes you get when you make groups. (We'll talk about the graphical aspects of this later). For instance, you might group contributions by state:

In [49]:
dfcwci.groupby("state").describe()
Out[49]:
amount candidate_id zip
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
state
AK 3.0 403.333333 100.166528 300.00 355.0000 410.000 455.0000 500.0 3.0 37.000000 0.000000 37.0 37.00 37.0 37.00 37.0 3.0 9.951532e+08 0.000000e+00 995153207.0 9.951532e+08 995153207.0 9.951532e+08 995153207.0
AR 12.0 1183.333333 775.574079 100.00 875.0000 1000.000 1700.0000 2300.0 12.0 16.000000 0.000000 16.0 16.00 16.0 16.00 16.0 12.0 7.206583e+04 4.250251e+02 71603.0 7.165500e+04 72033.5 7.222700e+04 72903.0
AZ 1.0 120.000000 NaN 120.00 120.0000 120.000 120.0000 120.0 1.0 37.000000 NaN 37.0 37.00 37.0 37.00 37.0 1.0 8.600111e+08 NaN 860011121.0 8.600111e+08 860011121.0 8.600111e+08 860011121.0
CA 23.0 -217.988261 942.102438 -2300.00 -175.0000 50.000 225.0000 1000.0 23.0 26.086957 8.151481 16.0 20.00 22.0 34.50 37.0 23.0 6.407113e+08 4.333269e+08 92127.0 9.456600e+04 913811020.0 9.260504e+08 941151435.0
CO 4.0 -1455.750000 1025.166125 -2300.00 -2300.0000 -1650.000 -805.7500 -223.0 4.0 27.750000 6.751543 22.0 22.00 27.0 32.75 35.0 4.0 6.013695e+08 4.008598e+08 80020.0 6.009768e+08 801668384.0 8.020610e+08 802061046.0
CT 1.0 2300.000000 NaN 2300.00 2300.0000 2300.000 2300.0000 2300.0 1.0 35.000000 NaN 35.0 35.00 35.0 35.00 35.0 1.0 6.890138e+07 NaN 68901376.0 6.890138e+07 68901376.0 6.890138e+07 68901376.0
DC 5.0 -309.982000 529.644175 -1000.00 -500.0000 -500.000 200.0900 250.0 5.0 20.400000 2.607681 16.0 20.00 22.0 22.00 22.0 5.0 1.600684e+08 8.946976e+07 20024.0 2.000525e+08 200052503.0 2.000525e+08 200164320.0
FL 30.0 -135.000000 1177.383862 -2300.00 -500.0000 100.000 500.0000 2600.0 30.0 26.766667 7.151963 20.0 20.00 22.0 34.00 37.0 30.0 2.990209e+08 1.015776e+08 33647.0 3.233313e+08 333063809.5 3.343163e+08 347344707.0
IA 1.0 250.000000 NaN 250.00 250.0000 250.000 250.0000 250.0 1.0 16.000000 NaN 16.0 16.00 16.0 16.00 16.0 1.0 5.026600e+04 NaN 50266.0 5.026600e+04 50266.0 5.026600e+04 50266.0
ID 1.0 -261.000000 NaN -261.00 -261.0000 -261.000 -261.0000 -261.0 1.0 32.000000 NaN 32.0 32.00 32.0 32.00 32.0 1.0 8.364800e+04 NaN 83648.0 8.364800e+04 83648.0 8.364800e+04 83648.0
IL 6.0 -931.133333 1230.657981 -2592.00 -1936.2000 -472.400 12.5000 200.0 6.0 29.166667 6.645801 20.0 24.50 32.0 32.00 37.0 6.0 5.070114e+08 2.485720e+08 60565.0 6.005716e+08 603169827.0 6.062808e+08 628645116.0
KS 1.0 -330.000000 NaN -330.00 -330.0000 -330.000 -330.0000 -330.0 1.0 32.000000 NaN 32.0 32.00 32.0 32.00 32.0 1.0 6.621500e+04 NaN 66215.0 6.621500e+04 66215.0 6.621500e+04 66215.0
KY 1.0 -200.000000 NaN -200.00 -200.0000 -200.000 -200.0000 -200.0 1.0 22.000000 NaN 22.0 22.00 22.0 22.00 22.0 1.0 4.025970e+08 NaN 402597029.0 4.025970e+08 402597029.0 4.025970e+08 402597029.0
LA 2.0 650.000000 212.132034 500.00 575.0000 650.000 725.0000 800.0 2.0 37.000000 0.000000 37.0 37.00 37.0 37.00 37.0 2.0 7.030217e+08 4.242415e+06 700021823.0 7.015217e+08 703021663.5 7.045216e+08 706021504.0
MA 6.0 -13.833333 248.197838 -503.00 25.0000 47.500 92.5000 200.0 6.0 34.666667 1.966384 32.0 34.00 34.0 36.25 37.0 6.0 2.050444e+07 1.058572e+07 2474.0 1.975712e+07 25727725.0 2.673330e+07 26733300.0
MD 2.0 150.000000 141.421356 50.00 100.0000 150.000 200.0000 250.0 2.0 27.500000 10.606602 20.0 23.75 27.5 31.25 35.0 2.0 2.081438e+08 1.140563e+03 208143002.0 2.081434e+08 208143808.5 2.081442e+08 208144615.0
ME 4.0 630.000000 1113.522938 50.00 65.0000 85.000 650.0000 2300.0 4.0 30.500000 7.000000 20.0 30.50 34.0 34.00 34.0 4.0 4.141179e+07 4.679973e+06 39071806.0 3.907181e+07 39071806.0 4.141179e+07 48431752.0
MI 5.0 -253.000000 1156.630883 -2300.00 35.0000 200.000 300.0000 500.0 5.0 32.800000 6.220932 22.0 34.00 34.0 37.00 37.0 5.0 4.853947e+08 5.515168e+06 481885097.0 4.818851e+08 483862274.0 4.842798e+08 495061254.0
MN 3.0 107.333333 139.145008 -28.00 36.0000 100.000 175.0000 250.0 3.0 33.333333 1.154701 32.0 33.00 34.0 34.00 34.0 3.0 3.674463e+08 3.181694e+08 55902.0 2.755987e+08 551141508.0 5.511415e+08 551141508.0
MO 1.0 100.000000 NaN 100.00 100.0000 100.000 100.0000 100.0 1.0 20.000000 NaN 20.0 20.00 20.0 20.00 20.0 1.0 6.411100e+04 NaN 64111.0 6.411100e+04 64111.0 6.411100e+04 64111.0
NC 1.0 500.000000 NaN 500.00 500.0000 500.000 500.0000 500.0 1.0 16.000000 NaN 16.0 16.00 16.0 16.00 16.0 1.0 2.750200e+04 NaN 27502.0 2.750200e+04 27502.0 2.750200e+04 27502.0
NH 1.0 -24.600000 NaN -24.60 -24.6000 -24.600 -24.6000 -24.6 1.0 32.000000 NaN 32.0 32.00 32.0 32.00 32.0 1.0 3.256442e+07 NaN 32564424.0 3.256442e+07 32564424.0 3.256442e+07 32564424.0
NJ 2.0 -408.725000 522.163003 -777.95 -593.3375 -408.725 -224.1125 -39.5 2.0 32.000000 0.000000 32.0 32.00 32.0 32.00 32.0 2.0 3.512750e+07 4.966549e+07 8691.0 1.756809e+07 35127496.5 5.268690e+07 70246302.0
NV 4.0 181.250000 213.478141 50.00 68.7500 87.500 200.0000 500.0 4.0 36.000000 1.154701 35.0 35.00 36.0 37.00 37.0 4.0 8.939724e+08 1.883416e+06 891175812.0 8.936894e+08 894810312.5 8.950933e+08 895093326.0
NY 8.0 -809.312500 925.274590 -2300.00 -1250.0000 -575.000 -230.8750 300.0 8.0 29.125000 5.986592 22.0 22.00 32.0 34.00 35.0 8.0 7.576625e+07 4.678602e+07 11701.0 7.512812e+07 100206344.0 1.002265e+08 105072222.0
OH 4.0 112.500000 289.755644 -200.00 -12.5000 75.000 200.0000 500.0 4.0 20.000000 8.000000 16.0 16.00 16.0 20.00 32.0 4.0 4.401775e+04 1.126738e+03 43143.0 4.314300e+04 43712.0 4.458675e+04 45504.0
OK 3.0 266.666667 208.166600 100.00 150.0000 200.000 350.0000 500.0 3.0 34.000000 0.000000 34.0 34.00 34.0 34.00 34.0 3.0 7.341663e+08 5.157607e+06 731188602.0 7.311886e+08 731188602.0 7.356552e+08 740121840.0
PA 5.0 -429.200000 600.635663 -1100.00 -1046.0000 -100.000 -100.0000 200.0 5.0 29.000000 6.708204 22.0 22.00 32.0 32.00 37.0 5.0 1.080998e+08 9.866553e+07 15201.0 1.906700e+04 180153316.0 1.801557e+08 180155718.0
RI 2.0 100.000000 0.000000 100.00 100.0000 100.000 100.0000 100.0 2.0 35.000000 0.000000 35.0 35.00 35.0 35.00 35.0 2.0 2.903295e+07 0.000000e+00 29032946.0 2.903295e+07 29032946.0 2.903295e+07 29032946.0
SC 3.0 800.000000 624.499800 100.00 550.0000 1000.000 1150.0000 1300.0 3.0 23.000000 12.124356 16.0 16.00 16.0 26.50 37.0 3.0 9.873826e+07 1.709680e+08 29860.0 2.986000e+04 29860.0 1.480925e+08 296155069.0
TN 1.0 -25.000000 NaN -25.00 -25.0000 -25.000 -25.0000 -25.0 1.0 32.000000 NaN 32.0 32.00 32.0 32.00 32.0 1.0 3.718800e+04 NaN 37188.0 3.718800e+04 37188.0 3.718800e+04 37188.0
TX 9.0 220.582222 664.483748 -1000.00 -14.7600 50.000 500.0000 1300.0 9.0 33.555556 4.746344 22.0 32.00 35.0 37.00 37.0 9.0 6.912725e+08 2.594552e+08 78746.0 7.735463e+08 775816547.0 7.875726e+08 790140007.0
UT 11.0 459.090909 1716.183000 -2300.00 25.0000 75.000 87.5000 4600.0 11.0 30.909091 7.006490 20.0 27.50 35.0 35.00 35.0 11.0 8.410139e+08 2.208553e+05 840683130.0 8.408522e+08 841176911.0 8.411769e+08 841176911.0
VA 5.0 103.184000 234.748140 -100.00 -34.0800 50.000 100.0000 500.0 5.0 27.000000 8.426150 16.0 20.00 32.0 32.00 35.0 5.0 8.813837e+07 1.206562e+08 22903.0 2.366600e+04 24091.0 2.203084e+08 220312731.0
WA 3.0 -166.666667 461.880215 -700.00 -300.0000 100.000 100.0000 100.0 3.0 30.000000 6.928203 22.0 28.00 34.0 34.00 34.0 3.0 9.804301e+08 5.270573e+05 980125787.0 9.801258e+08 980125787.0 9.805822e+08 981038677.0
In [50]:
dfcwci.groupby("state").sum()
Out[50]:
zip amount candidate_id
state
AK 2985459621 1210.00 111
AR 864790 14200.00 192
AZ 860011121 120.00 37
CA 14736360720 -5013.73 600
CO 2405477834 -5823.00 111
CT 68901376 2300.00 35
DC 800341853 -1549.91 102
FL 8970626520 -4050.00 803
IA 50266 250.00 16
ID 83648 -261.00 32
IL 3042068689 -5586.80 175
KS 66215 -330.00 32
KY 402597029 -200.00 22
LA 1406043327 1300.00 74
MA 123026638 -83.00 208
MD 416287617 300.00 55
ME 165647170 2520.00 122
MI 2426973485 -1265.00 164
MN 1102338918 322.00 100
MO 64111 100.00 20
NC 27502 500.00 16
NH 32564424 -24.60 32
NJ 70254993 -817.45 64
NV 3575889763 725.00 144
NY 606129991 -6474.50 233
OH 176071 450.00 80
OK 2202499044 800.00 102
PA 540499020 -2146.00 145
RI 58065892 200.00 70
SC 296214789 2400.00 69
TN 37188 -25.00 32
TX 6221452245 1985.24 302
UT 9251153394 5050.00 340
VA 440691831 515.92 135
WA 2941290251 -500.00 90
In [51]:
dfcwci.groupby("state")['amount'].mean()
Out[51]:
state
AK     403.333333
AR    1183.333333
AZ     120.000000
CA    -217.988261
CO   -1455.750000
CT    2300.000000
DC    -309.982000
FL    -135.000000
IA     250.000000
ID    -261.000000
IL    -931.133333
KS    -330.000000
KY    -200.000000
LA     650.000000
MA     -13.833333
MD     150.000000
ME     630.000000
MI    -253.000000
MN     107.333333
MO     100.000000
NC     500.000000
NH     -24.600000
NJ    -408.725000
NV     181.250000
NY    -809.312500
OH     112.500000
OK     266.666667
PA    -429.200000
RI     100.000000
SC     800.000000
TN     -25.000000
TX     220.582222
UT     459.090909
VA     103.184000
WA    -166.666667
Name: amount, dtype: float64
In [52]:
dfcwci.groupby("state")['amount'].apply(lambda x: np.std(x))
Out[52]:
state
AK      81.785628
AR     742.555647
AZ       0.000000
CA     921.394361
CO     887.819907
CT       0.000000
DC     473.728151
FL    1157.594489
IA       0.000000
ID       0.000000
IL    1123.431895
KS       0.000000
KY       0.000000
LA     150.000000
MA     226.572591
MD     100.000000
ME     964.339152
MI    1034.522112
MN     113.611424
MO       0.000000
NC       0.000000
NH       0.000000
NJ     369.225000
NV     184.877493
NY     865.515126
OH     250.935749
OK     169.967317
PA     537.224869
RI       0.000000
SC     509.901951
TN       0.000000
TX     626.481285
UT    1636.316287
VA     209.965120
WA     377.123617
Name: amount, dtype: float64

The dictionary-like structure is more obvious in this method of iteration, but it does not do the combining part...

In [53]:
for k, v in dfcwci.groupby('state'):
    print("State", k, "mean amount", v.amount.mean(), "std", v.amount.std())
State AK mean amount 403.3333333333333 std 100.16652800877813
State AR mean amount 1183.3333333333333 std 775.574078676661
State AZ mean amount 120.0 std nan
State CA mean amount -217.9882608695652 std 942.1024379191479
State CO mean amount -1455.75 std 1025.16612474922
State CT mean amount 2300.0 std nan
State DC mean amount -309.98199999999997 std 529.6441745360747
State FL mean amount -135.0 std 1177.3838620520878
State IA mean amount 250.0 std nan
State ID mean amount -261.0 std nan
State IL mean amount -931.1333333333333 std 1230.6579811900083
State KS mean amount -330.0 std nan
State KY mean amount -200.0 std nan
State LA mean amount 650.0 std 212.13203435596427
State MA mean amount -13.833333333333334 std 248.19783775582468
State MD mean amount 150.0 std 141.4213562373095
State ME mean amount 630.0 std 1113.5229379466475
State MI mean amount -253.0 std 1156.630883212099
State MN mean amount 107.33333333333333 std 139.14500829470433
State MO mean amount 100.0 std nan
State NC mean amount 500.0 std nan
State NH mean amount -24.6 std nan
State NJ mean amount -408.725 std 522.163002567206
State NV mean amount 181.25 std 213.47814095749163
State NY mean amount -809.3125 std 925.2745900349181
State OH mean amount 112.5 std 289.75564417856185
State OK mean amount 266.6666666666667 std 208.16659994661327
State PA mean amount -429.2 std 600.6356632768321
State RI mean amount 100.0 std 0.0
State SC mean amount 800.0 std 624.4997998398399
State TN mean amount -25.0 std nan
State TX mean amount 220.5822222222222 std 664.483747614977
State UT mean amount 459.09090909090907 std 1716.1830004137353
State VA mean amount 103.18400000000001 std 234.74814009912836
State WA mean amount -166.66666666666666 std 461.88021535170066

DELETE

In [54]:
dfcwci.head()
Out[54]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16
1 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 250.0 2007-05-16 16
2 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 50.0 2007-06-18 16
3 Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 100.0 2007-06-21 16
4 Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16

In-place drops

In [55]:
df2=dfcwci.copy()
df2.set_index('last_name', inplace=True)
df2.head()
Out[55]:
first_name middle_name street_1 street_2 city state zip amount date candidate_id
last_name
Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16
Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 250.0 2007-05-16 16
Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 50.0 2007-06-18 16
Ahrens Don NaN 4034 Rennellwood Way NaN Pleasanton CA 94566 100.0 2007-06-21 16
Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16
In [56]:
df2.drop(['Ahrens'], inplace=True)
df2.head()
Out[56]:
first_name middle_name street_1 street_2 city state zip amount date candidate_id
last_name
Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16
Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16
Akin Mike NaN 181 Baywood Lane NaN Monticello AR 71655 1500.0 2007-05-18 16
Akin Rebecca NaN 181 Baywood Lane NaN Monticello AR 71655 500.0 2007-05-18 16
Aldridge Brittni NaN 808 Capitol Square Place, SW NaN Washington DC 20024 250.0 2007-06-06 16
In [57]:
df2.reset_index(inplace=True)
df2.head()
Out[57]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16
1 Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16
2 Akin Mike NaN 181 Baywood Lane NaN Monticello AR 71655 1500.0 2007-05-18 16
3 Akin Rebecca NaN 181 Baywood Lane NaN Monticello AR 71655 500.0 2007-05-18 16
4 Aldridge Brittni NaN 808 Capitol Square Place, SW NaN Washington DC 20024 250.0 2007-06-06 16

The recommended way to do it is to create a new dataframe. This might be impractical is things are very large.

In [58]:
dfcwci=dfcwci[dfcwci.last_name!='Ahrens']
dfcwci.head(10)
Out[58]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16
4 Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16
5 Akin Mike NaN 181 Baywood Lane NaN Monticello AR 71655 1500.0 2007-05-18 16
6 Akin Rebecca NaN 181 Baywood Lane NaN Monticello AR 71655 500.0 2007-05-18 16
7 Aldridge Brittni NaN 808 Capitol Square Place, SW NaN Washington DC 20024 250.0 2007-06-06 16
8 Allen John D. NaN 1052 Cannon Mill Drive NaN North Augusta SC 29860 1000.0 2007-06-11 16
9 Allen John D. NaN 1052 Cannon Mill Drive NaN North Augusta SC 29860 1300.0 2007-06-29 16
10 Allison John W. NaN P.O. Box 1089 NaN Conway AR 72033 1000.0 2007-05-18 16
11 Allison Rebecca NaN 3206 Summit Court NaN Little Rock AR 72227 1000.0 2007-04-25 16
12 Allison Rebecca NaN 3206 Summit Court NaN Little Rock AR 72227 200.0 2007-06-12 16

LIMIT

In [59]:
dfcwci[0:3] # also see loc and iloc from the lab
Out[59]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
0 Agee Steven NaN 549 Laurel Branch Road NaN Floyd VA 24091 500.0 2007-06-30 16
4 Akin Charles NaN 10187 Sugar Creek Road NaN Bentonville AR 72712 100.0 2007-06-16 16
5 Akin Mike NaN 181 Baywood Lane NaN Monticello AR 71655 1500.0 2007-05-18 16

Relationships: JOINs are Cartesian Products.

Finally, there are many occasions you will want to combine dataframes. We might want to see who contributed to Obama:

Simple subselect

In [60]:
dfcand.head()
Out[60]:
id first_name last_name middle_name party
0 33 Joseph Biden NaN D
1 36 Samuel Brownback NaN R
2 34 Hillary Clinton R. D
3 39 Christopher Dodd J. D
4 26 John Edwards NaN D
In [61]:
obamaid=dfcand.query("last_name=='Obama'")['id'].values[0]
In [62]:
obamacontrib=dfcwci.query("candidate_id==%i" % obamaid)
obamacontrib.head()
Out[62]:
last_name first_name middle_name street_1 street_2 city state zip amount date candidate_id
25 Buckler Steve NaN 24351 Armada Dr NaN Dana Point CA 926291306 50.0 2007-07-30 20
26 Buckler Steve NaN 24351 Armada Dr NaN Dana Point CA 926291306 25.0 2007-08-16 20
27 Buckheit Bruce NaN 8904 KAREN DR NaN FAIRFAX VA 220312731 100.0 2007-09-19 20
28 Buckel Linda NaN PO Box 683130 NaN Park City UT 840683130 2300.0 2007-08-14 20
29 Buckel Linda NaN PO Box 683130 NaN Park City UT 840683130 -2300.0 2007-08-14 20

Explicit INNER JOIN

This is the one you will want 90% of the time. It will only match keys that are present in both dataframes.

In [63]:
cols_wanted=['last_name_x', 'first_name_x', 'candidate_id', 'id', 'last_name_y']
dfcwci.merge(dfcand, left_on="candidate_id", right_on="id")[cols_wanted]
Out[63]:
last_name_x first_name_x candidate_id id last_name_y
0 Agee Steven 16 16 Huckabee
1 Akin Charles 16 16 Huckabee
2 Akin Mike 16 16 Huckabee
3 Akin Rebecca 16 16 Huckabee
4 Aldridge Brittni 16 16 Huckabee
5 Allen John D. 16 16 Huckabee
6 Allen John D. 16 16 Huckabee
7 Allison John W. 16 16 Huckabee
8 Allison Rebecca 16 16 Huckabee
9 Allison Rebecca 16 16 Huckabee
10 Altes R.D. 16 16 Huckabee
11 Andres Dale 16 16 Huckabee
12 Anthony John 16 16 Huckabee
13 Arbogast Robert 16 16 Huckabee
14 Arbogast Robert 16 16 Huckabee
15 Ardle William 16 16 Huckabee
16 Atiq Omar 16 16 Huckabee
17 Atiq Omar 16 16 Huckabee
18 Baker David 16 16 Huckabee
19 Bancroft David 16 16 Huckabee
20 Banks Charles 16 16 Huckabee
21 Barbee John 16 16 Huckabee
22 Buckler Steve 20 20 Obama
23 Buckler Steve 20 20 Obama
24 Buckheit Bruce 20 20 Obama
25 Buckel Linda 20 20 Obama
26 Buckel Linda 20 20 Obama
27 Buckel Linda 20 20 Obama
28 Buck Thomas 20 20 Obama
29 Buck Jay 20 20 Obama
... ... ... ... ... ...
142 ABDELLA THOMAS 35 35 Romney
143 ABBOTT WELDON 35 35 Romney
144 ABBOTT WELDON 35 35 Romney
145 ABBOTT GERALD 35 35 Romney
146 ABBOTT GERALD 35 35 Romney
147 ABEDIN ZAINUL 37 37 McCain
148 ABBOTT SYBIL 37 37 McCain
149 ABBOTT SYBIL 37 37 McCain
150 ABBOTT RONALD 37 37 McCain
151 ABBOTT RONALD 37 37 McCain
152 ABBOTT ROBERT 37 37 McCain
153 ABBOTT MIKE 37 37 McCain
154 ABBOT DAVID 37 37 McCain
155 ABBO PAULINE 37 37 McCain
156 ABATE MARIA 37 37 McCain
157 ABAIR PETER 37 37 McCain
158 ABACHERLI SHIRLEY 37 37 McCain
159 AARONS CHARLES 37 37 McCain
160 AARONS CHARLES 37 37 McCain
161 AARONS CHARLES 37 37 McCain
162 ABEL JOHN 37 37 McCain
163 ABEL MARLING 37 37 McCain
164 ABEL RUDOLPH 37 37 McCain
165 ABELE RODNEY 37 37 McCain
166 ABERCROMBIE DENIS 37 37 McCain
167 ABESHAUS MERRILL 37 37 McCain
168 ABRAHAM GEORGE 37 37 McCain
169 ABRAHAMSON PETER 37 37 McCain
170 ABRAHAM SALEM 37 37 McCain
171 ABRAHAM SALEM 37 37 McCain

172 rows × 5 columns

If the names of the columns you wanted to merge on were identical, you could simply say on=id, for example, rather than a left_on and a right_on.

Outer JOIN

left outer (contributors on candidates)

left outer

In [64]:
dfcwci.merge(dfcand, left_on="candidate_id", right_on="id", how="left")[cols_wanted]
Out[64]:
last_name_x first_name_x candidate_id id last_name_y
0 Agee Steven 16 16 Huckabee
1 Akin Charles 16 16 Huckabee
2 Akin Mike 16 16 Huckabee
3 Akin Rebecca 16 16 Huckabee
4 Aldridge Brittni 16 16 Huckabee
5 Allen John D. 16 16 Huckabee
6 Allen John D. 16 16 Huckabee
7 Allison John W. 16 16 Huckabee
8 Allison Rebecca 16 16 Huckabee
9 Allison Rebecca 16 16 Huckabee
10 Altes R.D. 16 16 Huckabee
11 Andres Dale 16 16 Huckabee
12 Anthony John 16 16 Huckabee
13 Arbogast Robert 16 16 Huckabee
14 Arbogast Robert 16 16 Huckabee
15 Ardle William 16 16 Huckabee
16 Atiq Omar 16 16 Huckabee
17 Atiq Omar 16 16 Huckabee
18 Baker David 16 16 Huckabee
19 Bancroft David 16 16 Huckabee
20 Banks Charles 16 16 Huckabee
21 Barbee John 16 16 Huckabee
22 Buckler Steve 20 20 Obama
23 Buckler Steve 20 20 Obama
24 Buckheit Bruce 20 20 Obama
25 Buckel Linda 20 20 Obama
26 Buckel Linda 20 20 Obama
27 Buckel Linda 20 20 Obama
28 Buck Thomas 20 20 Obama
29 Buck Jay 20 20 Obama
... ... ... ... ... ...
142 ABDELLA THOMAS 35 35 Romney
143 ABBOTT WELDON 35 35 Romney
144 ABBOTT WELDON 35 35 Romney
145 ABBOTT GERALD 35 35 Romney
146 ABBOTT GERALD 35 35 Romney
147 ABEDIN ZAINUL 37 37 McCain
148 ABBOTT SYBIL 37 37 McCain
149 ABBOTT SYBIL 37 37 McCain
150 ABBOTT RONALD 37 37 McCain
151 ABBOTT RONALD 37 37 McCain
152 ABBOTT ROBERT 37 37 McCain
153 ABBOTT MIKE 37 37 McCain
154 ABBOT DAVID 37 37 McCain
155 ABBO PAULINE 37 37 McCain
156 ABATE MARIA 37 37 McCain
157 ABAIR PETER 37 37 McCain
158 ABACHERLI SHIRLEY 37 37 McCain
159 AARONS CHARLES 37 37 McCain
160 AARONS CHARLES 37 37 McCain
161 AARONS CHARLES 37 37 McCain
162 ABEL JOHN 37 37 McCain
163 ABEL MARLING 37 37 McCain
164 ABEL RUDOLPH 37 37 McCain
165 ABELE RODNEY 37 37 McCain
166 ABERCROMBIE DENIS 37 37 McCain
167 ABESHAUS MERRILL 37 37 McCain
168 ABRAHAM GEORGE 37 37 McCain
169 ABRAHAMSON PETER 37 37 McCain
170 ABRAHAM SALEM 37 37 McCain
171 ABRAHAM SALEM 37 37 McCain

172 rows × 5 columns

right outer (contributors on candidates) = left outer (candidates on contributors)

right outer

In [65]:
dfcwci.merge(dfcand, left_on="candidate_id", right_on="id", how="right")[cols_wanted]
Out[65]:
last_name_x first_name_x candidate_id id last_name_y
0 Agee Steven 16.0 16 Huckabee
1 Akin Charles 16.0 16 Huckabee
2 Akin Mike 16.0 16 Huckabee
3 Akin Rebecca 16.0 16 Huckabee
4 Aldridge Brittni 16.0 16 Huckabee
5 Allen John D. 16.0 16 Huckabee
6 Allen John D. 16.0 16 Huckabee
7 Allison John W. 16.0 16 Huckabee
8 Allison Rebecca 16.0 16 Huckabee
9 Allison Rebecca 16.0 16 Huckabee
10 Altes R.D. 16.0 16 Huckabee
11 Andres Dale 16.0 16 Huckabee
12 Anthony John 16.0 16 Huckabee
13 Arbogast Robert 16.0 16 Huckabee
14 Arbogast Robert 16.0 16 Huckabee
15 Ardle William 16.0 16 Huckabee
16 Atiq Omar 16.0 16 Huckabee
17 Atiq Omar 16.0 16 Huckabee
18 Baker David 16.0 16 Huckabee
19 Bancroft David 16.0 16 Huckabee
20 Banks Charles 16.0 16 Huckabee
21 Barbee John 16.0 16 Huckabee
22 Buckler Steve 20.0 20 Obama
23 Buckler Steve 20.0 20 Obama
24 Buckheit Bruce 20.0 20 Obama
25 Buckel Linda 20.0 20 Obama
26 Buckel Linda 20.0 20 Obama
27 Buckel Linda 20.0 20 Obama
28 Buck Thomas 20.0 20 Obama
29 Buck Jay 20.0 20 Obama
... ... ... ... ... ...
152 ABBOTT ROBERT 37.0 37 McCain
153 ABBOTT MIKE 37.0 37 McCain
154 ABBOT DAVID 37.0 37 McCain
155 ABBO PAULINE 37.0 37 McCain
156 ABATE MARIA 37.0 37 McCain
157 ABAIR PETER 37.0 37 McCain
158 ABACHERLI SHIRLEY 37.0 37 McCain
159 AARONS CHARLES 37.0 37 McCain
160 AARONS CHARLES 37.0 37 McCain
161 AARONS CHARLES 37.0 37 McCain
162 ABEL JOHN 37.0 37 McCain
163 ABEL MARLING 37.0 37 McCain
164 ABEL RUDOLPH 37.0 37 McCain
165 ABELE RODNEY 37.0 37 McCain
166 ABERCROMBIE DENIS 37.0 37 McCain
167 ABESHAUS MERRILL 37.0 37 McCain
168 ABRAHAM GEORGE 37.0 37 McCain
169 ABRAHAMSON PETER 37.0 37 McCain
170 ABRAHAM SALEM 37.0 37 McCain
171 ABRAHAM SALEM 37.0 37 McCain
172 NaN NaN NaN 33 Biden
173 NaN NaN NaN 36 Brownback
174 NaN NaN NaN 39 Dodd
175 NaN NaN NaN 26 Edwards
176 NaN NaN NaN 24 Gravel
177 NaN NaN NaN 30 Hunter
178 NaN NaN NaN 31 Kucinich
179 NaN NaN NaN 29 Richardson
180 NaN NaN NaN 38 Tancredo
181 NaN NaN NaN 41 Thompson

182 rows × 5 columns

full outer

outer

In [66]:
dfcwci.merge(dfcand, left_on="candidate_id", right_on="id", how="outer")[cols_wanted]
Out[66]:
last_name_x first_name_x candidate_id id last_name_y
0 Agee Steven 16.0 16 Huckabee
1 Akin Charles 16.0 16 Huckabee
2 Akin Mike 16.0 16 Huckabee
3 Akin Rebecca 16.0 16 Huckabee
4 Aldridge Brittni 16.0 16 Huckabee
5 Allen John D. 16.0 16 Huckabee
6 Allen John D. 16.0 16 Huckabee
7 Allison John W. 16.0 16 Huckabee
8 Allison Rebecca 16.0 16 Huckabee
9 Allison Rebecca 16.0 16 Huckabee
10 Altes R.D. 16.0 16 Huckabee
11 Andres Dale 16.0 16 Huckabee
12 Anthony John 16.0 16 Huckabee
13 Arbogast Robert 16.0 16 Huckabee
14 Arbogast Robert 16.0 16 Huckabee
15 Ardle William 16.0 16 Huckabee
16 Atiq Omar 16.0 16 Huckabee
17 Atiq Omar 16.0 16 Huckabee
18 Baker David 16.0 16 Huckabee
19 Bancroft David 16.0 16 Huckabee
20 Banks Charles 16.0 16 Huckabee
21 Barbee John 16.0 16 Huckabee
22 Buckler Steve 20.0 20 Obama
23 Buckler Steve 20.0 20 Obama
24 Buckheit Bruce 20.0 20 Obama
25 Buckel Linda 20.0 20 Obama
26 Buckel Linda 20.0 20 Obama
27 Buckel Linda 20.0 20 Obama
28 Buck Thomas 20.0 20 Obama
29 Buck Jay 20.0 20 Obama
... ... ... ... ... ...
152 ABBOTT ROBERT 37.0 37 McCain
153 ABBOTT MIKE 37.0 37 McCain
154 ABBOT DAVID 37.0 37 McCain
155 ABBO PAULINE 37.0 37 McCain
156 ABATE MARIA 37.0 37 McCain
157 ABAIR PETER 37.0 37 McCain
158 ABACHERLI SHIRLEY 37.0 37 McCain
159 AARONS CHARLES 37.0 37 McCain
160 AARONS CHARLES 37.0 37 McCain
161 AARONS CHARLES 37.0 37 McCain
162 ABEL JOHN 37.0 37 McCain
163 ABEL MARLING 37.0 37 McCain
164 ABEL RUDOLPH 37.0 37 McCain
165 ABELE RODNEY 37.0 37 McCain
166 ABERCROMBIE DENIS 37.0 37 McCain
167 ABESHAUS MERRILL 37.0 37 McCain
168 ABRAHAM GEORGE 37.0 37 McCain
169 ABRAHAMSON PETER 37.0 37 McCain
170 ABRAHAM SALEM 37.0 37 McCain
171 ABRAHAM SALEM 37.0 37 McCain
172 NaN NaN NaN 33 Biden
173 NaN NaN NaN 36 Brownback
174 NaN NaN NaN 39 Dodd
175 NaN NaN NaN 26 Edwards
176 NaN NaN NaN 24 Gravel
177 NaN NaN NaN 30 Hunter
178 NaN NaN NaN 31 Kucinich
179 NaN NaN NaN 29 Richardson
180 NaN NaN NaN 38 Tancredo
181 NaN NaN NaN 41 Thompson

182 rows × 5 columns