EDA, Pandas, and the Grammar of Data¶

We'd like a data structure that can represent the columns in the data above by their name. In particular, we want a structure that can easily store variables of different types, that stores column names, and that we can reference by column name as well as by indexed position. And it would be nice this data structure came with built-in functions that we can use to manipulate it.

Pandas is a package/library that does all of this! The library is built on top of numpy. There are two basic pandas objects, series and dataframes, which can be thought of as enhanced versions of 1D and 2D numpy arrays, respectively. Indeed Pandas attempts to keep all the efficiencies that numpy gives us.

For reference, here is a useful pandas cheat sheet and the pandas documentation.

The basic EDA workflow¶

Below is a basic checklist for the early stages of exploratory data analysis in Python. While not universally applicable, the rubric covers patterns which recur in several data analysis contexts, so useful to keep it in mind when encountering a new dataset.

The basic workflow (enunciated in this form by Chris Beaumont, the first Head TF of cs109 ever) is as follows:

Build a Dataframe from the data (ideally, put all data in this object)
Clean the Dataframe. It should have the following properties:
- Each row describes a single object
- Each column describes a property of that object
- Columns are numeric whenever appropriate
- Columns contain atomic properties that cannot be further decomposed
Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.
Explore group properties. Use groupby, queries, and small multiples to compare subsets of the data.

This process transforms the data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to follow-up on in subsequent analysis.

In [1]:

# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline
# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Building a dataframe¶

The easiest way to build a dataframe is simply to read in a CSV file. We saw an example of this in lab, and we shall see 2 examples here. We'll also see how we may combine multiple data sources into a larger dataframe.

This example is adapted from: https://github.com/tthibo/SQL-Tutorial

The first 3 lines of the file (!head -3 data/candidates.txt on mac/unix) look like this.

id|first_name|last_name|middle_name|party
33|Joseph|Biden||D
36|Samuel|Brownback||R

In [2]:

dfcand=pd.read_csv("./data/candidates.txt", sep='|')
dfcand.head()

Out[2]:

	id	first_name	last_name	middle_name	party
0	33	Joseph	Biden	NaN	D
1	36	Samuel	Brownback	NaN	R
2	34	Hillary	Clinton	R.	D
3	39	Christopher	Dodd	J.	D
4	26	John	Edwards	NaN	D

A pandas dataframe is a set of columns pasted together into a spreadsheet, as shown in the schematic below, which is taken from the cheatsheet above. The columns in pandas are called series objects.

All the columns in this dataframe:

In [3]:

dfcand.columns

Out[3]:

Index(['id', 'first_name', 'last_name', 'middle_name', 'party'], dtype='object')

And the types of these columns:

In [4]:

dfcand.dtypes

Out[4]:

id              int64
first_name     object
last_name      object
middle_name    object
party          object
dtype: object

Access to a particular column can be obtained by treating the column name as an "attribute" of the dataframe:

In [5]:

dfcand.first_name

Out[5]:

0          Joseph
1          Samuel
2         Hillary
3     Christopher
4            John
5         Rudolph
6            Mike
7            Mike
8          Duncan
9          Dennis
10           John
11         Barack
12            Ron
13           Bill
14           Mitt
15            Tom
16           Fred
Name: first_name, dtype: object

But Pandas supports a dictionary like access to columns. This is very useful when column names have spaces: Python variables cannot have spaces in them.

In [6]:

dfcand['first_name']

Out[6]:

0          Joseph
1          Samuel
2         Hillary
3     Christopher
4            John
5         Rudolph
6            Mike
7            Mike
8          Duncan
9          Dennis
10           John
11         Barack
12            Ron
13           Bill
14           Mitt
15            Tom
16           Fred
Name: first_name, dtype: object

We can also get sub-dataframes by choosing a set of series. We pass a list of the columns we want as "dictionary keys" to the dataframe.

In [7]:

columns_i_want=['first_name', 'last_name']
dfcand[columns_i_want]

Out[7]:

	first_name	last_name
0	Joseph	Biden
1	Samuel	Brownback
2	Hillary	Clinton
3	Christopher	Dodd
4	John	Edwards
5	Rudolph	Giuliani
6	Mike	Gravel
7	Mike	Huckabee
8	Duncan	Hunter
9	Dennis	Kucinich
10	John	McCain
11	Barack	Obama
12	Ron	Paul
13	Bill	Richardson
14	Mitt	Romney
15	Tom	Tancredo
16	Fred	Thompson

Categoricals¶

Even though party is a string, it takes on only a finite set of values, 'D', and 'R'. We can model this:

In [8]:

dfcand.party.unique()

Out[8]:

array(['D', 'R'], dtype=object)

In [9]:

dfcand['party'] = dfcand['party'].astype("category")

In [10]:

dfcand.dtypes

Out[10]:

id                int64
first_name       object
last_name        object
middle_name      object
party          category
dtype: object

In [11]:

dfcand.head()

Out[11]:

	id	first_name	last_name	middle_name	party
0	33	Joseph	Biden	NaN	D
1	36	Samuel	Brownback	NaN	R
2	34	Hillary	Clinton	R.	D
3	39	Christopher	Dodd	J.	D
4	26	John	Edwards	NaN	D

In [12]:

dfcand.party.describe()

Out[12]:

count     17
unique     2
top        R
freq       9
Name: party, dtype: object

In [13]:

dfcand.party.cat.categories

Out[13]:

Index(['D', 'R'], dtype='object')

In [14]:

dfcand.party.cat.ordered

Out[14]:

False

Keep in mind that this is a relatively new feature of Pandas. You dont need to do this, but might find it useful to keep your types straight. (Using categoricals in machine learning algorithms is more complex and usually involves a process called One Hot Encoding)

Another piece of data¶

This is a file of people who have contributed money to candidates: (!head -3 data/contributors_with_candidate_id.txt)

id|last_name|first_name|middle_name|street_1|street_2|city|state|zip|amount|date|candidate_id
|Agee|Steven||549 Laurel Branch Road||Floyd|VA|24091|500.00|2007-06-30|16
|Ahrens|Don||4034 Rennellwood Way||Pleasanton|CA|94566|250.00|2007-05-16|16

In [15]:

dfcwci=pd.read_csv("./data/contributors_with_candidate_id.txt", sep="|")
dfcwci.head()

Out[15]:

	id	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
0	NaN	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16
1	NaN	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	250.0	2007-05-16	16
2	NaN	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	50.0	2007-06-18	16
3	NaN	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	100.0	2007-06-21	16
4	NaN	Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16

Cleaning Data¶

most of the techniques you will learn about in Pandas are all about getting data in a form that can be used for further analysis. Cleaning usually means dealing with missing values, transforming types appropriately, and taking care of data integrity. But we'll lump everything required to transform data to a form appropriate for analysis cleaning, even if what we are doing is for example, combining multiple data sets, or producing processed data from raw data.

Lets start with some regular cleaning:

In [16]:

del dfcwci['id']
dfcwci.head()

Out[16]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16
1	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	250.0	2007-05-16	16
2	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	50.0	2007-06-18	16
3	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	100.0	2007-06-21	16
4	Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16

We can see the size of our data

In [17]:

dfcwci.shape, dfcand.shape

Out[17]:

((175, 11), (17, 5))

We will do some more cleaning soon but let us see the EDA process as sliced from another angle: data transformation, used both for cleaning and for seeing single column and multiple column properties in the data.

Data transformation: Single Table Verbs¶

Let us now focus on core data manipulation commands. These are universal across systems, and by identifying them, we can quickly ask how to do these when we encounter a new system.

See https://gist.github.com/TomAugspurger/6e052140eaa5fdb6e8c0/ which has a comparison of r/dplyr and pandas. I stole and modified this table from there:

dplyr has a small set of nicely defined verbs, which Hadley Wickham has used to identify core data manipulation commands. Here are listed the closest SQL and Pandas verbs, so we can see the universality of these manipulations.

VERB	dplyr	pandas	SQL
QUERY/SELECTION	filter() (and slice())	query() (and loc[], iloc[])	SELECT WHERE
SORT	arrange()	sort()	ORDER BY
SELECT-COLUMNS/PROJECTION	select() (and rename())	[](__getitem__) (and rename())	SELECT COLUMN
SELECT-DISTINCT	distinct()	unique(),drop_duplicates()	SELECT DISTINCT COLUMN
ASSIGN	mutate() (and transmute())	assign	ALTER/UPDATE
AGGREGATE	summarise()	describe(), mean(), max()	None, AVG(),MAX()
SAMPLE	sample_n() and sample_frac()	sample()	implementation dep, use RAND()
GROUP-AGG	group_by/summarize	groupby/agg, count, mean	GROUP BY
DELETE	?	drop/masking	DELETE/WHERE

We'll tackle these one by one in Pandas, since these are data manipulations you will do all the time.

QUERY¶

In [18]:

dfcwci.amount < 400

Out[18]:

0      False
1       True
2       True
3       True
4       True
5      False
6      False
7       True
8      False
9      False
10     False
11     False
12      True
13     False
14      True
15     False
16     False
17      True
18      True
19     False
20     False
21     False
22      True
23     False
24     False
25      True
26      True
27      True
28     False
29      True
       ...  
145     True
146     True
147     True
148     True
149     True
150    False
151     True
152     True
153     True
154     True
155    False
156    False
157     True
158     True
159    False
160     True
161     True
162     True
163    False
164    False
165     True
166     True
167     True
168    False
169    False
170     True
171    False
172     True
173    False
174    False
Name: amount, Length: 175, dtype: bool

This gives us Trues and Falses. Such a series is called a mask. A mask is the basis of filtering. We can do:

In [19]:

dfcwci[dfcwci.amount < 400].head()

Out[19]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
1	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	250.0	2007-05-16	16
2	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	50.0	2007-06-18	16
3	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	100.0	2007-06-21	16
4	Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16
7	Aldridge	Brittni	NaN	808 Capitol Square Place, SW	NaN	Washington	DC	20024	250.0	2007-06-06	16

Notice that the dataframe has been filtered down to only include those contributions with amount < 400. The rows with False in the mask have been eliminated, and those with True in the mask have been kept.

In [20]:

np.sum(dfcwci.amount < 400), np.mean(dfcwci.amount < 400)

Out[20]:

(132, 0.75428571428571434)

Why did that work? The booleans are coerced to integers as below:

In [21]:

1*True, 1*False

Out[21]:

(1, 0)

Or directly, in Pandas, which works since the comparison is a pandas Series.

In [22]:

(dfcwci.amount < 400).mean()

Out[22]:

0.75428571428571434

You can combine queries. Note that we use Python's & operator instead of and. This is because we are "Boolean AND"ing masks to get a series of True's And Falses.

In [23]:

dfcwci[(dfcwci.state=='VA') & (dfcwci.amount < 400)]

Out[23]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
27	Buckheit	Bruce	NaN	8904 KAREN DR	NaN	FAIRFAX	VA	220312731	100.00	2007-09-19	20
77	Ranganath	Anoop	NaN	2507 Willard Drive	NaN	Charlottesville	VA	22903	-100.00	2008-04-21	32
88	Perreault	Louise	NaN	503 Brockridge Hunt Drive	NaN	Hampton	VA	23666	-34.08	2008-04-21	32
145	ABDELLA	THOMAS	M.	4231 MONUMENT WALL WAY #340	NaN	FAIRFAX	VA	220308440	50.00	2007-09-30	35

Here is another way to write the query:

In [24]:

dfcwci.query("state=='VA' & amount < 400")

Out[24]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
27	Buckheit	Bruce	NaN	8904 KAREN DR	NaN	FAIRFAX	VA	220312731	100.00	2007-09-19	20
77	Ranganath	Anoop	NaN	2507 Willard Drive	NaN	Charlottesville	VA	22903	-100.00	2008-04-21	32
88	Perreault	Louise	NaN	503 Brockridge Hunt Drive	NaN	Hampton	VA	23666	-34.08	2008-04-21	32
145	ABDELLA	THOMAS	M.	4231 MONUMENT WALL WAY #340	NaN	FAIRFAX	VA	220308440	50.00	2007-09-30	35

For cleaning, we might want to use this querying ability

In [25]:

dfcwci[dfcwci.state.isnull()]

Out[25]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
125	BOURNE	TRAVIS	NaN	LAGE KAART 77	NaN	BRASSCHATT	NaN	2930	-500.0	2008-11-20	35

Or the opposite, which is probably more useful in making the selection:

In [26]:

dfcwci[dfcwci.state.notnull()].head()

Out[26]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16
1	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	250.0	2007-05-16	16
2	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	50.0	2007-06-18	16
3	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	100.0	2007-06-21	16
4	Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16

For categoricals you can use isin. You can use Boolean not on the mask to implement not in.

In [27]:

dfcwci[dfcwci.state.isin(['VA','WA'])].head(10)

Out[27]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.00	2007-06-30	16
27	Buckheit	Bruce	NaN	8904 KAREN DR	NaN	FAIRFAX	VA	220312731	100.00	2007-09-19	20
62	BURKE	SUZANNE	M.	3401 EVANSTON	NaN	SEATTLE	WA	981038677	-700.00	2008-03-05	22
77	Ranganath	Anoop	NaN	2507 Willard Drive	NaN	Charlottesville	VA	22903	-100.00	2008-04-21	32
88	Perreault	Louise	NaN	503 Brockridge Hunt Drive	NaN	Hampton	VA	23666	-34.08	2008-04-21	32
100	Aaronson	Rebecca	NaN	2000 Village Green Dr Apt 12	NaN	Mill Creek	WA	980125787	100.00	2008-02-08	34
106	Aaronson	Rebecca	NaN	2000 Village Green Dr Apt 12	NaN	Mill Creek	WA	980125787	100.00	2008-02-14	34
145	ABDELLA	THOMAS	M.	4231 MONUMENT WALL WAY #340	NaN	FAIRFAX	VA	220308440	50.00	2007-09-30	35

And you can chain queries thus.

In [28]:

dfcwci.query("10 <= amount <= 50").head(10)

Out[28]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
2	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	50.0	2007-06-18	16
18	Ardle	William	NaN	412 Dakota Avenue	NaN	Springfield	OH	45504	50.0	2007-06-28	16
25	Buckler	Steve	NaN	24351 Armada Dr	NaN	Dana Point	CA	926291306	50.0	2007-07-30	20
26	Buckler	Steve	NaN	24351 Armada Dr	NaN	Dana Point	CA	926291306	25.0	2007-08-16	20
34	Buck	Barbara	NaN	1780 NE 138th St	NaN	North Miami	FL	331811316	50.0	2007-09-13	20
35	Buck	Barbara	NaN	1780 NE 138th St	NaN	North Miami	FL	331811316	50.0	2007-07-19	20
38	Buchanek	Elizabeth	NaN	7917 Kentbury Dr	NaN	Bethesda	MD	208144615	50.0	2007-09-30	20
49	Harrison	Ryan	NaN	2247 3rd St	NaN	La Verne	CA	917504918	25.0	2007-07-26	20
101	Aarons	Elaine	NaN	481 Buck Island Rd Apt 17A	APT 17A	West Yarmouth	MA	26733300	25.0	2008-02-26	34
104	Aaron	Shirley	NaN	101 Cherry Ave	NaN	Havana	FL	323331311	50.0	2008-02-29	34

SORT¶

In [29]:

dfcwci.sort_values(by="amount").head(10)

Out[29]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
90	Kazor	Christopher	M	707 Spindletree ave	NaN	Naperville	IL	60565	-2592.0	2008-04-21	32
72	BRUNO	JOHN	NaN	10136 WINDERMERE CHASE BLVD.	NaN	GOTHA	FL	347344707	-2300.0	2008-03-06	22
64	BURKE	DONALD	J.	12 LOMPOC	NaN	RANCHO SANTA MARGA	CA	926881817	-2300.0	2008-03-11	22
73	BRUNO	IRENE	NaN	10136 WINDERMERE CHASE BLVD.	NaN	GOTHA	FL	347344707	-2300.0	2008-03-06	22
74	BROWN	TIMOTHY	J.	26826 MARLOWE COURT	NaN	STEVENSON RANCH	CA	913811020	-2300.0	2008-03-06	22
58	BURTON	GLENN	M.	4404 CHARLESTON COURT	NaN	TAMPA	FL	336092620	-2300.0	2008-03-05	22
57	BURTON	STEVEN	G.	9938 DEER CREEK DRIVE	NaN	TAMPA	FL	33647	-2300.0	2008-03-05	22
84	Uihlein	Richard	NaN	1396 N Waukegan Rd	NaN	Lake Forest	IL	600451147	-2300.0	2008-04-21	32
56	BURTON	SUSAN	NaN	9338 DEER CREEK DRIVE	NaN	TAMPA	FL	336472286	-2300.0	2008-03-05	22
55	BUSH	ERIC	NaN	P.O. BOX 61046	NaN	DENVER	CO	802061046	-2300.0	2008-03-06	22

In [30]:

dfcwci[dfcwci.amount < 0]

Out[30]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
29	Buckel	Linda	NaN	PO Box 683130	NaN	Park City	UT	840683130	-2300.00	2007-08-14	20
40	Buchanan	John	NaN	2025 NW 29th Rd	NaN	Boca Raton	FL	334316303	-500.00	2007-09-24	20
43	Buchanan	John	NaN	2025 NW 29th Rd	NaN	Boca Raton	FL	334316303	-700.00	2007-08-28	20
50	BYNUM	HERBERT	NaN	332 SUNNYSIDE ROAD	NaN	TAMPA	FL	336177249	-500.00	2008-03-10	22
51	BYINGTON	MARGARET	E.	2633 MIDDLEBORO LANE N.E.	NaN	GRAND RAPIDS	MI	495061254	-2300.00	2008-03-03	22
52	BYERS	BOB	A.	13170 TELFAIR AVENUE	NaN	SYLMAR	CA	913423573	-2300.00	2008-03-07	22
53	BYERS	AUDREY	NaN	2658 LADBROOK WAY	NaN	THOUSAND OAKS	CA	913615073	-200.00	2008-03-07	22
54	BUSH	KRYSTIE	NaN	P.O. BOX 61046	NaN	DENVER	CO	802061046	-2300.00	2008-03-06	22
55	BUSH	ERIC	NaN	P.O. BOX 61046	NaN	DENVER	CO	802061046	-2300.00	2008-03-06	22
56	BURTON	SUSAN	NaN	9338 DEER CREEK DRIVE	NaN	TAMPA	FL	336472286	-2300.00	2008-03-05	22
57	BURTON	STEVEN	G.	9938 DEER CREEK DRIVE	NaN	TAMPA	FL	33647	-2300.00	2008-03-05	22
58	BURTON	GLENN	M.	4404 CHARLESTON COURT	NaN	TAMPA	FL	336092620	-2300.00	2008-03-05	22
59	BURKHARDT	CRAIG	S.	910 15TH STREET N.W.	NaN	WASHINGTON	DC	200052503	-500.00	2008-03-07	22
60	BURKHARDT	CRAIG	S.	910 15TH STREET N.W.	NaN	WASHINGTON	DC	200052503	-1000.00	2008-03-07	22
61	BURKHARDT	BARBARA	NaN	910 15TH STREET N.W.	NaN	WASHINGTON	DC	200052503	-500.00	2008-03-07	22
62	BURKE	SUZANNE	M.	3401 EVANSTON	NaN	SEATTLE	WA	981038677	-700.00	2008-03-05	22
63	BURKE	GAIL	NaN	165 E. 32ND STREET	APARTMENT 9E	NEW YORK	NY	100166014	-2000.00	2008-03-05	22
64	BURKE	DONALD	J.	12 LOMPOC	NaN	RANCHO SANTA MARGA	CA	926881817	-2300.00	2008-03-11	22
65	BURGERT	RONALD	L.	5723 PLUMTREE DRIVE	NaN	DALLAS	TX	752524926	-1000.00	2008-03-05	22
66	BULL	BARTLE	B.	439 E. 51ST STREET	NaN	NEW YORK	NY	100226473	-800.00	2008-03-10	22
67	BULL	BARTLE	B.	439 E. 51ST STREET	NaN	NEW YORK	NY	100226473	-1000.00	2008-03-10	22
68	BUKOWSKI	DANIEL	J.	702 S. WRIGHT STREET	NaN	NAPERVILLE	IL	605406736	-100.00	2008-03-10	22
69	BUISSON	MARGARET	A.	P.O. BOX 197029	NaN	LOUISVILLE	KY	402597029	-200.00	2008-03-11	22
70	BUCKLEY	WALTER	W.	1635 COUNTRY ROAD	NaN	BETHLEHEM	PA	180155718	-100.00	2008-03-05	22
71	BUCKLEY	MARJORIE	B.	1635 COUNTRY ROAD	NaN	BETHLEHEM	PA	180155718	-100.00	2008-03-05	22
72	BRUNO	JOHN	NaN	10136 WINDERMERE CHASE BLVD.	NaN	GOTHA	FL	347344707	-2300.00	2008-03-06	22
73	BRUNO	IRENE	NaN	10136 WINDERMERE CHASE BLVD.	NaN	GOTHA	FL	347344707	-2300.00	2008-03-06	22
74	BROWN	TIMOTHY	J.	26826 MARLOWE COURT	NaN	STEVENSON RANCH	CA	913811020	-2300.00	2008-03-06	22
75	Schuff	Bryan	NaN	1700 W Sweden Rd	NaN	Brockport	NY	14420	-25.00	2008-08-22	32
76	Hobbs	James	NaN	229 Cherry Lane	NaN	White House	TN	37188	-25.00	2008-08-19	32
77	Ranganath	Anoop	NaN	2507 Willard Drive	NaN	Charlottesville	VA	22903	-100.00	2008-04-21	32
78	Nystrom	Michael	A	93A Fairmont Street	NaN	Arlington	MA	2474	-503.00	2008-04-21	32
79	Muse	Nina	Jo	2915 Toro Canyon Rd	NaN	Austin	TX	78746	-50.00	2008-04-21	32
80	Waddell	James	L.	1823 Spel Lane SW	NaN	Rochester	MN	55902	-28.00	2008-04-21	32
81	Brucks	William	C.	PO Box 391	NaN	Corona del Mar	CA	92625	-150.00	2008-04-21	32
82	Kuehn	David	NaN	14502 West 93rd Street	NaN	Lenexa	KS	66215	-330.00	2008-04-21	32
83	Verster	Jeanette	M.	7220 SW 61st St	NaN	Miami	FL	331431807	-1000.00	2008-04-21	32
84	Uihlein	Richard	NaN	1396 N Waukegan Rd	NaN	Lake Forest	IL	600451147	-2300.00	2008-04-21	32
85	Eskenberry	Robert	P	10960 Gray Cir	NaN	Westminster	CO	80020	-223.00	2008-04-21	32
86	Froehling	Alan	L.	302 Broadway St	NaN	Mount Vernon	IL	628645116	-844.80	2008-04-21	32
87	Duryea	Marcia	A.	123 Bayview Ave	NaN	Amityville	NY	11701	-299.50	2008-04-21	32
88	Perreault	Louise	NaN	503 Brockridge Hunt Drive	NaN	Hampton	VA	23666	-34.08	2008-04-21	32
89	Rozenfeld	Timur	NaN	57 Herbert Road	NaN	Robbinsville	NJ	8691	-777.95	2008-04-21	32
90	Kazor	Christopher	M	707 Spindletree ave	NaN	Naperville	IL	60565	-2592.00	2008-04-21	32
91	Lehner	Thomas	S.	2701 Star Lane	NaN	Wadsworth	OH	44281	-200.00	2008-04-21	32
92	Plummer	Joseph	NaN	587 Blake Hill Rd	NaN	New Hampton	NH	32564424	-24.60	2008-04-21	32
93	Raught	Philip	M	4714 Plum Way	NaN	Pittsburgh	PA	15201	-1046.00	2008-04-21	32
94	Ferrara	Judith	D	1508 Waterford Road	NaN	Yardley	PA	19067	-1100.00	2008-04-21	32
95	Johnson	Cathleen	E.	1003 Justin Ln Apt 2016	NaN	Austin	TX	787572648	-14.76	2008-04-21	32
96	Sanford	Bradley	NaN	940 Post St #43	NaN	San Francisco	CA	94109	-24.53	2008-04-21	32
97	Gaarder	Bruce	NaN	PO Box 4085	NaN	Mountain Home AFB	ID	83648	-261.00	2008-04-21	32
98	Choe	Hyeokchan	NaN	207 Bridle Way	NaN	Fort Lee	NJ	70246302	-39.50	2008-04-21	32
99	Jacobs	Richard	G.	14337 Tawya Rd	NaN	Apple Valley	CA	923075545	-1000.00	2008-04-21	32
110	Reid	Elizabeth	NaN	73 W Patent Rd	OPHIR FARM NORTH	Bedford Hills	NY	105072222	-350.00	2008-08-28	34
111	Reich	Thomas	NaN	499 Park Ave	NaN	New York	NY	100221240	-2300.00	2008-08-28	34
125	BOURNE	TRAVIS	NaN	LAGE KAART 77	NaN	BRASSCHATT	NaN	2930	-500.00	2008-11-20	35
126	SECRIST	BRIAN	L.	3 MULE DEER TRAIL	NaN	LITTLETON	CO	801275722	-1000.00	2008-04-07	35
127	TOLLESTRUP	TRAVIS	W.	16331 WINECREEK RD.	NaN	SAN DIEGO	CA	92127	-1000.00	2008-05-15	35

In [31]:

dfcwci.sort_values(by="amount", ascending=False).head(10)

Out[31]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
30	Buckel	Linda	NaN	PO Box 683130	NaN	Park City	UT	840683130	4600.0	2007-08-14	20
159	ABATE	MARIA	ELENA	1291 NIGHTINGALE AVENUE	NaN	MIAMI SPRINGS	FL	331663832	2600.0	2008-01-25	37
15	Anthony	John	NaN	211 Long Island Drive	NaN	Hot Springs	AR	71913	2300.0	2007-06-12	16
33	Buck	Blaine	M	45 Eaton Ave	NaN	Camden	ME	48431752	2300.0	2007-09-30	20
28	Buckel	Linda	NaN	PO Box 683130	NaN	Park City	UT	840683130	2300.0	2007-08-14	20
21	Baker	David	NaN	2550 Adamsbrooke Drive	NaN	Conway	AR	72034	2300.0	2007-04-11	16
13	Altes	R.D.	NaN	8600 Moody Road	NaN	Fort Smith	AR	72903	2300.0	2007-06-21	16
135	ABRAMOWITZ	NIRA	NaN	411 HARBOR ROAD	NaN	SOUTHPORT	CT	68901376	2300.0	2007-09-14	35
5	Akin	Mike	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	1500.0	2007-05-18	16
174	ABRAHAM	SALEM	A.	P.O. BOX 7	NaN	CANADIAN	TX	790140007	1300.0	2008-01-30	37

SELECT-COLUMNS¶

In [32]:

dfcwci[['first_name', 'amount']].head(10)

Out[32]:

	first_name	amount
0	Steven	500.0
1	Don	250.0
2	Don	50.0
3	Don	100.0
4	Charles	100.0
5	Mike	1500.0
6	Rebecca	500.0
7	Brittni	250.0
8	John D.	1000.0
9	John D.	1300.0

SELECT-DISTINCT¶

Selecting a distinct set is useful for cleaning. Here, we might wish to focus on contributors rather than contributions and see how many distinct contributors we have. Of-course we might be wrong, some people have identical names.

In [33]:

dfcwci[['last_name','first_name']].count()

Out[33]:

last_name     175
first_name    175
dtype: int64

In [34]:

dfcwci[['last_name','first_name']].drop_duplicates().count()

Out[34]:

last_name     126
first_name    126
dtype: int64

In [35]:

dfcwci[['last_name','first_name']].drop_duplicates().head(10)

Out[35]:

	last_name	first_name
0	Agee	Steven
1	Ahrens	Don
4	Akin	Charles
5	Akin	Mike
6	Akin	Rebecca
7	Aldridge	Brittni
8	Allen	John D.
10	Allison	John W.
11	Allison	Rebecca
13	Altes	R.D.

ASSIGN¶

Assignment to a new column is easy.

In [36]:

dfcwci['name']=dfcwci['last_name']+", "+dfcwci['first_name']
dfcwci.head(10)

Out[36]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id	name
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16	Agee, Steven
1	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	250.0	2007-05-16	16	Ahrens, Don
2	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	50.0	2007-06-18	16	Ahrens, Don
3	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	100.0	2007-06-21	16	Ahrens, Don
4	Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16	Akin, Charles
5	Akin	Mike	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	1500.0	2007-05-18	16	Akin, Mike
6	Akin	Rebecca	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	500.0	2007-05-18	16	Akin, Rebecca
7	Aldridge	Brittni	NaN	808 Capitol Square Place, SW	NaN	Washington	DC	20024	250.0	2007-06-06	16	Aldridge, Brittni
8	Allen	John D.	NaN	1052 Cannon Mill Drive	NaN	North Augusta	SC	29860	1000.0	2007-06-11	16	Allen, John D.
9	Allen	John D.	NaN	1052 Cannon Mill Drive	NaN	North Augusta	SC	29860	1300.0	2007-06-29	16	Allen, John D.

In [37]:

dfcwci.assign(ucname=dfcwci.last_name+":"+dfcwci.first_name).head(10)

Out[37]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id	name	ucname
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16	Agee, Steven	Agee:Steven
1	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	250.0	2007-05-16	16	Ahrens, Don	Ahrens:Don
2	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	50.0	2007-06-18	16	Ahrens, Don	Ahrens:Don
3	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	100.0	2007-06-21	16	Ahrens, Don	Ahrens:Don
4	Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16	Akin, Charles	Akin:Charles
5	Akin	Mike	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	1500.0	2007-05-18	16	Akin, Mike	Akin:Mike
6	Akin	Rebecca	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	500.0	2007-05-18	16	Akin, Rebecca	Akin:Rebecca
7	Aldridge	Brittni	NaN	808 Capitol Square Place, SW	NaN	Washington	DC	20024	250.0	2007-06-06	16	Aldridge, Brittni	Aldridge:Brittni
8	Allen	John D.	NaN	1052 Cannon Mill Drive	NaN	North Augusta	SC	29860	1000.0	2007-06-11	16	Allen, John D.	Allen:John D.
9	Allen	John D.	NaN	1052 Cannon Mill Drive	NaN	North Augusta	SC	29860	1300.0	2007-06-29	16	Allen, John D.	Allen:John D.

Will the above command actually change dfcwci?. No, it produces a fresh dataframe.

What if we wanted to change an existing assignment?¶

In [38]:

dfcwci[dfcwci.state=='VA']

Out[38]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id	name
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.00	2007-06-30	16	Agee, Steven
27	Buckheit	Bruce	NaN	8904 KAREN DR	NaN	FAIRFAX	VA	220312731	100.00	2007-09-19	20	Buckheit, Bruce
77	Ranganath	Anoop	NaN	2507 Willard Drive	NaN	Charlottesville	VA	22903	-100.00	2008-04-21	32	Ranganath, Anoop
88	Perreault	Louise	NaN	503 Brockridge Hunt Drive	NaN	Hampton	VA	23666	-34.08	2008-04-21	32	Perreault, Louise
145	ABDELLA	THOMAS	M.	4231 MONUMENT WALL WAY #340	NaN	FAIRFAX	VA	220308440	50.00	2007-09-30	35	ABDELLA, THOMAS

In [39]:

dfcwci.loc[dfcwci.state=='VA', 'name']

Out[39]:

0           Agee, Steven
27       Buckheit, Bruce
77      Ranganath, Anoop
88     Perreault, Louise
145      ABDELLA, THOMAS
Name: name, dtype: object

In [40]:

dfcwci.loc[dfcwci.state=='VA', 'name']="junk"

In [41]:

dfcwci.query("state=='VA'")

Out[41]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id	name
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.00	2007-06-30	16	junk
27	Buckheit	Bruce	NaN	8904 KAREN DR	NaN	FAIRFAX	VA	220312731	100.00	2007-09-19	20	junk
77	Ranganath	Anoop	NaN	2507 Willard Drive	NaN	Charlottesville	VA	22903	-100.00	2008-04-21	32	junk
88	Perreault	Louise	NaN	503 Brockridge Hunt Drive	NaN	Hampton	VA	23666	-34.08	2008-04-21	32	junk
145	ABDELLA	THOMAS	M.	4231 MONUMENT WALL WAY #340	NaN	FAIRFAX	VA	220308440	50.00	2007-09-30	35	junk

Drop-Column¶

Real simple:

In [42]:

del dfcwci['name']

AGGREGATE¶

In [43]:

dfcwci.describe()

Out[43]:

	zip	amount	candidate_id
count	1.750000e+02	175.000000	175.000000
mean	3.780014e+08	3.418114	28.000000
std	3.628278e+08	1028.418999	7.823484
min	2.474000e+03	-2592.000000	16.000000
25%	9.336700e+04	-175.000000	20.000000
50%	3.233313e+08	100.000000	32.000000
75%	7.816946e+08	300.000000	35.000000
max	9.951532e+08	4600.000000	37.000000

In [44]:

dfcwci.amount.max()

Out[44]:

4600.0

In [45]:

dfcwci[dfcwci.amount==dfcwci.amount.max()]

Out[45]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
30	Buckel	Linda	NaN	PO Box 683130	NaN	Park City	UT	840683130	4600.0	2007-08-14	20

In [46]:

dfcwci[dfcwci.amount > dfcwci.amount.max() - 2300]

Out[46]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
30	Buckel	Linda	NaN	PO Box 683130	NaN	Park City	UT	840683130	4600.0	2007-08-14	20
159	ABATE	MARIA	ELENA	1291 NIGHTINGALE AVENUE	NaN	MIAMI SPRINGS	FL	331663832	2600.0	2008-01-25	37

Aso MIN, SUM, AVG.

Grouping using Pandas and split-apply-combine¶

In [47]:

grouped_by_state = dfcwci.groupby("state")
grouped_by_state

Out[47]:

In [48]:

grouped_by_state.amount

Out[48]:

How do we get access to these? Standard pandas functions distribute over the groupby, going one by one over the sub-dataframes or sub-series. This is an example of a paradigm called split-apply-combine.

GROUP-AGG¶

The fourth part of the EDA rubric is to look at properties of the sub-dataframes you get when you make groups. (We'll talk about the graphical aspects of this later). For instance, you might group contributions by state:

In [49]:

dfcwci.groupby("state").describe()

Out[49]:

	amount								candidate_id								zip
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
state
AK	3.0	403.333333	100.166528	300.00	355.0000	410.000	455.0000	500.0	3.0	37.000000	0.000000	37.0	37.00	37.0	37.00	37.0	3.0	9.951532e+08	0.000000e+00	995153207.0	9.951532e+08	995153207.0	9.951532e+08	995153207.0
AR	12.0	1183.333333	775.574079	100.00	875.0000	1000.000	1700.0000	2300.0	12.0	16.000000	0.000000	16.0	16.00	16.0	16.00	16.0	12.0	7.206583e+04	4.250251e+02	71603.0	7.165500e+04	72033.5	7.222700e+04	72903.0
AZ	1.0	120.000000	NaN	120.00	120.0000	120.000	120.0000	120.0	1.0	37.000000	NaN	37.0	37.00	37.0	37.00	37.0	1.0	8.600111e+08	NaN	860011121.0	8.600111e+08	860011121.0	8.600111e+08	860011121.0
CA	23.0	-217.988261	942.102438	-2300.00	-175.0000	50.000	225.0000	1000.0	23.0	26.086957	8.151481	16.0	20.00	22.0	34.50	37.0	23.0	6.407113e+08	4.333269e+08	92127.0	9.456600e+04	913811020.0	9.260504e+08	941151435.0
CO	4.0	-1455.750000	1025.166125	-2300.00	-2300.0000	-1650.000	-805.7500	-223.0	4.0	27.750000	6.751543	22.0	22.00	27.0	32.75	35.0	4.0	6.013695e+08	4.008598e+08	80020.0	6.009768e+08	801668384.0	8.020610e+08	802061046.0
CT	1.0	2300.000000	NaN	2300.00	2300.0000	2300.000	2300.0000	2300.0	1.0	35.000000	NaN	35.0	35.00	35.0	35.00	35.0	1.0	6.890138e+07	NaN	68901376.0	6.890138e+07	68901376.0	6.890138e+07	68901376.0
DC	5.0	-309.982000	529.644175	-1000.00	-500.0000	-500.000	200.0900	250.0	5.0	20.400000	2.607681	16.0	20.00	22.0	22.00	22.0	5.0	1.600684e+08	8.946976e+07	20024.0	2.000525e+08	200052503.0	2.000525e+08	200164320.0
FL	30.0	-135.000000	1177.383862	-2300.00	-500.0000	100.000	500.0000	2600.0	30.0	26.766667	7.151963	20.0	20.00	22.0	34.00	37.0	30.0	2.990209e+08	1.015776e+08	33647.0	3.233313e+08	333063809.5	3.343163e+08	347344707.0
IA	1.0	250.000000	NaN	250.00	250.0000	250.000	250.0000	250.0	1.0	16.000000	NaN	16.0	16.00	16.0	16.00	16.0	1.0	5.026600e+04	NaN	50266.0	5.026600e+04	50266.0	5.026600e+04	50266.0
ID	1.0	-261.000000	NaN	-261.00	-261.0000	-261.000	-261.0000	-261.0	1.0	32.000000	NaN	32.0	32.00	32.0	32.00	32.0	1.0	8.364800e+04	NaN	83648.0	8.364800e+04	83648.0	8.364800e+04	83648.0
IL	6.0	-931.133333	1230.657981	-2592.00	-1936.2000	-472.400	12.5000	200.0	6.0	29.166667	6.645801	20.0	24.50	32.0	32.00	37.0	6.0	5.070114e+08	2.485720e+08	60565.0	6.005716e+08	603169827.0	6.062808e+08	628645116.0
KS	1.0	-330.000000	NaN	-330.00	-330.0000	-330.000	-330.0000	-330.0	1.0	32.000000	NaN	32.0	32.00	32.0	32.00	32.0	1.0	6.621500e+04	NaN	66215.0	6.621500e+04	66215.0	6.621500e+04	66215.0
KY	1.0	-200.000000	NaN	-200.00	-200.0000	-200.000	-200.0000	-200.0	1.0	22.000000	NaN	22.0	22.00	22.0	22.00	22.0	1.0	4.025970e+08	NaN	402597029.0	4.025970e+08	402597029.0	4.025970e+08	402597029.0
LA	2.0	650.000000	212.132034	500.00	575.0000	650.000	725.0000	800.0	2.0	37.000000	0.000000	37.0	37.00	37.0	37.00	37.0	2.0	7.030217e+08	4.242415e+06	700021823.0	7.015217e+08	703021663.5	7.045216e+08	706021504.0
MA	6.0	-13.833333	248.197838	-503.00	25.0000	47.500	92.5000	200.0	6.0	34.666667	1.966384	32.0	34.00	34.0	36.25	37.0	6.0	2.050444e+07	1.058572e+07	2474.0	1.975712e+07	25727725.0	2.673330e+07	26733300.0
MD	2.0	150.000000	141.421356	50.00	100.0000	150.000	200.0000	250.0	2.0	27.500000	10.606602	20.0	23.75	27.5	31.25	35.0	2.0	2.081438e+08	1.140563e+03	208143002.0	2.081434e+08	208143808.5	2.081442e+08	208144615.0
ME	4.0	630.000000	1113.522938	50.00	65.0000	85.000	650.0000	2300.0	4.0	30.500000	7.000000	20.0	30.50	34.0	34.00	34.0	4.0	4.141179e+07	4.679973e+06	39071806.0	3.907181e+07	39071806.0	4.141179e+07	48431752.0
MI	5.0	-253.000000	1156.630883	-2300.00	35.0000	200.000	300.0000	500.0	5.0	32.800000	6.220932	22.0	34.00	34.0	37.00	37.0	5.0	4.853947e+08	5.515168e+06	481885097.0	4.818851e+08	483862274.0	4.842798e+08	495061254.0
MN	3.0	107.333333	139.145008	-28.00	36.0000	100.000	175.0000	250.0	3.0	33.333333	1.154701	32.0	33.00	34.0	34.00	34.0	3.0	3.674463e+08	3.181694e+08	55902.0	2.755987e+08	551141508.0	5.511415e+08	551141508.0
MO	1.0	100.000000	NaN	100.00	100.0000	100.000	100.0000	100.0	1.0	20.000000	NaN	20.0	20.00	20.0	20.00	20.0	1.0	6.411100e+04	NaN	64111.0	6.411100e+04	64111.0	6.411100e+04	64111.0
NC	1.0	500.000000	NaN	500.00	500.0000	500.000	500.0000	500.0	1.0	16.000000	NaN	16.0	16.00	16.0	16.00	16.0	1.0	2.750200e+04	NaN	27502.0	2.750200e+04	27502.0	2.750200e+04	27502.0
NH	1.0	-24.600000	NaN	-24.60	-24.6000	-24.600	-24.6000	-24.6	1.0	32.000000	NaN	32.0	32.00	32.0	32.00	32.0	1.0	3.256442e+07	NaN	32564424.0	3.256442e+07	32564424.0	3.256442e+07	32564424.0
NJ	2.0	-408.725000	522.163003	-777.95	-593.3375	-408.725	-224.1125	-39.5	2.0	32.000000	0.000000	32.0	32.00	32.0	32.00	32.0	2.0	3.512750e+07	4.966549e+07	8691.0	1.756809e+07	35127496.5	5.268690e+07	70246302.0
NV	4.0	181.250000	213.478141	50.00	68.7500	87.500	200.0000	500.0	4.0	36.000000	1.154701	35.0	35.00	36.0	37.00	37.0	4.0	8.939724e+08	1.883416e+06	891175812.0	8.936894e+08	894810312.5	8.950933e+08	895093326.0
NY	8.0	-809.312500	925.274590	-2300.00	-1250.0000	-575.000	-230.8750	300.0	8.0	29.125000	5.986592	22.0	22.00	32.0	34.00	35.0	8.0	7.576625e+07	4.678602e+07	11701.0	7.512812e+07	100206344.0	1.002265e+08	105072222.0
OH	4.0	112.500000	289.755644	-200.00	-12.5000	75.000	200.0000	500.0	4.0	20.000000	8.000000	16.0	16.00	16.0	20.00	32.0	4.0	4.401775e+04	1.126738e+03	43143.0	4.314300e+04	43712.0	4.458675e+04	45504.0
OK	3.0	266.666667	208.166600	100.00	150.0000	200.000	350.0000	500.0	3.0	34.000000	0.000000	34.0	34.00	34.0	34.00	34.0	3.0	7.341663e+08	5.157607e+06	731188602.0	7.311886e+08	731188602.0	7.356552e+08	740121840.0
PA	5.0	-429.200000	600.635663	-1100.00	-1046.0000	-100.000	-100.0000	200.0	5.0	29.000000	6.708204	22.0	22.00	32.0	32.00	37.0	5.0	1.080998e+08	9.866553e+07	15201.0	1.906700e+04	180153316.0	1.801557e+08	180155718.0
RI	2.0	100.000000	0.000000	100.00	100.0000	100.000	100.0000	100.0	2.0	35.000000	0.000000	35.0	35.00	35.0	35.00	35.0	2.0	2.903295e+07	0.000000e+00	29032946.0	2.903295e+07	29032946.0	2.903295e+07	29032946.0
SC	3.0	800.000000	624.499800	100.00	550.0000	1000.000	1150.0000	1300.0	3.0	23.000000	12.124356	16.0	16.00	16.0	26.50	37.0	3.0	9.873826e+07	1.709680e+08	29860.0	2.986000e+04	29860.0	1.480925e+08	296155069.0
TN	1.0	-25.000000	NaN	-25.00	-25.0000	-25.000	-25.0000	-25.0	1.0	32.000000	NaN	32.0	32.00	32.0	32.00	32.0	1.0	3.718800e+04	NaN	37188.0	3.718800e+04	37188.0	3.718800e+04	37188.0
TX	9.0	220.582222	664.483748	-1000.00	-14.7600	50.000	500.0000	1300.0	9.0	33.555556	4.746344	22.0	32.00	35.0	37.00	37.0	9.0	6.912725e+08	2.594552e+08	78746.0	7.735463e+08	775816547.0	7.875726e+08	790140007.0
UT	11.0	459.090909	1716.183000	-2300.00	25.0000	75.000	87.5000	4600.0	11.0	30.909091	7.006490	20.0	27.50	35.0	35.00	35.0	11.0	8.410139e+08	2.208553e+05	840683130.0	8.408522e+08	841176911.0	8.411769e+08	841176911.0
VA	5.0	103.184000	234.748140	-100.00	-34.0800	50.000	100.0000	500.0	5.0	27.000000	8.426150	16.0	20.00	32.0	32.00	35.0	5.0	8.813837e+07	1.206562e+08	22903.0	2.366600e+04	24091.0	2.203084e+08	220312731.0
WA	3.0	-166.666667	461.880215	-700.00	-300.0000	100.000	100.0000	100.0	3.0	30.000000	6.928203	22.0	28.00	34.0	34.00	34.0	3.0	9.804301e+08	5.270573e+05	980125787.0	9.801258e+08	980125787.0	9.805822e+08	981038677.0

In [50]:

dfcwci.groupby("state").sum()

Out[50]:

	zip	amount	candidate_id
state
AK	2985459621	1210.00	111
AR	864790	14200.00	192
AZ	860011121	120.00	37
CA	14736360720	-5013.73	600
CO	2405477834	-5823.00	111
CT	68901376	2300.00	35
DC	800341853	-1549.91	102
FL	8970626520	-4050.00	803
IA	50266	250.00	16
ID	83648	-261.00	32
IL	3042068689	-5586.80	175
KS	66215	-330.00	32
KY	402597029	-200.00	22
LA	1406043327	1300.00	74
MA	123026638	-83.00	208
MD	416287617	300.00	55
ME	165647170	2520.00	122
MI	2426973485	-1265.00	164
MN	1102338918	322.00	100
MO	64111	100.00	20
NC	27502	500.00	16
NH	32564424	-24.60	32
NJ	70254993	-817.45	64
NV	3575889763	725.00	144
NY	606129991	-6474.50	233
OH	176071	450.00	80
OK	2202499044	800.00	102
PA	540499020	-2146.00	145
RI	58065892	200.00	70
SC	296214789	2400.00	69
TN	37188	-25.00	32
TX	6221452245	1985.24	302
UT	9251153394	5050.00	340
VA	440691831	515.92	135
WA	2941290251	-500.00	90

In [51]:

dfcwci.groupby("state")['amount'].mean()

Out[51]:

state
AK     403.333333
AR    1183.333333
AZ     120.000000
CA    -217.988261
CO   -1455.750000
CT    2300.000000
DC    -309.982000
FL    -135.000000
IA     250.000000
ID    -261.000000
IL    -931.133333
KS    -330.000000
KY    -200.000000
LA     650.000000
MA     -13.833333
MD     150.000000
ME     630.000000
MI    -253.000000
MN     107.333333
MO     100.000000
NC     500.000000
NH     -24.600000
NJ    -408.725000
NV     181.250000
NY    -809.312500
OH     112.500000
OK     266.666667
PA    -429.200000
RI     100.000000
SC     800.000000
TN     -25.000000
TX     220.582222
UT     459.090909
VA     103.184000
WA    -166.666667
Name: amount, dtype: float64

In [52]:

dfcwci.groupby("state")['amount'].apply(lambda x: np.std(x))

Out[52]:

state
AK      81.785628
AR     742.555647
AZ       0.000000
CA     921.394361
CO     887.819907
CT       0.000000
DC     473.728151
FL    1157.594489
IA       0.000000
ID       0.000000
IL    1123.431895
KS       0.000000
KY       0.000000
LA     150.000000
MA     226.572591
MD     100.000000
ME     964.339152
MI    1034.522112
MN     113.611424
MO       0.000000
NC       0.000000
NH       0.000000
NJ     369.225000
NV     184.877493
NY     865.515126
OH     250.935749
OK     169.967317
PA     537.224869
RI       0.000000
SC     509.901951
TN       0.000000
TX     626.481285
UT    1636.316287
VA     209.965120
WA     377.123617
Name: amount, dtype: float64

The dictionary-like structure is more obvious in this method of iteration, but it does not do the combining part...

In [53]:

for k, v in dfcwci.groupby('state'):
    print("State", k, "mean amount", v.amount.mean(), "std", v.amount.std())

State AK mean amount 403.3333333333333 std 100.16652800877813
State AR mean amount 1183.3333333333333 std 775.574078676661
State AZ mean amount 120.0 std nan
State CA mean amount -217.9882608695652 std 942.1024379191479
State CO mean amount -1455.75 std 1025.16612474922
State CT mean amount 2300.0 std nan
State DC mean amount -309.98199999999997 std 529.6441745360747
State FL mean amount -135.0 std 1177.3838620520878
State IA mean amount 250.0 std nan
State ID mean amount -261.0 std nan
State IL mean amount -931.1333333333333 std 1230.6579811900083
State KS mean amount -330.0 std nan
State KY mean amount -200.0 std nan
State LA mean amount 650.0 std 212.13203435596427
State MA mean amount -13.833333333333334 std 248.19783775582468
State MD mean amount 150.0 std 141.4213562373095
State ME mean amount 630.0 std 1113.5229379466475
State MI mean amount -253.0 std 1156.630883212099
State MN mean amount 107.33333333333333 std 139.14500829470433
State MO mean amount 100.0 std nan
State NC mean amount 500.0 std nan
State NH mean amount -24.6 std nan
State NJ mean amount -408.725 std 522.163002567206
State NV mean amount 181.25 std 213.47814095749163
State NY mean amount -809.3125 std 925.2745900349181
State OH mean amount 112.5 std 289.75564417856185
State OK mean amount 266.6666666666667 std 208.16659994661327
State PA mean amount -429.2 std 600.6356632768321
State RI mean amount 100.0 std 0.0
State SC mean amount 800.0 std 624.4997998398399
State TN mean amount -25.0 std nan
State TX mean amount 220.5822222222222 std 664.483747614977
State UT mean amount 459.09090909090907 std 1716.1830004137353
State VA mean amount 103.18400000000001 std 234.74814009912836
State WA mean amount -166.66666666666666 std 461.88021535170066

DELETE¶

In [54]:

dfcwci.head()

Out[54]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16
1	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	250.0	2007-05-16	16
2	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	50.0	2007-06-18	16
3	Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	100.0	2007-06-21	16
4	Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16

In-place drops

In [55]:

df2=dfcwci.copy()
df2.set_index('last_name', inplace=True)
df2.head()

Out[55]:

	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
last_name
Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16
Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	250.0	2007-05-16	16
Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	50.0	2007-06-18	16
Ahrens	Don	NaN	4034 Rennellwood Way	NaN	Pleasanton	CA	94566	100.0	2007-06-21	16
Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16

In [56]:

df2.drop(['Ahrens'], inplace=True)
df2.head()

Out[56]:

	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
last_name
Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16
Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16
Akin	Mike	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	1500.0	2007-05-18	16
Akin	Rebecca	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	500.0	2007-05-18	16
Aldridge	Brittni	NaN	808 Capitol Square Place, SW	NaN	Washington	DC	20024	250.0	2007-06-06	16

In [57]:

df2.reset_index(inplace=True)
df2.head()

Out[57]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16
1	Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16
2	Akin	Mike	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	1500.0	2007-05-18	16
3	Akin	Rebecca	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	500.0	2007-05-18	16
4	Aldridge	Brittni	NaN	808 Capitol Square Place, SW	NaN	Washington	DC	20024	250.0	2007-06-06	16

The recommended way to do it is to create a new dataframe. This might be impractical is things are very large.

In [58]:

dfcwci=dfcwci[dfcwci.last_name!='Ahrens']
dfcwci.head(10)

Out[58]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16
4	Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16
5	Akin	Mike	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	1500.0	2007-05-18	16
6	Akin	Rebecca	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	500.0	2007-05-18	16
7	Aldridge	Brittni	NaN	808 Capitol Square Place, SW	NaN	Washington	DC	20024	250.0	2007-06-06	16
8	Allen	John D.	NaN	1052 Cannon Mill Drive	NaN	North Augusta	SC	29860	1000.0	2007-06-11	16
9	Allen	John D.	NaN	1052 Cannon Mill Drive	NaN	North Augusta	SC	29860	1300.0	2007-06-29	16
10	Allison	John W.	NaN	P.O. Box 1089	NaN	Conway	AR	72033	1000.0	2007-05-18	16
11	Allison	Rebecca	NaN	3206 Summit Court	NaN	Little Rock	AR	72227	1000.0	2007-04-25	16
12	Allison	Rebecca	NaN	3206 Summit Court	NaN	Little Rock	AR	72227	200.0	2007-06-12	16

LIMIT¶

In [59]:

dfcwci[0:3] # also see loc and iloc from the lab

Out[59]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
0	Agee	Steven	NaN	549 Laurel Branch Road	NaN	Floyd	VA	24091	500.0	2007-06-30	16
4	Akin	Charles	NaN	10187 Sugar Creek Road	NaN	Bentonville	AR	72712	100.0	2007-06-16	16
5	Akin	Mike	NaN	181 Baywood Lane	NaN	Monticello	AR	71655	1500.0	2007-05-18	16

Relationships: JOINs are Cartesian Products.¶

Finally, there are many occasions you will want to combine dataframes. We might want to see who contributed to Obama:

Simple subselect¶

In [60]:

dfcand.head()

Out[60]:

	id	first_name	last_name	middle_name	party
0	33	Joseph	Biden	NaN	D
1	36	Samuel	Brownback	NaN	R
2	34	Hillary	Clinton	R.	D
3	39	Christopher	Dodd	J.	D
4	26	John	Edwards	NaN	D

In [61]:

obamaid=dfcand.query("last_name=='Obama'")['id'].values[0]

In [62]:

obamacontrib=dfcwci.query("candidate_id==%i" % obamaid)
obamacontrib.head()

Out[62]:

	last_name	first_name	middle_name	street_1	street_2	city	state	zip	amount	date	candidate_id
25	Buckler	Steve	NaN	24351 Armada Dr	NaN	Dana Point	CA	926291306	50.0	2007-07-30	20
26	Buckler	Steve	NaN	24351 Armada Dr	NaN	Dana Point	CA	926291306	25.0	2007-08-16	20
27	Buckheit	Bruce	NaN	8904 KAREN DR	NaN	FAIRFAX	VA	220312731	100.0	2007-09-19	20
28	Buckel	Linda	NaN	PO Box 683130	NaN	Park City	UT	840683130	2300.0	2007-08-14	20
29	Buckel	Linda	NaN	PO Box 683130	NaN	Park City	UT	840683130	-2300.0	2007-08-14	20

Explicit INNER JOIN¶

This is the one you will want 90% of the time. It will only match keys that are present in both dataframes.

inner join

(from http://pandas.pydata.org/pandas-docs/stable/merging.html)

In [63]:

cols_wanted=['last_name_x', 'first_name_x', 'candidate_id', 'id', 'last_name_y']
dfcwci.merge(dfcand, left_on="candidate_id", right_on="id")[cols_wanted]

Out[63]:

	last_name_x	first_name_x	candidate_id	id	last_name_y
0	Agee	Steven	16	16	Huckabee
1	Akin	Charles	16	16	Huckabee
2	Akin	Mike	16	16	Huckabee
3	Akin	Rebecca	16	16	Huckabee
4	Aldridge	Brittni	16	16	Huckabee
5	Allen	John D.	16	16	Huckabee
6	Allen	John D.	16	16	Huckabee
7	Allison	John W.	16	16	Huckabee
8	Allison	Rebecca	16	16	Huckabee
9	Allison	Rebecca	16	16	Huckabee
10	Altes	R.D.	16	16	Huckabee
11	Andres	Dale	16	16	Huckabee
12	Anthony	John	16	16	Huckabee
13	Arbogast	Robert	16	16	Huckabee
14	Arbogast	Robert	16	16	Huckabee
15	Ardle	William	16	16	Huckabee
16	Atiq	Omar	16	16	Huckabee
17	Atiq	Omar	16	16	Huckabee
18	Baker	David	16	16	Huckabee
19	Bancroft	David	16	16	Huckabee
20	Banks	Charles	16	16	Huckabee
21	Barbee	John	16	16	Huckabee
22	Buckler	Steve	20	20	Obama
23	Buckler	Steve	20	20	Obama
24	Buckheit	Bruce	20	20	Obama
25	Buckel	Linda	20	20	Obama
26	Buckel	Linda	20	20	Obama
27	Buckel	Linda	20	20	Obama
28	Buck	Thomas	20	20	Obama
29	Buck	Jay	20	20	Obama
...	...	...	...	...	...
142	ABDELLA	THOMAS	35	35	Romney
143	ABBOTT	WELDON	35	35	Romney
144	ABBOTT	WELDON	35	35	Romney
145	ABBOTT	GERALD	35	35	Romney
146	ABBOTT	GERALD	35	35	Romney
147	ABEDIN	ZAINUL	37	37	McCain
148	ABBOTT	SYBIL	37	37	McCain
149	ABBOTT	SYBIL	37	37	McCain
150	ABBOTT	RONALD	37	37	McCain
151	ABBOTT	RONALD	37	37	McCain
152	ABBOTT	ROBERT	37	37	McCain
153	ABBOTT	MIKE	37	37	McCain
154	ABBOT	DAVID	37	37	McCain
155	ABBO	PAULINE	37	37	McCain
156	ABATE	MARIA	37	37	McCain
157	ABAIR	PETER	37	37	McCain
158	ABACHERLI	SHIRLEY	37	37	McCain
159	AARONS	CHARLES	37	37	McCain
160	AARONS	CHARLES	37	37	McCain
161	AARONS	CHARLES	37	37	McCain
162	ABEL	JOHN	37	37	McCain
163	ABEL	MARLING	37	37	McCain
164	ABEL	RUDOLPH	37	37	McCain
165	ABELE	RODNEY	37	37	McCain
166	ABERCROMBIE	DENIS	37	37	McCain
167	ABESHAUS	MERRILL	37	37	McCain
168	ABRAHAM	GEORGE	37	37	McCain
169	ABRAHAMSON	PETER	37	37	McCain
170	ABRAHAM	SALEM	37	37	McCain
171	ABRAHAM	SALEM	37	37	McCain

172 rows × 5 columns

If the names of the columns you wanted to merge on were identical, you could simply say on=id, for example, rather than a left_on and a right_on.

Outer JOIN¶

left outer (contributors on candidates)¶

left outer

In [64]:

dfcwci.merge(dfcand, left_on="candidate_id", right_on="id", how="left")[cols_wanted]

Out[64]:

	last_name_x	first_name_x	candidate_id	id	last_name_y
0	Agee	Steven	16	16	Huckabee
1	Akin	Charles	16	16	Huckabee
2	Akin	Mike	16	16	Huckabee
3	Akin	Rebecca	16	16	Huckabee
4	Aldridge	Brittni	16	16	Huckabee
5	Allen	John D.	16	16	Huckabee
6	Allen	John D.	16	16	Huckabee
7	Allison	John W.	16	16	Huckabee
8	Allison	Rebecca	16	16	Huckabee
9	Allison	Rebecca	16	16	Huckabee
10	Altes	R.D.	16	16	Huckabee
11	Andres	Dale	16	16	Huckabee
12	Anthony	John	16	16	Huckabee
13	Arbogast	Robert	16	16	Huckabee
14	Arbogast	Robert	16	16	Huckabee
15	Ardle	William	16	16	Huckabee
16	Atiq	Omar	16	16	Huckabee
17	Atiq	Omar	16	16	Huckabee
18	Baker	David	16	16	Huckabee
19	Bancroft	David	16	16	Huckabee
20	Banks	Charles	16	16	Huckabee
21	Barbee	John	16	16	Huckabee
22	Buckler	Steve	20	20	Obama
23	Buckler	Steve	20	20	Obama
24	Buckheit	Bruce	20	20	Obama
25	Buckel	Linda	20	20	Obama
26	Buckel	Linda	20	20	Obama
27	Buckel	Linda	20	20	Obama
28	Buck	Thomas	20	20	Obama
29	Buck	Jay	20	20	Obama
...	...	...	...	...	...
142	ABDELLA	THOMAS	35	35	Romney
143	ABBOTT	WELDON	35	35	Romney
144	ABBOTT	WELDON	35	35	Romney
145	ABBOTT	GERALD	35	35	Romney
146	ABBOTT	GERALD	35	35	Romney
147	ABEDIN	ZAINUL	37	37	McCain
148	ABBOTT	SYBIL	37	37	McCain
149	ABBOTT	SYBIL	37	37	McCain
150	ABBOTT	RONALD	37	37	McCain
151	ABBOTT	RONALD	37	37	McCain
152	ABBOTT	ROBERT	37	37	McCain
153	ABBOTT	MIKE	37	37	McCain
154	ABBOT	DAVID	37	37	McCain
155	ABBO	PAULINE	37	37	McCain
156	ABATE	MARIA	37	37	McCain
157	ABAIR	PETER	37	37	McCain
158	ABACHERLI	SHIRLEY	37	37	McCain
159	AARONS	CHARLES	37	37	McCain
160	AARONS	CHARLES	37	37	McCain
161	AARONS	CHARLES	37	37	McCain
162	ABEL	JOHN	37	37	McCain
163	ABEL	MARLING	37	37	McCain
164	ABEL	RUDOLPH	37	37	McCain
165	ABELE	RODNEY	37	37	McCain
166	ABERCROMBIE	DENIS	37	37	McCain
167	ABESHAUS	MERRILL	37	37	McCain
168	ABRAHAM	GEORGE	37	37	McCain
169	ABRAHAMSON	PETER	37	37	McCain
170	ABRAHAM	SALEM	37	37	McCain
171	ABRAHAM	SALEM	37	37	McCain

172 rows × 5 columns

right outer (contributors on candidates) = left outer (candidates on contributors)¶

right outer

In [65]:

dfcwci.merge(dfcand, left_on="candidate_id", right_on="id", how="right")[cols_wanted]

Out[65]:

	last_name_x	first_name_x	candidate_id	id	last_name_y
0	Agee	Steven	16.0	16	Huckabee
1	Akin	Charles	16.0	16	Huckabee
2	Akin	Mike	16.0	16	Huckabee
3	Akin	Rebecca	16.0	16	Huckabee
4	Aldridge	Brittni	16.0	16	Huckabee
5	Allen	John D.	16.0	16	Huckabee
6	Allen	John D.	16.0	16	Huckabee
7	Allison	John W.	16.0	16	Huckabee
8	Allison	Rebecca	16.0	16	Huckabee
9	Allison	Rebecca	16.0	16	Huckabee
10	Altes	R.D.	16.0	16	Huckabee
11	Andres	Dale	16.0	16	Huckabee
12	Anthony	John	16.0	16	Huckabee
13	Arbogast	Robert	16.0	16	Huckabee
14	Arbogast	Robert	16.0	16	Huckabee
15	Ardle	William	16.0	16	Huckabee
16	Atiq	Omar	16.0	16	Huckabee
17	Atiq	Omar	16.0	16	Huckabee
18	Baker	David	16.0	16	Huckabee
19	Bancroft	David	16.0	16	Huckabee
20	Banks	Charles	16.0	16	Huckabee
21	Barbee	John	16.0	16	Huckabee
22	Buckler	Steve	20.0	20	Obama
23	Buckler	Steve	20.0	20	Obama
24	Buckheit	Bruce	20.0	20	Obama
25	Buckel	Linda	20.0	20	Obama
26	Buckel	Linda	20.0	20	Obama
27	Buckel	Linda	20.0	20	Obama
28	Buck	Thomas	20.0	20	Obama
29	Buck	Jay	20.0	20	Obama
...	...	...	...	...	...
152	ABBOTT	ROBERT	37.0	37	McCain
153	ABBOTT	MIKE	37.0	37	McCain
154	ABBOT	DAVID	37.0	37	McCain
155	ABBO	PAULINE	37.0	37	McCain
156	ABATE	MARIA	37.0	37	McCain
157	ABAIR	PETER	37.0	37	McCain
158	ABACHERLI	SHIRLEY	37.0	37	McCain
159	AARONS	CHARLES	37.0	37	McCain
160	AARONS	CHARLES	37.0	37	McCain
161	AARONS	CHARLES	37.0	37	McCain
162	ABEL	JOHN	37.0	37	McCain
163	ABEL	MARLING	37.0	37	McCain
164	ABEL	RUDOLPH	37.0	37	McCain
165	ABELE	RODNEY	37.0	37	McCain
166	ABERCROMBIE	DENIS	37.0	37	McCain
167	ABESHAUS	MERRILL	37.0	37	McCain
168	ABRAHAM	GEORGE	37.0	37	McCain
169	ABRAHAMSON	PETER	37.0	37	McCain
170	ABRAHAM	SALEM	37.0	37	McCain
171	ABRAHAM	SALEM	37.0	37	McCain
172	NaN	NaN	NaN	33	Biden
173	NaN	NaN	NaN	36	Brownback
174	NaN	NaN	NaN	39	Dodd
175	NaN	NaN	NaN	26	Edwards
176	NaN	NaN	NaN	24	Gravel
177	NaN	NaN	NaN	30	Hunter
178	NaN	NaN	NaN	31	Kucinich
179	NaN	NaN	NaN	29	Richardson
180	NaN	NaN	NaN	38	Tancredo
181	NaN	NaN	NaN	41	Thompson

182 rows × 5 columns

full outer¶

outer

In [66]:

dfcwci.merge(dfcand, left_on="candidate_id", right_on="id", how="outer")[cols_wanted]

Out[66]:

	last_name_x	first_name_x	candidate_id	id	last_name_y
0	Agee	Steven	16.0	16	Huckabee
1	Akin	Charles	16.0	16	Huckabee
2	Akin	Mike	16.0	16	Huckabee
3	Akin	Rebecca	16.0	16	Huckabee
4	Aldridge	Brittni	16.0	16	Huckabee
5	Allen	John D.	16.0	16	Huckabee
6	Allen	John D.	16.0	16	Huckabee
7	Allison	John W.	16.0	16	Huckabee
8	Allison	Rebecca	16.0	16	Huckabee
9	Allison	Rebecca	16.0	16	Huckabee
10	Altes	R.D.	16.0	16	Huckabee
11	Andres	Dale	16.0	16	Huckabee
12	Anthony	John	16.0	16	Huckabee
13	Arbogast	Robert	16.0	16	Huckabee
14	Arbogast	Robert	16.0	16	Huckabee
15	Ardle	William	16.0	16	Huckabee
16	Atiq	Omar	16.0	16	Huckabee
17	Atiq	Omar	16.0	16	Huckabee
18	Baker	David	16.0	16	Huckabee
19	Bancroft	David	16.0	16	Huckabee
20	Banks	Charles	16.0	16	Huckabee
21	Barbee	John	16.0	16	Huckabee
22	Buckler	Steve	20.0	20	Obama
23	Buckler	Steve	20.0	20	Obama
24	Buckheit	Bruce	20.0	20	Obama
25	Buckel	Linda	20.0	20	Obama
26	Buckel	Linda	20.0	20	Obama
27	Buckel	Linda	20.0	20	Obama
28	Buck	Thomas	20.0	20	Obama
29	Buck	Jay	20.0	20	Obama
...	...	...	...	...	...
152	ABBOTT	ROBERT	37.0	37	McCain
153	ABBOTT	MIKE	37.0	37	McCain
154	ABBOT	DAVID	37.0	37	McCain
155	ABBO	PAULINE	37.0	37	McCain
156	ABATE	MARIA	37.0	37	McCain
157	ABAIR	PETER	37.0	37	McCain
158	ABACHERLI	SHIRLEY	37.0	37	McCain
159	AARONS	CHARLES	37.0	37	McCain
160	AARONS	CHARLES	37.0	37	McCain
161	AARONS	CHARLES	37.0	37	McCain
162	ABEL	JOHN	37.0	37	McCain
163	ABEL	MARLING	37.0	37	McCain
164	ABEL	RUDOLPH	37.0	37	McCain
165	ABELE	RODNEY	37.0	37	McCain
166	ABERCROMBIE	DENIS	37.0	37	McCain
167	ABESHAUS	MERRILL	37.0	37	McCain
168	ABRAHAM	GEORGE	37.0	37	McCain
169	ABRAHAMSON	PETER	37.0	37	McCain
170	ABRAHAM	SALEM	37.0	37	McCain
171	ABRAHAM	SALEM	37.0	37	McCain
172	NaN	NaN	NaN	33	Biden
173	NaN	NaN	NaN	36	Brownback
174	NaN	NaN	NaN	39	Dodd
175	NaN	NaN	NaN	26	Edwards
176	NaN	NaN	NaN	24	Gravel
177	NaN	NaN	NaN	30	Hunter
178	NaN	NaN	NaN	31	Kucinich
179	NaN	NaN	NaN	29	Richardson
180	NaN	NaN	NaN	38	Tancredo
181	NaN	NaN	NaN	41	Thompson

182 rows × 5 columns

When to use which?

See this:

http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

Useful Links¶

And especially for R users: