CS-109A Introduction to Data Science
Lecture 3: EDA and Visualization Demo¶
Harvard University
Fall 2018
Instructors: Pavlos Protopapas and Kevin Rader
Author: Rahul Dave
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
EDA and visualization¶
In the last lecture we focussed on the machinery needed to get data into a tabular form. In this lecture we focus on the visualization part of Exploratory Data Analysis. This notebook accompanies the slides. It is strongly recommended that you see the lecture/read the slides and then work on this notebook. The textual descriptions are sparse here and the code will become much clearer in the context of visualization best practices.
# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline
#this line above prepares IPython notebook for working with matplotlib
# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().
import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
The seaborn API is changing, and seaborn.apionly
is being deprecated. The default will soon behave like apionly
and not change the standard matplotlib color scheme and defaults. Here we choose apionly
to make sure we have to do everything explicitly.
# versions below 0.8.1
import seaborn.apionly as sns #sets up styles and gives us more plotting options
Getting the mtcars dataset into shape¶
The documentation for this data is here but I have extracted some relevant parts below:
Description
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
Usage
mtcars
Format
A data frame with 32 observations on 11 variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
Source
Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.
dfcars=pd.read_csv("data/mtcars.csv")
dfcars.head()
There is an ugly poorly named column right here. Lets fix that.
dfcars=dfcars.rename(columns={"Unnamed: 0":"name"})
dfcars.head()
dfcars.shape
We parse out a maker
, which we shall lkater use to group cars.
dfcars['maker'] = dfcars.name.apply(lambda x: x.split()[0])
dfcars['maker']
This is what the dataframe looks like now:
dfcars.head()
We can construct the av_mpg
series by using the "split-apply-combine" paradigm and summarizing within group data by a mean:
av_mpg = dfcars.groupby('maker').mpg.mean()
av_mpg
Simple EDA¶
(as provided by Chris, a previous head-TF for cs109)
He says:
I'd like to suggest a basic rubric for the early stages of exploratory data analysis in Python. This isn't universally applicable, but it does cover many patterns which recur in several data analysis contexts. It's useful to keep this rubric in mind when encountering a new dataset.
The basic workflow is as follows:
- Build a DataFrame from the data (ideally, put all data in this object)
- Clean the DataFrame. It should have the following properties:
- Each row describes a single object
- Each column describes a property of that object
- Columns are numeric whenever appropriate
- Columns contain atomic properties that cannot be further decomposed
- Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.
- Explore group properties. Use groupby and small multiples to compare subsets of the data.
This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis.
So far we have built the dataframe, and carried out very minimal cleaning (renaming) in this dataframe.
Exploring global properties¶
So lets focus on visualizing global properties of the data set below. For now, we shall only focus on mpg
to illustrate the concepts, but you want to be doing this for all the columns. It may identify interesting properties and even errors in the data.
While we do this we shall see several examples of the matplotlib
plotting experience.
We first use seaborn to set the global matplotlib plotting context. Here we set it to notebook
which makes for reasonable sized graphics.
sns.set_context("notebook")
We ask what the defaukt color palette is. Since we imported sns.apionly
, we get the default matplotlib color palette.
sns.color_palette()
Now we see maplotlib's default color palette, 'viridis'.
sns.palplot(sns.color_palette());
We can choose other palettes. The default palette is for qualitative data. It creates a color cycle we use in our plots. In other words, as we plot multiple things in a plot, it will use these colors one-by-one.
output = sns.choose_colorbrewer_palette(data_type="qualitative")
sns.palplot(output)
sns.set_palette(output)
Bar Charts¶
We see that Pandas series very niftily give us bar graphs.
av_mpg.plot(kind="barh")
sns.reset_defaults()
av_mpg.plot(kind="barh")
The default, which comes from matplotlib, is chart-junky (although not as much as matlab, or as much as it used to be in earlier versions of matplotlib). But lets clean it.
plt.figure(figsize=(4, 6))
ax = plt.gca()
av_mpg.plot(kind="barh")
plt.grid(axis = 'x', color ='white', linestyle='-')
ax.tick_params(axis='both', which='both',length=0)
sns.despine()
And more....
with sns.plotting_context("poster"):
ax = plt.gca()
av_mpg.plot(kind="barh")
plt.grid(axis = 'x', color ='white', linestyle='-')
ax.tick_params(axis='both', which='both',length=0)
sns.despine(left=True, bottom=True)
Its important to understand what is going on under the hood. Try and write sns.despine
on your own by looking up the matplotlib documentation.
plt.figure(figsize=(3, 8))
ax = plt.gca()
av_mpg2 = av_mpg.sort_values()
makes = av_mpg2.index
speeds = av_mpg2.values
nums = np.arange(len(av_mpg2))
plt.barh(nums, speeds)
for p, c, ch in zip(nums, makes, speeds):
plt.annotate(str(ch), xy=(ch + 1, p), va='center')
ticks = plt.yticks(nums, makes)
xt = plt.xticks()[0]
plt.xticks(xt, [' '] * len(xt))
plt.grid(axis = 'x', color ='white', linestyle='-')
ax.tick_params(axis='both', which='both',length=0)
sns.despine(left=True, bottom=True)
Histograms¶
Numerical data leads to distributions, and distributions to histograms. Here is the Pandas default
dfcars.mpg.hist()
plt.xlabel("mpg");
And matplotlib:
plt.hist(dfcars.mpg.values);
Lets style it a bit with seaborn. The with
syntax followed by indentation is a python construct called a "context manager". Here it sets up a while axes style only for the code that is inside the context (ie, indented).
with sns.axes_style("white"):
dfcars.mpg.hist(alpha=0.4)
plt.xlabel("mpg");
sns.despine()
What if I wanted to get the benefit of the default seaborn style and colors?
sns.set()
dfcars.mpg.hist()
plt.xlabel("mpg");
sns.despine()
I can reset_defaults
back to the matplotlib defaults.
sns.reset_defaults()
ax = plt.gca()
dfcars.mpg.hist()
plt.xlabel("mpg");
plt.title("Miles per Gallon")
ax.tick_params(axis='both', which='both',length=0)
sns.despine(bottom=True, left=True)
Here are the most commonly used matplotlib plotting routines.
Here we use a white grid. We also illustrate how a series can be converted to a numpy array with the values
method (dfcars.mpg.values
) and how x and y limits can be set (the difference in the way xlim
and ylim
are called is on purpose and illustrates two ways you can set the limits). In the hist
function you can also change the number of bins. We also show how to label a plot and obtain a legend from the plot. A vertical line is drawn at the mean.
with sns.axes_style("whitegrid"):
plt.hist(dfcars.mpg.values, bins=20)
plt.xlim(5, 40)
plt.ylim([0, 8])
plt.axvline(dfcars.mpg.mean(), 0, 0.75, color='r', label='Mean')
plt.xlabel("mpg")
plt.ylabel("Counts")
plt.title("Cars Miles per gallon Histogram")
plt.legend()
One can set bins using a list, and also label the histogram (not recommended but just for illustration). We also illustrate how to capture the color used and use it again
with sns.axes_style("whitegrid"):
color = sns.color_palette()[0]
plt.hist(dfcars.mpg.values, bins=range(5, 40, 3), label="histogram", color=color)
plt.xlim(5, 40)
plt.ylim([0, 8])
plt.axvline(dfcars.mpg.mean(), 0, 1.0, color=color, label='Mean')
plt.xlabel("mpg")
plt.ylabel("Counts")
plt.title("Cars Miles per gallon Histogram")
plt.legend()
Seaborn very handily provides Kernel Density Estimates (KDE) which try to infer a probability distribution from the data. This is more useful when you have lots of data.
sns.set()
sns.kdeplot(dfcars.mpg);
And a histogram and a kdeplot together:
sns.distplot(dfcars.mpg);
Plotting features against other features¶
Sometimes we want to see co-variation amongst our columns. A scatter-plot does this for us.
plt.scatter(dfcars.wt, dfcars.mpg);
We can also use plot
without lines:
plt.plot(dfcars.wt, dfcars.mpg, marker='o', linestyle='None', color='k');
plt.plot(dfcars.wt, dfcars.mpg, 'o')
plt.show()
But what if we want to save our figure into a file? The extension tells you how it will be saved..and note that the savefig
needs to be in the same cell as the plotting commands. Go look at the files..
plt.plot(dfcars.wt, dfcars.mpg, 'o', markersize=4, alpha=0.5)
plt.savefig('foo1.pdf')
plt.savefig('foo2.png', bbox_inches='tight') #less whitespace around image
Trend¶
The correlation that we saw might suggest a trend. We can capture it with a "regression". We'll learn more about regressions soon, but we show a quadratic fit here with a 1 standard deviation bar to show the graphics aspect of this. Also see sns.regplot
.
x = dfcars.wt
y = dfcars.mpg
params = np.polyfit(x, y, 2)
xp = np.linspace(x.min(), x.max(), 20)
yp = np.polyval(params, xp)
plt.plot(xp, yp, 'k', alpha=0.8, linewidth=1)
plt.plot(dfcars.wt, dfcars.mpg, 'o', markersize=4, alpha=0.5)
sig = np.std(y - np.polyval(params, x))
plt.fill_between(xp, yp - sig, yp + sig,
color='k', alpha=0.2)
Group Properties¶
Such "co-variational" plots, and for that matter, even single-variable plots, are even more interesting when we look at them conditional upon the value of a categorical variable.
Such conditionality is behind the notion of grouping, where we group our data by various values of categorical variables, for example, whether our cars have an automatic transmission or not.
Grouping of one outcome variable¶
This notion of grouping based on combinations of factors is used by seaborn to make various easy-to-see exploratory visualizations for us. We shall first see how to make these plots quick-and-dirty using seaborn
and then see how we make make these plots ourselves.
First, we make a boxplot of mpg
, grouped by transmission style
sns.boxplot(x='am', y='mpg', data=dfcars);
The box shows the quartiles of the distribution This can also be done using a violin-plot, which uses KDE to show the distribution itself:
sns.violinplot(x='am', y='mpg', data=dfcars);
Automatics surveyed have a higher range of mpg, with higher fuel efficiency significant at 1 sigma, and we may want to investigate if this is a selection effect or whether there is some true impact of an automatic transmission. Further subgrouping the data is one way to do this.
Split violin plots can be used to do multiple categorical variables
sns.violinplot(x='cyl', y='mpg', hue='am', order=[4, 6, 8], data=dfcars, split=True);
One can see that the difference in mpg is starker between 6 and 8 cylinder cars, for manual transmissions. And that the large-range effect in automatics is coming almost entirely through 4-cylinder cars. What about the better mpg for automatics? Mostly in the 4 and 6 cyclinders. But we ought to see how representative these are in our sample. We'll show how to do this soon but a cross-tabulation, combined with the graph above, gives us some idea that the poor miles per gallon for manual transmission may be coming from the preponderance of 8-cylinder manual cars in our dataset.
pd.crosstab(dfcars.am, dfcars.cyl)
Faceting for general grouping¶
Seaborn provides a nice construct: the FacetGrid
. You decide what variables to facet over, and then decide the kind of plot you want. Here we want hue to be am
, and different columns in the plot grid to be cylinders. We then ask for a facet plot of mpg
against wt
scatter.
Such plots are often called small multiple plots. They repeat the same plot based on categories, making sure that all plotting parameters are the same so that we have direct comparability.
g = sns.FacetGrid(dfcars, col="cyl", hue="am", palette="Set1")
g.map(plt.scatter, "mpg", "wt", alpha=0.5, s=10);
We can see that the "regression-like" effect is cleanest for automatic transmissions in 4 cylinder cars.
SPLOM, or Scatter Plot Matrix¶
We'd get tired if we had to this on a 2-by-2 basis for every pair of continuously co-varying features. The PairGrid
, colorable by transmission type, allows us to do this comparison for 5 continuous features here, with the diagonal being a kernel density estimate.
g = sns.PairGrid(dfcars, vars=['mpg', 'hp', 'wt', 'qsec', 'disp'], hue="am")
g.map_diag(sns.kdeplot)
g.map_offdiag(plt.scatter, s=15)
In many places, for example mpg
vs disp
, there see to be two separate trends for the different transmissions. This will (as we shall see later in this course) suggest to us the addition of a transmission term as a indicator variable in regressions for mpg
against various features. This changes the intercept of the regression. But the trends have different slopes as well, which suggests that disp
may interact with am
, the transmission indicator to create a varying slope as well.
Correlation¶
The SPLOM seems to suggest correlations. Well, so lets calculate it:
dfcars[['mpg', 'wt', 'hp', 'qsec', 'disp']].corr()
Since correlations range from -1 to 1 through 0, a diverging palette is probably our best bet.
dpal = sns.choose_colorbrewer_palette('diverging', as_cmap=True)
We use maptplotlib
s correlation plot. You'll land up doing plots like this for both EDA and do see misclassification from your machine learning algorithms. In other words, EDA is even useful at the analysis stage.
plt.matshow(dfcars[['mpg', 'wt', 'hp', 'qsec', 'disp']].corr(), cmap=dpal)
ax = plt.gca()
ax.tick_params(axis='both', which='both',length=0);
plt.title("Correlation Matrix")
plt.xticks(range(4), ['mpg', 'wt', 'hp', 'qsec', 'disp'])
plt.yticks(range(4), ['mpg', 'wt', 'hp', 'qsec', 'disp']);
KDE plots and sequential palettes.¶
Here we make a KDE plot of a multivariate normal distribution. Since a probability density is strictly positive, with values near 0 not being so interesting, a sequential palette is our ticket. Seaborn will by default provide such a palette for KDE plots, but you can use your own!
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 1000)
df = pd.DataFrame(data, columns=["x", "y"])
df.head()
seqpal = sns.choose_colorbrewer_palette("sequential", as_cmap=True)
sns.kdeplot(df.x, df.y, cmap=seqpal, shade=True);
Matplotlib and multiple plots: Small Multiples¶
There are many cases where we want to see plots side by side. The SPLOMS and Facet grids that seaborn makes for us are an example.But we should know how to do this on our own.
Here is a simple example of a plot with one column and 3 rows. It illustrates one way of doing this.
fig = plt.figure(figsize=(5, 9))
ax1 = fig.add_subplot(311)
ax1.plot([1, 2, 3], [1, 2, 3])
ax1.set_xticklabels([])
ax1.set_ylim([1.0, 3.0])
ax2 = fig.add_subplot(312)
ax2.scatter([1, 2, 3], [1, 2, 3])
ax2.set_xticklabels([])
ax2.set_ylim([1.0, 3.0])
ax3 = fig.add_subplot(313)
ax3.plot([1, 2, 3], [1, 2, 3])
ax3.set_ylim([1.0, 3.0])
fig.tight_layout()
Small multiples, youself¶
Here is another way, which to me is far clearer that the add_subplot
way. It basically creates an array of plots and zips this array up with the various data grouped by categories.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
print(axes)
print(axes.ravel())
carbs = ['==1', '==2', '==3', '>=4']
bins = np.arange(10, 30, 2)
for ax, carb in zip(axes.ravel(), carbs):
data = dfcars.query("carb%s" % carb)
print(data.shape)
#ax.plot(data.wt, data.mpg, 'o', markersize=10, alpha=0.5)
ax.hist(data.mpg, bins=bins, histtype='stepfilled', normed=True, color='r', alpha=.3)
ax.annotate("carb"+str(carb), xy=(12, 0.35), fontsize=14)
#ax.set_yticks([])
ax.set_ylim((0,0.4))
ax.set_xlabel('mpg');
The Interest in Science Problem from lecture slides¶
Dataframe construction:
science = {
'interest': ['Excited', 'Kind of interested', 'OK', 'Not great', 'Bored'],
'before': [19, 25, 40, 5, 11],
'after': [38, 30, 14, 6, 12]
}
dfscience = pd.DataFrame.from_dict(science).set_index("interest")[['before', 'after']]
dfscience
Multiple Pie Charts¶
fig, axs = plt.subplots(1,2, figsize = (8.5,4))
dfscience.before.plot(kind="pie", ax=axs[0], labels=None);
axs[0].legend(loc="upper left", ncol=5, labels=dfscience.index)
dfscience.after.plot(kind="pie", ax=axs[1], labels=None);
Before and after bar charts¶
dfscience.plot(kind="bar");
Stacked Charts¶
dfscience.plot(kind="barh", stacked=True);
This is hard to read. We wznt to compare before and after easily. Sometimes the solution is a transpose!!
dfscience.transpose()
We have to play some games to get the ordering right...
dfst = dfscience.transpose()
dfst.iloc[[1, 0],:]
dfscience.transpose().loc[['after', 'before'], :].plot(kind="barh", stacked=True)
plt.legend(loc=2, ncol=5);
Now we see that the blue and greens have it. Try and improve this plot. It still has too much chartjunk.
Difference Bar chart¶
Sometimes a small data transformation makes the effect blindingly obvious. Here we just make a bar chart of the change!
(dfscience.after - dfscience.before).plot(kind="barh");
Bacteria Dataset¶
For those want to play...
antibiotics = [
{ "bacteria": "Mycobacterium tuberculosis", "penicillin": 800, "streptomycin": 5, "neomycin": 2, "gram": "negative" },
{ "bacteria": "Salmonella schottmuelleri", "penicillin": 10, "streptomycin": 0.8, "neomycin": 0.09, "gram": "negative" },
{ "bacteria": "Proteus vulgaris", "penicillin": 3, "streptomycin": 0.1, "neomycin": 0.1, "gram": "negative" },
{ "bacteria": "Klebsiella pneumoniae", "penicillin": 850, "streptomycin": 1.2, "neomycin": 1, "gram": "negative" },
{ "bacteria": "Brucella abortus", "penicillin": 1, "streptomycin": 2, "neomycin": 0.02, "gram": "negative" },
{ "bacteria": "Pseudomonas aeruginosa", "penicillin": 850, "streptomycin": 2, "neomycin": 0.4, "gram": "negative" },
{ "bacteria": "Escherichia coli", "penicillin": 100, "streptomycin": 0.4, "neomycin": 0.1, "gram": "negative" },
{ "bacteria": "Salmonella (Eberthella) typhosa", "penicillin": 1, "streptomycin": 0.4, "neomycin": 0.008, "gram": "negative" },
{ "bacteria": "Aerobacter aerogenes", "penicillin": 870, "streptomycin": 1, "neomycin": 1.6, "gram": "negative" },
{ "bacteria": "Brucella antracis", "penicillin": 0.001, "streptomycin": 0.01, "neomycin": 0.007, "gram": "positive" },
{ "bacteria": "Streptococcus fecalis", "penicillin": 1, "streptomycin": 1, "neomycin": 0.1, "gram": "positive" },
{ "bacteria": "Staphylococcus aureus", "penicillin": 0.03, "streptomycin": 0.03, "neomycin": 0.001, "gram": "positive" },
{ "bacteria": "Staphylococcus albus", "penicillin": 0.007, "streptomycin": 0.1, "neomycin": 0.001, "gram": "positive" },
{ "bacteria": "Streptococcus hemolyticus", "penicillin": 0.001, "streptomycin": 14, "neomycin": 10, "gram": "positive" },
{ "bacteria": "Streptococcus viridans", "penicillin": 0.005, "streptomycin": 10, "neomycin": 40, "gram": "positive" },
{ "bacteria": "Diplococcus pneumoniae", "penicillin": 0.005, "streptomycin": 11, "neomycin": 10, "gram": "positive" }
]
dfabio = pd.DataFrame.from_records(antibiotics)
dfabio