Key Word(s): matplotlib, seaborn, plots, pandas

# CS109A Introduction to Data Science

## Lab 5: Exploratory Data Analysis, `seaborn`

, more Plotting¶

**Harvard University**

**Fall 2019**

**Instructors:** Pavlos Protopapas, Kevin Rader, and Chris Tanner

**Material Preparation:** Eleni Kaxiras.

```
#RUN THIS CELL
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
```

```
# import the necessary libraries
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 200)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format ='retina'
```

```
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
```

## Learning Goals¶

By the end of this lab, you should be able to:

- know how to implement the different types of plots such as histograms, boxplots, etc, that were mentioned in class.
- have
`seaborn`

as well as`matplotlib`

in your plotting toolbox.

**This lab corresponds to lecture 6 up to 9 and maps to homework 3.**

## 1 - Visualization Inspiration¶

Notice that in “Summers Are Getting Hotter,” above, the histogram has intervals for global summer temperatures on the x-axis, designated from extremely cold to extremely hot, and their frequency on the y-axis.

That was an **infographic** intended for the general public. In contrast, take a look at the plots below of the same data published at a **scientific journal**. They look quite different, don't they?

*James Hansen, Makiko Sato, and Reto Ruedy*, Perception of climate change. PNAS

## 2 - Implementing Various Types of Plots using `matplotlib`

and `seaborn`

.¶

Before you start coding your visualization, you need to decide what **type** of vizualization to use. A box plot, a histogram, a scatter plot, or something else? That will depend on the purpose of the plot (is it for performing an inspection on your data (EDA, or for showing your results/conclusions to people) and the number variables that you want to plot.

You have a lot of tools for plotting in Python. The basic one, of course, is `matplotlib`

and there are other libraries that are built on top of it, such as `seaborn`

, `bokeh`

, or `altair`

.

In this class we will continue using `matplotlib`

and also look into `seaborn`

. Those two libraries are the ones you should be using for homework.

### Introduction to `seaborn`

¶

`Seaborn`

is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. The library provides a database of useful datasets for educational purposes that can be loaded by typing:

```
seaborn.load_dataset(name, cache=True, data_home=None, **kws)
```

For information on what these datasets are : https://github.com/mwaskom/seaborn-data

#### The plotting functions in `seaborn`

can be decided in two categories¶

**'axes-level'**functions, such as`regplot`

,`boxplot`

,`kdeplot`

,`scatterplot`

,`distplot`

which can connect with the`matplotlib`

Axes object and its parameters. You can use that object as you would in`matplotlib`

:f, (ax1, ax2) = plt.subplots(2) sns.regplot(x, y, ax=ax1) sns.kdeplot(x, ax=ax2) ax1 = sns.distplot(x, kde=False, bins=20)

**'figure-level'**functions, such as`lmplot`

,`factorplot`

,`jointplot`

,`relplot`

. In this case,`seaborn`

organizes the resulting plot which may include several Axes in a meaningful way. That means that the functions need to have total control over the figure, so it isn't possible to plot, say, an`lmplot`

onto one that already exists. Calling the function always initializes a figure and sets it up for the specific plot it's drawing. These functions return an object of the type`FacetGrid`

with its own methods for operating on the resulting plot.

To set the parameters for figure-level functions:

```
sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2.5})
```

### The Titanic dataset¶

The `titanic.csv`

file contains data for 887 passengers on the Titanic. Each row represents one person. The columns describe different attributes about the person including whether they survived, their age, their on-board class, their sex, and the fare they paid.

```
titanic = sns.load_dataset('titanic');
titanic.info();
```

```
titanic.columns
```

**Exercise: Drop the following features:**

'embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone'

```
# your code here
# your code here
columns = ['embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone']
titanic = titanic.drop(columns=columns)
titanic
```

**Exercise: Find for how many passengeres we do not have their deck information.**

```
# your code here
# your code here
missing_decks = len(titanic[(pd.isna(titanic['deck']) == True)])
missing_decks
```

### Histograms¶

#### Plotting one variable's distribution (categorical and continous)¶

The most convenient way to take a quick look at a univariate distribution in `seaborn`

is the `distplot()`

function. By default, this will draw a histogram and fit a kernel density estimate (KDE).

A histogram displays a quantitative (numerical) distribution by showing the number (or percentage) of the data values that fall in specified intervals. The intervals are on the x-axis and the number of values falling in each interval, shown as either a number or percentage, are represented by bars drawn above the corresponding intervals.

```
# What was the age distribution among passengers in the Titanic?
import seaborn as sns
sns.set(color_codes=True)
f, ax = plt.subplots(1,1, figsize=(8, 3));
ax = sns.distplot(titanic.age, kde=False, bins=20)
# bug
#ax = sns.distplot(titanic.age, kde=False, bins=20).set(xlim=(0, 90));
ax.set(xlim=(0, 90));
ax.set_ylabel('counts');
```

```
f, ax = plt.subplots(1,1, figsize=(8, 3))
ax.hist(titanic.age, bins=20);
ax.set_xlim(0,90);
```

**Exercise (pandas trick): Count all the infants on board (age less than 3) and all the children ages 5-10.**

```
# your code here
infants = len(titanic[(titanic.age < 3)])
children = len(titanic[(titanic.age >= 3) & (titanic.age < 10)])
print(f'There were {infants} infants and {children} children on board the Titanic')
```

**Pandas trick:** We want to creat virtual "bins" for readability and replace ranges of values with categories.

We will do this in an ad hoc way, **it can be done better**. For example in the previous plot we could set:

`(age<3) = 'infants'`

,`(3`

, `(18`

```
``````
```

```
```See matplotlib colors here.

```
```

```
```

```
In [77]:
```# set the colors
cmap = plt.get_cmap('Pastel1')
young = cmap(0.5)
middle = cmap(0.2)
older = cmap(0.8)
# get the object we will change - patches is an array with len: num of bins
fig, ax = plt.subplots()
y_values, bins, patches = ax.hist(titanic.age, 10)
[patches[i].set_facecolor(young) for i in range(0,1)] # bin 0
[patches[i].set_facecolor(middle) for i in range(1,3)] # bins 1 and 2
[patches[i].set_facecolor(older) for i in range(3,10)] # 7 remaining bins
ax.grid(True)
fig.show()

#### Kernel Density Estimation¶

The kernel density estimate can be a useful tool for plotting the shape of a distribution. The **bandwidth (bw)** parameter of the KDE controls how tightly the estimation is fit to the data, much like the bin size in a histogram. It corresponds to the width of the kernels we plotted above. The default behavior tries to guess a good value using a common reference rule, but it may be helpful to try larger or smaller values.

In [78]:
sns.kdeplot(titanic.age, bw=0.6, label="bw: 0.6", shade=True, color="r");
sns.kdeplot(titanic.age, bw=2, label="bw: 2", shade=True);

** Exercise: Plot the distribution of fare paid by passengers **
In [79]:
# your code here
sns.kdeplot(titanic.fare, bw=0.5, label="bw: 0.5", shade=True);

#### You can mix elements of `matplotlib`

such as Axes with `seaborn`

elements for a best use of both worlds.¶

In [80]:
import seaborn as sns
sns.set(color_codes=True)
x1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
fig, ax = plt.subplots(1,2, figsize=(15,5))
# seaborn goes in first subplot
sns.set(font_scale=0.5)
sns.distplot(x1, kde=False, bins=15, ax=ax[0]);
sns.distplot(x2, kde=False, bins=15, ax=ax[0]);
ax[0].set_title('seaborn Graph Here', fontsize=14)
ax[0].set_xlabel(r'$x$', fontsize=14)
ax[0].set_ylabel(r'$count$', fontsize=14)
# matplotlib goes in second subplot
ax[1].hist(x1, alpha=0.2, bins=15, label=r'$x1$');
ax[1].hist(x2, alpha=0.5, bins=15, label=r'$x2$');
ax[1].set_xlabel(r'$x$', fontsize=14)
ax[1].set_ylabel(r'$count$', fontsize=14)
ax[1].set_title('matplotlib Graph Here', fontsize=14)
ax[1].legend(loc='best', fontsize=14);

#### Introduding the heart disease dataset.¶

More on this in the in-class exercise at the end of the notebook.

In [81]:
columns = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
heart_df = pd.read_csv('../data/heart_disease.csv', header=None, names=columns)
heart_df.head()

Out[81]:
In [82]:
# seaborn
ax = sns.boxplot(x='age', data=titanic)
#ax = sns.boxplot(x=titanic['age']) # another way to write this
ax.set_ylabel(None);
ax.set_xlabel('age', fontsize=14);
ax.set_title('Distribution of age in the Titanic', fontsize=14);

#### Two variables¶

** Exercise: Did more young people or older ones get first class tickets on the Titanic?**
In [83]:
# your code here
# two variables seaborn
ax = sns.boxplot(x='class', y='age', data=titanic)

In [84]:
# two variable boxplot in pandas
titanic.boxplot('age',by='class')

Out[84]:
In [85]:
f, ax = plt.subplots(1,1, figsize=(10, 5))
sns.scatterplot(x="fare", y="age", data=titanic, ax=ax);

In [86]:
sns.jointplot("fare", "age", data=titanic, s=40, edgecolor="w", linewidth=1)

Out[86]:
You may control the `seaborn`

Figure aesthetics.

In [87]:
# matplotlib
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.scatter(heart_df['age'], heart_df['restbp'], alpha=0.8);
ax.set_xlabel(r'$Age (yrs)$', fontsize=15);
ax.set_ylabel(r'Resting Blood Pressure (mmHg)', fontsize=15);
ax.set_title('Age vs. Resting Blood Pressure', fontsize=14)
plt.show();

#### Plotting the distribution of three variables¶

In [22]:
f, ax = plt.subplots(1,1, figsize=(10, 5))
sns.scatterplot(x="fare", y="age", hue="survived", data=titanic, ax=ax);

#### Plotting the distribution of four variables (going too far?)¶

** Exercise: Plot the distribution of fare paid by passengers according to age, survival and sex. **Use `size=`

for the fourth variable

In [23]:
# your code here
f, ax = plt.subplots(1,1, figsize=(10, 5))
sns.scatterplot(x="fare", y="age", hue="survived", size="sex", data=titanic, ax=ax);

### Pairplots¶

In [24]:
titanic.columns

Out[24]:
In [25]:
to_plot = ['age', 'fare', 'survived', 'deck']

In [28]:
df_to_plot = titanic.loc[:,to_plot]
sns.pairplot(df_to_plot);