Key Word(s): matplotlib, seaborn, plots, pandas

# CS109A Introduction to Data Science

## Lab 5: Exploratory Data Analysis, seaborn, more Plotting¶

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Material Preparation: Eleni Kaxiras.

In [1]:
#RUN THIS CELL
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:
In [2]:
# import the necessary libraries
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 200)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format ='retina'

In [3]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;


## Learning Goals¶

By the end of this lab, you should be able to:

• know how to implement the different types of plots such as histograms, boxplots, etc, that were mentioned in class.
• have seaborn as well as matplotlib in your plotting toolbox.

This lab corresponds to lecture 6 up to 9 and maps to homework 3.

## 1 - Visualization Inspiration¶

source: nytimes.org

Notice that in “Summers Are Getting Hotter,” above, the histogram has intervals for global summer temperatures on the x-axis, designated from extremely cold to extremely hot, and their frequency on the y-axis.

That was an infographic intended for the general public. In contrast, take a look at the plots below of the same data published at a scientific journal. They look quite different, don't they?

James Hansen, Makiko Sato, and Reto Ruedy, Perception of climate change. PNAS

## 2 - Implementing Various Types of Plots using matplotlib and seaborn.¶

Before you start coding your visualization, you need to decide what type of vizualization to use. A box plot, a histogram, a scatter plot, or something else? That will depend on the purpose of the plot (is it for performing an inspection on your data (EDA, or for showing your results/conclusions to people) and the number variables that you want to plot.

You have a lot of tools for plotting in Python. The basic one, of course, is matplotlib and there are other libraries that are built on top of it, such as seaborn, bokeh, or altair.

In this class we will continue using matplotlib and also look into seaborn. Those two libraries are the ones you should be using for homework.

### Introduction to seaborn¶

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. The library provides a database of useful datasets for educational purposes that can be loaded by typing:

seaborn.load_dataset(name, cache=True, data_home=None, **kws)


For information on what these datasets are : https://github.com/mwaskom/seaborn-data

#### The plotting functions in seaborn can be decided in two categories¶

• 'axes-level' functions, such as regplot, boxplot, kdeplot, scatterplot, distplot which can connect with the matplotlib Axes object and its parameters. You can use that object as you would in matplotlib:

f, (ax1, ax2) = plt.subplots(2)
sns.regplot(x, y, ax=ax1)
sns.kdeplot(x, ax=ax2)
ax1 = sns.distplot(x, kde=False, bins=20)

• 'figure-level' functions, such as lmplot, factorplot, jointplot, relplot, pairplot. In this case, seaborn organizes the resulting plot which may include several Axes in a meaningful way. That means that the functions need to have total control over the figure, so it isn't possible to plot, say, an lmplot onto one that already exists. Calling the function always initializes a figure and sets it up for the specific plot it's drawing. These functions return an object of the type FacetGrid with its own methods for operating on the resulting plot.

To set the parameters for figure-level functions:

sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2.5})


### The Titanic dataset¶

The titanic.csv file contains data for 887 passengers on the Titanic. Each row represents one person. The columns describe different attributes about the person including whether they survived, their age, their on-board class, their sex, and the fare they paid.

In [4]:
titanic = sns.load_dataset('titanic');
titanic.info();


RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB

In [5]:
titanic.columns

Out[5]:
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone'], dtype='object')
Exercise: Drop the following features:

'embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone'

In [6]:
# your code here
mary = ['embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone']
titanic = titanic.drop(columns=mary)
titanic

Out[6]:
survived pclass sex age sibsp parch fare class deck
0 0 3 male 22.0 1 0 7.2500 Third NaN
1 1 1 female 38.0 1 0 71.2833 First C
2 1 3 female 26.0 0 0 7.9250 Third NaN
3 1 1 female 35.0 1 0 53.1000 First C
4 0 3 male 35.0 0 0 8.0500 Third NaN
... ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.0000 Second NaN
887 1 1 female 19.0 0 0 30.0000 First B
888 0 3 female NaN 1 2 23.4500 Third NaN
889 1 1 male 26.0 0 0 30.0000 First C
890 0 3 male 32.0 0 0 7.7500 Third NaN

891 rows × 9 columns

Exercise: Find for how many passengeres we do not have their deck information.
In [7]:
# your code here
missing_decks = len(titanic[(pd.isna(titanic['deck']) == True)])
missing_decks

Out[7]:
688

### Histograms¶

#### Plotting one variable's distribution (categorical and continous)¶

The most convenient way to take a quick look at a univariate distribution in seaborn is the distplot() function. By default, this will draw a histogram and fit a kernel density estimate (KDE).

A histogram displays a quantitative (numerical) distribution by showing the number (or percentage) of the data values that fall in specified intervals. The intervals are on the x-axis and the number of values falling in each interval, shown as either a number or percentage, are represented by bars drawn above the corresponding intervals.

In [9]:
# What was the age distribution among passengers in the Titanic?
import seaborn as sns
sns.set(color_codes=True)

f, ax = plt.subplots(1,1, figsize=(8, 3));
ax = sns.distplot(titanic.age, kde=False, bins=20)

# bug
#ax = sns.distplot(titanic.age, kde=False, bins=20).set(xlim=(0, 90));

ax.set(xlim=(0, 90));
ax.set_ylabel('counts');

In [10]:
f, ax = plt.subplots(1,1, figsize=(8, 3))
ax.hist(titanic.age, bins=20);
ax.set_xlim(0,90);

Exercise (pandas trick): Count all the infants on board (age less than 3) and all the children ages 3-10.
In [11]:
# your code here
infants = len(titanic[(titanic.age < 3)])
children = len(titanic[(titanic.age >= 3) & (titanic.age < 10)])
print(f'There were {infants} infants and {children} children on board the Titanic')

There were 24 infants and 38 children on board the Titanic


Pandas trick: We want to creat virtual "bins" for readability and replace ranges of values with categories.

We will do this in an ad hoc way, it can be done better. For example in the previous plot we could set:

• (age<3) = 'infants',
• (3,
•  (18 
 See matplotlib colors here. 
 
 
 In [12]: # set the colors cmap = plt.get_cmap('Pastel1') young = cmap(0.5) middle = cmap(0.2) older = cmap(0.8) # get the object we will change - patches is an array with len: num of bins fig, ax = plt.subplots() y_values, bins, patches = ax.hist(titanic.age, 10) [patches[i].set_facecolor(young) for i in range(0,1)] # bin 0 [patches[i].set_facecolor(middle) for i in range(1,3)] # bins 1 and 2 [patches[i].set_facecolor(older) for i in range(3,10)] # 7 remaining bins ax.grid(True) fig.show() Kernel Density Estimation¶The kernel density estimate can be a useful tool for plotting the shape of a distribution. The bandwidth (bw) parameter of the KDE controls how tightly the estimation is fit to the data, much like the bin size in a histogram. It corresponds to the width of the kernels we plotted above. The default behavior tries to guess a good value using a common reference rule, but it may be helpful to try larger or smaller values. In [13]: sns.kdeplot(titanic.age, bw=0.6, label="bw: 0.6", shade=True, color="r"); sns.kdeplot(titanic.age, bw=2, label="bw: 2", shade=True); Exercise: Plot the distribution of fare paid by passengers In [14]: # your code here sns.kdeplot(titanic.fare, bw=0.5, label="bw: 0.5", shade=True); You can mix elements of matplotlib such as Axes with seaborn elements for a best use of both worlds.¶ In [15]: import seaborn as sns sns.set(color_codes=True) x1 = np.random.normal(size=100) x2 = np.random.normal(size=100) fig, ax = plt.subplots(1,2, figsize=(15,5)) # seaborn goes in first subplot sns.set(font_scale=0.5) sns.distplot(x1, kde=False, bins=15, ax=ax[0]); sns.distplot(x2, kde=False, bins=15, ax=ax[0]); ax[0].set_title('seaborn Graph Here', fontsize=14) ax[0].set_xlabel(r'$x$', fontsize=14) ax[0].set_ylabel(r'$count$', fontsize=14) # matplotlib goes in second subplot ax[1].hist(x1, alpha=0.2, bins=15, label=r'$x1$'); ax[1].hist(x2, alpha=0.5, bins=15, label=r'$x2$'); ax[1].set_xlabel(r'$x$', fontsize=14) ax[1].set_ylabel(r'$count$', fontsize=14) ax[1].set_title('matplotlib Graph Here', fontsize=14) ax[1].legend(loc='best', fontsize=14); Introduding the heart disease dataset.¶More on this in the in-class exercise at the end of the notebook. In [16]: columns = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"] heart_df = pd.read_csv('../data/heart_disease.csv', header=None, names=columns) heart_df.head() Out[16]: .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num 0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3.0 0.0 6.0 0.0 1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2.0 2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 2.0 2.0 7.0 1.0 3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5 3.0 0.0 3.0 0.0 4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4 1.0 0.0 3.0 0.0 Boxplots¶One variable.¶ In [17]: # seaborn ax = sns.boxplot(x='age', data=titanic) #ax = sns.boxplot(x=titanic['age']) # another way to write this ax.set_ylabel(None); ax.set_xlabel('age', fontsize=14); ax.set_title('Distribution of age in the Titanic', fontsize=14); Two variables¶ Exercise: Did more young people or older ones get first class tickets on the Titanic? In [18]: # your code here # two variables seaborn ax = sns.boxplot(x="class", y="age", data=titanic) In [19]: # two variable boxplot in pandas titanic.boxplot('age',by='class') Out[19]: Scatterplots¶Plotting the distribution of two variables¶Also called a bivariate distribution where each observation is shown with a point with x and y values. You can draw a scatterplot with the matplotlib plt.scatter function, or the seaborn jointplot() function: In [20]: f, ax = plt.subplots(1,1, figsize=(10, 5)) sns.scatterplot(x="fare", y="age", data=titanic, ax=ax); In [21]: sns.jointplot("fare", "age", data=titanic, s=40, edgecolor="w", linewidth=1) Out[21]: You may control the seaborn Figure aesthetics. In [22]: # matplotlib fig, ax = plt.subplots(1,1, figsize=(10,6)) ax.scatter(heart_df['age'], heart_df['restbp'], alpha=0.8); ax.set_xlabel(r'$Age (yrs)$', fontsize=15); ax.set_ylabel(r'Resting Blood Pressure (mmHg)', fontsize=15); ax.set_title('Age vs. Resting Blood Pressure', fontsize=14) plt.show(); Plotting the distribution of three variables¶ In [23]: f, ax = plt.subplots(1,1, figsize=(10, 5)) sns.scatterplot(x="fare", y="age", hue="survived", data=titanic, ax=ax); Plotting the distribution of four variables (going too far?)¶ Exercise: Plot the distribution of fare paid by passengers according to age, survival and sex. Use size= for the fourth variable In [24]: # your code here f, ax = plt.subplots(1,1, figsize=(10, 5)) sns.scatterplot(x="fare", y="age", hue="survived", size="sex", data=titanic, ax=ax); Pairplots¶ In [25]: titanic.columns Out[25]: Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'class', 'deck'], dtype='object') In [26]: to_plot = ['age', 'fare', 'survived', 'deck'] In [34]: df_to_plot = titanic.loc[:,to_plot] sns.pairplot(df_to_plot);