Key Word(s): pandas
CS109A Introduction to Data Science
Lab 2: EDA with Pandas (+seaborn)¶¶
Harvard University
Fall 2021
Instructors: Pavlos Protopapas and Natesh Pillai
Authors: Natesh Pillai
## RUN THIS CELL TO GET THE RIGHT FORMATTING
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In this lecture we will look at tools for plotting using both matplotlib and seaborn.
Load data¶
The file quartets.csv contains 4 different tiny datasets that we will use to quickly understand the value of ploting.
quartets = pd.read_csv('quartets.csv', index_col=0)
Exploration¶
quartets.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 44 entries, 1 to 11 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x 44 non-null int64 1 y 44 non-null float64 2 quartet 44 non-null object dtypes: float64(1), int64(1), object(1) memory usage: 1.4+ KB
We see there are 44 entries, two numerical columns x and y and one column to potentially identify every quartet dataset.
How does this dataframe look like?
quartets.head()
x | y | quartet | |
---|---|---|---|
1 | 10 | 8.04 | I |
2 | 8 | 6.95 | I |
3 | 13 | 7.58 | I |
4 | 9 | 8.81 | I |
5 | 11 | 8.33 | I |
How do random samples look like?
quartets.sample(5)
x | y | quartet | |
---|---|---|---|
8 | 19 | 12.50 | IV |
3 | 13 | 7.58 | I |
3 | 13 | 12.74 | III |
5 | 11 | 8.33 | I |
2 | 8 | 5.76 | IV |
Quartet's names
quartets['quartet'].unique().tolist()
['I', 'II', 'III', 'IV']
Display the first 3 samples from every dataset
quartets.groupby('quartet').head(3)
x | y | quartet | |
---|---|---|---|
1 | 10 | 8.04 | I |
2 | 8 | 6.95 | I |
3 | 13 | 7.58 | I |
1 | 10 | 9.14 | II |
2 | 8 | 8.14 | II |
3 | 13 | 8.74 | II |
1 | 10 | 7.46 | III |
2 | 8 | 6.77 | III |
3 | 13 | 12.74 | III |
1 | 8 | 6.58 | IV |
2 | 8 | 5.76 | IV |
3 | 8 | 7.71 | IV |
Display 2 random samples from every dataset
quartets.groupby('quartet').sample(2)
x | y | quartet | |
---|---|---|---|
5 | 11 | 8.33 | I |
10 | 7 | 4.82 | I |
3 | 13 | 8.74 | II |
7 | 6 | 6.13 | II |
10 | 7 | 6.42 | III |
2 | 8 | 6.77 | III |
10 | 8 | 7.91 | IV |
4 | 8 | 8.84 | IV |
Display every quartet's dataset size
quartets.groupby('quartet').size()
quartet I 11 II 11 III 11 IV 11 dtype: int64
Descriptive Statistics¶
quartets.groupby('quartet').agg(['mean', 'std']).round(3)
x | y | |||
---|---|---|---|---|
mean | std | mean | std | |
quartet | ||||
I | 9 | 3.317 | 7.501 | 2.032 |
II | 9 | 3.317 | 7.501 | 2.032 |
III | 9 | 3.317 | 7.500 | 2.030 |
IV | 9 | 3.317 | 7.501 | 2.031 |
Almost same mean and standard deviation for every quartet.
This looks like all quartets samples could be sampled from the same distribution.
These are tiny datasets so we could read them all
quartets[quartets['quartet'] == 'I']
x | y | quartet | |
---|---|---|---|
1 | 10 | 8.04 | I |
2 | 8 | 6.95 | I |
3 | 13 | 7.58 | I |
4 | 9 | 8.81 | I |
5 | 11 | 8.33 | I |
6 | 14 | 9.96 | I |
7 | 6 | 7.24 | I |
8 | 4 | 4.26 | I |
9 | 12 | 10.84 | I |
10 | 7 | 4.82 | I |
11 | 5 | 5.68 | I |
quartets[quartets['quartet'] == 'II']
x | y | quartet | |
---|---|---|---|
1 | 10 | 9.14 | II |
2 | 8 | 8.14 | II |
3 | 13 | 8.74 | II |
4 | 9 | 8.77 | II |
5 | 11 | 9.26 | II |
6 | 14 | 8.10 | II |
7 | 6 | 6.13 | II |
8 | 4 | 3.10 | II |
9 | 12 | 9.13 | II |
10 | 7 | 7.26 | II |
11 | 5 | 4.74 | II |
quartets[quartets['quartet'] == 'III']
x | y | quartet | |
---|---|---|---|
1 | 10 | 7.46 | III |
2 | 8 | 6.77 | III |
3 | 13 | 12.74 | III |
4 | 9 | 7.11 | III |
5 | 11 | 7.81 | III |
6 | 14 | 8.84 | III |
7 | 6 | 6.08 | III |
8 | 4 | 5.39 | III |
9 | 12 | 8.15 | III |
10 | 7 | 6.42 | III |
11 | 5 | 5.73 | III |
quartets[quartets['quartet'] == 'IV']
x | y | quartet | |
---|---|---|---|
1 | 8 | 6.58 | IV |
2 | 8 | 5.76 | IV |
3 | 8 | 7.71 | IV |
4 | 8 | 8.84 | IV |
5 | 8 | 8.47 | IV |
6 | 8 | 7.04 | IV |
7 | 8 | 5.25 | IV |
8 | 19 | 12.50 | IV |
9 | 8 | 5.56 | IV |
10 | 8 | 7.91 | IV |
11 | 8 | 6.89 | IV |
Plot or not to plot?¶
Pandas by default comes with matplotlib incorporated.
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
BoxPlot¶
quartets.groupby('quartet').boxplot(grid=False);
sns.color_palette()
sns.color_palette('pastel')
palette = 'pastel'
Seaborn's boxplots
Similar boxplots with Matplotlib and Seaborn
fig, axes = plt.subplots(2, 2, figsize=(8,7))
axes = axes.flatten().tolist()
for quartet, g in quartets.groupby('quartet'):
ax = axes.pop(0)
sns.boxplot(data=g, ax=ax, palette=palette);
ax.set_title(f'quartet {quartet}')
plt.suptitle("Quartets' boxplots");
Using seaborn boxplots to compare quartes's shared features
fig, ax = plt.subplots(1, 1, figsize=(16,4))
sns.boxplot(x='x', y='value', hue='quartet',
data=pd.melt(quartets, id_vars='quartet', var_name='x', value_name='value'),
ax=ax, palette=palette)
ax.set_title("quartets' features");
The problem with the plot above is that we are forcing different features (like x
and y
) to share the same y-axis.
So, another way to acomplish the goal could be this one
fig, axes = plt.subplots(1, 2, figsize=(16,4))
for i, col in enumerate(['x', 'y']):
sns.boxplot(x='quartet', y=col, data=quartets, ax=axes[i], palette=palette);
axes[i].set_title(f'variable {col}')
Histograms¶
Pandas let us easily plot the individual quartet's feature histogram in one line of code.
quartets.groupby('quartet').hist();
The histograms allows us to start to see some differences
for quartet, g in quartets.groupby('quartet'):
fig, axes = plt.subplots(1 , 2, figsize=(8, 2.5))
sns.histplot(data=g, x="x", hue='quartet', ax=axes[0], palette=palette, bins=10, kde=True);
sns.histplot(data=g, x="y", hue='quartet', ax=axes[1], palette=palette, bins=10, kde=True);
plt.suptitle(f'Quartet {quartet}')
We can plot all quartets's two features x
and y
in two different plots moving out the subplots creation
# some elements are 'bars' (default but too noisy when plotting so many features), 'step', 'poly'
element = 'step'
fig, axes = plt.subplots(1 , 2, figsize=(12, 5))
legends = []
for quartet, g in quartets.groupby('quartet'):
legends.append(f'quartet {quartet}')
sns.histplot(data=g, x="x", hue='quartet', ax=axes[0], palette=palette, bins=10, kde=False, alpha=.2, element=element);
sns.histplot(data=g, x="y", hue='quartet', ax=axes[1], palette=palette, bins=10, kde=False, alpha=.2, element=element);
axes[0].legend(legends)
axes[1].legend(legends);
FacetGrid¶
This is a powerful tool that can be used in combination with ploting method from seaborn or even matplotlib to plot multiple subplots based on some conditional relationship.
seaborn.FacetGrid()
: Multi-plot grid for plotting conditional relationships.
Grid of histograms
for feature in ['x', 'y']:
# create the grid with condition quartet
g = sns.FacetGrid(quartets, col="quartet", palette=palette, col_wrap=4)
# for every condition we are going to create a subplot for the grid for column "feature"
g.map(sns.histplot, feature, bins=10);
# col_wrap define the number of columns. Change the value to 3 and 2 to understand visually its behaviour
We can create one FacetGrid for all. For that we need to convert the dataframe to access values based on conditions.
melted = pd.melt(quartets, id_vars='quartet', var_name='x', value_name='value').rename(columns={'x':'variable'})
melted
quartet | variable | value | |
---|---|---|---|
0 | I | x | 10.00 |
1 | I | x | 8.00 |
2 | I | x | 13.00 |
3 | I | x | 9.00 |
4 | I | x | 11.00 |
... | ... | ... | ... |
83 | IV | y | 5.25 |
84 | IV | y | 12.50 |
85 | IV | y | 5.56 |
86 | IV | y | 7.91 |
87 | IV | y | 6.89 |
88 rows × 3 columns
# create the grid with quartets as columns and variable as rows
g = sns.FacetGrid(melted, row="variable", col='quartet', palette=palette, sharex=False)
g.map(sns.histplot, 'value', bins=10);
# we need set sharex to False to avoid distorting shapes between rows (you can try changing it to True)
Scatter plots¶
Knowing that we have x
and y
features, we can think about using other kind of helpful plots. Why not a scatter plot?
quartets.groupby('quartet').plot.scatter(x='x', y='y', s=50);
Scatter plots with seaborn
We can combine matplotlib with seaborn to improve the aesthetic.
fig, axes = plt.subplots(2,2,figsize=(7,7))
axes = axes.flatten().tolist()
for quartet, g in quartets.groupby('quartet'):
ax = axes.pop(0)
sns.scatterplot(data=g, x='x', y='y', ax=ax)
ax.set_title(f'quartet {quartet}')
plt.subplots_adjust(hspace=0.3);
Scatter plots with FacetGrid
FacetGrid is great to avoid writting too many lines of matplotlib code. In this case we can force the grid to share x and y domain to simplify features domains comparison.
g = sns.FacetGrid(quartets, col='quartet', palette=palette, col_wrap=2, sharex=True, sharey=True)
g.map(sns.scatterplot, 'x', 'y');
Line plots¶
We could also use a lineplot but to do that we need to know that dots should be ordered in the x axis.
quartets.sort_values(by='x').groupby('quartet').plot(x='x', y='y', marker='o', lw=.7);
All in one¶
We also can use matplotlib to plot all groups in the same plot
# create one figure of 1 x 1 size.
fig, ax = plt.subplots(1,1,figsize=(16,6))
# plot all 4 quartets in the same ax
quartets.sort_values(by='x').groupby('quartet').plot(x='x', y='y', marker='o', ms=10, lw=.7, alpha=.7, ax=ax)
plt.ylabel('y')
plt.title('All in one quartets');
Lineplots with seaborn¶
Seaborn.lineplot() simplifies the creation of the same plot.
fig, ax = plt.subplots(1,1,figsize=(16,6))
sns.lineplot(data=quartets, x='x', y='y', hue='quartet', marker='o', ms=10, lw=.7, alpha=.7, ax=ax)
plt.title('All in one quartets');
And we can plot all quartets together (removing the conditional hue
for seaborn)
fig, axes = plt.subplots(1,2,figsize=(16,4))
sns.lineplot(data=quartets, x='x', y='y', lw=.7, ax=axes[0])
axes[0].set_title('one line of seaborn')
quartets.plot(x='x', y='y', lw=.7, ax=axes[1])
axes[1].set_title('one line of matplotlib');
Seaborn is built on matplotlib, so using more lines of matplotlib should let you arrive to the same seaborn plot.
Let´s use a different dataset¶
We will load an already known dataset.
Source: https://www.kaggle.com/spscientist/students-performance-in-exams
Original source generator: http://roycekimmons.com/tools/generated_data/exams
If you go to the original source you will find this is a fictitious dataset created specifically for data science training purposes.
df = pd.read_csv('StudentsPerformance.csv').rename(
columns={
'race/ethnicity': 'group',
'parental level of education': 'parental',
'test preparation course': 'course',
'math score': 'math',
'reading score': 'reading',
'writing score': 'writing'
}
)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 1000 non-null object 1 group 1000 non-null object 2 parental 1000 non-null object 3 lunch 1000 non-null object 4 course 1000 non-null object 5 math 1000 non-null int64 6 reading 1000 non-null int64 7 writing 1000 non-null int64 dtypes: int64(3), object(5) memory usage: 62.6+ KB
df.head()
gender | group | parental | lunch | course | math | reading | writing | |
---|---|---|---|---|---|---|---|---|
0 | female | group B | bachelor's degree | standard | none | 72 | 72 | 74 |
1 | female | group C | some college | standard | completed | 69 | 90 | 88 |
2 | female | group B | master's degree | standard | none | 90 | 95 | 93 |
3 | male | group A | associate's degree | free/reduced | none | 47 | 57 | 44 |
4 | male | group C | some college | standard | none | 76 | 78 | 75 |
df['group'].unique().tolist()
['group B', 'group C', 'group A', 'group D', 'group E']
Let's simplify the dataframe¶
We can simplify the group values to the group letter
Series.str
Series.str
: Vectorized string functions for Series and Index.
df['group'] = df['group'].str[-1]
df['group'].unique().tolist()
['B', 'C', 'A', 'D', 'E']
df.head()
gender | group | parental | lunch | course | math | reading | writing | |
---|---|---|---|---|---|---|---|---|
0 | female | B | bachelor's degree | standard | none | 72 | 72 | 74 |
1 | female | C | some college | standard | completed | 69 | 90 | 88 |
2 | female | B | master's degree | standard | none | 90 | 95 | 93 |
3 | male | A | associate's degree | free/reduced | none | 47 | 57 | 44 |
4 | male | C | some college | standard | none | 76 | 78 | 75 |
df['course'].unique()
array(['none', 'completed'], dtype=object)
Series.apply
Series.apply
:Invoke function on values of Series.
Series.apply(func, convert_dtype=True, args=(), **kwargs)
# we verify that we have never change this column values yet
if 'completed' in df['course'].unique().tolist():
df['course'] = df['course'].apply(lambda x: 1 if x == 'completed' else 0)
# we can change the column values type to boolean
df['course'] = df['course'].astype(bool)
df['course'].unique()
array([False, True])
df.head()
gender | group | parental | lunch | course | math | reading | writing | |
---|---|---|---|---|---|---|---|---|
0 | female | B | bachelor's degree | standard | False | 72 | 72 | 74 |
1 | female | C | some college | standard | True | 69 | 90 | 88 |
2 | female | B | master's degree | standard | False | 90 | 95 | 93 |
3 | male | A | associate's degree | free/reduced | False | 47 | 57 | 44 |
4 | male | C | some college | standard | False | 76 | 78 | 75 |
Missing values¶
df.isna().sum()
gender 0 group 0 parental 0 lunch 0 course 0 math 0 reading 0 writing 0 dtype: int64
None of the column series present missing values
Some questions:
- Does gender affect math scores?
- Does math scores affect gender?
- Does reading and writing scores affect math scores?
- Do math scores affect reading and writing scores?
- Does a group perform better at math than the rest?
- Does parental level education affect math scores?
df[['reading','math']].sample(5)
reading | math | |
---|---|---|
340 | 61 | 58 |
370 | 77 | 84 |
186 | 76 | 80 |
687 | 78 | 77 |
499 | 71 | 76 |
df[['reading','math']].describe()
reading | math | |
---|---|---|
count | 1000.000000 | 1000.00000 |
mean | 69.169000 | 66.08900 |
std | 14.600192 | 15.16308 |
min | 17.000000 | 0.00000 |
25% | 59.000000 | 57.00000 |
50% | 70.000000 | 66.00000 |
75% | 79.000000 | 77.00000 |
max | 100.000000 | 100.00000 |
It's not common at all to see a zero on scores. Here we see a 0 found at math
df[df['math'] == 0]
gender | group | parental | lunch | course | math | reading | writing | |
---|---|---|---|---|---|---|---|---|
59 | female | C | some high school | free/reduced | False | 0 | 17 | 10 |
Does this sample look possible? Why?
Histograms for our selected variables¶
df[['reading', 'math']].hist(bins=50, grid=False);
Histograms for our selected variables (seaborn)¶
We can plot histogram in different plots using matplotlib subplots
plt.figure(figsize=(12,4))
sns.histplot(df[['reading']], bins=50, ax=plt.subplot(121), palette=palette)
sns.histplot(df[['math']], bins=50, ax=plt.subplot(122), palette=palette);
But knowing that by default sns.histplot merges all features into the same plot, it could be simpler
sns.histplot(df[['reading', 'math']], bins=50, palette=palette);
Kernel Density Estimate¶
df[['reading', 'math']].plot.kde()
plt.title('KDEs');
Seaborn comes with the method seaborn.kdeplot()
to create Kernel Density Plots but we can just set the histplot params kde to True to combine them.
sns.histplot(df[['reading', 'math']], bins=50, kde=True, palette=palette);
BoxPlot¶
df[['reading', 'math']].boxplot();
At first glance distributions looks similar as one could expect. Math scores distribution looks a bit shifted down.
Boxplots with seaborn
sns.boxplot(data=df[['reading', 'math']], palette=palette);
Boxenplots or Letter values
sns.boxenplot(data=df[['reading', 'math']], palette=palette);
Violinplots
sns.violinplot(data=df[['reading', 'math']], palette=palette);
What about the relation between the scores? Do they interact?
Scatter to the rescue¶
df.plot.scatter(x='reading', y='math', s=10, alpha=.5, figsize=(6,5))
plt.title('reading vs math');
There is visual correlation between these variables.
Correlation¶
Pandas has implemented a method named corr()
.
DataFrame.corr()
: Compute pairwise correlation of columns, excluding NA/null values.
DataFrame.corr(method='pearson', min_periods=1)
df[['reading', 'math']].corr()
reading | math | |
---|---|---|
reading | 1.00000 | 0.81758 |
math | 0.81758 | 1.00000 |
Pandas corr() offers different correlation methods. In most cases pearson
or/and spearman
are the methods to go.
for method in ['pearson', 'kendall', 'spearman']:
# iloc is used to access value at first row second column.
corr = df[['reading', 'math']].corr(method=method).iloc[0,1]
print(f'{method} correlation: {corr:.3f}')
pearson correlation: 0.818 kendall correlation: 0.617 spearman correlation: 0.804
We've confirmed there is a strong (linear) correlation between reading and math scores. Each variable could work as a proxy of the other variable.
Boxplot on the whole dataframe¶
df.boxplot();
Correlation between all variables¶
df.corr()
course | math | reading | writing | |
---|---|---|---|---|
course | 1.000000 | 0.177702 | 0.241780 | 0.312946 |
math | 0.177702 | 1.000000 | 0.817580 | 0.802642 |
reading | 0.241780 | 0.817580 | 1.000000 | 0.954598 |
writing | 0.312946 | 0.802642 | 0.954598 | 1.000000 |
Reading and writing have a really strong correlation.
Of course one could use plots
cols = ['math', 'reading', 'writing']
for i, c1 in enumerate(cols):
c2 = cols[i+1] if i < len(cols)-1 else cols[0]
df.plot.scatter(x=c1, y=c2, s=10, alpha=.5)
plt.title(f'{c1} vs {c2}')
Scatter plots with seaborn
Seaborn comes with seaborn.scatterplot()
.
sns.pairplot(df.select_dtypes('number'), palette=palette);
df[['gender','math', 'reading', 'writing']].sample(5)
gender | math | reading | writing | |
---|---|---|---|---|
663 | female | 65 | 69 | 67 |
903 | female | 93 | 100 | 100 |
443 | female | 73 | 83 | 76 |
362 | female | 52 | 58 | 58 |
137 | male | 70 | 55 | 56 |
df[['gender','math', 'reading', 'writing']].describe()
math | reading | writing | |
---|---|---|---|
count | 1000.00000 | 1000.000000 | 1000.000000 |
mean | 66.08900 | 69.169000 | 68.054000 |
std | 15.16308 | 14.600192 | 15.195657 |
min | 0.00000 | 17.000000 | 10.000000 |
25% | 57.00000 | 59.000000 | 57.750000 |
50% | 66.00000 | 70.000000 | 69.000000 |
75% | 77.00000 | 79.000000 | 79.000000 |
max | 100.00000 | 100.000000 | 100.000000 |
Pie plot¶
df['gender'].value_counts(normalize=True).plot.pie(figsize=(6,6));
Seaborn doesn't come with a method to plot pie plots
df.groupby('gender').mean()
course | math | reading | writing | |
---|---|---|---|---|
gender | ||||
female | 0.355212 | 63.633205 | 72.608108 | 72.467181 |
male | 0.360996 | 68.728216 | 65.473029 | 63.311203 |
df.groupby('gender').boxplot();
pandas.melt() is a powerful method to unpivot a dataframe. We are going to use it to simplify use of some seaborn plots.
score_cols = df.select_dtypes('number').columns.tolist()
id_vars = [c for c in df.columns if c not in score_cols]
score_cols, id_vars
melted = pd.melt(df, id_vars=id_vars, var_name='skill', value_name='score')
melted.head()
gender | group | parental | lunch | course | skill | score | |
---|---|---|---|---|---|---|---|
0 | female | B | bachelor's degree | standard | False | math | 72 |
1 | female | C | some college | standard | True | math | 69 |
2 | female | B | master's degree | standard | False | math | 90 |
3 | male | A | associate's degree | free/reduced | False | math | 47 |
4 | male | C | some college | standard | False | math | 76 |
When you make things easier to read for seaborn, seaborn will make the plots easier to read for you.
for func in [sns.boxplot, sns.boxenplot, sns.violinplot]:
g = sns.FacetGrid(melted, col="skill")
g.map(func, 'score', 'gender', order=None, palette=palette);
sns.pairplot(df, palette=palette, hue='gender');
df.groupby('gender').plot.kde();
df['is_female'] = df['gender'].apply(lambda x: 1 if x == 'female' else 0)
df['is_female'] = df['is_female'].astype(float)
df['is_male'] = df['gender'].apply(lambda x: 1 if x == 'male' else 0)
df['is_male'] = df['is_male'].astype(float)
df.head()
gender | group | parental | lunch | course | math | reading | writing | is_female | is_male | |
---|---|---|---|---|---|---|---|---|---|---|
0 | female | B | bachelor's degree | standard | False | 72 | 72 | 74 | 1.0 | 0.0 |
1 | female | C | some college | standard | True | 69 | 90 | 88 | 1.0 | 0.0 |
2 | female | B | master's degree | standard | False | 90 | 95 | 93 | 1.0 | 0.0 |
3 | male | A | associate's degree | free/reduced | False | 47 | 57 | 44 | 0.0 | 1.0 |
4 | male | C | some college | standard | False | 76 | 78 | 75 | 0.0 | 1.0 |
Instead of looking at correlation between all variables we want to see how this new variables is_female
correlates with the scores. Pandas gives us the method DataFrame.corrwith()
for this kind of cases.
df[['math', 'reading', 'writing']].corrwith(df['is_female'])
math -0.167982 reading 0.244313 writing 0.301225 dtype: float64
df[['math', 'reading', 'writing']].corrwith(df['is_male'])
math 0.167982 reading -0.244313 writing -0.301225 dtype: float64
One Hot Encoding¶
What we have just done is known as One Hot Encoding of the gender
variable.
Pandas has a method to simplify this kind of conversion under the name: get_dummies()
pd.get_dummies()
: Convert categorical variable into dummy/indicator variables.
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None,
sparse=False, drop_first=False, dtype=None)
pd.get_dummies(df['gender']).head()
female | male | |
---|---|---|
0 | 1 | 0 |
1 | 1 | 0 |
2 | 1 | 0 |
3 | 0 | 1 |
4 | 0 | 1 |
In example, we could make use of this method to create a new DataFrame with the scores and the one hot encoded version of gender variable. We use get_dummies()
to encode the categorical gender variable and then we use the pd.concat()
method to concatenate two DataFrames on the horizonal axis.
df_encoded = pd.concat([df[['math', 'reading', 'writing']], pd.get_dummies(df['gender'])], axis=1)
df_encoded.head()
math | reading | writing | female | male | |
---|---|---|---|---|---|
0 | 72 | 72 | 74 | 1 | 0 |
1 | 69 | 90 | 88 | 1 | 0 |
2 | 90 | 95 | 93 | 1 | 0 |
3 | 47 | 57 | 44 | 0 | 1 |
4 | 76 | 78 | 75 | 0 | 1 |
And then using one line of code more we could arrive to the same conclusions
df_encoded.corr()
math | reading | writing | female | male | |
---|---|---|---|---|---|
math | 1.000000 | 0.817580 | 0.802642 | -0.167982 | 0.167982 |
reading | 0.817580 | 1.000000 | 0.954598 | 0.244313 | -0.244313 |
writing | 0.802642 | 0.954598 | 1.000000 | 0.301225 | -0.301225 |
female | -0.167982 | 0.244313 | 0.301225 | 1.000000 | -1.000000 |
male | 0.167982 | -0.244313 | -0.301225 | -1.000000 | 1.000000 |
Extra
Most of ML models don't work with categorial variables. You will become familiar with method like get_dummies() from pandas or similar ones from other libraries to prepare the data that will feed your models. Sometimes, it is convenient to normalize or standardize the data. We already known that the new dataset ranges, so we could normalize it using one line of code.
df_normalized = df_encoded.div(df_encoded.max() - df_encoded.min())
df_normalized.head()
math | reading | writing | female | male | |
---|---|---|---|---|---|
0 | 0.72 | 0.867470 | 0.822222 | 1.0 | 0.0 |
1 | 0.69 | 1.084337 | 0.977778 | 1.0 | 0.0 |
2 | 0.90 | 1.144578 | 1.033333 | 1.0 | 0.0 |
3 | 0.47 | 0.686747 | 0.488889 | 0.0 | 1.0 |
4 | 0.76 | 0.939759 | 0.833333 | 0.0 | 1.0 |
The new dataset df_normalized
looks like a generic dataset for any kind of ML algorithm
And we can check that we don't change correlations after normalizing them.
df_normalized.corr().round(14) == df_encoded.corr().round(14)
math | reading | writing | female | male | |
---|---|---|---|---|---|
math | True | True | True | True | True |
reading | True | True | True | True | True |
writing | True | True | True | True | True |
female | True | True | True | True | True |
male | True | True | True | True | True |
Heatmap¶
seaborn.heatmap()
: Plot rectangular data as a color-encoded matrix.
Heatmap is a great tool for plotting features' correlations
fig, ax = plt.subplots(1, 1, figsize=(8,6))
sns.heatmap(df_normalized.corr(), annot=True, fmt='.2f', cmap='Blues', ax=ax);
Who will approve?¶
approval_threshold = 40
df['approved'] = df['math'] >= approval_threshold
df['approved'] = df['approved'].astype(int)
df.head()
gender | group | parental | lunch | course | math | reading | writing | is_female | is_male | approved | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | female | B | bachelor's degree | standard | False | 72 | 72 | 74 | 1.0 | 0.0 | 1 |
1 | female | C | some college | standard | True | 69 | 90 | 88 | 1.0 | 0.0 | 1 |
2 | female | B | master's degree | standard | False | 90 | 95 | 93 | 1.0 | 0.0 | 1 |
3 | male | A | associate's degree | free/reduced | False | 47 | 57 | 44 | 0.0 | 1.0 | 1 |
4 | male | C | some college | standard | False | 76 | 78 | 75 | 0.0 | 1.0 | 1 |
df['approved'].value_counts(normalize=True).plot.bar();
Seaborn has a method for plotting counts of feature's values.
sns.countplot(data=df, x='group', palette=palette);
The problem is that the countplot method doesn't count with a normalize parameter. So trying to plot a normalized version is to so simple as when using pandas (Series.value_counts(normalize=True).plot.bar()
)
df['approved'].value_counts(normalize=True).to_frame()
approved | |
---|---|
1 | 0.96 |
0 | 0.04 |
sns.barplot(data=((df['approved'].value_counts(normalize=True)*100).to_frame()
.reset_index().rename(columns={'approved': '%', 'index': 'approved'})),
x='approved',
y='%',
palette=palette);
df[['gender', 'course', 'reading', 'writing', 'math']].groupby('gender').corrwith(df['approved'])
course | reading | writing | math | |
---|---|---|---|---|
gender | ||||
female | 0.102233 | 0.513812 | 0.549594 | 0.548292 |
male | 0.071767 | 0.317447 | 0.331337 | 0.339371 |
df.groupby('approved')['gender'].value_counts(normalize=True).plot.bar();
We will try to do the same plot with seaborn
tmp = (df.groupby('approved')['gender'].value_counts(normalize=True).to_frame().rename(columns={'gender': '%'})*100).reset_index()
sns.barplot(data=tmp, x='approved', y='%', hue='gender', palette=palette);
Years in a cell¶
When plotting you can think of using one of these four approaches:
- Pandas
- Pandas + Matplotlib
- Pandas + Seaborn
- Pandas + Seaborn + Matplotlib
Pandas
- Learning: easy
- Default Visual: bad
- Custom Visual: regular
- TIP: just knowing what are the plotting methods implemented in pandas is enough to start plotting many things to extract information for you (but maybe not for a presentation).
Pandas + Matplotlib
- Learning: difficult
- Default Visual: regular
- Custom Visual: excellent but tricky (it's all about learning matplotlib, not easy to start from scratch)
- TIP: Think the plot you want and then using DataFrame.groupby or some condition applied to the dataframe will be enough to feed your plots.
Pandas + Seaborn
- Learning: good
- Default Visual: good
- Custom Visual: very good
- TIP: Seaborn is almost about preparing a DataFrame to feed the seaborn plot you are looking for. So you need to learn about Seaborn's available plots and probably expend some time learning pandas methods like
melt
andpivot
to transform the dataframe in an input kind of the ones seaborn likes.
Pandas + Matplotlib + Seaborn
- Learning: difficult
- Default Visual: good
- Custom Visual: excellent
- TIP: Sky is the limit. Remember that seaborn was built on matplotlib.
sns.countplot(data=df, x='approved', hue='gender', palette=palette);
ax = plt.subplot()
for group, g in df.groupby(['approved','gender']):
g[['math']].hist(bins=50, ax=ax, alpha=.3, label=f'{group[0]} {group[1]}');
plt.legend();
To do the same plot with seaborn we will need to convert the dataframe like the melted one and add some new column that represents a combination of gender and approved. Sometimes it's better to look for alternatives that let us do the same analysis without too much coding.
g = sns.FacetGrid(df, col='approved', row='gender')
g.map(sns.histplot, 'math', palette=palette);
g = sns.FacetGrid(df, col='approved', row='gender')
g.map(sns.histplot, 'reading', palette=palette);
g = sns.FacetGrid(df, col='approved', row='gender')
g.map(sns.histplot, 'writing', palette=palette);
# let's repeat the three features with violinplots
for feature in ['reading', 'writing', 'math']:
g = sns.FacetGrid(df, col='approved', row='gender', sharex=True, sharey=True)
g.map(sns.violinplot, feature, order=None, palette=palette);
If we prepare data for seaborn, seaborn will give what we want. For instance, seaborn.violinplot()
permits to split the violin distribution using a secondary binary hue
feature. But this just can be done when using parameters x
and y
. In this case we can use a dummy feature to plot what we want. Knowing this will help us to improve our previous plot.
df['dummy'] = ''
# let's repeat the three features with violinplots
for feature in ['reading', 'writing', 'math']:
g = sns.FacetGrid(df, col='approved', sharey=True)
g.map(sns.violinplot, data=df, x='dummy', y=feature, hue='gender', split=True, order=None, palette=palette);
g.add_legend() # we want to display the gender legend
g.set_ylabels('score')
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle(f'feature: {feature}', fontsize=12, font='verdana')
del df['dummy']
PairGrid¶
seaborn.PairGrid() is a great tool that let us extend seaborn plots easily.
# this should do something similar to pairplot() but without setting the histogram in the diagonal
g = sns.PairGrid(df)
g.map(sns.scatterplot);
del df['is_female']
del df['is_male']
Maybe you didn't see the power of PairGrids. Let's try again with a new custom PairGrid plot with multivariate KDE subplots
# Create a cubehelix colormap to use with kdeplot
cmap = sns.cubehelix_palette(start=0, light=.95, as_cmap=True)
g = sns.PairGrid(df, diag_sharey=False)
g.map_upper(sns.kdeplot, cmap=cmap, fill=True)
g.map_lower(sns.kdeplot, cmap=cmap, fill=True)
g.map_diag(sns.kdeplot, color='#aa0000', fill=True);
Summary¶
In this lecture you've learnt:
Some important things about using SEABORN with PANDAS! ¶
- Importance of plots
- Using pandas for EDA
- Notion of One Hot Encoding
Pandas
- pandas.read_csv()
- pandas.concat()
- pandas.get_dummies()
- DataFrame.info()
- DataFrame.head()
- DataFrame.sample()
- DataFrame.describe()
- DataFrame.unique()
- DataFrame.str
- DataFrame.grouby()
- DataFrame.sourt_values()
- DataFrame.corr()
- DataFrame.corrwith()
- DataFrameGroupBy.size()
Pandas (plotting)
- DataFrame.boxplot()
- DataFrame.hist()
- DataFrame.plot()
- DataFrame.plot.kde()
- DataFrame.plot.pie()
- DataFrame.plot.scatter()
matplotlib
- matplotlib.pyplot.subplots()
- matplotlib.pyplot.title()
- matplotlib.pyplot.plot()
- matplotlib.pyplot.suptitle()
- matplotlib.pyplot.subplot()
- matplotlib.pyplot.subplots_adjust()
- matplotlib.pyplot.ylabel()
- matplotlib.pyplot.legend()
- matplotlib.pyplot.figure()
seaborn
- seaborn.boxplot()
- seaborn.boxenplot()
- seaborn.histplot()
- seaborn.barplot()
- seaborn.countplot()
- seaborn.scatterplot()
- seaborn.violinplot()
- seaborn.lineplot()
- seaborn.pairplot()
- seaborn.heatmap()
- seaborn.kdeplot()
- seaborn.FacetGrid()
- seaborn.PairGrid()