Key Word(s): NN



EDA

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

Italian Olives

I found this data set in the RGGobi book (http://www.ggobi.org/book/), from which the above diagram is taken. It has "the percentage composition of fatty acids found in the lipid fraction of Italian olive oils', with oils from 3 regions of Italy: the North, the South, and Sardinia. The regions themselves are subdivided into areas as shown in the map above. The source for this data is:

Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classification of Olive Oils from their Fatty Acid Composition, in Martens, H. and Russwurm Jr., H., eds, Food Research and Data Analysis, Applied Science Publishers, London, pp. 189–214.

Exploratory Viz

In [4]:
df = pd.read_csv("local-olives-cleaned.csv")
df.head()
Out[4]:
areastring region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic regionstring
0 North-Apulia 1 1 10.75 0.75 2.26 78.23 6.72 0.36 0.60 0.29 South
1 North-Apulia 1 1 10.88 0.73 2.24 77.09 7.81 0.31 0.61 0.29 South
2 North-Apulia 1 1 9.11 0.54 2.46 81.13 5.49 0.31 0.63 0.29 South
3 North-Apulia 1 1 9.66 0.57 2.40 79.52 6.19 0.50 0.78 0.35 South
4 North-Apulia 1 1 10.51 0.67 2.59 77.71 6.72 0.50 0.80 0.46 South

exploring globally

In [5]:
pd.crosstab(df.areastring, df.regionstring)
Out[5]:
regionstring North Sardinia South
areastring
Calabria 0 0 56
Coast-Sardinia 0 33 0
East-Liguria 50 0 0
Inland-Sardinia 0 65 0
North-Apulia 0 0 25
Sicily 0 0 36
South-Apulia 0 0 206
Umbria 51 0 0
West-Liguria 50 0 0
In [6]:
pd.value_counts(df.areastring, sort=False).plot(kind="bar");
In [7]:
pd.value_counts(df.regionstring, sort=False).plot(kind="barh");
In [9]:
acidlist=['palmitic', 'palmitoleic', 'stearic', 'oleic', 'linoleic', 'linolenic', 'arachidic', 'eicosenoic']
df[acidlist].median().plot(kind="bar");

Or one can use aggregate to pass an arbitrary function of to the sub-dataframe. The function is applied columnwise.

In [10]:
dfbymean=df.groupby("regionstring").aggregate(np.mean)
dfbymean.head()
Out[10]:
region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
regionstring
North 3.0 8.006623 10.948013 0.837351 2.308013 77.930530 7.270331 0.217881 0.375762 0.019735
Sardinia 2.0 5.336735 11.113469 0.967449 2.261837 72.680204 11.965306 0.270918 0.731735 0.019388
South 1.0 2.783282 13.322879 1.548019 2.287740 71.000093 10.334985 0.380650 0.631176 0.273220
In [11]:
with sns.axes_style("white", {'grid':False}):
    dfbymean[acidlist].plot(kind='barh', stacked=True);
    sns.despine()

Figuring the dataset by Region

In [12]:
g=sns.FacetGrid(df, col="region")
g.map(plt.scatter,"eicosenoic", "linoleic");

Clearly, region 1 or the South can visually be separated out by eicosenoic fraction itself.

In [13]:
with sns.axes_style("white"):
    g=sns.FacetGrid(df, col="region")
    g.map(sns.distplot, "eicosenoic")

We make a SPLOM using seaborn to see in what space the regions may be separated. Note that linoleic and oleic seem promising. And perhaps arachidic paired with eicosenoic.

In [14]:
sns.pairplot(df, vars=acidlist, hue="regionstring", size=2.5, diag_kind='kde');

Pandas supports conditional indexing: documentation. Lets use it to follow up on the clear pattern of Southern oils seeeming to be separable by just the eicosenoic feature.

Indeed this is the case! Can also be seen using parallel co-ordinates:

In [40]:
from pandas.tools.plotting import parallel_coordinates
dfna=df[acidlist]
#normalizing by range
dfna_norm = (dfna - dfna.mean()) / (dfna.max() - dfna.min())
with sns.axes_style("white"):
    parallel_coordinates(df[['regionstring']].join(dfna_norm), 'regionstring', alpha=0.3)

Figuring the South of Italy by Area

In [15]:
dfsouth=df[df.regionstring=='South']
dfsouth.head()
Out[15]:
areastring region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic regionstring
0 North-Apulia 1 1 10.75 0.75 2.26 78.23 6.72 0.36 0.60 0.29 South
1 North-Apulia 1 1 10.88 0.73 2.24 77.09 7.81 0.31 0.61 0.29 South
2 North-Apulia 1 1 9.11 0.54 2.46 81.13 5.49 0.31 0.63 0.29 South
3 North-Apulia 1 1 9.66 0.57 2.40 79.52 6.19 0.50 0.78 0.35 South
4 North-Apulia 1 1 10.51 0.67 2.59 77.71 6.72 0.50 0.80 0.46 South

We make a couple of SPLOM's, one with sicily and one without sicily, to see whats separable. Sicily seems to be a problem. As before, see the KDE's first to see if separability exists and then let the eye look for patterns.

In [16]:
sns.pairplot(dfsouth, hue="areastring", size=2.5, vars=acidlist, diag_kind='kde');
In [17]:
sns.pairplot(dfsouth[dfsouth.areastring!="Sicily"], hue="areastring", size=2.5, vars=acidlist, diag_kind='kde');

Seems that combinations of oleic, palmitic, palmitoleic might be useful?