Key Word(s): pandas
CS109a Introduction to PANDAS
Lecture 1, Pandas Intro¶
Harvard University
Fall 2021
Instructors: Pavlos Protopapas and Natesh Pillai
Pandas¶
PANDAS
is Python library that contains highly useful data structures, including DataFrames, which makes Exploratory Data Analysis (EDA) easy. Here we will see some of the elementary functions in practice.
Installing¶
Using conda
conda install pandas
Using pip
pip install pandas
TIP: You can try installing a library from a jupyter notebook cell adding "!"
# using conda
!conda install pandas
# or using pip
!pip install pandas
PANDAS Basics¶
Let's get started with basic functionality of PANDAS!
Importing pandas¶
importing pandas is as simple as next line
import pandas
But because of lazyness for convenience we usually import it as pd
import pandas as pd
You can always check for the version of almost any library using __version__
pd.__version__
'1.3.2'
Pandas data structures¶
The main data structures in pandas are the Series
(useful for time series) and the DataFrame
.
- Series
- Formal: One-dimensional ndarray with axis labels (including time series).
- Roughly: You can think of it as kind of spreadsheet column or a relational database table of one column
- DataFrame
- Formal: Two-dimensional, size-mutable, potentially heterogeneous tabular data.
- Roughly: to a relational database table. Where every DataFrame's column is a Series.
Both DataFrames and Series always have an index.
pd.Series¶
pd.Series(data=None, index=None, dtype=None, name=None, copy=False)
When not using an index pandas will add an index for us:
>>> s1 = pd.Series(range(0, 50, 10))
0 0
1 10
2 20
3 30
4 40
dtype: int64
The data can be strings not just numbers
The index can be anything, but the data and index should have the same length.
s = pd.Series(data=['A', 'B', 'C', 'D', 'E'], index=range(10, 5, -1))
s
10 A 9 B 8 C 7 D 6 E dtype: object
We can independently access the series' values or its index
s.values
array(['A', 'B', 'C', 'D', 'E'], dtype=object)
s.index
RangeIndex(start=10, stop=5, step=-1)
pd.DataFrame¶
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
This data structure also contains labeled axes (rows and columns).
index | First Name | Last Name |
---|---|---|
0 | Ann | Gatton |
1 | John | Fosa |
2 | Zack | Kaufman |
DataFrame class offers powerful ways to create them. For instance the two code lines belows generate the same DataFrame object.
# using rows
pd.DataFrame(data=[[1,2], [3,4], [5,6]], columns=['A','B'])
# using columns
pd.DataFrame(data={'A':[1,3,5], 'B': [2,4,6]})
A | B | |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
2 | 5 | 6 |
Loading data¶
It's common to create DataFrames, but usually we read data from external sources. This is easy to do in Pandas.
tpl = 'https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.{}.html'
for m in ['clipboard', 'csv', 'excel', 'feather', 'fwf', 'gbq',
'hdf', 'html', 'json', 'parquet', 'pickle', 'spss',
'sql', 'sql_query', 'sql_table', 'stata', 'table', 'xml']:
method = f'read_{m}'
url = tpl.format(method)
print(f'{method}\t{url}')
read_clipboard https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_clipboard.html read_csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html read_excel https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html read_feather https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_feather.html read_fwf https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html read_gbq https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html read_hdf https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_hdf.html read_html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html read_json https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html read_parquet https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html read_pickle https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html read_spss https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_spss.html read_sql https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html read_sql_query https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_query.html read_sql_table https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_table.html read_stata https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_stata.html read_table https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html read_xml https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_xml.html
Example
The method read_html
is powerful and requires a bit of expirience.
- The first line processes the url and extracts html that match the criteria into a DataFrames
- The header will come as first row of the DataFrame, so in line 2 we use the first row values as columns names for the dataframe and finally we remove the first row.
df = pd.read_html('https://en.wikipedia.org/wiki/Harvard_University', match='School')[0]
df
0 | 1 | |
---|---|---|
0 | School | Founded |
1 | Harvard College | 1636 |
2 | Medicine | 1782 |
3 | Divinity | 1816 |
4 | Law | 1817 |
5 | Dental Medicine | 1867 |
6 | Arts and Sciences | 1872 |
7 | Business | 1908 |
8 | Extension | 1910 |
9 | Design | 1914 |
10 | Education | 1920 |
11 | Public Health | 1922 |
12 | Government | 1936 |
13 | Engineering and Applied Sciences | 2007 |
df = pd.read_html('https://en.wikipedia.org/wiki/Harvard_University', match='School')[0]
df = df.rename(columns=df.iloc[0])[1:]
df
School | Founded | |
---|---|---|
1 | Harvard College | 1636 |
2 | Medicine | 1782 |
3 | Divinity | 1816 |
4 | Law | 1817 |
5 | Dental Medicine | 1867 |
6 | Arts and Sciences | 1872 |
7 | Business | 1908 |
8 | Extension | 1910 |
9 | Design | 1914 |
10 | Education | 1920 |
11 | Public Health | 1922 |
12 | Government | 1936 |
13 | Engineering and Applied Sciences | 2007 |
pd.read_csv¶
read_csv
is the recommended starting point for anyone learning pandas. You can read its docs here.
Let's use it to load Avocado prices¶
It is a well known fact that Millenials LOVE Avocado Toast. It's also a well known fact that all Millenials live in their parents basements.
Clearly, they aren't buying home because they are buying too much Avocado Toast!
But maybe there's hope… if a Millenial could find a city with cheap avocados, they could live out the Millenial American Dream.
The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.
Some relevant columns in the dataset:
- Date: The date of the observation
- AveragePrice: the average price of a single avocado
- type: conventional or organic
- year: the year
- Region: the city or region of the observation
- Total Volume: Total number of avocados sold
- 4046: Total number of avocados with PLU 4046 sold
- 4225: Total number of avocados with PLU 4225 sold
- 4770: Total number of avocados with PLU 4770 sold
Load dataset¶
Read a compressed csv file. We ask pandas to use first csv column as index to avoid creating a new one by default.
TIP: when you are blind about what you are loading or you already know it is a big dataset you can fix the number of rows to be loaded using the parameter nrows
(nrows=None
to load all and it's the default value)
df = pd.read_csv('avocado.csv.zip', index_col=0, compression='zip', nrows=None)
Roughly exploring the data¶
We can quickly see the dataframe's dimension
df.shape
(18249, 13)
The shape is a tuple with the number of rows and the number of columns
len(df.index), len(df.columns)
(18249, 13)
Show only the columns' names
df.columns
Index(['Date', 'AveragePrice', 'Total Volume', '4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type', 'year', 'region'], dtype='object')
The columns attribute is not a python list.
type(df.columns) == pd.Index
True
Show only the index
df.index
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64', length=18249)
Sometimes some column type is incorrect and possible ways to detect it is using df.info()
or df.dtypes
.
df.dtypes
Date object AveragePrice float64 Total Volume float64 4046 float64 4225 float64 4770 float64 Total Bags float64 Small Bags float64 Large Bags float64 XLarge Bags float64 type object year int64 region object dtype: object
In example, here Date
is an object (the way pandas save strings). We can use a better column type for that.
df['Date'] = pd.to_datetime(df['Date'])
df.dtypes
Date datetime64[ns] AveragePrice float64 Total Volume float64 4046 float64 4225 float64 4770 float64 Total Bags float64 Small Bags float64 Large Bags float64 XLarge Bags float64 type object year int64 region object dtype: object
show first (by default: 5) rows
df.head()
Date | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | type | year | region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-12-27 | 1.33 | 64236.62 | 1036.74 | 54454.85 | 48.16 | 8696.87 | 8603.62 | 93.25 | 0.0 | conventional | 2015 | Albany |
1 | 2015-12-20 | 1.35 | 54876.98 | 674.28 | 44638.81 | 58.33 | 9505.56 | 9408.07 | 97.49 | 0.0 | conventional | 2015 | Albany |
2 | 2015-12-13 | 0.93 | 118220.22 | 794.70 | 109149.67 | 130.50 | 8145.35 | 8042.21 | 103.14 | 0.0 | conventional | 2015 | Albany |
3 | 2015-12-06 | 1.08 | 78992.15 | 1132.00 | 71976.41 | 72.58 | 5811.16 | 5677.40 | 133.76 | 0.0 | conventional | 2015 | Albany |
4 | 2015-11-29 | 1.28 | 51039.60 | 941.48 | 43838.39 | 75.78 | 6183.95 | 5986.26 | 197.69 | 0.0 | conventional | 2015 | Albany |
show last 2 rows
df.tail(2)
Date | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | type | year | region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 2018-01-14 | 1.93 | 16205.22 | 1527.63 | 2981.04 | 727.01 | 10969.54 | 10919.54 | 50.00 | 0.0 | organic | 2018 | WestTexNewMexico |
11 | 2018-01-07 | 1.62 | 17489.58 | 2894.77 | 2356.13 | 224.53 | 12014.15 | 11988.14 | 26.01 | 0.0 | organic | 2018 | WestTexNewMexico |
display some data info
Sometimes the Dataframe method info()
is a great way to take a first data snapshot with few columns datasets. It displays:
- columns names
- number of rows (as entries)
- number of non null values
- data type per column (per Series)
- memory usage
TIP: if you know that the number of columns is high (maybe when printing df.shape[1]
), then you can pass a False
flag to the method info()
to reduce the information just to global information.
few_columns = True
df.info(verbose=few_columns)
<class 'pandas.core.frame.DataFrame'> Int64Index: 18249 entries, 0 to 11 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 18249 non-null datetime64[ns] 1 AveragePrice 18249 non-null float64 2 Total Volume 18249 non-null float64 3 4046 18249 non-null float64 4 4225 18249 non-null float64 5 4770 18249 non-null float64 6 Total Bags 18249 non-null float64 7 Small Bags 18249 non-null float64 8 Large Bags 18249 non-null float64 9 XLarge Bags 18249 non-null float64 10 type 18249 non-null object 11 year 18249 non-null int64 12 region 18249 non-null object dtypes: datetime64[ns](1), float64(9), int64(1), object(2) memory usage: 1.9+ MB
Descriptive statistics¶
We can take a fast look at some data statistics with one line of code
df.describe()
AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | year | |
---|---|---|---|---|---|---|---|---|---|---|
count | 18249.000000 | 1.824900e+04 | 1.824900e+04 | 1.824900e+04 | 1.824900e+04 | 1.824900e+04 | 1.824900e+04 | 1.824900e+04 | 18249.000000 | 18249.000000 |
mean | 1.405978 | 8.506440e+05 | 2.930084e+05 | 2.951546e+05 | 2.283974e+04 | 2.396392e+05 | 1.821947e+05 | 5.433809e+04 | 3106.426507 | 2016.147899 |
std | 0.402677 | 3.453545e+06 | 1.264989e+06 | 1.204120e+06 | 1.074641e+05 | 9.862424e+05 | 7.461785e+05 | 2.439660e+05 | 17692.894652 | 0.939938 |
min | 0.440000 | 8.456000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 2015.000000 |
25% | 1.100000 | 1.083858e+04 | 8.540700e+02 | 3.008780e+03 | 0.000000e+00 | 5.088640e+03 | 2.849420e+03 | 1.274700e+02 | 0.000000 | 2015.000000 |
50% | 1.370000 | 1.073768e+05 | 8.645300e+03 | 2.906102e+04 | 1.849900e+02 | 3.974383e+04 | 2.636282e+04 | 2.647710e+03 | 0.000000 | 2016.000000 |
75% | 1.660000 | 4.329623e+05 | 1.110202e+05 | 1.502069e+05 | 6.243420e+03 | 1.107834e+05 | 8.333767e+04 | 2.202925e+04 | 132.500000 | 2017.000000 |
max | 3.250000 | 6.250565e+07 | 2.274362e+07 | 2.047057e+07 | 2.546439e+06 | 1.937313e+07 | 1.338459e+07 | 5.719097e+06 | 551693.650000 | 2018.000000 |
$[]$ vs $[[]]$¶
Using column name as key will return the column values as type Series
# returns a Series with dataframe values for column 'my_col'
df['my_col']
# this gives the same access but is not recommended. Can't work when there is a space or a not allowed char in the name.
df.my_col
Using a python list of column names as key will return a sub dataframe with that columns
# returns a DataFrame with the two columns
df[['my_col_A', 'my_col_B']]
# returns a Series with my_col_A values
df[['my_col_A']]
# this should be False because we just say that column name inside brackets returns a Series
type(df['AveragePrice']) == pd.DataFrame
False
type(df['AveragePrice']) == pd.Series
True
# this should be True because we say that a list of column names inside brackets returns a sub dataframe
type(df[['AveragePrice']]) == pd.DataFrame
True
Accessing column Series
df['AveragePrice'].head()
0 1.33 1 1.35 2 0.93 3 1.08 4 1.28 Name: AveragePrice, dtype: float64
Accessing subdataframe of one column
df[['AveragePrice']].head()
AveragePrice | |
---|---|
0 | 1.33 |
1 | 1.35 |
2 | 0.93 |
3 | 1.08 |
4 | 1.28 |
Let's try to visualize the difference once more using the method values
that return the data as numpy array.
df['AveragePrice'].values
array([1.33, 1.35, 0.93, ..., 1.87, 1.93, 1.62])
df[['AveragePrice']].values
array([[1.33], [1.35], [0.93], ..., [1.87], [1.93], [1.62]])
This is because Series.values
returns a one dimensional array with the column values and DataFrame.values
returns a two dimensional array that could be thought as an array of rows.
df['AveragePrice'].values.shape, df[['AveragePrice']].values.shape
((18249,), (18249, 1))
Exercise
In the cell below fill in the blanks to display the first 10 rows of a sub-dataframe with columns Date
and AveragePrice
. Remember that DataFrame is a class that allows chaining composition.
df[['Date','AveragePrice']].head(10)
Date | AveragePrice | |
---|---|---|
0 | 2015-12-27 | 1.33 |
1 | 2015-12-20 | 1.35 |
2 | 2015-12-13 | 0.93 |
3 | 2015-12-06 | 1.08 |
4 | 2015-11-29 | 1.28 |
5 | 2015-11-22 | 1.26 |
6 | 2015-11-15 | 0.99 |
7 | 2015-11-08 | 0.98 |
8 | 2015-11-01 | 1.02 |
9 | 2015-10-25 | 1.07 |
Filtering¶
An expresion like the one show below represents a condition that will return a boolean list with many boolean values as values in the Series df['Date']
. And this number, its length is the same size of the number of rows in the DataFrame df.
df['Date'] == '2015-10-25'
The boolean list will be True for rows where the condition is True and False otherwise. A list of boolean values let as filter a DataFrame based on the condition.
condition = df['Date'] == '2015-10-25'
condition
0 False 1 False 2 False 3 False 4 False ... 7 False 8 False 9 False 10 False 11 False Name: Date, Length: 18249, dtype: bool
df[condition].head()
Date | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | type | year | region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 2015-10-25 | 1.07 | 74338.76 | 842.40 | 64757.44 | 113.00 | 8625.92 | 8061.47 | 564.45 | 0.0 | conventional | 2015 | Albany |
9 | 2015-10-25 | 1.09 | 358478.08 | 236814.29 | 64607.97 | 304.36 | 56751.46 | 31826.88 | 24924.58 | 0.0 | conventional | 2015 | Atlanta |
9 | 2015-10-25 | 1.19 | 656892.03 | 53766.25 | 397911.35 | 49085.74 | 156128.69 | 149987.55 | 6141.14 | 0.0 | conventional | 2015 | BaltimoreWashington |
9 | 2015-10-25 | 1.11 | 59874.45 | 29521.58 | 10089.82 | 6551.57 | 13711.48 | 13660.98 | 0.00 | 50.5 | conventional | 2015 | Boise |
9 | 2015-10-25 | 1.02 | 534249.47 | 4005.39 | 430725.78 | 191.31 | 99326.99 | 94581.94 | 4745.05 | 0.0 | conventional | 2015 | Boston |
It's common to find this kind of expressions directly
df[df['Date'] == '2015-10-25']
Logical expressions¶
Example of conditions
condition = df[col] > value
condition = df[col] <= value
condition = df[col] == value
condition = df[col] != value
# in list
condition = df[col].isin([value1, value2])
# not in list
condition = ~df[col].isin([value1, value2])
# between (inclusive)
condition = df[col].between(value1, value2)
Then we can combine different conditions with logical operators like "&" or "|".
df.loc[cond1 & cond2]
df.loc[cond1 | cond2]
These above expressions can be executed without the loc operator
df[cond1 & cond2]
df[cond1 | cond2]
TIP: many problems can be avoided using parenthesis for each simple condition in situations where we need to combine two or more conditions.
df[(df['Date'] == '2015-10-25') & (df['AveragePrice'] < .90)].head()
Date | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | type | year | region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 2015-10-25 | 0.86 | 1010394.81 | 557469.46 | 301143.50 | 49959.10 | 101822.75 | 96417.63 | 5279.41 | 125.71 | conventional | 2015 | DallasFtWorth |
9 | 2015-10-25 | 0.88 | 933623.58 | 437329.85 | 313129.29 | 81274.85 | 101889.59 | 57577.21 | 44260.60 | 51.78 | conventional | 2015 | Houston |
9 | 2015-10-25 | 0.83 | 761261.71 | 435986.90 | 240689.98 | 19968.66 | 64616.17 | 64585.35 | 30.82 | 0.00 | conventional | 2015 | PhoenixTucson |
9 | 2015-10-25 | 0.86 | 4912068.04 | 2542914.87 | 1537781.45 | 247539.31 | 583832.41 | 475267.20 | 108231.39 | 333.82 | conventional | 2015 | SouthCentral |
9 | 2015-10-25 | 0.82 | 635873.60 | 363487.08 | 166607.85 | 31960.04 | 73818.63 | 72717.86 | 1100.77 | 0.00 | conventional | 2015 | WestTexNewMexico |
# be careful with expressions like this that will fail when doing the bit operation
df[df['Date'] == '2015-10-25' & df['AveragePrice'] < .90]
.loc[] vs .iloc[]¶
Accessing rows
.loc[]¶
This operator allows us to access information by index label, but by definition it could be used with a boolean array as we saw with conditions:
- Access a group of rows and columns by label(s) or a boolean array.
- .loc[] is primarily label based, but may also be used with a boolean array.
When using df.info() we discover that the number of unique values for index (index domain values) is between 0 and 11 included. So, we can use .loc
to filter the rows where the index value is 9.
df.loc[9]
Date | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | type | year | region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 2015-10-25 | 1.07 | 74338.76 | 842.40 | 64757.44 | 113.00 | 8625.92 | 8061.47 | 564.45 | 0.00 | conventional | 2015 | Albany |
9 | 2015-10-25 | 1.09 | 358478.08 | 236814.29 | 64607.97 | 304.36 | 56751.46 | 31826.88 | 24924.58 | 0.00 | conventional | 2015 | Atlanta |
9 | 2015-10-25 | 1.19 | 656892.03 | 53766.25 | 397911.35 | 49085.74 | 156128.69 | 149987.55 | 6141.14 | 0.00 | conventional | 2015 | BaltimoreWashington |
9 | 2015-10-25 | 1.11 | 59874.45 | 29521.58 | 10089.82 | 6551.57 | 13711.48 | 13660.98 | 0.00 | 50.50 | conventional | 2015 | Boise |
9 | 2015-10-25 | 1.02 | 534249.47 | 4005.39 | 430725.78 | 191.31 | 99326.99 | 94581.94 | 4745.05 | 0.00 | conventional | 2015 | Boston |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9 | 2018-01-21 | 1.27 | 3159.80 | 92.12 | 73.17 | 0.00 | 2994.51 | 2117.69 | 876.82 | 0.00 | organic | 2018 | Syracuse |
9 | 2018-01-21 | 1.52 | 6871.05 | 76.66 | 407.09 | 0.00 | 6387.30 | 6375.55 | 11.75 | 0.00 | organic | 2018 | Tampa |
9 | 2018-01-21 | 1.63 | 1283987.65 | 108705.28 | 259172.13 | 1490.02 | 914409.26 | 710654.40 | 203526.59 | 228.27 | organic | 2018 | TotalUS |
9 | 2018-01-21 | 1.83 | 189317.99 | 27049.44 | 33561.32 | 439.47 | 128267.76 | 76091.99 | 51947.50 | 228.27 | organic | 2018 | West |
9 | 2018-01-21 | 1.87 | 13766.76 | 1191.92 | 2452.79 | 727.94 | 9394.11 | 9351.80 | 42.31 | 0.00 | organic | 2018 | WestTexNewMexico |
432 rows × 13 columns
.iloc[]¶
This operator allows us to access information by index position in the way we usually do with other programing languages like C.
- Purely integer-location based indexing for selection by position.
- .iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.
When using df.iloc[9]
we are going to access to the 10th row in the DataFrame df. The returned value should be of type Series with the row values (df.iloc[9].values
) as values and the columns names as the Series index.
df.iloc[9]
Date 2015-10-25 00:00:00 AveragePrice 1.07 Total Volume 74338.76 4046 842.4 4225 64757.44 4770 113.0 Total Bags 8625.92 Small Bags 8061.47 Large Bags 564.45 XLarge Bags 0.0 type conventional year 2015 region Albany Name: 9, dtype: object
type(df.iloc[9])
pandas.core.series.Series
The name of the series is the index label value of the original dataframe.
TIP: practice to really learn how and when to use .loc vs i.loc
Mathematical and other methods on a DataFrame¶
Pandas Series and DataFrame offers access to hundred of methods to operate on them like: sum(), mul(), mean(), std(), max(), min(), etc. All of these methods usually operate by default over columns but they can operate over rows.
Look at the next cell results and try to think about what happened (take a look at fields like type or region).
df.sum()
/tmp/ipykernel_32/1703867807.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. df.sum()
AveragePrice 25657.7 Total Volume 15523402593.400002 4046 5347110739.26 4225 5386275717.93 4770 416802342.13 Total Bags 4373175798.389999 Small Bags 3324870837.51 Large Bags 991615770.55 XLarge Bags 56689177.33 type conventionalconventionalconventionalconvention... year 36792683 region AlbanyAlbanyAlbanyAlbanyAlbanyAlbanyAlbanyAlba... dtype: object
df['AveragePrice'].mean()
1.4059784097758825
Missing Data¶
This is a critical problem for any Data Scientist and deserves its own Lecture. What to do when some of the data are missing?
Pandas offers some options to explore a dataframe looking for missing data.
# returns a boolean dataframe of the same size with True values for cells where values are NaN
df.isna()
# returns a boolean dataframe of the same size with True values for cells where values aren't NaN
df.notna()
# alias of the above methods
df.isnull()
df.notnull()
Count the number of NaN values for every column
df.isna().sum()
Date 0 AveragePrice 0 Total Volume 0 4046 0 4225 0 4770 0 Total Bags 0 Small Bags 0 Large Bags 0 XLarge Bags 0 type 0 year 0 region 0 dtype: int64
Count the number of NaN values per row
df.isna().sum(axis=1)
0 0 1 0 2 0 3 0 4 0 .. 7 0 8 0 9 0 10 0 11 0 Length: 18249, dtype: int64
Count the total number of NaN values in the dataframe
df.isna().sum().sum()
0
Select the rows with at least one NaN value
df[df.isna().any(axis=1)]
Date | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | type | year | region |
---|
There are specific methods related to face this problem like:
fillna()
bfill()
ffill()
dropna()
It's important to learn to handle missing data
drop_columns = ['4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags']
df = df.drop(columns=drop_columns)
df
Date | AveragePrice | Total Volume | type | year | region | |
---|---|---|---|---|---|---|
0 | 2015-12-27 | 1.33 | 64236.62 | conventional | 2015 | Albany |
1 | 2015-12-20 | 1.35 | 54876.98 | conventional | 2015 | Albany |
2 | 2015-12-13 | 0.93 | 118220.22 | conventional | 2015 | Albany |
3 | 2015-12-06 | 1.08 | 78992.15 | conventional | 2015 | Albany |
4 | 2015-11-29 | 1.28 | 51039.60 | conventional | 2015 | Albany |
... | ... | ... | ... | ... | ... | ... |
7 | 2018-02-04 | 1.63 | 17074.83 | organic | 2018 | WestTexNewMexico |
8 | 2018-01-28 | 1.71 | 13888.04 | organic | 2018 | WestTexNewMexico |
9 | 2018-01-21 | 1.87 | 13766.76 | organic | 2018 | WestTexNewMexico |
10 | 2018-01-14 | 1.93 | 16205.22 | organic | 2018 | WestTexNewMexico |
11 | 2018-01-07 | 1.62 | 17489.58 | organic | 2018 | WestTexNewMexico |
18249 rows × 6 columns
Sorting¶
sort_values
and sort_index
are common methods when using pandas.
sort_values: Sort by the values along either axis.
df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort',
na_position='last', ignore_index=False, key=None)
Next cell filter data for a particular Date and type. Then sort values by region in ascending order and finally display only the first 10 rows.
condition = (df['Date'] == '2018-01-07') & (df['type'] == 'organic')
df[condition].sort_values(by='region').head(10)
Date | AveragePrice | Total Volume | type | year | region | |
---|---|---|---|---|---|---|
11 | 2018-01-07 | 1.54 | 4816.90 | organic | 2018 | Albany |
11 | 2018-01-07 | 1.53 | 15714.11 | organic | 2018 | Atlanta |
11 | 2018-01-07 | 1.15 | 82282.71 | organic | 2018 | BaltimoreWashington |
11 | 2018-01-07 | 1.77 | 2553.90 | organic | 2018 | Boise |
11 | 2018-01-07 | 1.91 | 30096.00 | organic | 2018 | Boston |
11 | 2018-01-07 | 1.17 | 9115.92 | organic | 2018 | BuffaloRochester |
11 | 2018-01-07 | 1.95 | 156341.57 | organic | 2018 | California |
11 | 2018-01-07 | 1.08 | 28741.11 | organic | 2018 | Charlotte |
11 | 2018-01-07 | 1.83 | 41573.25 | organic | 2018 | Chicago |
11 | 2018-01-07 | 1.71 | 13141.82 | organic | 2018 | CincinnatiDayton |
Sorting can use more columns using a python list with some parameters.
df[condition].sort_values(by=['region', 'AveragePrice'], ascending=[True, False]).head()
Date | AveragePrice | Total Volume | type | year | region | |
---|---|---|---|---|---|---|
11 | 2018-01-07 | 1.54 | 4816.90 | organic | 2018 | Albany |
11 | 2018-01-07 | 1.53 | 15714.11 | organic | 2018 | Atlanta |
11 | 2018-01-07 | 1.15 | 82282.71 | organic | 2018 | BaltimoreWashington |
11 | 2018-01-07 | 1.77 | 2553.90 | organic | 2018 | Boise |
11 | 2018-01-07 | 1.91 | 30096.00 | organic | 2018 | Boston |
sort_index: Sort object by labels (along an axis)
df.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)
Next cell sort rows based on the index values (in ascending order)
df.sort_index()
Date | AveragePrice | Total Volume | type | year | region | |
---|---|---|---|---|---|---|
0 | 2015-12-27 | 1.33 | 64236.62 | conventional | 2015 | Albany |
0 | 2016-12-25 | 1.85 | 8657.87 | organic | 2016 | PhoenixTucson |
0 | 2015-12-27 | 1.25 | 73109.90 | conventional | 2015 | Pittsburgh |
0 | 2016-12-25 | 1.90 | 11376.97 | organic | 2016 | Philadelphia |
0 | 2016-12-25 | 1.27 | 5601.65 | organic | 2016 | Orlando |
... | ... | ... | ... | ... | ... | ... |
52 | 2017-01-01 | 2.06 | 39260.55 | organic | 2017 | NewYork |
52 | 2017-01-01 | 1.11 | 476239.03 | conventional | 2017 | NorthernNewEngland |
52 | 2017-01-01 | 2.00 | 115256.09 | organic | 2017 | Northeast |
52 | 2017-01-01 | 0.93 | 547565.88 | conventional | 2017 | Atlanta |
52 | 2017-01-01 | 0.97 | 142347.90 | conventional | 2017 | Roanoke |
18249 rows × 6 columns
df.rename(columns={'AveragePrice': 'price', 'Total Volume': 'volume'}, inplace=True)
df
Date | price | volume | type | year | region | |
---|---|---|---|---|---|---|
0 | 2015-12-27 | 1.33 | 64236.62 | conventional | 2015 | Albany |
1 | 2015-12-20 | 1.35 | 54876.98 | conventional | 2015 | Albany |
2 | 2015-12-13 | 0.93 | 118220.22 | conventional | 2015 | Albany |
3 | 2015-12-06 | 1.08 | 78992.15 | conventional | 2015 | Albany |
4 | 2015-11-29 | 1.28 | 51039.60 | conventional | 2015 | Albany |
... | ... | ... | ... | ... | ... | ... |
7 | 2018-02-04 | 1.63 | 17074.83 | organic | 2018 | WestTexNewMexico |
8 | 2018-01-28 | 1.71 | 13888.04 | organic | 2018 | WestTexNewMexico |
9 | 2018-01-21 | 1.87 | 13766.76 | organic | 2018 | WestTexNewMexico |
10 | 2018-01-14 | 1.93 | 16205.22 | organic | 2018 | WestTexNewMexico |
11 | 2018-01-07 | 1.62 | 17489.58 | organic | 2018 | WestTexNewMexico |
18249 rows × 6 columns
Counting¶
Counting number of values without NaNs
We already saw different ways to access to the number of rows. But what if you want to count the number of rows with not NaN values? The count()
method counts non-NA cells for each column or row
df.count()
Date 18249 price 18249 volume 18249 type 18249 year 18249 region 18249 dtype: int64
Counting number of unique values per column (Series) in the DataFrame
Number of unique for one Series
df.region.nunique()
54
Number of unique values for every Series in the DataFrame
df.nunique()
Date 169 price 259 volume 18237 type 2 year 4 region 54 dtype: int64
Unique values
df['type'].unique()
array(['conventional', 'organic'], dtype=object)
Remember that we can easily change a DataFrame or Series output into a python list
df['type'].unique().tolist()
['conventional', 'organic']
Counting rows based on unique values
value_counts()
return a Series containing counts of unique rows in the DataFrame
DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)
This method is simple but powerful for simple exploration. Let's look some examples
df.value_counts(subset='type')
type conventional 9126 organic 9123 dtype: int64
df.value_counts(subset='year', sort=False)
year 2015 5615 2016 5616 2017 5722 2018 1296 dtype: int64
We can use a bigger subset for a more detailed view
df.value_counts(subset=['year', 'type'], sort=False)
year type 2015 conventional 2808 organic 2807 2016 conventional 2808 organic 2808 2017 conventional 2862 organic 2860 2018 conventional 648 organic 648 dtype: int64
And we just need add one more flag to access to the same but normalized values
df.value_counts(subset=['year', 'type'], sort=False, normalize=True)*100
year type 2015 conventional 15.387145 organic 15.381665 2016 conventional 15.387145 organic 15.387145 2017 conventional 15.683051 organic 15.672092 2018 conventional 3.550880 organic 3.550880 dtype: float64
Grouping¶
Have you been looking for power: meet groupby()
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=<no_default>, observed=False, dropna=True)
Next cell mimics in some way the value_counts() behaviour.
- Split dataset into subdaframes where each subdataframe.year is unique
- Count the number of rows without NaNs for every column
df.groupby('year').count()
Date | price | volume | type | region | |
---|---|---|---|---|---|
year | |||||
2015 | 5615 | 5615 | 5615 | 5615 | 5615 |
2016 | 5616 | 5616 | 5616 | 5616 | 5616 |
2017 | 5722 | 5722 | 5722 | 5722 | 5722 |
2018 | 1296 | 1296 | 1296 | 1296 | 1296 |
The above rows present the same values because the original dataset is free of NaN values. It's a great dataset: No NaNs and Avocados everywhere.
Functions¶
What about max()?
df.groupby('year').max()
Date | price | volume | type | region | |
---|---|---|---|---|---|
year | |||||
2015 | 2015-12-27 | 2.79 | 44655461.51 | organic | WestTexNewMexico |
2016 | 2016-12-25 | 3.25 | 52288697.89 | organic | WestTexNewMexico |
2017 | 2017-12-31 | 3.17 | 61034457.10 | organic | WestTexNewMexico |
2018 | 2018-03-25 | 2.30 | 62505646.52 | organic | WestTexNewMexico |
And mean()?
df.groupby('year').mean()
price | volume | |
---|---|---|
year | ||
2015 | 1.375590 | 7.810274e+05 |
2016 | 1.338640 | 8.584206e+05 |
2017 | 1.515128 | 8.623393e+05 |
2018 | 1.347531 | 1.066928e+06 |
Do you find something different between using max and mean? What are your thoughts?
df.groupby('year').describe()
price | volume | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
year | ||||||||||||||||
2015 | 5615.0 | 1.375590 | 0.375595 | 0.49 | 1.07 | 1.300 | 1.67 | 2.79 | 5615.0 | 7.810274e+05 | 3.171256e+06 | 84.56 | 6931.6300 | 76146.82 | 400176.6800 | 44655461.51 |
2016 | 5616.0 | 1.338640 | 0.393708 | 0.51 | 1.04 | 1.300 | 1.56 | 3.25 | 5616.0 | 8.584206e+05 | 3.478732e+06 | 385.55 | 10643.6850 | 109597.29 | 451107.2925 | 52288697.89 |
2017 | 5722.0 | 1.515128 | 0.432906 | 0.44 | 1.22 | 1.490 | 1.77 | 3.17 | 5722.0 | 8.623393e+05 | 3.481957e+06 | 515.01 | 13790.6975 | 122915.75 | 426454.5125 | 61034457.10 |
2018 | 1296.0 | 1.347531 | 0.305858 | 0.56 | 1.13 | 1.345 | 1.56 | 2.30 | 1296.0 | 1.066928e+06 | 4.285501e+06 | 2064.90 | 17690.9825 | 157175.09 | 529462.2450 | 62505646.52 |
There exist other methods that can be chain to gropuby(). In example first and last will return the first and the last row of each group respectivily.
df.groupby('year').first()
Date | price | volume | type | region | |
---|---|---|---|---|---|
year | |||||
2015 | 2015-12-27 | 1.33 | 64236.62 | conventional | Albany |
2016 | 2016-12-25 | 1.52 | 73341.73 | conventional | Albany |
2017 | 2017-12-31 | 1.47 | 113514.42 | conventional | Albany |
2018 | 2018-03-25 | 1.57 | 149396.50 | conventional | Albany |
Previous call is similar to using head, but head() keeps the group index where first set a new index: the year
df.groupby('year').head(1)
Date | price | volume | type | year | region | |
---|---|---|---|---|---|---|
0 | 2015-12-27 | 1.33 | 64236.62 | conventional | 2015 | Albany |
0 | 2016-12-25 | 1.52 | 73341.73 | conventional | 2016 | Albany |
0 | 2017-12-31 | 1.47 | 113514.42 | conventional | 2017 | Albany |
0 | 2018-03-25 | 1.57 | 149396.50 | conventional | 2018 | Albany |
Aggregate¶
aggregate
: Aggregate using one or more operations over the specified axis (agg
is an alias).
Next cell shows the aggregated avocado price and volume values from year 2018
condition = (df['year'] == 2018)
df[condition][['price','volume']].agg(['min', 'mean', 'std', 'max'])
price | volume | |
---|---|---|
min | 0.560000 | 2.064900e+03 |
mean | 1.347531 | 1.066928e+06 |
std | 0.305858 | 4.285501e+06 |
max | 2.300000 | 6.250565e+07 |
aggregate
can be applied to dataframes though it can be chained with groupby()
df.groupby('year')[['price','volume']].agg(['min', 'mean', 'std', 'max'])
price | volume | |||||||
---|---|---|---|---|---|---|---|---|
min | mean | std | max | min | mean | std | max | |
year | ||||||||
2015 | 0.49 | 1.375590 | 0.375595 | 2.79 | 84.56 | 7.810274e+05 | 3.171256e+06 | 44655461.51 |
2016 | 0.51 | 1.338640 | 0.393708 | 3.25 | 385.55 | 8.584206e+05 | 3.478732e+06 | 52288697.89 |
2017 | 0.44 | 1.515128 | 0.432906 | 3.17 | 515.01 | 8.623393e+05 | 3.481957e+06 | 61034457.10 |
2018 | 0.56 | 1.347531 | 0.305858 | 2.30 | 2064.90 | 1.066928e+06 | 4.285501e+06 | 62505646.52 |
Suppose you need a way to extract percentiles. The quantile()
method can be applied directly on a dataframe to extract the thing you want.
df[condition][['price', 'volume']].quantile(.10)
price 0.970 volume 8174.655 Name: 0.1, dtype: float64
But there are cases where you need to extract more than that. For those cases it can be convenient to create custom methods to be used with aggregation.
def percentil_10(x): return x.quantile(.10)
def percentil_90(x): return x.quantile(.90)
df[condition][['price','volume']].agg([percentil_10, 'median', percentil_90])
price | volume | |
---|---|---|
percentil_10 | 0.970 | 8174.655 |
median | 1.345 | 157175.090 |
percentil_90 | 1.750 | 1810981.615 |
df.groupby('year')[['price','volume']].agg([percentil_10, 'median', percentil_90])
price | volume | |||||
---|---|---|---|---|---|---|
percentil_10 | median | percentil_90 | percentil_10 | median | percentil_90 | |
year | ||||||
2015 | 0.96 | 1.300 | 1.90 | 2431.434 | 76146.82 | 1285267.958 |
2016 | 0.88 | 1.300 | 1.86 | 4146.935 | 109597.29 | 1351865.735 |
2017 | 0.98 | 1.490 | 2.07 | 5889.687 | 122915.75 | 1398304.817 |
2018 | 0.97 | 1.345 | 1.75 | 8174.655 | 157175.09 | 1810981.615 |
Summary¶
In this lecture you've learnt:
- How to install pandas
- About pandas
- What are Series and DataFrame data structures
- How to create a simple DataFrame
- How to load a DataFrame with an external data source
- How to access column Series
- How to access row Series (index)
- The differences between
loc[]
andiloc[]
- Different ways to start exploring the data general structure
- Different ways to access to description statistics
- How to look for missing data
- How to do data filtering using conditions
- How to do sorting
- How to do counting
- How to group information
- How to do aggregation
Facts:¶
- We've only imported pandas!
- Almost everything was about accessing and processing the data, and not creating it.
Topics left out maybe for other lectures:¶
- DataFrame operations:
- How to add a column to a DataFrame
- DataFrame inter columns operations
- apply()
- applymap()
- pipe()
- Merging DataFrames (merge)
- Concatenating DataFrames (concat)
- Appending DataFrames, Series or a simple row (append)
- Using loops with:
- iterrows()
- itertuples()
- groupby()
Next lecture¶
- Plotting with PANDAS (showing the importance of plots)
- EDA with PANDAS (using Seaborn if possible)