Key Word(s): pandas
CS109A Introduction to PANDAS
Lecture 2, Exercise 2: PANDAS Intro 2¶
Harvard University
Fall 2021
Instructors: Pavlos Protopapas and Natesh Pillai
Exercise 2: PANDAS Intro 2
Let's get some more practice with a few of the core PANDAS
functions.
import pandas as pd
We'll continue working with StudentsPerformance dataset from Exercise 1.
# import the CSV file
df = pd.read_csv("StudentsPerformance.csv")
df.head()
gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|
0 | female | group B | bachelor's degree | standard | none | 72 | 72 | 74 |
1 | female | group C | some college | standard | completed | 69 | 90 | 88 |
2 | female | group B | master's degree | standard | none | 90 | 95 | 93 |
3 | male | group A | associate's degree | free/reduced | none | 47 | 57 | 44 |
4 | male | group C | some college | standard | none | 76 | 78 | 75 |
Indexing - iloc and loc¶
It's very important to understand the differences between loc and iloc. Looking at the next cell code one could think that they do the same thing.
(When you query just for one row you obtain an object whose name is the index of the selected row.)
df.iloc[10] == df.loc[10]
gender True race/ethnicity True parental level of education True lunch True test preparation course True math score True reading score True writing score True Name: 10, dtype: bool
all(df.iloc[10] == df.loc[10])
True
The first time we loaded the CSV into a DataFrame we didn't tell pandas to interpret a specific column as an index, so pandas created an index for us. Whether the ordering imposed on our data by this index should be respected is a matter on which iloc
and loc
disagree.
To really learn the difference of iloc
and loc
we need to shuffle the rows
To do that we can use the sample method to take a new sample of the dataframe original size (frac=1 means that)
_Bonus: Stop and consider: what is the purpose of setting a random_state
when we call them sample
method?_
df = df.sample(frac=1, random_state=109)
df
gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|
301 | male | group D | some high school | free/reduced | none | 56 | 54 | 52 |
895 | female | group E | some high school | free/reduced | none | 32 | 34 | 38 |
763 | female | group B | high school | standard | none | 62 | 62 | 63 |
854 | male | group C | some high school | standard | none | 62 | 64 | 55 |
49 | male | group C | high school | standard | completed | 82 | 84 | 82 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
399 | male | group D | some high school | standard | none | 60 | 59 | 54 |
141 | female | group C | some college | free/reduced | none | 59 | 62 | 64 |
757 | male | group E | bachelor's degree | free/reduced | completed | 70 | 68 | 72 |
245 | male | group C | associate's degree | standard | none | 85 | 76 | 71 |
262 | female | group C | some high school | free/reduced | none | 44 | 50 | 51 |
1000 rows × 8 columns
Now let's repeat our code from ealier.
df.iloc[10] == df.loc[10]
gender False race/ethnicity False parental level of education False lunch True test preparation course False math score False reading score False writing score False dtype: bool
all(df.iloc[10] == df.loc[10])
False
It turns out that loc
filters by index value (something like where df.index == value
).
That is, loc
's results are depend only on the indices (which are now scrambled after sampling). The actual positions of elements in the DataFrame are ignored.
df.loc[10]
gender male race/ethnicity group C parental level of education associate's degree lunch standard test preparation course none math score 58 reading score 54 writing score 52 Name: 10, dtype: object
By contrast, iloc
filters by row position (something like where df.index == df.index[value]
)
So iloc
's results depend on the actual positions of elements in a pandas data structure. The indices of these elements are ignored.
It's this difference that explains counterintuitive results like this:
df.index[10]
342
Consider a single row index with iloc
df.iloc[10]
gender female race/ethnicity group B parental level of education high school lunch standard test preparation course completed math score 69 reading score 76 writing score 74 Name: 342, dtype: object
And take note of where you can find the index in output formatted like this.
Enough chat. Time for...
Exercise
In the cell below, fill in the blank so that the row5 variable stores the 5th row of df. To be clear, imagine our DataFrame looked as follows:
Index | Words |
---|---|
0 | this |
1 | is |
2 | not |
3 | easy |
We'd say the 1st row is the one with this word, the 2nd row is the one with is word, the 3rd row is the one with not word, etc.
### edTest(test_a) ###
row5 = df.iloc[4]
row5
gender male race/ethnicity group C parental level of education high school lunch standard test preparation course completed math score 82 reading score 84 writing score 82 Name: 49, dtype: object
You can display the first rows to have a better understanding of what you did. Can you find the row you've just selected?
df.iloc[:5]
gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|
301 | male | group D | some high school | free/reduced | none | 56 | 54 | 52 |
895 | female | group E | some high school | free/reduced | none | 32 | 34 | 38 |
763 | female | group B | high school | standard | none | 62 | 62 | 63 |
854 | male | group C | some high school | standard | none | 62 | 64 | 55 |
49 | male | group C | high school | standard | completed | 82 | 84 | 82 |
Notice how we can use familar Python slice notation with iloc
and loc
!
Sorting¶
We scrambled out df earlier with sample
. We should also know how to tidy things up.
Exercise
In the cell below, fill in the blank so that sorted_df
now stores df
after sorting it by the math score column in decreasing order (HINT)
### edTest(test_b) ###
sorted_df = df.sort_values(by='math score', ascending=False)
sorted_df
gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|
623 | male | group A | some college | standard | completed | 100 | 96 | 86 |
625 | male | group D | some college | standard | completed | 100 | 97 | 99 |
149 | male | group E | associate's degree | free/reduced | completed | 100 | 100 | 93 |
458 | female | group E | bachelor's degree | standard | none | 100 | 100 | 100 |
962 | female | group E | associate's degree | standard | none | 100 | 100 | 100 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | female | group C | some college | free/reduced | none | 22 | 39 | 33 |
787 | female | group B | some college | standard | none | 19 | 38 | 32 |
17 | female | group B | some high school | free/reduced | none | 18 | 32 | 28 |
980 | female | group B | high school | free/reduced | none | 8 | 24 | 23 |
59 | female | group C | some high school | free/reduced | none | 0 | 17 | 10 |
1000 rows × 8 columns
Exercise
In the cell below, fill in the blank so that sorted_row5
stores the 5th row of sorted_df
. To be clear, imagine our sorted DataFrame looked as follows:
Index | Words |
---|---|
3 | easy |
1 | is |
2 | not |
0 | this |
We'd say the 1st row is the one with easy, the 2nd row is the one with is, the 3rd row is the one with not, etc.
### edTest(test_c) ###
sorted_row5 = sorted_df.iloc[4]
sorted_row5
gender female race/ethnicity group E parental level of education associate's degree lunch standard test preparation course none math score 100 reading score 100 writing score 100 Name: 962, dtype: object
Can you find the row you've just selected?
# len('head()') < len('iloc[:5]') :)
sorted_df.head()
gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|
623 | male | group A | some college | standard | completed | 100 | 96 | 86 |
625 | male | group D | some college | standard | completed | 100 | 97 | 99 |
149 | male | group E | associate's degree | free/reduced | completed | 100 | 100 | 93 |
458 | female | group E | bachelor's degree | standard | none | 100 | 100 | 100 |
962 | female | group E | associate's degree | standard | none | 100 | 100 | 100 |
Column Naming Conventions¶
How you've named your df columns can affect the amount of typing required to manipulate your data, the readability of your code, and even the syntax optons you have. Let's take a look at some best practices for naming columns.
df
gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|
301 | male | group D | some high school | free/reduced | none | 56 | 54 | 52 |
895 | female | group E | some high school | free/reduced | none | 32 | 34 | 38 |
763 | female | group B | high school | standard | none | 62 | 62 | 63 |
854 | male | group C | some high school | standard | none | 62 | 64 | 55 |
49 | male | group C | high school | standard | completed | 82 | 84 | 82 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
399 | male | group D | some high school | standard | none | 60 | 59 | 54 |
141 | female | group C | some college | free/reduced | none | 59 | 62 | 64 |
757 | male | group E | bachelor's degree | free/reduced | completed | 70 | 68 | 72 |
245 | male | group C | associate's degree | standard | none | 85 | 76 | 71 |
262 | female | group C | some high school | free/reduced | none | 44 | 50 | 51 |
1000 rows × 8 columns
We can see that there are columns names whose lengths are not confortable for coding. What we can do to make our life easier:
- Try to work with short columns names
- Try to avoid characters like spaces that will allow us to use column access without brackets
- Try working in only lower or upper case (prefably lower case; there's no need to shout)
Our df already conforms to this last suggestion.
We'll find that, after some renaming, things should become easier for us, making:
expressions like these ones:
condition = (df['test preparation course'] != 'completed') & (df['writing score'] > df['writing score'].median())
to become:
condition = (df.course != 'completed') & (df.writing > df.writing.median())
Exercise
In the cell below fill in the blank with these goals in mind:
- rename column
race/ethnicity
torace
- rename column
parental level of education
topeduc
- raname column
test preparation course
tocourse
- remove
score
(included the left space char) from the rest of the columns
HINT1: Don't by shy, check the documentation if you need some help
HINT2: in many cases, it's faster to access docstrings using help
help(df.rename)
HINT3: TAB autocomplete can be ued to you explore an object's available methods and attributes.\ HINT4: Still more exciting, place your cursor after the opening paranthesis of a function call and press SHIFT+TAB once or twice. Instantly, you're presented with a docstring. It's a whole new world! 🌈
df.columns
Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course', 'math score', 'reading score', 'writing score'], dtype='object')
### edTest(test_d) ###
df = df.rename(columns={'race/ethnicity':'race', 'parental level of education': 'peduc','test preparation course':'course','math score': 'math',
'reading score': 'reading',
'writing score': 'writing', })
df
gender | race | peduc | lunch | course | math | reading | writing | |
---|---|---|---|---|---|---|---|---|
301 | male | group D | some high school | free/reduced | none | 56 | 54 | 52 |
895 | female | group E | some high school | free/reduced | none | 32 | 34 | 38 |
763 | female | group B | high school | standard | none | 62 | 62 | 63 |
854 | male | group C | some high school | standard | none | 62 | 64 | 55 |
49 | male | group C | high school | standard | completed | 82 | 84 | 82 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
399 | male | group D | some high school | standard | none | 60 | 59 | 54 |
141 | female | group C | some college | free/reduced | none | 59 | 62 | 64 |
757 | male | group E | bachelor's degree | free/reduced | completed | 70 | 68 | 72 |
245 | male | group C | associate's degree | standard | none | 85 | 76 | 71 |
262 | female | group C | some high school | free/reduced | none | 44 | 50 | 51 |
1000 rows × 8 columns
Look for missing values¶
Missing values in our DataFrames can cause many issues. They can cause certain operations and function calls to fail and throw an error. Perhaps worse, this problems can happen 'silently,' affecting our results without us realizing there is a problem. Unless we take precautions of course!
The first step is locating missingness. This dataset doesn't have any missing values. So we'll make some ourselves and 'poke a holes' in the Dataframe.
# 'poking holes' in the data
df.iloc[0,5] = None
df.iloc[2,2] = None
Exercise
Fill in the blank to display whether or not entries in the first 5 rows are missing (HINT)
Solution
df.isna().head()
gender | race | peduc | lunch | course | math | reading | writing | |
---|---|---|---|---|---|---|---|---|
301 | False | False | False | False | False | True | False | False |
895 | False | False | False | False | False | False | False | False |
763 | False | False | True | False | False | False | False | False |
854 | False | False | False | False | False | False | False | False |
49 | False | False | False | False | False | False | False | False |
Exercise
Fill in the blanks to sum the total number of missing values in each of the dataset's columns.
### edTest(test_e) ###
resultE = df.isna().sum()
display(resultE)
gender 0 race 0 peduc 1 lunch 0 course 0 math 1 reading 0 writing 0 dtype: int64
Now let's deal with these 'holes' we've just made
Exercise
Fill the missing math
entry with that column's mean.
Hint: Select subsets of the data frame using a column name or names
Note: The blanks here represent just one way of doing this. Don't feel constrain by the blanks here.
Solution
df['math'] = df['math'].fillna(df['math'].mean())
df.head()
gender | race | peduc | lunch | course | math | reading | writing | |
---|---|---|---|---|---|---|---|---|
301 | male | group D | some high school | free/reduced | none | 66.099099 | 54 | 52 |
895 | female | group E | some high school | free/reduced | none | 32.000000 | 34 | 38 |
763 | female | group B | None | standard | none | 62.000000 | 62 | 63 |
854 | male | group C | some high school | standard | none | 62.000000 | 64 | 55 |
49 | male | group C | high school | standard | completed | 82.000000 | 84 | 82 |
Exercise
Drop the row with the missing peduc
from the DataFrame.
Hint: to make it easier, consider that this is now the only remaining missing value
Solution
df = df.dropna()
df.head()
gender | race | peduc | lunch | course | math | reading | writing | |
---|---|---|---|---|---|---|---|---|
301 | male | group D | some high school | free/reduced | none | 66.099099 | 54 | 52 |
895 | female | group E | some high school | free/reduced | none | 32.000000 | 34 | 38 |
854 | male | group C | some high school | standard | none | 62.000000 | 64 | 55 |
49 | male | group C | high school | standard | completed | 82.000000 | 84 | 82 |
790 | female | group B | high school | standard | none | 48.000000 | 62 | 60 |
Categorical Columns - nunique() and unique()¶
If some of your data is categorical (e.g., taking on a descredte set of values with an inherent ordering) you'll often what to know exactly how many unique values you're dealing with. nunique
and unique
are here to help and can be used on a single column or across multiple columns.
Please consider, unique
will return all unique values. What if you ask for the uniques of a column with 1 million different values? 🤔
This particular method should be used wisely.
df.nunique()
gender 2 race 5 peduc 6 lunch 2 course 2 math 82 reading 72 writing 77 dtype: int64
df['gender'].unique().tolist()
['male', 'female']
# the line below represents the usage that should be avoided
# df['math'].unique()
Exercise
Fill in the blanks using unique()
and nunique()
to complete the method print_uniques
to help us to learn more about the categorical variables in our data.
### edTest(test_f) ###
def print_uniques(df, col, limit=10):
"""Print column's uniques values when the number of column's uniques is lower or equal than limit """
n = df[col].nunique()
if n <= limit:
print(f'{col}:', df[col].unique().tolist())
else:
print(f'{col}:', f'more than {limit} uniques')
for col in df.columns:
print_uniques(df, col)
gender: ['male', 'female'] race: ['group D', 'group E', 'group C', 'group B', 'group A'] peduc: ['some high school', 'high school', "bachelor's degree", 'some college', "associate's degree", "master's degree"] lunch: ['free/reduced', 'standard'] course: ['none', 'completed'] math: more than 10 uniques reading: more than 10 uniques writing: more than 10 uniques
Descriptive statistics¶
In Pandas, DataFrames have a simple method for displaying summary statistics.
df.describe()
math | reading | writing | |
---|---|---|---|
count | 999.000000 | 999.000000 | 999.000000 |
mean | 66.103202 | 69.176176 | 68.059059 |
std | 15.166754 | 14.605740 | 15.202426 |
min | 0.000000 | 17.000000 | 10.000000 |
25% | 57.000000 | 59.000000 | 57.500000 |
50% | 66.000000 | 70.000000 | 69.000000 |
75% | 77.000000 | 79.000000 | 79.000000 |
max | 100.000000 | 100.000000 | 100.000000 |
Exercise
Sometimes we don't want to access all these statistics. In the cell below fill in the blanks to get the mean and the standard deviation of the writing
and reading
columns (HINT).
### edTest(test_g) ###
resultG = df[['writing', 'reading']].aggregate(['mean', 'std'])
display(resultG)
writing | reading | |
---|---|---|
mean | 68.059059 | 69.176176 |
std | 15.202426 | 14.605740 |
Now we can group our dataframe by a specific column(s) value's and aggregate
the other columns.
Hint: Try using agg()
as an alternative to spare some typing
Exercise
Group the dataframe by peduc
and gender
while aggregating math
, reading
, and writing
with the mean and course
with the mode.
Tip: Again, don't feel constained by the blanks. This command may span multiple lines.
df.groupby(['peduc', 'gender']).agg({'math': 'mean',
'reading': 'mean',
'writing': 'mean'})
math | reading | writing | ||
---|---|---|---|---|
peduc | gender | |||
associate's degree | female | 65.250000 | 74.120690 | 74.000000 |
male | 70.764151 | 67.433962 | 65.405660 | |
bachelor's degree | female | 68.349206 | 77.285714 | 78.380952 |
male | 70.581818 | 68.090909 | 67.654545 | |
high school | female | 59.322581 | 68.268817 | 66.731183 |
male | 64.705882 | 61.480392 | 58.539216 | |
master's degree | female | 66.500000 | 76.805556 | 77.638889 |
male | 74.826087 | 73.130435 | 72.608696 | |
some college | female | 65.406780 | 73.550847 | 74.050847 |
male | 69.009259 | 64.990741 | 63.148148 | |
some high school | female | 59.296703 | 69.109890 | 68.285714 |
male | 67.955672 | 64.693182 | 61.375000 |
We would likely want to treatpeduc
as an 'ordinal' rather than a categorical variable as it does have an inherent ordering. Feel free to try using indexing to sort the rows. But after the grouping we now have multiple indices for each row!
There's still more Pandas to discover 🐼