Key Word(s): pandas
CS109A Introduction to PANDAS
Lecture 2, Exercise 2: PANDAS Intro 2¶
Harvard University
Fall 2021
Instructors: Pavlos Protopapas and Natesh Pillai
Exercise 2: PANDAS Intro 2
Let's get some more practice with a few of the core PANDAS
functions.
import pandas as pd
We'll continue working with StudentsPerformance dataset from Exercise 1.
# import the CSV file
df = pd.read_csv("StudentsPerformance.csv")
df.head()
Indexing - iloc and loc¶
It's very important to understand the differences between loc and iloc. Looking at the next cell code one could think that they do the same thing.
(When you query just for one row you obtain an object whose name is the index of the selected row.)
df.iloc[10] == df.loc[10]
all(df.iloc[10] == df.loc[10])
The first time we loaded the CSV into a DataFrame we didn't tell pandas to interpret a specific column as an index, so pandas created an index for us. Whether the ordering imposed on our data by this index should be respected is a matter on which iloc
and loc
disagree.
To really learn the difference of iloc
and loc
we need to shuffle the rows
To do that we can use the sample method to take a new sample of the dataframe original size (frac=1 means that)
_Bonus: Stop and consider: what is the purpose of setting a random_state
when we call them sample
method?_
df = df.sample(frac=1, random_state=109)
df
Now let's repeat our code from ealier.
df.iloc[10] == df.loc[10]
all(df.iloc[10] == df.loc[10])
It turns out that loc
filters by index value (something like where df.index == value
).
That is, loc
's results are depend only on the indices (which are now scrambled after sampling). The actual positions of elements in the DataFrame are ignored.
df.loc[10]
By contrast, iloc
filters by row position (something like where df.index == df.index[value]
)
So iloc
's results depend on the actual positions of elements in a pandas data structure. The indices of these elements are ignored.
It's this difference that explains counterintuitive results like this:
df.index[10]
Consider a single row index with iloc
df.iloc[10]
And take note of where you can find the index in output formatted like this.
Enough chat. Time for...
Exercise
In the cell below, fill in the blank so that the row5 variable stores the 5th row of df. To be clear, imagine our DataFrame looked as follows:
Index | Words |
---|---|
0 | this |
1 | is |
2 | not |
3 | easy |
We'd say the 1st row is the one with this word, the 2nd row is the one with is word, the 3rd row is the one with not word, etc.
### edTest(test_a) ###
row5 = ________
row5
You can display the first rows to have a better understanding of what you did. Can you find the row you've just selected?
df.iloc[:5]
Notice how we can use familar Python slice notation with iloc
and loc
!
Sorting¶
We scrambled out df earlier with sample
. We should also know how to tidy things up.
Exercise
In the cell below, fill in the blank so that sorted_df
now stores df
after sorting it by the math score column in decreasing order (HINT)
### edTest(test_b) ###
sorted_df = ________
sorted_df
Exercise
In the cell below, fill in the blank so that sorted_row5
stores the 5th row of sorted_df
. To be clear, imagine our sorted DataFrame looked as follows:
Index | Words |
---|---|
3 | easy |
1 | is |
2 | not |
0 | this |
We'd say the 1st row is the one with easy, the 2nd row is the one with is, the 3rd row is the one with not, etc.
### edTest(test_c) ###
sorted_row5 = ________
sorted_row5
Can you find the row you've just selected?
# len('head()') < len('iloc[:5]') :)
sorted_df.head()
Column Naming Conventions¶
How you've named your df columns can affect the amount of typing required to manipulate your data, the readability of your code, and even the syntax optons you have. Let's take a look at some best practices for naming columns.
df
We can see that there are columns names whose lengths are not confortable for coding. What we can do to make our life easier:
- Try to work with short columns names
- Try to avoid characters like spaces that will allow us to use column access without brackets
- Try working in only lower or upper case (prefably lower case; there's no need to shout)
Our df already conforms to this last suggestion.
We'll find that, after some renaming, things should become easier for us, making:
expressions like these ones:
condition = (df['test preparation course'] != 'completed') & (df['writing score'] > df['writing score'].median())
to become:
condition = (df.course != 'completed') & (df.writing > df.writing.median())
Exercise
In the cell below fill in the blank with these goals in mind:
- rename column
race/ethnicity
torace
- rename column
parental level of education
topeduc
- raname column
test preparation course
tocourse
- remove
score
(included the left space char) from the rest of the columns
HINT1: Don't by shy, check the documentation if you need some help
HINT2: in many cases, it's faster to access docstrings using help
help(df.rename)
HINT3: TAB autocomplete can be ued to you explore an object's available methods and attributes.\ HINT4: Still more exciting, place your cursor after the opening paranthesis of a function call and press SHIFT+TAB once or twice. Instantly, you're presented with a docstring. It's a whole new world! 🌈
### edTest(test_d) ###
df = df.rename(columns={____})
df
Look for missing values¶
Missing values in our DataFrames can cause many issues. They can cause certain operations and function calls to fail and throw an error. Perhaps worse, this problems can happen 'silently,' affecting our results without us realizing there is a problem. Unless we take precautions of course!
The first step is locating missingness. This dataset doesn't have any missing values. So we'll make some ourselves and 'poke a holes' in the Dataframe.
# 'poking holes' in the data
df.iloc[0,5] = None
df.iloc[2,2] = None
Exercise
Fill in the blank to display whether or not entries in the first 5 rows are missing (HINT)
Solution
df.___.___
Exercise
Fill in the blanks to sum the total number of missing values in each of the dataset's columns.
### edTest(test_e) ###
resultE = df.___.___
display(resultE)
Now let's deal with these 'holes' we've just made
Exercise
Fill the missing math
entry with that column's mean.
Hint: Select subsets of the data frame using a column name or names
Note: The blanks here represent just one way of doing this. Don't feel constrain by the blanks here.
Solution
df___ = df___.___(df___.___)
df.head()
Exercise
Drop the row with the missing peduc
from the DataFrame.
Hint: to make it easier, consider that this is now the only remaining missing value
Solution
df = df___
df.head()
Categorical Columns - nunique() and unique()¶
If some of your data is categorical (e.g., taking on a descredte set of values with an inherent ordering) you'll often what to know exactly how many unique values you're dealing with. nunique
and unique
are here to help and can be used on a single column or across multiple columns.
Please consider, unique
will return all unique values. What if you ask for the uniques of a column with 1 million different values? 🤔
This particular method should be used wisely.
df.nunique()
df['gender'].unique().tolist()
# the line below represents the usage that should be avoided
# df['math'].unique()
Exercise
Fill in the blanks using unique()
and nunique()
to complete the method print_uniques
to help us to learn more about the categorical variables in our data.
### edTest(test_f) ###
def print_uniques(df, col, limit=10):
"""Print column's uniques values when the number of column's uniques is lower or equal than limit """
n = df[col].______
if n <= limit:
print(f'{col}:', df[col].______.tolist())
else:
print(f'{col}:', f'more than {limit} uniques')
for col in df.columns:
print_uniques(df, col)
Descriptive statistics¶
In Pandas, DataFrames have a simple method for displaying summary statistics.
df.describe()
Exercise
Sometimes we don't want to access all these statistics. In the cell below fill in the blanks to get the mean and the standard deviation of the writing
and reading
columns (HINT).
### edTest(test_g) ###
resultG = df[[____]].aggregate([____])
display(resultG)
Now we can group our dataframe by a specific column(s) value's and aggregate
the other columns.
Hint: Try using agg()
as an alternative to spare some typing
Exercise
Group the dataframe by peduc
and gender
while aggregating math
, reading
, and writing
with the mean and course
with the mode.
Tip: Again, don't feel constained by the blanks. This command may span multiple lines.
df.___(_____).___(_________)
math | reading | writing | ||
---|---|---|---|---|
peduc | gender | |||
associate's degree | female | 65.250000 | 74.120690 | 74.000000 |
male | 70.764151 | 67.433962 | 65.405660 | |
bachelor's degree | female | 68.349206 | 77.285714 | 78.380952 |
male | 70.581818 | 68.090909 | 67.654545 | |
high school | female | 59.322581 | 68.268817 | 66.731183 |
male | 64.705882 | 61.480392 | 58.539216 | |
master's degree | female | 66.500000 | 76.805556 | 77.638889 |
male | 74.826087 | 73.130435 | 72.608696 | |
some college | female | 65.406780 | 73.550847 | 74.050847 |
male | 69.009259 | 64.990741 | 63.148148 | |
some high school | female | 59.296703 | 69.109890 | 68.285714 |
male | 67.955672 | 64.693182 | 61.375000 |
We would likely want to treatpeduc
as an 'ordinal' rather than a categorical variable as it does have an inherent ordering. Feel free to try using indexing to sort the rows. But after the grouping we now have multiple indices for each row!
There's still more Pandas to discover 🐼