Key Word(s): pandas

Lec2_Ex2

CS109A Introduction to PANDAS

Lecture 2, Exercise 2: PANDAS Intro 2¶

Harvard University
Fall 2021
Instructors: Pavlos Protopapas and Natesh Pillai

Exercise 2: PANDAS Intro 2

Let's get some more practice with a few of the core PANDAS functions.

In [2]:

import pandas as pd

We'll continue working with StudentsPerformance dataset from Exercise 1.

In [3]:

# import the CSV file
df = pd.read_csv("StudentsPerformance.csv")
df.head()

Indexing - iloc and loc¶

It's very important to understand the differences between loc and iloc. Looking at the next cell code one could think that they do the same thing.
(When you query just for one row you obtain an object whose name is the index of the selected row.)

In [133]:

df.iloc[10] == df.loc[10]

In [134]:

all(df.iloc[10] == df.loc[10])

The first time we loaded the CSV into a DataFrame we didn't tell pandas to interpret a specific column as an index, so pandas created an index for us. Whether the ordering imposed on our data by this index should be respected is a matter on which iloc and loc disagree.

To really learn the difference of iloc and loc we need to shuffle the rows

To do that we can use the sample method to take a new sample of the dataframe original size (frac=1 means that)

_Bonus: Stop and consider: what is the purpose of setting a random_state when we call them sample method?_

In [135]:

df = df.sample(frac=1, random_state=109)
df

Now let's repeat our code from ealier.

In [136]:

df.iloc[10] == df.loc[10]

In [137]:

all(df.iloc[10] == df.loc[10])

It turns out that loc filters by index value (something like where df.index == value).

That is, loc's results are depend only on the indices (which are now scrambled after sampling). The actual positions of elements in the DataFrame are ignored.

In [147]:

df.loc[10]

By contrast, iloc filters by row position (something like where df.index == df.index[value])

So iloc's results depend on the actual positions of elements in a pandas data structure. The indices of these elements are ignored.

It's this difference that explains counterintuitive results like this:

In [150]:

df.index[10]

Consider a single row index with iloc

In [146]:

df.iloc[10]

And take note of where you can find the index in output formatted like this.

Enough chat. Time for...

Exercise

In the cell below, fill in the blank so that the row5 variable stores the 5th row of df. To be clear, imagine our DataFrame looked as follows:

Index	Words
0	this
1	is
2	not
3	easy

We'd say the 1st row is the one with this word, the 2nd row is the one with is word, the 3rd row is the one with not word, etc.

In [0]:

### edTest(test_a) ###
row5 = ________
row5

You can display the first rows to have a better understanding of what you did. Can you find the row you've just selected?

In [0]:

df.iloc[:5]

Notice how we can use familar Python slice notation with iloc and loc!

Sorting¶

We scrambled out df earlier with sample. We should also know how to tidy things up.

Exercise
In the cell below, fill in the blank so that sorted_df now stores df after sorting it by the math score column in decreasing order (HINT)

In [11]:

### edTest(test_b) ###
sorted_df = ________
sorted_df

Exercise
In the cell below, fill in the blank so that sorted_row5 stores the 5th row of sorted_df. To be clear, imagine our sorted DataFrame looked as follows:

Index	Words
3	easy
1	is
2	not
0	this

We'd say the 1st row is the one with easy, the 2nd row is the one with is, the 3rd row is the one with not, etc.

In [12]:

### edTest(test_c) ###
sorted_row5 = ________
sorted_row5

Can you find the row you've just selected?

In [149]:

# len('head()') < len('iloc[:5]') :)
sorted_df.head()

Column Naming Conventions¶

How you've named your df columns can affect the amount of typing required to manipulate your data, the readability of your code, and even the syntax optons you have. Let's take a look at some best practices for naming columns.

In [10]:

df

We can see that there are columns names whose lengths are not confortable for coding. What we can do to make our life easier:

Try to work with short columns names
Try to avoid characters like spaces that will allow us to use column access without brackets
Try working in only lower or upper case (prefably lower case; there's no need to shout)

Our df already conforms to this last suggestion.

We'll find that, after some renaming, things should become easier for us, making:

expressions like these ones:

condition = (df['test preparation course'] != 'completed') & (df['writing score'] > df['writing score'].median())

to become:

condition = (df.course != 'completed') & (df.writing > df.writing.median())

Exercise
In the cell below fill in the blank with these goals in mind:

rename column race/ethnicity to race
rename column parental level of education to peduc
raname column test preparation course to course
remove score (included the left space char) from the rest of the columns

HINT1: Don't by shy, check the documentation if you need some help
HINT2: in many cases, it's faster to access docstrings using help

help(df.rename)

HINT3: TAB autocomplete can be ued to you explore an object's available methods and attributes.\ HINT4: Still more exciting, place your cursor after the opening paranthesis of a function call and press SHIFT+TAB once or twice. Instantly, you're presented with a docstring. It's a whole new world! 🌈

In [0]:

### edTest(test_d) ###
df = df.rename(columns={____})
df

Look for missing values¶

Missing values in our DataFrames can cause many issues. They can cause certain operations and function calls to fail and throw an error. Perhaps worse, this problems can happen 'silently,' affecting our results without us realizing there is a problem. Unless we take precautions of course!

The first step is locating missingness. This dataset doesn't have any missing values. So we'll make some ourselves and 'poke a holes' in the Dataframe.

In [19]:

# 'poking holes' in the data
df.iloc[0,5] = None
df.iloc[2,2] = None

Exercise

Fill in the blank to display whether or not entries in the first 5 rows are missing (HINT)

Solution

In [0]:

df.___.___

Exercise

Fill in the blanks to sum the total number of missing values in each of the dataset's columns.

In [0]:

### edTest(test_e) ###
resultE = df.___.___
display(resultE)

Now let's deal with these 'holes' we've just made

Exercise
Fill the missing math entry with that column's mean.

Hint: Select subsets of the data frame using a column name or names

Note: The blanks here represent just one way of doing this. Don't feel constrain by the blanks here.

Solution

In [0]:

df___ = df___.___(df___.___)
df.head()

Exercise
Drop the row with the missing peduc from the DataFrame.

Hint: to make it easier, consider that this is now the only remaining missing value

Solution

In [0]:

df = df___
df.head()

Categorical Columns - nunique() and unique()¶

If some of your data is categorical (e.g., taking on a descredte set of values with an inherent ordering) you'll often what to know exactly how many unique values you're dealing with. nunique and unique are here to help and can be used on a single column or across multiple columns. Please consider, unique will return all unique values. What if you ask for the uniques of a column with 1 million different values? 🤔

This particular method should be used wisely.

In [46]:

df.nunique()

In [63]:

df['gender'].unique().tolist()

In [66]:

# the line below represents  the usage that should be avoided
# df['math'].unique()

Exercise

Fill in the blanks using unique() and nunique() to complete the method print_uniques to help us to learn more about the categorical variables in our data.

In [0]:

### edTest(test_f) ###
def print_uniques(df, col, limit=10):
    """Print column's uniques values when the number of column's uniques is lower or equal than limit """
    n = df[col].______
    if n <= limit:
        print(f'{col}:', df[col].______.tolist())
    else:
        print(f'{col}:', f'more than {limit} uniques')

for col in df.columns:
    print_uniques(df, col)

Descriptive statistics¶

In Pandas, DataFrames have a simple method for displaying summary statistics.

In [0]:

df.describe()

Exercise

Sometimes we don't want to access all these statistics. In the cell below fill in the blanks to get the mean and the standard deviation of the writing and reading columns (HINT).

In [0]:

### edTest(test_g) ###
resultG = df[[____]].aggregate([____])
display(resultG)

Now we can group our dataframe by a specific column(s) value's and aggregate the other columns.

Hint: Try using agg() as an alternative to spare some typing

Exercise

Group the dataframe by peduc and gender while aggregating math, reading, and writing with the mean and course with the mode.

Tip: Again, don't feel constained by the blanks. This command may span multiple lines.

In [33]:

df.___(_____).___(_________)

Out[33]:

		math	reading	writing
peduc	gender
associate's degree	female	65.250000	74.120690	74.000000
associate's degree	male	70.764151	67.433962	65.405660
bachelor's degree	female	68.349206	77.285714	78.380952
bachelor's degree	male	70.581818	68.090909	67.654545
high school	female	59.322581	68.268817	66.731183
high school	male	64.705882	61.480392	58.539216
master's degree	female	66.500000	76.805556	77.638889
master's degree	male	74.826087	73.130435	72.608696
some college	female	65.406780	73.550847	74.050847
some college	male	69.009259	64.990741	63.148148
some high school	female	59.296703	69.109890	68.285714
some high school	male	67.955672	64.693182	61.375000

We would likely want to treatpeducas an 'ordinal' rather than a categorical variable as it does have an inherent ordering. Feel free to try using indexing to sort the rows. But after the grouping we now have multiple indices for each row!

There's still more Pandas to discover 🐼