Key Word(s): requests, web scraping, pandas, beautiful soup, parsing, eda

CS109A Introduction to Data Science

Lecture 3, Exercise 2: PANDAS Intro¶

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner

NOTE: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶

Title¶

Exercise 2: PANDAS Intro

Description¶

As discussed in class, PANDAS is Python library that contains highly useful data structures, including DataFrames, which makes Exploratory Data Analysis (EDA) easy. Here, we get practice with some of the elementary functions.

In [1]:

import pandas as pd

For this exercise, we will be working with the CS109 First Day survey results!

In [2]:

# import the CSV file
df = pd.read_csv("cs109a_student_survey.csv")

PANDAS Basics¶

Let's get started with basic functionality of PANDAS!

In the cell below, fill in the blank so that the variable cols stores the df's column names. NOTE: Please keep the type of the data structure as a . Do not have to convert this to a list.

In [31]:

### edTest(test_a) ###
cols = ____

In the cell below, fill in the blank so that:

num_cols stores the number of columns in df

In [4]:

### edTest(test_b) ###
num_rows = df.shape[0]
num_cols = ____

In the cell below, fill in the blank so that sneak_peak is equal to the first 7 rows. (HINT)

In [5]:

### edTest(test_c) ###
sneak_peak = ____

In the cell below, fill in the blank so that the_end is equal to the last 4 rows. (HINT)

In [6]:

### edTest(test_d) ###
the_end = ____

In the cell below, fill in the blank so that the python_experiences variable stores a list of the 5 distinct values found within the Python experience column of df.

In [27]:

### edTest(test_e) ###
python_experiences = ________

In the cell below, fill in the blank so that the inventor variable stores the DataFrame row(s) that correspond to everyone who is an "Inventor of Python".

In [8]:

### edTest(test_f) ###
inventor = ________

In the cell below, fill in the blank so that the utc1 variable stores the DataFrame rows that correspond to everyone who has a Timezone value of UTC+1 (Most of mainland Europe)

In [9]:

### edTest(test_g) ###
utc1 = ____________

In the cell below, fill in the blank so that the row56 variable stores the 56th row of df. To be clear, imagine our DataFrame looked as follows:

. Name Age \ 0 Enrique 25 \ 1 Sheila 67 \ 2 Marcy 21 \ 3 Utibe 33

We'd say the 1st row is the one with Enrique, the 2nd row is the one with Sheila, the 3rd row is the one w/ Marcy, etc.

In [10]:

### edTest(test_h) ###
row56 = ________

In the cell below, fill in the blank so that sorted_df now stores df after sorting it by the Name column in ascending order (A -> Z)

In [11]:

### edTest(test_i) ###
sorted_df = ________

In the cell below, fill in the blank so that sorted_row56 stores the 56th row of sorted_df. To be clear, imagine our sorted DataFrame looked as follows:

. Name Age \ 0 Enrique 25 \ 2 Marcy 21 \ 1 Sheila 67 \ 3 Utibe 33

We'd say the 1st row is the one with Enrique, the 2nd row is the one with Marcy, the 3rd row is the one w/ Sheila, etc.

In [12]:

### edTest(test_j) ###
sorted_row56 = ________