CS-109A Introduction to Data Science

Lab 1: Introduction to Python and its Numerical Stack

Harvard University
Fall 2018
Instructors: Pavlos Protopapas and Kevin Rader
Lab Instructor: Rahul Dave
Authors: Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas


In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

print('hello')
hello

Programming Expectations

All assignments for this class will use Python and the browser-based iPython notebook format you are currently viewing. Python experience is not a prerequisite for this course, as long as you are comfortable learning on your own as needed.

Note though that the programming at the level of CS 50 is a prerequisite for this course. If you have concerns about the prerequisite, please come speak with any of the instructors.

We will refer to the Python 3 documentation in this lab and throughout the course. There are also many introductory tutorials to help build programming skills, which we are listed in the last section of this lab.

Table of Contents

  1. Learning Goals
  2. Getting Started
  3. Lists
  4. Simple Functions
  5. Numpy
  6. Scipy.stats and plotting distributions
  7. Conclusions

Additional Stuff

  1. Dictionaries
  2. Reading CSVs using pandas

Learning Goals

This introductory lab is a condensed introduction to Python numerical programming. By the end of this lab, you will feel more comfortable:

  • Writing short Python code using functions, loops, lists, numpy arrays, and dictionaries.

  • Manipulating Python lists and numpy arrays and understanding the difference between them.

  • Using probability distributions from scipy.stats

  • Making very simple plots using matplotlib

  • Reading and writing CSV files using pandas

  • Learning and reading Python documentation.

Lab 1 relates to material in lecture 0,1,2,3 and homework 0.

Part 1: Getting Started

Importing modules

All notebooks should begin with code that imports modules, collections of built-in, commonly-used Python functions. Below we import the Numpy module, a fast numerical programming library for scientific computing. Future labs will require additional modules, which we'll import with the same import MODULE_NAME as MODULE_NICKNAME syntax.

In [2]:
import numpy as np #imports a fast numerical programming library

Now that Numpy has been imported, we can access some useful functions. For example, we can use mean to calculate the mean of a set of numbers.

In [3]:
np.mean([1.2, 2, 3.3])

to calculate the mean of 1.2, 2, and 3.3.

The code above is not particularly efficient, and efficiency will be important for you when dealing with large data sets. Later and in lab 2 we will see more efficient options.

Calculations and variables

In [4]:
# // is integer division
1/2, 1//2, 1.0/2.0, 3*3.2

The last line in a cell is returned as the output value, as above. For cells with multiple lines of results, we can display results using print, as can be seen below.

In [5]:
print(1 + 3.0, "\n", 9, 7)
5/3

We can store integer or floating point values as variables. The other basic Python data types -- booleans, strings, lists -- can also be stored as variables.

(more on types here: http://www.diveintopython3.net/native-datatypes.html)

In [6]:
a = 1
b = 2.0

Here is the storing of a list (more about what a list is later):

In [7]:
a = [1, 2, 3]

Think of a variable as a label for a value, not a box in which you put the value

(image taken from Fluent Python by Luciano Ramalho)

In [8]:
b = a
b

This DOES NOT create a new copy of a. It merely puts a new label on the memory at a, as can be seen by the following code:

In [9]:
print("a", a)
print("b", b)
a[1] = 7
print("a after change", a)
print("b after change", b)

Tuples

Multiple items on one line in the interface are returned as a tuple, an immutable sequence of Python objects.

In [10]:
a = 1
b = 2.0
a + a, a - b, b * b, 10*a

We can obtain the type of a variable, and use boolean comparisons to test these types.

In [11]:
type(a) == float
In [12]:
type(a) == int

For reference, below are common arithmetic and comparison operations.

Drawing

Drawing

EXERCISE**: Create a tuple called `tup` with the following seven objects:
.

  • The first element is an integer of your choice
  • The second element is a float of your choice
  • The third element is the sum of the first two elements
  • The fourth element is the difference of the first two elements
  • The fifth element is first element divided by the second element

Display the output of tup. What is the type of the variable tup? What happens if you try and chage an item in the tuple?

In [13]:
# your code here

Part 2: Lists

Much of Python is based on the notion of a list. In Python, a list is a sequence of items separated by commas, all within square brackets. The items can be integers, floating points, or another type. Unlike in C arrays, items in a Python list can be different types, so Python lists are more versatile than traditional arrays in C or in other languages.

Let's start out by creating a few lists.

In [14]:
empty_list = []
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
mixed_list = [1, 2., 3, 4., 5]
print(empty_list)
print(int_list)
print(mixed_list, float_list)

Lists in Python are zero-indexed, as in C. The first entry of the list has index 0, the second has index 1, and so on.

In [15]:
print(int_list[0])
print(float_list[1])

What happens if we try to use an index that doesn't exist for that list? Python will complain!

In [16]:
print(float_list[10])

You can find the length of a list using the builtin function len:

In [17]:
print(float_list)
len(float_list)

Indexing on lists

And since Python is zero-indexed, the last element of float_list is

In [18]:
float_list[len(float_list)-1]

It is more idiomatic in python to use -1 for the last element, -2 for the second last, and so on

In [19]:
float_list[-1]

We can use the : operator to access a subset of the list. This is called slicing.

In [20]:
print(float_list[1:5])
print(float_list[0:2])

Below is a summary of list slicing operations:

Drawing

You can slice "backwards" as well:

In [21]:
float_list[:-2] # up to second last
In [22]:
float_list[:4] # up to but not including 5th element

You can also slice with a stride:

In [23]:
float_list[:4:2] # above but skipping every second element

We can iterate through a list using a loop. Here's a for loop.

In [24]:
for ele in float_list:
    print(ele)

Or, if we like, we can iterate through a list using the indices using a for loop with in range. This is not idiomatic and is not recommended, but accomplishes the same thing as above.

In [25]:
for i in range(len(float_list)):
    print(float_list[i])

What if you wanted the index as well?

Use the built-in python method enumerate, which can be used to create a list of tuples with each tuple of the form (index, value).

In [26]:
for i, ele in enumerate(float_list):
    print(i,ele)
In [27]:
# or make a list from it using the list constructor
list(enumerate(float_list))

Appending and deleting

We can also append items to the end of the list using the + operator or with append.

In [28]:
float_list + [.333]
In [29]:
float_list.append(.444)
In [30]:
print(float_list)
len(float_list)

Go and run the cell with float_list.append a second time. Then run the next line. What happens?

To remove an item from the list, use del.

In [31]:
del(float_list[2])
print(float_list)

List Comprehensions

Lists can be constructed in a compact way using a list comprehension. Here's a simple example.

In [32]:
squaredlist = [i*i for i in int_list]
squaredlist

And here's a more complicated one, requiring a conditional.

In [33]:
comp_list1 = [2*i for i in squaredlist if i % 2 == 0]
print(comp_list1)

This is entirely equivalent to creating comp_list1 using a loop with a conditional, as below:

In [34]:
comp_list2 = []
for i in squaredlist:
    if i % 2 == 0:
        comp_list2.append(2*i)
        
print(comp_list2)

The list comprehension syntax

[expression for item in list if conditional]

is equivalent to the syntax

for item in list:
    if conditional:
        expression

Exercise: Build a list that contains every prime number between 1 and 100, in two different ways:

  1. Using for loops and conditional if statements.
  2. (Stretch Goal) Using a list comprehension. You should be able to do this in one line of code, and it may be helpful to look up the function all in the documentation.
In [35]:
# your code here
In [36]:
# your code here

Part 4: Simple Functions

A function object is a reusable block of code that does a specific task. Functions are all over Python, either on their own or on other objects. To invoke a function func, you call it as func(arguments).

We've seen built-in Python functions and methods. For example, len and print are built-in Python functions. And at the beginning, you called np.mean to calculate the mean of three numbers, where mean is a function in the numpy module and numpy was abbreviated as np. This syntax allows us to have multiple "mean" functions in different modules; calling this one as np.mean guarantees that we will pick up numpy's mean function, as opposed to a mean function from a different module.

Methods

A function that belongs to an object is called a method. By "object" here we mean an "instance" of a list, or integer, or floating point variable.

An example of this is append on an existing list. In other words, a method is a function on an instance of a type of object (also called class, in this case, list type).

In [37]:
float_list = [1.0, 2.09, 4.0, 2.0, 0.444]
print(float_list)
float_list.append(56.7) 
float_list

User-defined functions

We'll now learn to write our own user-defined functions. Below is the syntax for defining a basic function with one input argument and one output. You can also define functions with no input or output arguments, or multiple input or output arguments.

def name_of_function(arg):
    ...
    return(output)

We can write functions with one input and one output argument. Here are two such functions.

In [38]:
def square(x):
    x_sqr = x*x
    return(x_sqr)

def cube(x):
    x_cub = x*x*x
    return(x_cub)

square(5),cube(5)

What if you want to return two variables at a time? The usual way is to return a tuple:

In [39]:
def square_and_cube(x):
    x_cub = x*x*x
    x_sqr = x*x
    return(x_sqr, x_cub)

square_and_cube(5)

Lambda functions

Often we quickly define mathematical functions with a one-line function called a lambda function. Lambda functions are great because they enable us to write functions without having to name them, ie, they're anonymous.
No return statement is needed.

In [40]:
# create an anonymous function and assign it to the variable square
square = lambda x: x*x
print(square(3))


hypotenuse = lambda x, y: x*x + y*y

## Same as

# def hypotenuse(x, y):
#     return(x*x + y*y)

hypotenuse(3,4)

Refactoring using functions

In an exercise from Lab 0, you wrote code that generated a list of the prime numbers between 1 and 100. For the excercise below, it may help to revisit that code.

Write a function called `isprime` that takes in a positive integer $N$, and determines whether or not it is prime. Return `True` if it's prime and return `False` if it isn't. Then, using a list comprehension and `isprime`, create a list `myprimes` that contains all the prime numbers less than 100.
In [41]:
# your code here

Part 5: Introduction to Numpy

Scientific Python code uses a fast array structure, called the numpy array. Those who have worked in Matlab will find this very natural. For reference, the numpy documention can be found here.

Let's make a numpy array.

In [42]:
my_array = np.array([1, 2, 3, 4])
my_array
In [43]:
# works as in lists
len(my_array)

The shape array of an array is very useful (we'll see more of it later when we talk about 2D and higher dimensional arrays).

In [44]:
my_array.shape

Numpy arrays are typed. This means that by default, all the elements will be assumed to be of one type

In [45]:
my_array.dtype

Numpy arrays are listy (i.e. they act like lists)! Below we compute length, slice, and iterate.

In [46]:
print(len(my_array))
print(my_array[2:4])
for ele in my_array:
    print(ele)

In general you should manipulate numpy arrays by using numpy module functions (np.mean, for example). This is for efficiency purposes, and a discussion about this will happen in Lab2.

You can calculate the mean of the array elements either by calling the method .mean on a numpy array or by applying the function np.mean with the numpy array as an argument.

In [47]:
print(my_array.mean())
print(np.mean(my_array))

The way we constructed the numpy array above seems redundant. After all we already had a regular python list. Indeed, it is the other ways we have to construct numpy arrays that make them super useful.

There are many such numpy array constructors. Here are some commonly used constructors. Look them up in the documentation.

In [48]:
np.ones(10) # generates 10 floating point ones

Numpy gains a lot of its efficiency from being typed. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float of size appropriate for the machine (64 bit on a 64 bit machine).

In [49]:
np.dtype(float).itemsize # in bytes
In [50]:
np.ones(10, dtype='int') # generates 10 integer ones
In [51]:
np.zeros(10)

Often you will want random numbers. Use the random constructor!

In [52]:
np.random.random(10) # uniform on [0,1]

You can generate random numbers from a normal distribution with mean 0 and variance 1:

In [53]:
normal_array = np.random.randn(1000)
print("The sample mean and standard devation are %f and %f, respectively." %(np.mean(normal_array), np.std(normal_array)))

You can sample with and without replacement from an array. Lets first construct a grid:

In [54]:
grid = np.arange(0., 1.01, 0.1)
grid

Without replacement

In [55]:
np.random.choice(grid, 5, replace=False)
In [56]:
np.random.choice(grid, 20, replace=False)

With replacement:

In [57]:
np.random.choice(grid, 12, replace=True)

Numpy supports vector operations

What does this mean? It means that instead of adding two arrays, element by element, you can just say: add the two arrays. Note that this behavior is very different from python lists.

In [58]:
first = np.ones(5)
second = np.ones(5)
first + second
In [59]:
first_list = [1., 1., 1., 1., 1.]
second_list = [1., 1., 1., 1., 1.]
first_list + second_list #not what u want

On some computer chips this addition actually happens in parallel, so speedups can be high. But even on regular chips, the advantage of greater readability is important.

Numpy supports a concept known as broadcasting, which dictates how arrays of different sizes are combined together. There are too many rules to list here, but importantly, multiplying an array by a number multiplies each element by the number. Adding a number adds the number to each element.

In [60]:
first + 1
In [61]:
first*5

This means that if you wanted the distribution $N(5, 7)$ you could do:

In [62]:
normal_5_7 = 5 + 7*normal_array
np.mean(normal_5_7), np.std(normal_5_7)

Multiplying two arrays multiplies them element-by-element

In [63]:
(first +1) * (first*5)

You might have wanted to compute the dot product instead:

In [64]:
np.dot((first +1) , (first*5))

You can also use the @ operator for this purpose

In [65]:
(first +1) @ (first*5)

Part 6: Probabilitiy Distributions from scipy.stats

Since we'll be using many distributions, we'll want to access the pdf/pmf functions of these distributions and obtain samples from them. We already saw how to obtain samples from the continuous uniform and normal distributions. But we might want to obtain their pdfs as well.

scipy.stats allows us to obtain the pdf function as well as samples. The programming interface is identical for all the distributions.

To plot samples from and the pdfs of these distributions, we'll first import matplotlib, python's plotting library.

The

%matplotlib inline incantation ensures that plots are rendered inline in the browser.

In [66]:
%matplotlib inline
import matplotlib.pyplot as plt

Lets get the normal distribution namespace from scipy.stats. Docs here.

In [67]:
from scipy.stats import norm

Lets create 1000 points between -10 and 10

In [68]:
x = np.linspace(-10, 10, 1000)
x[:10], x[-10:]

Lets get the pdf of a normal distribution with a mean of 1 and standard deviation 3 and plot it using the grid points computed before...

In [69]:
pdf_x = norm.pdf(x, 1, 3)
plt.plot(x, pdf_x);

And you can get random variables using the rvs function.

In [70]:
norm.rvs(size=30, loc=1, scale=3)

We can use a more instance based way of getting both the pdf and samples. The documentation calls this instance a "frozen" distribution:

In [71]:
frozen_norm = norm(loc=1, scale=3)
type(frozen_norm)
In [72]:
plt.plot(x, frozen_norm.pdf(x))
In [73]:
frozen_norm.rvs(10)

We can now plot a histogram of the samples using matplotlib:

(see docs on plt.hist by typing

?plt.hist

in a cell by itself)

In [74]:
plt.hist(frozen_norm.rvs(1000), bins=30);

By default the histogram gives us counts. We can re-normalize these counts to get an approximation to the probability distribution from the samples.

In [75]:
plt.hist(frozen_norm.rvs(1000), bins=30, normed=True);

Part 7: Conclusions

For more practice exercises (with solutions) and discussion, see this page. Some of these exercises are particularly relevant. Check them out!

Don't forget to look up Jake's book.

Finally, we would like to suggest using Chris Albon's web site as a reference. Lots of useful information there.

Additional Stuff

Part 1: Dictionaries

A dictionary is another storage container. Like a list, a dictionary is a sequence of items. Unlike a list, a dictionary is unordered and its items are accessed with keys and not integer positions.

Dictionaries are the closest container we have to a database.

Let's make a dictionary with a few Harvard courses and their corresponding enrollment numbers.

In [76]:
enroll2017_dict = {'CS50': 692, 'CS109A / Stat 121A / AC 209A': 352, 'Econ1011a': 95, 'AM21a': 153, 'Stat110': 485}
enroll2017_dict

One can obtain the value corrsponding to a key thus:

In [77]:
enroll2017_dict['CS50']

Or thus, which allows for the key to not be in the dictionary

In [78]:
enroll2017_dict.get('CS01', 5), enroll2017_dict.get('CS01')
In [79]:
enroll2017_dict.get('CS50')

All sorts of iterations are supported:

In [80]:
enroll2017_dict.values()
In [81]:
enroll2017_dict.items()

We can iterate over the tuples obtained above: (to read more about how the print formatting works , look at https://docs.python.org/3/library/stdtypes.html#old-string-formatting and https://docs.python.org/3/tutorial/inputoutput.html)

In [82]:
for key, value in enroll2017_dict.items():
    print("%s: %d" %(key, value))

Simply iterating over a dictionary gives us the keys. This is useful when we want to do something with each item:

In [83]:
second_dict={}
for key in enroll2017_dict:
    second_dict[key] = enroll2017_dict[key]
second_dict

The above is an actual copy to another part of memory, unlike, second_dict = enroll2017_dict which would have made both variables label the same memory location.

In this example, the keys were strings corresponding to course names. Keys don't have to be strings though.

Like lists, you can construct dictionaries using a dictionary comprehension, which is similar to a list comprehension. Notice the brackets {} and the use of zip, which is another iterator that combines two lists together.

In [84]:
my_dict = {k:v for (k, v) in zip(int_list, float_list)}
my_dict

You can also create dictionaries using the constructor function dict.

In [85]:
dict(a = 1, b = 2)

Part 2: Introduction to Pandas

Often data is stored in comma separated values (CSV) files. For the remainder of this lab, we'll be working with automobile data, where we've extracted relevant parts below.

Note that CSV files can be output by any spreadsheet software, and are plain text, hence are a great way to share data.

Importing data with pandas

Now let's read in our automobile data as a pandas dataframe structure.

In [86]:
import pandas as pd
In [87]:
# Read in the csv files
dfcars=pd.read_csv("data/mtcars.csv")
type(dfcars)
In [88]:
dfcars.head()

What we have now is a spreadsheet with indexed rows and named columns, called a dataframe in pandas. dfcars is an instance of the pd.DataFrame class, created by calling the pd.read_csv "constructor function".

The take-away is that dfcars is a dataframe object, and it has methods (functions) belonging to it. For example, df.head() is a method that shows the first 5 rows of the dataframe.

A pandas dataframe is a set of columns pasted together into a spreadsheet, as shown in the schematic below, which is taken from the cheatsheet above. The columns in pandas are called series objects.

Let's look again at the first five rows of dfcars.

In [89]:
dfcars.head()

Notice the poorly named first column: "Unnamed: 0". Why did that happen?

The first column, which seems to be the name of the car, does not have a name. Here are the first 3 lines of the file:

"","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4

Lets clean that up:

In [90]:
dfcars = dfcars.rename(columns={"Unnamed: 0": "name"})
dfcars.head()

In the above, the argument columns = {"Unnamed: 0": "name"} of rename changed the name of the first column in the dataframe from Unnamed: 0 to name.

Lets save this cleaned dataframe out to a CSV file.

In [91]:
# dont store the 0,1,2,3,4.. index
dfcars.to_csv("data/cleaned-mtcars.csv", index=False, header=True)

The output will look something like this:

name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1

To access a series (column), you can use either dictionary syntax or instance-variable syntax.

Dictionary syntax is very useful when column names have spaces: Python variables cannot have spaces in them.

In [92]:
dfcars.mpg
In [93]:
dfcars['mpg']

You can get a numpy array of values from the Pandas Series:

In [94]:
dfcars.mpg.values

And we can produce a histogram from these values

In [95]:
# the .values isnt really need, a series behaves like a list for
# plotting purposes
plt.hist(dfcars.mpg.values, bins=20);
plt.xlabel("mpg");
plt.ylabel("Frequency")
plt.title("Miles per Gallon");

But pandas is very cool: you can get a histogram directly:

In [96]:
dfcars.mpg.hist(bins=20);
plt.xlabel("mpg");
plt.ylabel("Frequency")
plt.title("Miles per Gallon");

We can also get sub-dataframes by choosing a set of series. We pass a list of the columns we want as "dictionary keys" to the dataframe.

In [97]:
dfcars[['am', 'mpg']]

Scatter plots

We often want to see co-variation among our columns, for example, miles/gallon versus weight. This can be done with a scatter plot.

In [98]:
plt.scatter(dfcars.wt, dfcars.mpg);
plt.xlabel("weight");
plt.ylabel("miles per gallon");

You could have used plot instead of scatter.

In [99]:
plt.plot(dfcars.wt, dfcars.mpg, 'o');
plt.xlabel("weight");
plt.ylabel("miles per gallon");

Usually we use plt.show() at the end of every plot to display the plot. Our magical incantation %matplotlib inline takes care of this for us, and we don't have to do it in the Jupyter notebook. But if you run your Python program from a file, you will need to explicitly have a call to show. We include it for completion.

In [100]:
plt.plot(dfcars.wt, dfcars.mpg, 'ko')  #black dots
plt.xlabel("weight");
plt.ylabel("miles per gallon");
plt.show()

Suppose we'd like to save a figure to a file. We do this by including the savefig command in the same cell as the plotting commands. The file extension tells you how the file will be saved.

In [101]:
plt.plot(dfcars.wt, dfcars.mpg, 'o')
plt.xlabel("weight");
plt.ylabel("miles per gallon");
plt.savefig('images/foo1.pdf')
plt.savefig('images/foo1.png', bbox_inches='tight') #less whitespace around image

And this is what the saved png looks like. Code in Markdown to show this is:

![](images/foo1.png)

Below is a summary of the most commonly used matplotlib plotting routines.

Exercise
Create a scatter plot showing the co-variation between two columns of your choice. Label the axes. See if you can do this without copying and pasting code from earlier in the lab. What can you conclude, if anything, from your scatter plot?

In [102]:
# your code here