Key Word(s): Lists, Dictionaries, Functions, Read Data, NumPy, Pandas, Matplotlib
CS-109A Introduction to Data Science
Lab 1: Introduction to Python and its Numerical Stack¶
Harvard University
Fall 2018
Instructors: Pavlos Protopapas and Kevin Rader
Lab Instructor: Rahul Dave
Authors: Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas
## RUN THIS CELL TO GET THE RIGHT FORMATTING
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
print('hello')
Programming Expectations¶
All assignments for this class will use Python and the browser-based iPython notebook format you are currently viewing. Python experience is not a prerequisite for this course, as long as you are comfortable learning on your own as needed.
Note though that the programming at the level of CS 50 is a prerequisite for this course. If you have concerns about the prerequisite, please come speak with any of the instructors.
We will refer to the Python 3 documentation in this lab and throughout the course. There are also many introductory tutorials to help build programming skills, which we are listed in the last section of this lab.
Learning Goals¶
This introductory lab is a condensed introduction to Python numerical programming. By the end of this lab, you will feel more comfortable:
Writing short Python code using functions, loops, lists, numpy arrays, and dictionaries.
Manipulating Python lists and numpy arrays and understanding the difference between them.
Using probability distributions from
scipy.stats
Making very simple plots using
matplotlib
Reading and writing CSV files using
pandas
Learning and reading Python documentation.
Lab 1 relates to material in lecture 0,1,2,3 and homework 0.
Part 1: Getting Started¶
Importing modules¶
All notebooks should begin with code that imports modules, collections of built-in, commonly-used Python functions. Below we import the Numpy module, a fast numerical programming library for scientific computing. Future labs will require additional modules, which we'll import with the same import MODULE_NAME as MODULE_NICKNAME
syntax.
import numpy as np #imports a fast numerical programming library
Now that Numpy has been imported, we can access some useful functions. For example, we can use mean
to calculate the mean of a set of numbers.
np.mean([1.2, 2, 3.3])
to calculate the mean of 1.2, 2, and 3.3.
The code above is not particularly efficient, and efficiency will be important for you when dealing with large data sets. Later and in lab 2 we will see more efficient options.
Calculations and variables¶
# // is integer division
1/2, 1//2, 1.0/2.0, 3*3.2
The last line in a cell is returned as the output value, as above. For cells with multiple lines of results, we can display results using print
, as can be seen below.
print(1 + 3.0, "\n", 9, 7)
5/3
We can store integer or floating point values as variables. The other basic Python data types -- booleans, strings, lists -- can also be stored as variables.
(more on types here: http://www.diveintopython3.net/native-datatypes.html)
a = 1
b = 2.0
Here is the storing of a list (more about what a list is later):
a = [1, 2, 3]
Think of a variable as a label for a value, not a box in which you put the value
(image taken from Fluent Python by Luciano Ramalho)
b = a
b
This DOES NOT create a new copy of a
. It merely puts a new label on the memory at a, as can be seen by the following code:
print("a", a)
print("b", b)
a[1] = 7
print("a after change", a)
print("b after change", b)
Tuples
Multiple items on one line in the interface are returned as a tuple, an immutable sequence of Python objects.
a = 1
b = 2.0
a + a, a - b, b * b, 10*a
We can obtain the type of a variable, and use boolean comparisons to test these types.
type(a) == float
type(a) == int
For reference, below are common arithmetic and comparison operations.
- The first element is an integer of your choice
- The second element is a float of your choice
- The third element is the sum of the first two elements
- The fourth element is the difference of the first two elements
- The fifth element is first element divided by the second element
Display the output of
tup
. What is the type of the variabletup
? What happens if you try and chage an item in the tuple?
# your code here
Part 2: Lists¶
Much of Python is based on the notion of a list. In Python, a list is a sequence of items separated by commas, all within square brackets. The items can be integers, floating points, or another type. Unlike in C arrays, items in a Python list can be different types, so Python lists are more versatile than traditional arrays in C or in other languages.
Let's start out by creating a few lists.
empty_list = []
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
mixed_list = [1, 2., 3, 4., 5]
print(empty_list)
print(int_list)
print(mixed_list, float_list)
Lists in Python are zero-indexed, as in C. The first entry of the list has index 0, the second has index 1, and so on.
print(int_list[0])
print(float_list[1])
What happens if we try to use an index that doesn't exist for that list? Python will complain!
print(float_list[10])
You can find the length of a list using the builtin function len
:
print(float_list)
len(float_list)
Indexing on lists¶
And since Python is zero-indexed, the last element of float_list
is
float_list[len(float_list)-1]
It is more idiomatic in python to use -1 for the last element, -2 for the second last, and so on
float_list[-1]
We can use the :
operator to access a subset of the list. This is called slicing.
print(float_list[1:5])
print(float_list[0:2])
Below is a summary of list slicing operations:
You can slice "backwards" as well:
float_list[:-2] # up to second last
float_list[:4] # up to but not including 5th element
You can also slice with a stride:
float_list[:4:2] # above but skipping every second element
We can iterate through a list using a loop. Here's a for loop.
for ele in float_list:
print(ele)
Or, if we like, we can iterate through a list using the indices using a for loop with in range
. This is not idiomatic and is not recommended, but accomplishes the same thing as above.
for i in range(len(float_list)):
print(float_list[i])
What if you wanted the index as well?
Use the built-in python method enumerate
, which can be used to create a list of tuples with each tuple of the form (index, value)
.
for i, ele in enumerate(float_list):
print(i,ele)
# or make a list from it using the list constructor
list(enumerate(float_list))
Appending and deleting¶
We can also append items to the end of the list using the +
operator or with append
.
float_list + [.333]
float_list.append(.444)
print(float_list)
len(float_list)
Go and run the cell with float_list.append
a second time. Then run the next line. What happens?
To remove an item from the list, use del.
del(float_list[2])
print(float_list)
List Comprehensions¶
Lists can be constructed in a compact way using a list comprehension. Here's a simple example.
squaredlist = [i*i for i in int_list]
squaredlist
And here's a more complicated one, requiring a conditional.
comp_list1 = [2*i for i in squaredlist if i % 2 == 0]
print(comp_list1)
This is entirely equivalent to creating comp_list1
using a loop with a conditional, as below:
comp_list2 = []
for i in squaredlist:
if i % 2 == 0:
comp_list2.append(2*i)
print(comp_list2)
The list comprehension syntax
[expression for item in list if conditional]
is equivalent to the syntax
for item in list:
if conditional:
expression
- Using for loops and conditional if statements.
- (Stretch Goal) Using a list comprehension. You should be able to do this in one line of code, and it may be helpful to look up the function
all
in the documentation.
# your code here
# your code here
Part 4: Simple Functions¶
A function object is a reusable block of code that does a specific task. Functions are all over Python, either on their own or on other objects. To invoke a function func
, you call it as func(arguments)
.
We've seen built-in Python functions and methods. For example, len
and print
are built-in Python functions. And at the beginning, you called np.mean
to calculate the mean of three numbers, where mean
is a function in the numpy module and numpy was abbreviated as np
. This syntax allows us to have multiple "mean" functions in different modules; calling this one as np.mean
guarantees that we will pick up numpy's mean function, as opposed to a mean function from a different module.
Methods¶
A function that belongs to an object is called a method. By "object" here we mean an "instance" of a list, or integer, or floating point variable.
An example of this is append
on an existing list. In other words, a method is a function on an instance of a type of object (also called class, in this case, list type).
float_list = [1.0, 2.09, 4.0, 2.0, 0.444]
print(float_list)
float_list.append(56.7)
float_list
User-defined functions¶
We'll now learn to write our own user-defined functions. Below is the syntax for defining a basic function with one input argument and one output. You can also define functions with no input or output arguments, or multiple input or output arguments.
def name_of_function(arg):
...
return(output)
We can write functions with one input and one output argument. Here are two such functions.
def square(x):
x_sqr = x*x
return(x_sqr)
def cube(x):
x_cub = x*x*x
return(x_cub)
square(5),cube(5)
What if you want to return two variables at a time? The usual way is to return a tuple:
def square_and_cube(x):
x_cub = x*x*x
x_sqr = x*x
return(x_sqr, x_cub)
square_and_cube(5)
Lambda functions¶
Often we quickly define mathematical functions with a one-line function called a lambda function. Lambda functions are great because they enable us to write functions without having to name them, ie, they're anonymous.
No return statement is needed.
# create an anonymous function and assign it to the variable square
square = lambda x: x*x
print(square(3))
hypotenuse = lambda x, y: x*x + y*y
## Same as
# def hypotenuse(x, y):
# return(x*x + y*y)
hypotenuse(3,4)
Refactoring using functions¶
In an exercise from Lab 0, you wrote code that generated a list of the prime numbers between 1 and 100. For the excercise below, it may help to revisit that code.
# your code here
my_array = np.array([1, 2, 3, 4])
my_array
# works as in lists
len(my_array)
The shape array of an array is very useful (we'll see more of it later when we talk about 2D and higher dimensional arrays).
my_array.shape
Numpy arrays are typed. This means that by default, all the elements will be assumed to be of one type
my_array.dtype
Numpy arrays are listy (i.e. they act like lists)! Below we compute length, slice, and iterate.
print(len(my_array))
print(my_array[2:4])
for ele in my_array:
print(ele)
In general you should manipulate numpy arrays by using numpy module functions (np.mean
, for example). This is for efficiency purposes, and a discussion about this will happen in Lab2.
You can calculate the mean of the array elements either by calling the method .mean
on a numpy array or by applying the function np.mean with the numpy array as an argument.
print(my_array.mean())
print(np.mean(my_array))
The way we constructed the numpy array above seems redundant. After all we already had a regular python list. Indeed, it is the other ways we have to construct numpy arrays that make them super useful.
There are many such numpy array constructors. Here are some commonly used constructors. Look them up in the documentation.
np.ones(10) # generates 10 floating point ones
Numpy gains a lot of its efficiency from being typed. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float of size appropriate for the machine (64 bit on a 64 bit machine).
np.dtype(float).itemsize # in bytes
np.ones(10, dtype='int') # generates 10 integer ones
np.zeros(10)
Often you will want random numbers. Use the random
constructor!
np.random.random(10) # uniform on [0,1]
You can generate random numbers from a normal distribution with mean 0 and variance 1:
normal_array = np.random.randn(1000)
print("The sample mean and standard devation are %f and %f, respectively." %(np.mean(normal_array), np.std(normal_array)))
You can sample with and without replacement from an array. Lets first construct a grid:
grid = np.arange(0., 1.01, 0.1)
grid
Without replacement
np.random.choice(grid, 5, replace=False)
np.random.choice(grid, 20, replace=False)
With replacement:
np.random.choice(grid, 12, replace=True)
Numpy supports vector operations¶
What does this mean? It means that instead of adding two arrays, element by element, you can just say: add the two arrays. Note that this behavior is very different from python lists.
first = np.ones(5)
second = np.ones(5)
first + second
first_list = [1., 1., 1., 1., 1.]
second_list = [1., 1., 1., 1., 1.]
first_list + second_list #not what u want
On some computer chips this addition actually happens in parallel, so speedups can be high. But even on regular chips, the advantage of greater readability is important.
Numpy supports a concept known as broadcasting, which dictates how arrays of different sizes are combined together. There are too many rules to list here, but importantly, multiplying an array by a number multiplies each element by the number. Adding a number adds the number to each element.
first + 1
first*5
This means that if you wanted the distribution $N(5, 7)$ you could do:
normal_5_7 = 5 + 7*normal_array
np.mean(normal_5_7), np.std(normal_5_7)
Multiplying two arrays multiplies them element-by-element
(first +1) * (first*5)
You might have wanted to compute the dot product instead:
np.dot((first +1) , (first*5))
You can also use the @ operator for this purpose
(first +1) @ (first*5)
Part 6: Probabilitiy Distributions from scipy.stats
¶
Since we'll be using many distributions, we'll want to access the pdf/pmf functions of these distributions and obtain samples from them. We already saw how to obtain samples from the continuous uniform and normal distributions. But we might want to obtain their pdfs as well.
scipy.stats
allows us to obtain the pdf function as well as samples. The programming interface is identical for all the distributions.
To plot samples from and the pdfs of these distributions, we'll first import matplotlib
, python's plotting library.
The
%matplotlib inline
incantation ensures that plots are rendered inline in the browser.
%matplotlib inline
import matplotlib.pyplot as plt
Lets get the normal distribution namespace from scipy.stats
. Docs here.
from scipy.stats import norm
Lets create 1000 points between -10 and 10
x = np.linspace(-10, 10, 1000)
x[:10], x[-10:]
Lets get the pdf of a normal distribution with a mean of 1 and standard deviation 3 and plot it using the grid points computed before...
pdf_x = norm.pdf(x, 1, 3)
plt.plot(x, pdf_x);
And you can get random variables using the rvs
function.
norm.rvs(size=30, loc=1, scale=3)
We can use a more instance based way of getting both the pdf and samples. The documentation calls this instance a "frozen" distribution:
frozen_norm = norm(loc=1, scale=3)
type(frozen_norm)
plt.plot(x, frozen_norm.pdf(x))
frozen_norm.rvs(10)
We can now plot a histogram of the samples using matplotlib:
(see docs on plt.hist by typing
?plt.hist
in a cell by itself)
plt.hist(frozen_norm.rvs(1000), bins=30);
By default the histogram gives us counts. We can re-normalize these counts to get an approximation to the probability distribution from the samples.
plt.hist(frozen_norm.rvs(1000), bins=30, normed=True);
Part 7: Conclusions¶
For more practice exercises (with solutions) and discussion, see this page. Some of these exercises are particularly relevant. Check them out!
Don't forget to look up Jake's book.
Finally, we would like to suggest using Chris Albon's web site as a reference. Lots of useful information there.
Additional Stuff¶
Part 1: Dictionaries¶
A dictionary is another storage container. Like a list, a dictionary is a sequence of items. Unlike a list, a dictionary is unordered and its items are accessed with keys and not integer positions.
Dictionaries are the closest container we have to a database.
Let's make a dictionary with a few Harvard courses and their corresponding enrollment numbers.
enroll2017_dict = {'CS50': 692, 'CS109A / Stat 121A / AC 209A': 352, 'Econ1011a': 95, 'AM21a': 153, 'Stat110': 485}
enroll2017_dict
One can obtain the value corrsponding to a key thus:
enroll2017_dict['CS50']
Or thus, which allows for the key to not be in the dictionary
enroll2017_dict.get('CS01', 5), enroll2017_dict.get('CS01')
enroll2017_dict.get('CS50')
All sorts of iterations are supported:
enroll2017_dict.values()
enroll2017_dict.items()
We can iterate over the tuples obtained above: (to read more about how the print formatting works , look at https://docs.python.org/3/library/stdtypes.html#old-string-formatting and https://docs.python.org/3/tutorial/inputoutput.html)
for key, value in enroll2017_dict.items():
print("%s: %d" %(key, value))
Simply iterating over a dictionary gives us the keys. This is useful when we want to do something with each item:
second_dict={}
for key in enroll2017_dict:
second_dict[key] = enroll2017_dict[key]
second_dict
The above is an actual copy to another part of memory, unlike, second_dict = enroll2017_dict
which would have made both variables label the same memory location.
In this example, the keys were strings corresponding to course names. Keys don't have to be strings though.
Like lists, you can construct dictionaries using a dictionary comprehension, which is similar to a list comprehension. Notice the brackets {} and the use of zip
, which is another iterator that combines two lists together.
my_dict = {k:v for (k, v) in zip(int_list, float_list)}
my_dict
You can also create dictionaries using the constructor function dict
.
dict(a = 1, b = 2)
Part 2: Introduction to Pandas¶
Often data is stored in comma separated values (CSV) files. For the remainder of this lab, we'll be working with automobile data, where we've extracted relevant parts below.
Note that CSV files can be output by any spreadsheet software, and are plain text, hence are a great way to share data.
Importing data with pandas¶
Now let's read in our automobile data as a pandas dataframe structure.
import pandas as pd
# Read in the csv files
dfcars=pd.read_csv("data/mtcars.csv")
type(dfcars)
dfcars.head()
What we have now is a spreadsheet with indexed rows and named columns, called a dataframe in pandas. dfcars
is an instance of the pd.DataFrame class, created by calling the pd.read_csv "constructor function".
The take-away is that dfcars
is a dataframe object, and it has methods (functions) belonging to it. For example, df.head()
is a method that shows the first 5 rows of the dataframe.
A pandas dataframe is a set of columns pasted together into a spreadsheet, as shown in the schematic below, which is taken from the cheatsheet above. The columns in pandas are called series objects.
Let's look again at the first five rows of dfcars
.
dfcars.head()
Notice the poorly named first column: "Unnamed: 0". Why did that happen?
The first column, which seems to be the name of the car, does not have a name. Here are the first 3 lines of the file:
"","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
Lets clean that up:
dfcars = dfcars.rename(columns={"Unnamed: 0": "name"})
dfcars.head()
In the above, the argument columns = {"Unnamed: 0": "name"}
of rename
changed the name of the first column in the dataframe from Unnamed: 0
to name
.
Lets save this cleaned dataframe out to a CSV file.
# dont store the 0,1,2,3,4.. index
dfcars.to_csv("data/cleaned-mtcars.csv", index=False, header=True)
The output will look something like this:
name,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
To access a series (column), you can use either dictionary syntax or instance-variable syntax.
Dictionary syntax is very useful when column names have spaces: Python variables cannot have spaces in them.
dfcars.mpg
dfcars['mpg']
You can get a numpy array of values from the Pandas Series:
dfcars.mpg.values
And we can produce a histogram from these values
# the .values isnt really need, a series behaves like a list for
# plotting purposes
plt.hist(dfcars.mpg.values, bins=20);
plt.xlabel("mpg");
plt.ylabel("Frequency")
plt.title("Miles per Gallon");
But pandas is very cool: you can get a histogram directly:
dfcars.mpg.hist(bins=20);
plt.xlabel("mpg");
plt.ylabel("Frequency")
plt.title("Miles per Gallon");
We can also get sub-dataframes by choosing a set of series. We pass a list of the columns we want as "dictionary keys" to the dataframe.
dfcars[['am', 'mpg']]
Scatter plots¶
We often want to see co-variation among our columns, for example, miles/gallon versus weight. This can be done with a scatter plot.
plt.scatter(dfcars.wt, dfcars.mpg);
plt.xlabel("weight");
plt.ylabel("miles per gallon");
You could have used plot
instead of scatter
.
plt.plot(dfcars.wt, dfcars.mpg, 'o');
plt.xlabel("weight");
plt.ylabel("miles per gallon");
Usually we use plt.show()
at the end of every plot to display the plot. Our magical incantation %matplotlib inline
takes care of this for us, and we don't have to do it in the Jupyter notebook. But if you run your Python program from a file, you will need to explicitly have a call to show. We include it for completion.
plt.plot(dfcars.wt, dfcars.mpg, 'ko') #black dots
plt.xlabel("weight");
plt.ylabel("miles per gallon");
plt.show()
Suppose we'd like to save a figure to a file. We do this by including the savefig
command in the same cell as the plotting commands. The file extension tells you how the file will be saved.
plt.plot(dfcars.wt, dfcars.mpg, 'o')
plt.xlabel("weight");
plt.ylabel("miles per gallon");
plt.savefig('images/foo1.pdf')
plt.savefig('images/foo1.png', bbox_inches='tight') #less whitespace around image
And this is what the saved png looks like. Code in Markdown to show this is:
![](images/foo1.png)
Below is a summary of the most commonly used matplotlib
plotting routines.
# your code here