CS-109A Introduction to Data Science
Lab 1: Introduction to Python and its Numerical Stack¶
Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Lab Instructor: Eleni Kaxiras
Authors: Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, and Eleni Kaxiras
## RUN THIS CELL TO GET THE RIGHT FORMATTING
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
PATHTOSOLUTIONS = '../solutions'
Programming Expectations¶
All assignments for this class will use Python and the browser-based iPython notebook format you are currently viewing. Programming at the level of CS 50 is a prerequisite for this course. If you have concerns about this, come speak with any of the instructors.
We will refer to the Python 3 documentation in this lab and throughout the course.
Learning Goals¶
This introductory lab is a condensed introduction to Python numerical programming. By the end of this lab, you will feel more comfortable:
Learn about anconda environments and setup your own with the necessary dependencies
Writing short Python code using functions, loops, lists, numpy arrays, and dictionaries.
Manipulating Python lists and numpy arrays and understanding the difference between them.
Introducing the stats libraries
scipy.stats
andstatsmodels
Part 1: Set up a Conda Python Environment and Clone the Class Repository¶
On Python installation packages¶
There are two main installing packages for Python, conda
and pip
. Pip is the Python Packaging Authority’s recommended tool for installing packages from the Python Package Index (PyPI). Conda
is a cross platform package and environment manager that installs and manages conda packages from the Anaconda repository and Anaconda Cloud. Conda does not assume any specific configuration in your computer and will install the Python interpreter along with the other Python packages, whereas pip
assumes that you have installed the Python interpreter in your computer. Given the fact that most operating systems do include Python this is not a problem.
If I could summarize their differences into a sentence it would be that conda has the ability to create isolated environments that can contain different versions of Python and/or the packages installed in them. This can be extremely useful when working with data science tools as different tools may contain conflicting requirements which could prevent them all being installed into a single environment. You can have environments with pip but would have to install a tool such as virtualenv or venv. You may use either, we recommend conda
because in our experience it leads to fewer incompatibilities between packages and thus fewer broken environments.
Conclusion: Use Both. Most often in our data science environments we want to combining pip with conda when one or more packages are only available to install via pip. Although thousands of packages are available in the Anaconda repository, including the most popular data science, machine learning, and AI frameworks but a lot more are available on PyPI. Even if you have your environment installed via conda
you can use pip
to install individual packages
Installing Conda¶
- First check if you have conda¶
In MacOS or Linux open a Terminal window and at the prompt type
conda –V
If you get the version number (e.g. conda 4.6.14
) you are all set! If you get an error, that means you do not have Anaconda and would be a good idea to install it.
- If you do not have it, you can install it by following the instructions:¶
Mac : https://docs.anaconda.com/anaconda/install/mac-os/
Windows : https://docs.anaconda.com/anaconda/install/windows (Note: #8 is important: DO NOT add to your path. The reason is that Windows contains paths that may include spaces and that clashes with the way conda
understands paths.)
- If you do have anaconda consider upgrading it so you get the latest version of the packages:¶
conda update conda
Conda allows you to work in 'computing sandboxes' called environments. You may have environments installed on your computer to access different versions of Python and different libraries to avoid conflict between libraries which can cause errors.
NOTE (Sept.6, 2019):¶
If you are still having issues please check the Announcements and the Discussion Forum (Ed) via the 2019-CS109a Canvas site
Also please check the latest version of the cs109a.yml file. We have edited it as of today.
What are environments and do I need them?¶
Environments in Python are like sandboxes that have different versions of Python and/or packages installed in them. You can create, export, list, remove, and update environments. Switching or moving between environments is called activating the environment. When you are done with an environments you may deactivate it.
For this class we want to have a bit more control on the packages that will be installed with the enviromnent so we will create an environment with a so called YAML file called cs109a.yml
. Originally YAML was said to mean Yet Another Markup Language referencing its purpose as a markup language with the yet another construct, but it was then repurposed as YAML Ain't Markup Language [source:wikipedia]. This is included in the Lab directory in the class git repository.
Creating an environment from an environment.yml file¶
Using your browser, visit the class git repository https://github.com/Harvard-IACS/2019-CS109A
Go to content
--> labs/
--> lab1
and look for the cs109a.yml file. Download it to a local directory in your computer.
Then in the Terminal again type
conda env create -f {PATH-TO-FILE}/cs109a.yml
Activate the new environment:¶
source activate cs109a
You should see the name of the environment at the start of your command prompth in parenthesis.
Verify that the new environment was installed correctly:¶
conda list
This will give you a list of the packages installed in this environment.
References¶
Clone the class repository¶
In the Terminal type:
git clone https://github.com/Harvard-IACS/2019-CS109A.git
Starting the Jupyter Notebook¶
Once all is installed go in the Terminal and type
jupyter notebook
to start the jupyter notebook server. This will spawn a process that will be running in the Terminal window until you are done working with the notebook. In that case press control-C
to stop it.
Starting the notebook will bring up a browser window with your file structure. Look for the 2019-CS109A folder. It should be where you cloned it previously. When you visit this folder in the future, and while in the top folder of it, type
git pull
This will update the contents of the folder with whatever is new. Make sure you are at the top part of the folder by typing
pwd
which should give you /2019-CS109A/
For more on using the Notebook see: https://jupyter-notebook.readthedocs.io/en/latest/
Part 2: Getting Started with Python¶
Importing modules¶
All notebooks should begin with code that imports modules, collections of built-in, commonly-used Python functions. Below we import the Numpy module, a fast numerical programming library for scientific computing. Future labs will require additional modules, which we'll import with the same syntax.
import MODULE_NAME as MODULE_NICKNAME
import numpy as np #imports a fast numerical programming library
Now that Numpy has been imported, we can access some useful functions. For example, we can use mean
to calculate the mean of a set of numbers.
my_list = [1.2, 2, 3.3]
np.mean(my_list)
Calculations and variables¶
# // is integer division
1/2, 1//2, 1.0/2, 3*3.2
The last line in a cell is returned as the output value, as above. For cells with multiple lines of results, we can display results using print
, as can be seen below.
print(1 + 3.0, "\n", 9, 7)
5/3
We can store integer or floating point values as variables. The other basic Python data types -- booleans, strings, lists -- can also be stored as variables.
a = 1
b = 2.0
Here is the storing of a list
a = [1, 2, 3]
Think of a variable as a label for a value, not a box in which you put the value
(image: Fluent Python by Luciano Ramalho)
b = a
b
This DOES NOT create a new copy of a
. It merely puts a new label on the memory at a, as can be seen by the following code:
print("a", a)
print("b", b)
a[1] = 7
print("a after change", a)
print("b after change", b)
Tuples
Multiple items on one line in the interface are returned as a tuple, an immutable sequence of Python objects. See the end of this notebook for an interesting use of tuples
.
a = 1
b = 2.0
a + a, a - b, b * b, 10*a
type()
¶
We can obtain the type of a variable, and use boolean comparisons to test these types. VERY USEFUL when things go wrong and you cannot understand why this method does not work on a specific variable!
type(a) == float
type(a) == int
type(a)
For reference, below are common arithmetic and comparison operations.
- The first element is an integer of your choice
- The second element is a float of your choice
- The third element is the sum of the first two elements
- The fourth element is the difference of the first two elements
The fifth element is the first element divided by the second element
Display the output of
tup
. What is the type of the variabletup
? What happens if you try and chage an item in the tuple?
# your code here
tup = (1,1.1,1+1.1,1-1.1,1/1.1)
print(tup)
print(type(tup))
# TO RUN THE SOLUTIONS
# 1. uncomment the first line of the cell below so you have just %load
# 2. Run the cell AGAIN to execute the python code, it will not run when you execute the %load command!!
# %load ../solutions/exercise1.py
a = 3
b = 4.0
c = a + b
d = a - b
e = a / b
tup = (a, b, c, d, e)
tup
Lists¶
Much of Python is based on the notion of a list. In Python, a list is a sequence of items separated by commas, all within square brackets. The items can be integers, floating points, or another type. Unlike in C arrays, items in a Python list can be different types, so Python lists are more versatile than traditional arrays in C or other languages.
Let's start out by creating a few lists.
empty_list = []
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
mixed_list = [1, 2., 3, 4., 5]
print(empty_list)
print(int_list)
print(mixed_list, float_list)
Lists in Python are zero-indexed, as in C. The first entry of the list has index 0, the second has index 1, and so on.
print(int_list[0])
print(float_list[1])
What happens if we try to use an index that doesn't exist for that list? Python will complain!
print(float_list[10])
You can find the length of a list using the built-in function len
:
print(float_list)
len(float_list)
Indexing on lists plus Slicing¶
And since Python is zero-indexed, the last element of float_list
is
float_list[len(float_list)-1]
It is more idiomatic in Python to use -1 for the last element, -2 for the second last, and so on
float_list[-1]
We can use the :
operator to access a subset of the list. This is called slicing.
print(float_list[1:5])
print(float_list[0:2])
Below is a summary of list slicing operations:
lst = ['hi', 7, 'c', 'cat', 'hello', 8]
lst[:2]
You can slice "backwards" as well:
float_list[:-2] # up to second last
float_list[:4] # up to but not including 5th element
You can also slice with a stride:
float_list[:4:2] # above but skipping every second element
We can iterate through a list using a loop. Here's a for loop.
for ele in float_list:
print(ele)
What if you wanted the index as well?
Use the built-in python method enumerate
, which can be used to create a list of tuples with each tuple of the form (index, value)
.
for i, ele in enumerate(float_list):
print(i, ele)
Appending and deleting¶
We can also append items to the end of the list using the +
operator or with append
.
float_list + [.333]
float_list.append(.444)
print(float_list)
len(float_list)
Now, run the cell with float_list.append()
a second time. Then run the subsequent cell. What happens?
To remove an item from the list, use del.
del(float_list[2])
print(float_list)
You may also add an element (elem) in a specific position (index) in the list
elem = '3.14'
index = 1
float_list.insert(index, elem)
float_list
List Comprehensions¶
Lists can be constructed in a compact way using a list comprehension. Here's a simple example.
squaredlist = [i*i for i in int_list]
squaredlist
And here's a more complicated one, requiring a conditional.
comp_list1 = [2*i for i in squaredlist if i % 2 == 0]
print(comp_list1)
This is entirely equivalent to creating comp_list1
using a loop with a conditional, as below:
comp_list2 = []
for i in squaredlist:
if i % 2 == 0:
comp_list2.append(2*i)
print(comp_list2)
The list comprehension syntax
[expression for item in list if conditional]
is equivalent to the syntax
for item in list:
if conditional:
expression
- 2.1 Using for loops and conditional if statements.
- 2.2 (Stretch Goal) Using a list comprehension. You should be able to do this in one line of code. Hint: it might help to look up the function
all()
in the documentation.
primes = []
for i in range(1,101):
if sum([(i % p) == 0 for p in primes]) > 0:
continue
if i != 1:
primes.append(i)
primes
[i for i in range(2,101) if all(i % j != 0 for j in range(2,i))]
# %load ../solutions/exercise2_1.py
N = 100;
# using loops and if statements
primes = [];
for j in range(2, N):
count = 0;
for i in range(2,j):
if j % i == 0:
count = count + 1;
if count == 0:
primes.append(j)
print(primes)
# %load ../solutions/exercise2_2.py
primes_lc = [j for j in range(2, N) if all(j % i != 0 for i in range(2, j))]
print(primes)
print(primes_lc)
Simple Functions¶
A function object is a reusable block of code that does a specific task. Functions are commonplace in Python, either on their own or as they belong to other objects. To invoke a function func
, you call it as func(arguments)
.
We've seen built-in Python functions and methods (details below). For example, len()
and print()
are built-in Python functions. And at the beginning, you called np.mean()
to calculate the mean of three numbers, where mean()
is a function in the numpy module and numpy was abbreviated as np
. This syntax allows us to have multiple "mean" functions in different modules; calling this one as np.mean()
guarantees that we will execute numpy's mean function, as opposed to a mean function from a different module.
User-defined functions¶
We'll now learn to write our own user-defined functions. Below is the syntax for defining a basic function with one input argument and one output. You can also define functions with no input or output arguments, or multiple input or output arguments.
def name_of_function(arg):
...
return(output)
We can write functions with one input and one output argument. Here are two such functions.
def square(x):
x_sqr = x*x
return(x_sqr)
def cube(x):
x_cub = x*x*x
return(x_cub)
square(5),cube(5)
What if you want to return two variables at a time? The usual way is to return a tuple:
def square_and_cube(x):
x_cub = x*x*x
x_sqr = x*x
return(x_sqr, x_cub)
square_and_cube(5)
Lambda functions¶
Often we quickly define mathematical functions with a one-line function called a lambda function. Lambda functions are great because they enable us to write functions without having to name them, ie, they're anonymous.
No return statement is needed.
# create an anonymous function and assign it to the variable square
square = lambda x: x*x
print(square(3))
hypotenuse = lambda x, y: x*x + y*y
## Same as
# def hypotenuse(x, y):
# return(x*x + y*y)
hypotenuse(3,4)
Methods¶
A function that belongs to an object is called a method. By "object," we mean an "instance" of a class (e.g., list, integer, or floating point variable).
For example, when we invoke append()
on an existing list, append()
is a method.
In other words, a method is a function on a specific instance of a class (i.e., object). In this example, our class is a list. float_list
is an instance of a list (thus, an object), and the append()
function is technically a method since it pertains to the specific instance float_list
.
float_list = [1.0, 2.09, 4.0, 2.0, 0.444]
print(float_list)
float_list.append(56.7)
float_list
In Exercise 2, above, you wrote code that generated a list of the prime numbers between 1 and 100. Now, write a function called isprime()
that takes in a positive integer $N$, and determines whether or not it is prime. Return True
if it's prime and return False
if it isn't. Then, using a list comprehension and isprime()
, create a list myprimes
that contains all the prime numbers less than 100.
# your code here
def isprime(n):
return all([n % i != 0 for i in range(2,n)])
[n for n in range(2,100) if isprime(n)]
# %load ../solutions/exercise3.py
def isprime(N):
count = 0;
if not isinstance(N, int):
return False
if N <= 1:
return False
for i in range(2, N):
if N % i == 0:
count = count + 1;
if count == 0:
return(True)
else:
return(False)
print(isprime(3.0), isprime("pavlos"), isprime(0), isprime(-1), isprime(1), isprime(2), isprime(93), isprime(97))
myprimes = [j for j in range(1, 100) if isprime(j)]
print(myprimes)
my_array = np.array([1, 2, 3, 4])
my_array
# works as it would with a standard list
len(my_array)
The shape array of an array is very useful (we'll see more of it later when we talk about 2D arrays -- matrices -- and higher-dimensional arrays).
my_array.shape
Numpy arrays are typed. This means that by default, all the elements will be assumed to be of the same type (e.g., integer, float, String).
my_array.dtype
Numpy arrays have similar functionality as lists! Below, we compute the length, slice the array, and iterate through it (one could identically perform the same with a list).
print(len(my_array))
print(my_array[2:4])
for ele in my_array:
print(ele)
There are two ways to manipulate numpy arrays a) by using the numpy module's methods (e.g., np.mean()
) or b) by applying the function np.mean() with the numpy array as an argument.
print(my_array.mean())
print(np.mean(my_array))
A constructor
is a general programming term that refers to the mechanism for creating a new object (e.g., list, array, String).
There are many other efficient ways to construct numpy arrays. Here are some commonly used numpy array constructors. Read more details in the numpy documentation.
np.ones(10) # generates 10 floating point ones
Numpy gains a lot of its efficiency from being typed. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float. (Each float uses either 32 or 64 bits of memory, depending on if the code is running a 32-bit or 64-bit machine, respectively).
np.dtype(float).itemsize # in bytes (remember, 1 byte = 8 bits)
np.ones(10, dtype='int') # generates 10 integer ones
np.zeros(10)
Often, you will want random numbers. Use the random
constructor!
np.random.random(10) # uniform from [0,1]
You can generate random numbers from a normal distribution with mean 0 and variance 1:
normal_array = np.random.randn(1000)
print("The sample mean and standard devation are %f and %f, respectively." %(np.mean(normal_array), np.std(normal_array)))
len(normal_array)
You can sample with and without replacement from an array. Let's first construct a list with evenly-spaced values:
grid = np.arange(0., 1.01, 0.1)
grid
Without replacement
np.random.choice(grid, 5, replace=False)
np.random.choice(grid, 20, replace=False)
With replacement:
np.random.choice(grid, 20, replace=True)
Tensors¶
We can think of tensors as a name to include multidimensional arrays of numerical values. While tensors first emerged in the 20th century, they have since been applied to numerous other disciplines, including machine learning. In this class you will only be using scalars, vectors, and 2D arrays, so you do not need to worry about the name 'tensor'.
We will use the following naming conventions:
- scalar = just a number = rank 0 tensor ($a$ ∈ $F$,)
- vector = 1D array = rank 1 tensor ( $x = (\;x_1,...,x_i\;)⊤$ ∈ $F^n$ )
- matrix = 2D array = rank 2 tensor ( $\textbf{X} = [a_{ij}] ∈ F^{m×n}$ )
- 3D array = rank 3 tensor ( $\mathscr{X} =[t_{i,j,k}]∈F^{m×n×l}$ )
- $\mathscr{N}$D array = rank $\mathscr{N}$ tensor ( $\mathscr{T} =[t_{i1},...,t_{i\mathscr{N}}]∈F^{n_1×...×n_\mathscr{N}}$ )
Slicing a 2D array¶
# how do we get just the second row of the above array?
Numpy supports vector operations¶
What does this mean? It means that instead of adding two arrays, element by element, you can just say: add the two arrays.
first = np.ones(5)
second = np.ones(5)
first + second # adds in-place
Note that this behavior is very different from python lists where concatenation happens.
first_list = [1., 1., 1., 1., 1.]
second_list = [1., 1., 1., 1., 1.]
first_list + second_list # concatenation
On some computer chips, this numpy addition actually happens in parallel and can yield significant increases in speed. But even on regular chips, the advantage of greater readability is important.
Broadcasting¶
Numpy supports a concept known as broadcasting, which dictates how arrays of different sizes are combined together. There are too many rules to list here, but importantly, multiplying an array by a number multiplies each element by the number. Adding a number adds the number to each element.
first + 1
first*5
This means that if you wanted the distribution $N(5, 7)$ you could do:
normal_5_7 = 5 + 7*normal_array
np.mean(normal_5_7), np.std(normal_5_7)
Multiplying two arrays multiplies them element-by-element
(first +1) * (first*5)
You might have wanted to compute the dot product instead:
np.dot((first +1) , (first*5))
Probabilitiy Distributions from scipy.stats
and statsmodels
¶
Two useful statistics libraries in python are scipy
and statsmodels
.
For example to load the z_test:
import statsmodels
from statsmodels.stats.proportion import proportions_ztest
x = np.array([74,100])
n = np.array([152,266])
zstat, pvalue = statsmodels.stats.proportion.proportions_ztest(x, n)
print("Two-sided z-test for proportions: \n","z =",zstat,", pvalue =",pvalue)
#The `%matplotlib inline` ensures that plots are rendered inline in the browser.
%matplotlib inline
import matplotlib.pyplot as plt
Let's get the normal distribution namespace from scipy.stats
. See here for Documentation.
from scipy.stats import norm
Let's create 1,000 points between -10 and 10
x = np.linspace(-10, 10, 1000) # linspace() returns evenly-spaced numbers over a specified interval
x[0:10], x[-10:]
Let's get the pdf of a normal distribution with a mean of 1 and standard deviation 3, and plot it using the grid points computed before:
pdf_x = norm.pdf(x, 1, 3)
plt.plot(x, pdf_x);
And you can get random variables using the rvs
function.
Referencies¶
A useful book by Jake Vanderplas: PythonDataScienceHandbook.
You may also benefit from using Chris Albon's web site as a reference. It contains lots of useful information.
Dictionaries¶
A dictionary is another data structure (aka storage container) -- arguably the most powerful. Like a list, a dictionary is a sequence of items. Unlike a list, a dictionary is unordered and its items are accessed with keys and not integer positions.
Dictionaries are the closest data structure we have to a database.
Let's make a dictionary with a few Harvard courses and their corresponding enrollment numbers.
enroll2017_dict = {'CS50': 692, 'CS109A / Stat 121A / AC 209A': 352, 'Econ1011a': 95, 'AM21a': 153, 'Stat110': 485}
enroll2017_dict
One can obtain the value corresponding to a key via:
enroll2017_dict['CS50']
If you try to access a key that isn't present, your code will yield an error:
enroll2017_dict['CS630']
Alternatively, the .get()
function allows one to gracefully handle these situations by providing a default value if the key isn't found:
enroll2017_dict.get('CS630', 5)
Note, this does not store a new value for the key; it only provides a value to return if the key isn't found.
enroll2017_dict['CS630']
enroll2017_dict.get('C730', None)
All sorts of iterations are supported:
enroll2017_dict.values()
enroll2017_dict.items()
We can iterate over the tuples obtained above:
for key, value in enroll2017_dict.items():
print("%s: %d" %(key, value))
Simply iterating over a dictionary gives us the keys. This is useful when we want to do something with each item:
second_dict={}
for key in enroll2017_dict:
second_dict[key] = enroll2017_dict[key]
second_dict
The above is an actual copy of _enroll2017dict's allocated memory, unlike, second_dict = enroll2017_dict
which would have made both variables label the same memory location.
In the previous dictionary example, the keys were strings corresponding to course names. Keys don't have to be strings, though; they can be other immutable data type such as numbers or tuples (not lists, as lists are mutable).
Dictionary comprehension: "Do not try this at home"¶
You can construct dictionaries using a dictionary comprehension, which is similar to a list comprehension. Notice the brackets {} and the use of zip
(see next cell for more on zip
)
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
my_dict = {k:v for (k, v) in zip(int_list, float_list)}
my_dict
Creating tuples with zip
¶
zip
is a Python built-in function that returns an iterator that aggregates elements from each of the iterables. This is an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The iterator stops when the shortest input iterable is exhausted. The set()
built-in function returns a set
object, optionally with elements taken from another iterable. By using set()
you can make zip
printable. In the example below, the iterables are the two lists, float_list
and int_list
. We can have more than two iterables.
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
viz_zip = set(zip(int_list, float_list))
viz_zip
type(viz_zip)