Key Word(s): Python, YAML, numpy



CS-109A Introduction to Data Science

Lab 1: Introduction to Python and its Numerical Stack

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Lab Instructor: Eleni Kaxiras
Authors: Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, and Eleni Kaxiras


In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:
In [2]:
PATHTOSOLUTIONS = '../solutions'

Programming Expectations

All assignments for this class will use Python and the browser-based iPython notebook format you are currently viewing. Programming at the level of CS 50 is a prerequisite for this course. If you have concerns about this, come speak with any of the instructors.

We will refer to the Python 3 documentation in this lab and throughout the course.

Learning Goals

This introductory lab is a condensed introduction to Python numerical programming. By the end of this lab, you will feel more comfortable:

  • Learn about anconda environments and setup your own with the necessary dependencies

  • Writing short Python code using functions, loops, lists, numpy arrays, and dictionaries.

  • Manipulating Python lists and numpy arrays and understanding the difference between them.

  • Introducing the stats libraries scipy.stats and statsmodels

Part 1: Set up a Conda Python Environment and Clone the Class Repository

On Python installation packages

There are two main installing packages for Python, conda and pip. Pip is the Python Packaging Authority’s recommended tool for installing packages from the Python Package Index (PyPI). Conda is a cross platform package and environment manager that installs and manages conda packages from the Anaconda repository and Anaconda Cloud. Conda does not assume any specific configuration in your computer and will install the Python interpreter along with the other Python packages, whereas pip assumes that you have installed the Python interpreter in your computer. Given the fact that most operating systems do include Python this is not a problem.

If I could summarize their differences into a sentence it would be that conda has the ability to create isolated environments that can contain different versions of Python and/or the packages installed in them. This can be extremely useful when working with data science tools as different tools may contain conflicting requirements which could prevent them all being installed into a single environment. You can have environments with pip but would have to install a tool such as virtualenv or venv. You may use either, we recommend conda because in our experience it leads to fewer incompatibilities between packages and thus fewer broken environments.

Conclusion: Use Both. Most often in our data science environments we want to combining pip with conda when one or more packages are only available to install via pip. Although thousands of packages are available in the Anaconda repository, including the most popular data science, machine learning, and AI frameworks but a lot more are available on PyPI. Even if you have your environment installed via conda you can use pip to install individual packages

(source: anaconda site)

Installing Conda

- First check if you have conda

In MacOS or Linux open a Terminal window and at the prompt type

conda –V

If you get the version number (e.g. conda 4.6.14) you are all set! If you get an error, that means you do not have Anaconda and would be a good idea to install it.

- If you do not have it, you can install it by following the instructions:

Mac : https://docs.anaconda.com/anaconda/install/mac-os/

Windows : https://docs.anaconda.com/anaconda/install/windows (Note: #8 is important: DO NOT add to your path. The reason is that Windows contains paths that may include spaces and that clashes with the way conda understands paths.)

- If you do have anaconda consider upgrading it so you get the latest version of the packages:

conda update conda

Conda allows you to work in 'computing sandboxes' called environments. You may have environments installed on your computer to access different versions of Python and different libraries to avoid conflict between libraries which can cause errors.


NOTE (Sept.6, 2019):

If you are still having issues please check the Announcements and the Discussion Forum (Ed) via the 2019-CS109a Canvas site

Also please check the latest version of the cs109a.yml file. We have edited it as of today.


What are environments and do I need them?

Environments in Python are like sandboxes that have different versions of Python and/or packages installed in them. You can create, export, list, remove, and update environments. Switching or moving between environments is called activating the environment. When you are done with an environments you may deactivate it.

For this class we want to have a bit more control on the packages that will be installed with the enviromnent so we will create an environment with a so called YAML file called cs109a.yml. Originally YAML was said to mean Yet Another Markup Language referencing its purpose as a markup language with the yet another construct, but it was then repurposed as YAML Ain't Markup Language [source:wikipedia]. This is included in the Lab directory in the class git repository.

Creating an environment from an environment.yml file

Using your browser, visit the class git repository https://github.com/Harvard-IACS/2019-CS109A

Go to content --> labs/ --> lab1 and look for the cs109a.yml file. Download it to a local directory in your computer.

Then in the Terminal again type

conda env create -f {PATH-TO-FILE}/cs109a.yml

Activate the new environment:

source activate cs109a

You should see the name of the environment at the start of your command prompth in parenthesis.

Verify that the new environment was installed correctly:

conda list

This will give you a list of the packages installed in this environment.

References

Manage conda environments

Clone the class repository

In the Terminal type:

git clone https://github.com/Harvard-IACS/2019-CS109A.git

Starting the Jupyter Notebook

Once all is installed go in the Terminal and type

jupyter notebook

to start the jupyter notebook server. This will spawn a process that will be running in the Terminal window until you are done working with the notebook. In that case press control-C to stop it.

Starting the notebook will bring up a browser window with your file structure. Look for the 2019-CS109A folder. It should be where you cloned it previously. When you visit this folder in the future, and while in the top folder of it, type

git pull

This will update the contents of the folder with whatever is new. Make sure you are at the top part of the folder by typing

pwd

which should give you /2019-CS109A/

For more on using the Notebook see: https://jupyter-notebook.readthedocs.io/en/latest/

Part 2: Getting Started with Python

Importing modules

All notebooks should begin with code that imports modules, collections of built-in, commonly-used Python functions. Below we import the Numpy module, a fast numerical programming library for scientific computing. Future labs will require additional modules, which we'll import with the same syntax.

import MODULE_NAME as MODULE_NICKNAME

In [3]:
import numpy as np #imports a fast numerical programming library

Now that Numpy has been imported, we can access some useful functions. For example, we can use mean to calculate the mean of a set of numbers.

In [4]:
my_list = [1.2, 2, 3.3]
np.mean(my_list)
Out[4]:
2.1666666666666665

Calculations and variables

In [5]:
# // is integer division
1/2, 1//2, 1.0/2, 3*3.2
Out[5]:
(0, 0, 0.5, 9.600000000000001)

The last line in a cell is returned as the output value, as above. For cells with multiple lines of results, we can display results using print, as can be seen below.

In [6]:
print(1 + 3.0, "\n", 9, 7)
5/3
(4.0, '\n', 9, 7)
Out[6]:
1

We can store integer or floating point values as variables. The other basic Python data types -- booleans, strings, lists -- can also be stored as variables.

In [7]:
a = 1
b = 2.0

Here is the storing of a list

In [1]:
a = [1, 2, 3]

Think of a variable as a label for a value, not a box in which you put the value

(image: Fluent Python by Luciano Ramalho)

In [2]:
b = a
b
Out[2]:
[1, 2, 3]

This DOES NOT create a new copy of a. It merely puts a new label on the memory at a, as can be seen by the following code:

In [3]:
print("a", a)
print("b", b)
a[1] = 7
print("a after change", a)
print("b after change", b)
a [1, 2, 3]
b [1, 2, 3]
a after change [1, 7, 3]
b after change [1, 7, 3]

Tuples

Multiple items on one line in the interface are returned as a tuple, an immutable sequence of Python objects. See the end of this notebook for an interesting use of tuples.

In [10]:
a = 1
b = 2.0
a + a, a - b, b * b, 10*a
Out[10]:
(2, -1.0, 4.0, 10)

type()

We can obtain the type of a variable, and use boolean comparisons to test these types. VERY USEFUL when things go wrong and you cannot understand why this method does not work on a specific variable!

In [11]:
type(a) == float
Out[11]:
False
In [12]:
type(a) == int
Out[12]:
True
In [13]:
type(a)
Out[13]:
int

For reference, below are common arithmetic and comparison operations.

Drawing

Drawing

EXERCISE 1: Create a tuple called `tup` with the following seven objects:
  • The first element is an integer of your choice
  • The second element is a float of your choice
  • The third element is the sum of the first two elements
  • The fourth element is the difference of the first two elements
  • The fifth element is the first element divided by the second element

  • Display the output of tup. What is the type of the variable tup? What happens if you try and chage an item in the tuple?

In [6]:
# your code here
tup = (1,1.1,1+1.1,1-1.1,1/1.1)
print(tup)
print(type(tup))
(1, 1.1, 2.1, -0.10000000000000009, 0.9090909090909091)

In [73]:
# TO RUN THE SOLUTIONS
# 1. uncomment the first line of the cell below so you have just %load
# 2. Run the cell AGAIN to execute the python code, it will not run when you execute the %load command!!
In [8]:
# %load ../solutions/exercise1.py
a = 3
b = 4.0
c = a + b
d = a - b
e = a / b
tup = (a, b, c, d, e)
tup
Out[8]:
(3, 4.0, 7.0, -1.0, 0.75)

Lists

Much of Python is based on the notion of a list. In Python, a list is a sequence of items separated by commas, all within square brackets. The items can be integers, floating points, or another type. Unlike in C arrays, items in a Python list can be different types, so Python lists are more versatile than traditional arrays in C or other languages.

Let's start out by creating a few lists.

In [16]:
empty_list = []
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
mixed_list = [1, 2., 3, 4., 5]
print(empty_list)
print(int_list)
print(mixed_list, float_list)
[]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
([1, 2.0, 3, 4.0, 5], [1.0, 3.0, 5.0, 4.0, 2.0])

Lists in Python are zero-indexed, as in C. The first entry of the list has index 0, the second has index 1, and so on.

In [17]:
print(int_list[0])
print(float_list[1])
1
3.0

What happens if we try to use an index that doesn't exist for that list? Python will complain!

In [18]:
print(float_list[10])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
 in ()
----> 1 print(float_list[10])

IndexError: list index out of range

You can find the length of a list using the built-in function len:

In [19]:
print(float_list)
len(float_list)
[1.0, 3.0, 5.0, 4.0, 2.0]
Out[19]:
5

Indexing on lists plus Slicing

And since Python is zero-indexed, the last element of float_list is

In [20]:
float_list[len(float_list)-1]
Out[20]:
2.0

It is more idiomatic in Python to use -1 for the last element, -2 for the second last, and so on

In [21]:
float_list[-1]
Out[21]:
2.0

We can use the : operator to access a subset of the list. This is called slicing.

In [22]:
print(float_list[1:5])
print(float_list[0:2])
[3.0, 5.0, 4.0, 2.0]
[1.0, 3.0]

Below is a summary of list slicing operations:

Drawing

In [24]:
lst = ['hi', 7, 'c', 'cat', 'hello', 8]
lst[:2]
Out[24]:
['hi', 7]

You can slice "backwards" as well:

In [25]:
float_list[:-2] # up to second last
Out[25]:
[1.0, 3.0, 5.0]
In [26]:
float_list[:4] # up to but not including 5th element
Out[26]:
[1.0, 3.0, 5.0, 4.0]

You can also slice with a stride:

In [27]:
float_list[:4:2] # above but skipping every second element
Out[27]:
[1.0, 5.0]

We can iterate through a list using a loop. Here's a for loop.

In [28]:
for ele in float_list:
    print(ele)
1.0
3.0
5.0
4.0
2.0

What if you wanted the index as well?

Use the built-in python method enumerate, which can be used to create a list of tuples with each tuple of the form (index, value).

In [29]:
for i, ele in enumerate(float_list):
    print(i, ele)
(0, 1.0)
(1, 3.0)
(2, 5.0)
(3, 4.0)
(4, 2.0)

Appending and deleting

We can also append items to the end of the list using the + operator or with append.

In [30]:
float_list + [.333]
Out[30]:
[1.0, 3.0, 5.0, 4.0, 2.0, 0.333]
In [31]:
float_list.append(.444)
In [32]:
print(float_list)
len(float_list)
[1.0, 3.0, 5.0, 4.0, 2.0, 0.444]
Out[32]:
6

Now, run the cell with float_list.append() a second time. Then run the subsequent cell. What happens?

To remove an item from the list, use del.

In [33]:
del(float_list[2])
print(float_list)
[1.0, 3.0, 4.0, 2.0, 0.444]

You may also add an element (elem) in a specific position (index) in the list

In [34]:
elem = '3.14'
index = 1
float_list.insert(index, elem)
float_list
Out[34]:
[1.0, '3.14', 3.0, 4.0, 2.0, 0.444]

List Comprehensions

Lists can be constructed in a compact way using a list comprehension. Here's a simple example.

In [35]:
squaredlist = [i*i for i in int_list]
squaredlist
Out[35]:
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

And here's a more complicated one, requiring a conditional.

In [36]:
comp_list1 = [2*i for i in squaredlist if i % 2 == 0]
print(comp_list1)
[8, 32, 72, 128, 200]

This is entirely equivalent to creating comp_list1 using a loop with a conditional, as below:

In [37]:
comp_list2 = []
for i in squaredlist:
    if i % 2 == 0:
        comp_list2.append(2*i)       
print(comp_list2)
[8, 32, 72, 128, 200]

The list comprehension syntax

[expression for item in list if conditional]

is equivalent to the syntax

for item in list:
    if conditional:
        expression
Exercise 2: (do at home) Build a list that contains every prime number between 1 and 100, in two different ways:
  • 2.1 Using for loops and conditional if statements.
  • 2.2 (Stretch Goal) Using a list comprehension. You should be able to do this in one line of code. Hint: it might help to look up the function all() in the documentation.
In [14]:
primes = []
for i in range(1,101):
    if sum([(i % p) == 0 for p in primes]) > 0:
        continue
    if i != 1:
        primes.append(i)
primes
Out[14]:
[2,
 3,
 5,
 7,
 11,
 13,
 17,
 19,
 23,
 29,
 31,
 37,
 41,
 43,
 47,
 53,
 59,
 61,
 67,
 71,
 73,
 79,
 83,
 89,
 97]
In [18]:
[i for i in range(2,101) if all(i % j != 0 for j in range(2,i))]
Out[18]:
[2,
 3,
 5,
 7,
 11,
 13,
 17,
 19,
 23,
 29,
 31,
 37,
 41,
 43,
 47,
 53,
 59,
 61,
 67,
 71,
 73,
 79,
 83,
 89,
 97]
In [ ]:
# %load ../solutions/exercise2_1.py
N = 100;

# using loops and if statements
primes = [];
for j in range(2, N):
    count = 0;
    for i in range(2,j):
        if j % i == 0:
            count = count + 1;
    if count == 0:
        primes.append(j)
print(primes)
In [40]:
 
In [ ]:
# %load ../solutions/exercise2_2.py
primes_lc = [j for j in range(2, N) if all(j % i != 0 for i in range(2, j))]

print(primes)
print(primes_lc)

Simple Functions

A function object is a reusable block of code that does a specific task. Functions are commonplace in Python, either on their own or as they belong to other objects. To invoke a function func, you call it as func(arguments).

We've seen built-in Python functions and methods (details below). For example, len() and print() are built-in Python functions. And at the beginning, you called np.mean() to calculate the mean of three numbers, where mean() is a function in the numpy module and numpy was abbreviated as np. This syntax allows us to have multiple "mean" functions in different modules; calling this one as np.mean() guarantees that we will execute numpy's mean function, as opposed to a mean function from a different module.

User-defined functions

We'll now learn to write our own user-defined functions. Below is the syntax for defining a basic function with one input argument and one output. You can also define functions with no input or output arguments, or multiple input or output arguments.

def name_of_function(arg):
    ...
    return(output)

We can write functions with one input and one output argument. Here are two such functions.

In [42]:
def square(x):
    x_sqr = x*x
    return(x_sqr)

def cube(x):
    x_cub = x*x*x
    return(x_cub)

square(5),cube(5)
Out[42]:
(25, 125)

What if you want to return two variables at a time? The usual way is to return a tuple:

In [43]:
def square_and_cube(x):
    x_cub = x*x*x
    x_sqr = x*x
    return(x_sqr, x_cub)

square_and_cube(5)
Out[43]:
(25, 125)

Lambda functions

Often we quickly define mathematical functions with a one-line function called a lambda function. Lambda functions are great because they enable us to write functions without having to name them, ie, they're anonymous.
No return statement is needed.

In [44]:
# create an anonymous function and assign it to the variable square
square = lambda x: x*x
print(square(3))

hypotenuse = lambda x, y: x*x + y*y

## Same as
# def hypotenuse(x, y):
#     return(x*x + y*y)

hypotenuse(3,4)
9
Out[44]:
25

Methods

A function that belongs to an object is called a method. By "object," we mean an "instance" of a class (e.g., list, integer, or floating point variable).

For example, when we invoke append() on an existing list, append() is a method.

In other words, a method is a function on a specific instance of a class (i.e., object). In this example, our class is a list. float_list is an instance of a list (thus, an object), and the append() function is technically a method since it pertains to the specific instance float_list.

In [45]:
float_list = [1.0, 2.09, 4.0, 2.0, 0.444]
print(float_list)
float_list.append(56.7) 
float_list
[1.0, 2.09, 4.0, 2.0, 0.444]
Out[45]:
[1.0, 2.09, 4.0, 2.0, 0.444, 56.7]
Exercise 3: (do at home) generated a list of the prime numbers between 1 and 100

In Exercise 2, above, you wrote code that generated a list of the prime numbers between 1 and 100. Now, write a function called isprime() that takes in a positive integer $N$, and determines whether or not it is prime. Return True if it's prime and return False if it isn't. Then, using a list comprehension and isprime(), create a list myprimes that contains all the prime numbers less than 100.

In [19]:
# your code here
def isprime(n):
    return all([n % i != 0 for i in range(2,n)])
In [26]:
[n for n in range(2,100) if isprime(n)]
Out[26]:
[2,
 3,
 5,
 7,
 11,
 13,
 17,
 19,
 23,
 29,
 31,
 37,
 41,
 43,
 47,
 53,
 59,
 61,
 67,
 71,
 73,
 79,
 83,
 89,
 97]
In [ ]:
# %load ../solutions/exercise3.py
def isprime(N):
    count = 0;
    if not isinstance(N, int):
        return False
    if N <= 1:
        return False
    for i in range(2, N):
        if N % i == 0:
            count = count + 1;
    if count == 0:
        return(True)
    else:
        return(False)
    
print(isprime(3.0), isprime("pavlos"), isprime(0), isprime(-1), isprime(1), isprime(2), isprime(93), isprime(97))    
myprimes = [j for j in range(1, 100) if isprime(j)]
print(myprimes)

Introduction to Numpy

Scientific Python code uses a fast array structure, called the numpy array. Those who have programmed in Matlab will find this very natural. For reference, the numpy documention can be found here.

Let's make a numpy array:

In [48]:
my_array = np.array([1, 2, 3, 4])
my_array
Out[48]:
array([1, 2, 3, 4])
In [49]:
# works as it would with a standard list
len(my_array)
Out[49]:
4

The shape array of an array is very useful (we'll see more of it later when we talk about 2D arrays -- matrices -- and higher-dimensional arrays).

In [50]:
my_array.shape
Out[50]:
(4,)

Numpy arrays are typed. This means that by default, all the elements will be assumed to be of the same type (e.g., integer, float, String).

In [51]:
my_array.dtype
Out[51]:
dtype('int64')

Numpy arrays have similar functionality as lists! Below, we compute the length, slice the array, and iterate through it (one could identically perform the same with a list).

In [52]:
print(len(my_array))
print(my_array[2:4])
for ele in my_array:
    print(ele)
4
[3 4]
1
2
3
4

There are two ways to manipulate numpy arrays a) by using the numpy module's methods (e.g., np.mean()) or b) by applying the function np.mean() with the numpy array as an argument.

In [53]:
print(my_array.mean())
print(np.mean(my_array))
2.5
2.5

A constructor is a general programming term that refers to the mechanism for creating a new object (e.g., list, array, String).

There are many other efficient ways to construct numpy arrays. Here are some commonly used numpy array constructors. Read more details in the numpy documentation.

In [54]:
np.ones(10) # generates 10 floating point ones
Out[54]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

Numpy gains a lot of its efficiency from being typed. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float. (Each float uses either 32 or 64 bits of memory, depending on if the code is running a 32-bit or 64-bit machine, respectively).

In [55]:
np.dtype(float).itemsize # in bytes (remember, 1 byte = 8 bits)
Out[55]:
8
In [56]:
np.ones(10, dtype='int') # generates 10 integer ones
Out[56]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
In [57]:
np.zeros(10)
Out[57]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

Often, you will want random numbers. Use the random constructor!

In [58]:
np.random.random(10) # uniform from [0,1]
Out[58]:
array([ 0.85115672,  0.37346821,  0.3298871 ,  0.47496563,  0.69940192,
        0.97207796,  0.91488615,  0.36063927,  0.81240722,  0.16128617])

You can generate random numbers from a normal distribution with mean 0 and variance 1:

In [59]:
normal_array = np.random.randn(1000)
print("The sample mean and standard devation are %f and %f, respectively." %(np.mean(normal_array), np.std(normal_array)))
The sample mean and standard devation are 0.025195 and 1.026880, respectively.
In [60]:
len(normal_array)
Out[60]:
1000

You can sample with and without replacement from an array. Let's first construct a list with evenly-spaced values:

In [61]:
grid = np.arange(0., 1.01, 0.1)
grid
Out[61]:
array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ])

Without replacement

In [62]:
np.random.choice(grid, 5, replace=False)
Out[62]:
array([ 0.3,  0.8,  0.7,  1. ,  0. ])
In [63]:
np.random.choice(grid, 20, replace=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in ()
----> 1 np.random.choice(grid, 20, replace=False)

mtrand.pyx in mtrand.RandomState.choice()

ValueError: Cannot take a larger sample than population when 'replace=False'

With replacement:

In [64]:
np.random.choice(grid, 20, replace=True)
Out[64]:
array([ 0. ,  1. ,  0.3,  1. ,  0. ,  0.9,  1. ,  0.7,  0.2,  0.7,  0.4,
        0.6,  0.1,  0.6,  0.4,  0.3,  0.6,  0.3,  0.8,  1. ])

Tensors

We can think of tensors as a name to include multidimensional arrays of numerical values. While tensors first emerged in the 20th century, they have since been applied to numerous other disciplines, including machine learning. In this class you will only be using scalars, vectors, and 2D arrays, so you do not need to worry about the name 'tensor'.

We will use the following naming conventions:

  • scalar = just a number = rank 0 tensor ($a$ ∈ $F$,)

  • vector = 1D array = rank 1 tensor ( $x = (\;x_1,...,x_i\;)⊤$ ∈ $F^n$ )

  • matrix = 2D array = rank 2 tensor ( $\textbf{X} = [a_{ij}] ∈ F^{m×n}$ )

  • 3D array = rank 3 tensor ( $\mathscr{X} =[t_{i,j,k}]∈F^{m×n×l}$ )

  • $\mathscr{N}$D array = rank $\mathscr{N}$ tensor ( $\mathscr{T} =[t_{i1},...,t_{i\mathscr{N}}]∈F^{n_1×...×n_\mathscr{N}}$ )

Slicing a 2D array

In [ ]:
# how do we get just the second row of the above array?

Numpy supports vector operations

What does this mean? It means that instead of adding two arrays, element by element, you can just say: add the two arrays.

In [ ]:
first = np.ones(5)
second = np.ones(5)
first + second # adds in-place

Note that this behavior is very different from python lists where concatenation happens.

In [ ]:
first_list = [1., 1., 1., 1., 1.]
second_list = [1., 1., 1., 1., 1.]
first_list + second_list # concatenation

On some computer chips, this numpy addition actually happens in parallel and can yield significant increases in speed. But even on regular chips, the advantage of greater readability is important.

Broadcasting

Numpy supports a concept known as broadcasting, which dictates how arrays of different sizes are combined together. There are too many rules to list here, but importantly, multiplying an array by a number multiplies each element by the number. Adding a number adds the number to each element.

In [ ]:
first + 1
In [ ]:
first*5

This means that if you wanted the distribution $N(5, 7)$ you could do:

In [ ]:
normal_5_7 = 5 + 7*normal_array
np.mean(normal_5_7), np.std(normal_5_7)

Multiplying two arrays multiplies them element-by-element

In [ ]:
(first +1) * (first*5)

You might have wanted to compute the dot product instead:

In [ ]:
np.dot((first +1) , (first*5))

Probabilitiy Distributions from scipy.stats and statsmodels

Two useful statistics libraries in python are scipy and statsmodels.

For example to load the z_test:

In [ ]:
import statsmodels
from statsmodels.stats.proportion import proportions_ztest
In [ ]:
x = np.array([74,100])
n = np.array([152,266])

zstat, pvalue = statsmodels.stats.proportion.proportions_ztest(x, n)    
print("Two-sided z-test for proportions: \n","z =",zstat,", pvalue =",pvalue)
In [ ]:
#The `%matplotlib inline` ensures that plots are rendered inline in the browser.
%matplotlib inline
import matplotlib.pyplot as plt

Let's get the normal distribution namespace from scipy.stats. See here for Documentation.

In [ ]:
from scipy.stats import norm

Let's create 1,000 points between -10 and 10

In [ ]:
x = np.linspace(-10, 10, 1000) # linspace() returns evenly-spaced numbers over a specified interval
x[0:10], x[-10:]

Let's get the pdf of a normal distribution with a mean of 1 and standard deviation 3, and plot it using the grid points computed before:

In [ ]:
pdf_x = norm.pdf(x, 1, 3)
plt.plot(x, pdf_x);

And you can get random variables using the rvs function.

Referencies

A useful book by Jake Vanderplas: PythonDataScienceHandbook.

You may also benefit from using Chris Albon's web site as a reference. It contains lots of useful information.

Dictionaries

A dictionary is another data structure (aka storage container) -- arguably the most powerful. Like a list, a dictionary is a sequence of items. Unlike a list, a dictionary is unordered and its items are accessed with keys and not integer positions.

Dictionaries are the closest data structure we have to a database.

Let's make a dictionary with a few Harvard courses and their corresponding enrollment numbers.

In [ ]:
enroll2017_dict = {'CS50': 692, 'CS109A / Stat 121A / AC 209A': 352, 'Econ1011a': 95, 'AM21a': 153, 'Stat110': 485}
enroll2017_dict

One can obtain the value corresponding to a key via:

In [ ]:
enroll2017_dict['CS50']

If you try to access a key that isn't present, your code will yield an error:

In [ ]:
enroll2017_dict['CS630']

Alternatively, the .get() function allows one to gracefully handle these situations by providing a default value if the key isn't found:

In [ ]:
enroll2017_dict.get('CS630', 5)

Note, this does not store a new value for the key; it only provides a value to return if the key isn't found.

In [ ]:
enroll2017_dict['CS630']
In [ ]:
enroll2017_dict.get('C730', None)

All sorts of iterations are supported:

In [ ]:
enroll2017_dict.values()
In [ ]:
enroll2017_dict.items()

We can iterate over the tuples obtained above:

In [ ]:
for key, value in enroll2017_dict.items():
    print("%s: %d" %(key, value))

Simply iterating over a dictionary gives us the keys. This is useful when we want to do something with each item:

In [ ]:
second_dict={}
for key in enroll2017_dict:
    second_dict[key] = enroll2017_dict[key]
second_dict

The above is an actual copy of _enroll2017dict's allocated memory, unlike, second_dict = enroll2017_dict which would have made both variables label the same memory location.

In the previous dictionary example, the keys were strings corresponding to course names. Keys don't have to be strings, though; they can be other immutable data type such as numbers or tuples (not lists, as lists are mutable).

Dictionary comprehension: "Do not try this at home"

You can construct dictionaries using a dictionary comprehension, which is similar to a list comprehension. Notice the brackets {} and the use of zip (see next cell for more on zip)

In [ ]:
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

my_dict = {k:v for (k, v) in zip(int_list, float_list)}
my_dict

Creating tuples with zip

zip is a Python built-in function that returns an iterator that aggregates elements from each of the iterables. This is an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The iterator stops when the shortest input iterable is exhausted. The set() built-in function returns a set object, optionally with elements taken from another iterable. By using set() you can make zip printable. In the example below, the iterables are the two lists, float_list and int_list. We can have more than two iterables.

In [ ]:
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

viz_zip = set(zip(int_list, float_list))
viz_zip
In [ ]:
type(viz_zip)