CS 109A/STAT 121A/AC 209A/CSCI E-109A¶

Programming Style: Illustrated with an example¶

Harvard University
Fall 2017
Instructors: Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine

In [1]:

# IPython, the interactive python shell used by the Notebook, has a set of predefined ‘magic functions'
# that you can call with a command line style syntax.
# The following one is for asking the plotting library to draw inside the notebook, instead of on a separate window.
# this line above prepares the jupyter notebook for working with matplotlib

%matplotlib inline 

# Below is the set of most useful data science modules that you will need for your code
# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().
# notice we use short aliases here, and these are conventional in the python community

import numpy as np               # imports a fast numerical programming library
import scipy as sp               # imports stats functions, amongst other things
import matplotlib as mpl         # this actually imports matplotlib
import matplotlib.cm as cm       # allows us easy access to colormaps
import matplotlib.pyplot as plt  # sets up plotting under plt
import pandas as pd              #lets us handle data as dataframes

# sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

import seaborn as sns            # sets up styles and gives us more plotting options
from bs4 import BeautifulSoup    # imports web parsing library https://www.crummy.com/software/BeautifulSoup/

Python Best Practices¶

Code readability is key, Python syntax itself is close to plain english.

Your variables, functions should be given descriptive identifiers!¶

Identifiers for variables should be descriptive words separated by underscores (not spaces) and in all lower case. Classes can use CamelCase.

BAD: var6 = 25, AG3ofMoTh3R = 25. GOOD : age_of_mother = 25

Do this:

def optimize_animals(cats, dogs):
    return cats ** 2 + np.sqrt(dogs)

Not this:

def feval(x, y):
    return x**2+np.sqrt(y)

White Space¶

You should use white space to increase readability. Apply spacing around operators, except in function calls and slices. Also insert newlines if that makes sense logically, perhaps with insightful comments.

BAD: x=[2,3,4], GOOD: x = [2, 3, 4]

Do this:

# Normalize x by adding 12
x = x + 12  

# Train on first 10 data points
train(data[:10], bias=x)

Not this:

x=x+12 #Add 12
train( data[: 10], bias = x )

Avoid name shadowing.¶

Please use names that are different from built-in functions, so to make sure they can still be used.

Do this:

train_data = data[:10] 
max_iterations = 100

Not this:

input = data[:10]  # Shadows input() function
iter = 100  # Shadows iter() function

Additional¶

You should liberally intersperse your code with comments!
Proper indentation is non-negotiable. Be consistent in what you use. In Python, indentation matters!
If possible, do not use map and reduce where you can also run a for-loop (unless your environment, like Spark, specifically requires it)
Be consistent with variable names across all your work through the semester. For example you might want to always write standard deviation as ‘std_dev’ or ‘sigma’.

General Guidelines¶

The general guideline for Python code is: try to stick by PEP8 as much as possible. The complete style guide is available here: https://www.python.org/dev/peps/pep-0008/. Please read this document: while we dont expect strict adherance to it, it will make your code readable by the entire python community.

If you would like to adhere to it strictly, check out the code linter flake8. This is a tool that will complain if your code does not adhere to PEP8.

The one place we differ is in writing documentation. We will be using the numpy style guide instead. See the docstring section of the numpy style guide.

A question may arise: If you are expected to provide documentation in code-cells, what is the notebook markdown cell for? The answer is that it is for English and Math explanation of your process, and a very high level overview and skeleton of the code, rather than details.

The example below illustrates all of these considerations, in addition to providing good notebook text to set up and motivate the problem, and to explain the code.

The Monty Hall Problem¶

Here's a fun and perhaps surprising statistical riddle, and a good way to get some practice writing python functions. It also illustrates our style guide.

In a gameshow, contestants try to guess which of 3 closed doors contain a cash prize (goats are behind the other two doors). Of course, the odds of choosing the correct door are 1 in 3. As a twist, the host of the show occasionally opens a door after a contestant makes his or her choice. This door is always one of the two the contestant did not pick, and is also always one of the goat doors (note that it is always possible to do this, since there are two goat doors). At this point, the contestant has the option of keeping his or her original choice, or swtiching to the other unopened door. The question is: is there any benefit to switching doors? The answer surprises many people who haven't heard the question before.

We can answer the problem by running simulations in Python. We'll do it in several parts.

First, write a function called simulate_prizedoor. This function will simulate the location of the prize in many games -- see the detailed specification below:

In [2]:

"""
Function
--------
simulate_prizedoor

Generate a random array of 0s, 1s, and 2s, representing
hiding a prize between door 0, door 1, and door 2

Parameters
----------
nsim : int
    The number of simulations to run

Returns
-------
sims : array
    Random array of 0s, 1s, and 2s

Example
-------
>>> print simulate_prizedoor(3)
array([0, 0, 2])
"""

def simulate_prizedoor(nsim):
    return np.random.randint(0, 3, (nsim))

Next, write a function that simulates the contestant's guesses for nsim simulations. Call this function simulate_guess. The specs:

In [3]:

"""
Function
--------
simulate_guess

Return any strategy for guessing which door a prize is behind. This
could be a random strategy, one that always guesses 2, whatever.

Parameters
----------
nsim : int
    The number of simulations to generate guesses for

Returns
-------
guesses : array
    An array of guesses. Each guess is a 0, 1, or 2

Example
-------
>>> print simulate_guess(5)
array([0, 0, 0, 0, 0])
"""

def simulate_guess(nsim):
    return np.zeros(nsim, dtype=np.int)

Next, write a function, goat_door, to simulate randomly revealing one of the goat doors that a contestant didn't pick.

In [4]:

"""
Function
--------
goat_door

Simulate the opening of a "goat door" that doesn't contain the prize,
and is different from the contestants guess

Parameters
----------
prizedoors : array
    The door that the prize is behind in each simulation
guesses : array
    THe door that the contestant guessed in each simulation

Returns
-------
goats : array
    The goat door that is opened for each simulation. Each item is 0, 1, or 2, and is different
    from both prizedoors and guesses

Examples
--------
>>> print goat_door(np.array([0, 1, 2]), np.array([1, 1, 1]))
>>> array([2, 2, 0])
"""

def goat_door(prizedoors, guesses):
    
    # strategy: generate random answers, and
    # keep updating until they satisfy the rule
    # that they aren't a prizedoor or a guess
    result = np.random.randint(0, 3, prizedoors.size)
    while True:
        bad = (result == prizedoors) | (result == guesses)
        if not bad.any():
            return result
        result[bad] = np.random.randint(0, 3, bad.sum())

Write a function, switch_guess, that represents the strategy of always switching a guess after the goat door is opened.

In [5]:

"""
Function
--------
switch_guess

The strategy that always switches a guess after the goat door is opened

Parameters
----------
guesses : array
     Array of original guesses, for each simulation
goatdoors : array
     Array of revealed goat doors for each simulation

Returns
-------
The new door after switching. Should be different from both guesses and goatdoors

Examples
--------
>>> print switch_guess(np.array([0, 1, 2]), np.array([1, 2, 1]))
>>> array([2, 0, 0])
"""

def switch_guess(guesses, goatdoors):
    result = np.zeros(guesses.size)
    switch = {(0, 1): 2, (0, 2): 1, (1, 0): 2, (1, 2): 1, (2, 0): 1, (2, 1): 0}
    for i in [0, 1, 2]:
        for j in [0, 1, 2]:
            mask = (guesses == i) & (goatdoors == j)
            if not mask.any():
                continue
            result = np.where(mask, np.ones_like(result) * switch[(i, j)], result)
    return result

Last function: write a win_percentage function that takes an array of guesses and prizedoors, and returns the percent of correct guesses

In [6]:

"""
Function
--------
win_percentage

Calculate the percent of times that a simulation of guesses is correct

Parameters
-----------
guesses : array
    Guesses for each simulation
prizedoors : array
    Location of prize for each simulation

Returns
--------
percentage : number between 0 and 100
    The win percentage

Examples
---------
>>> print win_percentage(np.array([0, 1, 2]), np.array([0, 0, 0]))
33.333
"""

def win_percentage(guesses, prizedoors):
    return 100 * (guesses == prizedoors).mean()

Now, put it together. Simulate 10000 games where contestant keeps his original guess, and 10000 games where the contestant switches his door after a goat door is revealed. Compute the percentage of time the contestant wins under either strategy. Is one strategy better than the other?

In [7]:

nsim = 10000

# keep guesses
print("Win percentage when keeping original door")
print(win_percentage(simulate_prizedoor(nsim), simulate_guess(nsim)))

#switch
pd = simulate_prizedoor(nsim)
guess = simulate_guess(nsim)
goats = goat_door(pd, guess)
guess = switch_guess(guess, goats)
print("Win percentage when switching doors")
print(win_percentage(pd, guess).mean())

Many people find this answer counter-intuitive (famously, PhD mathematicians have incorrectly claimed the result must be wrong. Clearly, none of them knew Python).

One of the best ways to build intuition about why opening a Goat door affects the odds is to re-run the experiment with 100 doors and one prize. If the game show host opens 98 goat doors after you make your initial selection, would you want to keep your first pick or switch? Can you generalize your simulation code to handle the case of n doors?