Key Word(s): ??



Title :

MCMC and PyMC3 Hierarchical Models

Description :

Download this notebook on your local environment. Some parts of it are time-consuming, so we are providing the output cells as well. Once you have gone through them, make changes and run the cells again.

CS109B Data Science 2: Advanced Topics in Data Science

Lecture 13 - MCMC and Hierarchical Models in PyMC3

Harvard University
Spring 2021
Instructors: Pavlos Protopapas, Mark Glickman, and Chris Tanner
Additional Instructor: Eleni Angelaki Kaxiras
Content: Eleni Angelaki Kaxiras


In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:
In [2]:
import pymc3 as pm
from pymc3 import summary
import arviz as az
from matplotlib import gridspec
# Ignore a common pymc3 warning that comes from library functions, not our code.
# Pymc3 may throw additional warnings, but other warnings should be manageable
# by following the instructions included within the warning messages.
import warnings
messages=[
    "Using `from_pymc3` without the model will be deprecated in a future release",
]

# or silence all warnings (not recommended)
# warnings.filterwarnings('ignore')

for m in messages:
    warnings.filterwarnings("ignore", message=m)
print(f"Using PyMC3 version: {pm.__version__}")
print(f"Using ArviZ version: {az.__version__}")
Using PyMC3 version: 3.8
Using ArviZ version: 0.7.0
In [3]:
import pymc3 as pm
from pymc3 import summary
#import arviz as az
from matplotlib import gridspec
In [4]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
%matplotlib inline
In [5]:
# from pymc3 import Model, Normal, HalfNormal, model_to_graphviz
# from pymc3 import NUTS, sample, find_MAP
from scipy import optimize
In [6]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 20000;
In [7]:
#pandas trick
pd.options.display.max_columns = 50  # None -> No Restrictions
pd.options.display.max_rows = 200    # None -> Be careful with this 
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3

Learning Objectives

By the end of this lab, you should be able to understand how to:

  • run a PyMC3 bayesian model using MCMC.
  • implement hierarchical models in PyMC3.

Top

1. Markov Chain Monte Carlo (MCMC) Simulations

PyMC3 is a Python library for programming Bayesian analysis, and more specifically, data creation, model definition, model fitting, and posterior analysis. It uses the concept of a model which contains assigned parametric statistical distributions to unknown quantities in the model. Within models we define random variables and their distributions. A distribution requires at least a name argument, and other parameters that define it. You may also use the logp() method in the model to build the model log-likelihood function. We define and fit the model.

PyMC3 includes a comprehensive set of pre-defined statistical distributions that can be used as model building blocks. Although they are not meant to be used outside of a model, you can invoke them by using the prefix pm, as in pm.Normal.

Monte Carlo methods are rooted in the 1940s when Enrico Fermi, John von Neumann, Stan Ulam, Nicolas Metropolis, and others, employed the use of random numbers to study physics from a stochastic point of view. The name is attributed to Metropolis 1.

For a history of the Monte Carlo Method first used in Physics see: The beginning of the Monte Carlo Method

PyMC3 uses the No-U-Turn Sampler (NUTS) and the Random Walk Metropolis, two Markov chain Monte Carlo (MCMC) algorithms for sampling in posterior space. Monte Carlo gets its name because when we sample in posterior space, we choose our next move via a pseudo-random process. NUTS is a sophisticated algorithm that can handle a large number of unknown (albeit continuous) variables.

In [8]:
#help(pm.Poisson)

Bayesian Linear Regression

We will artificially create the data to predict on. We will then see if our model predicts them correctly.

In [9]:
np.random.seed(123)

######## True parameter values 
##### our model does not see these
sigma = 1
beta0 = 1
beta = [1, 2.5]
###############################
# Size of dataset
size = 100

# Feature variables
x1 = np.linspace(0, 1., size)
x2 = np.linspace(0,2., size)

# Create outcome variable with random noise
Y = beta0 + beta[0]*x1 + beta[1]*x2 + np.random.randn(size)*sigma
In [10]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
fontsize=14
labelsize=8
title='Observed Data (created artificially by ' + r'$Y(x_1,x_2)$)'
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x1, x2, Y)
ax.set_xlabel(r'$x_1$', fontsize=fontsize)
ax.set_ylabel(r'$x_2$', fontsize=fontsize)
ax.set_zlabel(r'$Y$', fontsize=fontsize)

ax.tick_params(labelsize=labelsize)

fig.suptitle(title, fontsize=fontsize)
fig.tight_layout(pad=.1, w_pad=10.1, h_pad=2.)
#fig.subplots_adjust(); #top=0.5
plt.tight_layout
plt.show()

Now let's see if our model will correctly predict the values for our unknown parameters, namely $b_0$, $b_1$, $b_2$ and $\sigma$.

Defining the Problem

Our problem is the following: we want to perform multiple linear regression to predict an outcome variable $Y$ which depends on variables $\bf{x}_1$ and $\bf{x}_2$.

We will model $Y$ as normally distributed observations with an expected value $mu$ that is a linear function of the two predictor variables, $\bf{x}_1$ and $\bf{x}_2$.

\begin{equation} Y \sim \mathcal{N}(\mu,\,\sigma^{2}) \end{equation}

\begin{equation} \mu = \beta_0 + \beta_1 \bf{x}_1 + \beta_2 x_2 \end{equation}

where $\sigma^2$ represents the measurement error (in this example, we will use $\sigma^2 = 10$)

We also choose the parameters to have normal distributions with those parameters set by us.

\begin{eqnarray} \beta_i \sim \mathcal{N}(0,\,10) \\ \sigma^2 \sim |\mathcal{N}(0,\,10)| \end{eqnarray}

Defining a Model in PyMC3

In [11]:
with pm.Model() as my_linear_model:

    # Priors for unknown model parameters, specifically created stochastic random variables 
    # with Normal prior distributions for the regression coefficients,
    # and a half-normal distribution for the standard deviation of the observations.
    # These are our parameters. P(theta)

    beta0 = pm.Normal('beta0', mu=0, sd=10)
    # Note: betas is a vector of two variables, b1 and b2, (denoted by shape=2)
    # so, in array notation, our beta1 = betas[0], and beta2=betas[1]
    betas = pm.Normal('betas', mu=0, sd=10, shape=2)
    sigma = pm.HalfNormal('sigma', sd=1)

    # mu is what is called a deterministic random variable, which implies that its value is completely
    # determined by its parents’ values (betas and sigma in our case). 
    # There is no uncertainty in the variable beyond that which is inherent in the parents’ values

    mu = beta0 + betas[0]*x1 + betas[1]*x2

    # Likelihood function = how probable is my observed data?
    # This is a special case of a stochastic variable that we call an observed stochastic.
    # It is identical to a standard stochastic, except that its observed argument, 
    # which passes the data to the variable, indicates that the values for this variable were observed, 
    # and should not be changed by any fitting algorithm applied to the model. 
    # The data can be passed in the form of either a numpy.ndarray or pandas.DataFrame object.

    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

Note: If our problem was a classification for which we would use Logistic regression see below

Python Note: pm.Model is designed as a simple API that abstracts away the details of the inference. For the use of with see Compounds statements in Python..

In [12]:
## do not worry about this, it's just a nice graph to have
## you need to install python-graphviz first
# conda install -c conda-forge python-graphviz
pm.model_to_graphviz(my_linear_model)
Out[12]:
%3 cluster2 2 cluster100 100 sigma sigma ~ HalfNormal Y_obs Y_obs ~ Normal sigma->Y_obs beta0 beta0 ~ Normal beta0->Y_obs betas betas ~ Normal betas->Y_obs

Fitting the Model with Sampling - Doing Inference

See below for PyMC3's sampling method. As you can see it has quite a few parameters. Most of them are set to default values by the package. For some, it's useful to set your own values.

pymc3.sampling.sample(draws=500, step=None, n_init=200000, chains=None,
                      cores=None, tune=500, random_seed=None)

Parameters to set:

  • draws: (int): Number of samples to keep when drawing, defaults to 500. Number starts after the tuning has ended.
  • tune: (int): Number of iterations to use for tuning the model, also called the burn-in period, defaults to 500. Samples from the tuning period will be discarded.
  • target_accept (float in $[0, 1]$). The step size is tuned such that we approximate this acceptance rate. Higher values like 0.9 or 0.95 often work better for problematic posteriors.
  • (optional) chains (int) number of chains to run in parallel, defaults to the number of CPUs in the system, but at most 4.

pm.sample returns a pymc3.backends.base.MultiTrace object that contains the samples. We usually name it a variation of the word trace. All the information about the posterior is in trace, which also provides statistics about the sampler.

How does a good trace plot look like? Is this a good one?

In [13]:
## uncomment this to see more about pm.sample
#help(pm.sample)
In [14]:
#help(pm.backends.base.MultiTrace)
In [15]:
with my_linear_model:

    print(f'Starting MCMC process')
    # draw nsamples posterior samples and run the default number of chains = 4 
    nsamples = 1000 # number of samples to keep
    burnin = 100 # burnin period
    trace = pm.sample(nsamples, tune=burnin, target_accept=0.8)
    print(f'DONE')
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Starting MCMC process
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, betas, beta0]
Sampling 4 chains, 0 divergences: 100%|██████████| 4400/4400 [00:12<00:00, 352.64draws/s]
The acceptance probability does not match the target. It is 0.8975560054708723, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.9136382172502662, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.9276833880914027, but should be close to 0.8. Try to increase the number of tuning steps.
The number of effective samples is smaller than 25% for some parameters.
DONE
In [16]:
var_names = trace.varnames
var_names = var_names.remove('sigma_log__')
var_names

Model Plotting

PyMC3 provides a variety of visualizations via plots: https://docs.pymc.io/api/plots.html. arviz is another library that you can use.

In [17]:
title = f'Traceplots for our artificial dataset with parameters: {var_names}'
pm.traceplot(trace);
plt.suptitle(
    title,
    fontsize=20
)
plt.show()
In [18]:
az.plot_trace(trace, var_names=var_names);
plt.suptitle(
    title,
    fontsize=20
)
plt.show()
In [19]:
# generate results table from trace samples
# remember our true hidden values sigma = 1, beta0 = 1, beta = [1, 2.5] 
# We want R_hat < 1.1
results = az.summary(
    trace,
    var_names=var_names
)
display(results)
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
beta0 1.016 0.232 0.572 1.446 0.005 0.003 2313.0 2214.0 2327.0 1608.0 1.00
betas[0] 1.439 8.795 -15.281 17.336 0.366 0.259 576.0 576.0 577.0 687.0 1.00
betas[1] 2.292 4.402 -5.836 10.352 0.184 0.130 572.0 572.0 573.0 684.0 1.01
sigma 1.147 0.082 1.005 1.311 0.001 0.001 3021.0 2964.0 3091.0 2468.0 1.00
In [20]:
#help(pm.Normal)

$\hat{R}$ is a metric for comparing how well a chain has converged to the equilibrium distribution by comparing its behavior to other randomly initialized Markov chains. Multiple chains initialized from different initial conditions should give similar results. If all chains converge to the same equilibrium, $\hat{R}$ will be 1. If the chains have not converged to a common distribution, $\hat{R}$ will be > 1.01. $\hat{R}$ is a necessary but not sufficient condition.

For details on the $\hat{R}$ see Gelman and Rubin (1992).

This linear regression example is from the original paper on PyMC3: Salvatier J, Wiecki TV, Fonnesbeck C. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 https://doi.org/10.7717/peerj-cs.55

Top

2. Hierarchical Models

Gelman et al.'s famous radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in counties in several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to be more strongly present in households containing a basement and to differ in amount present among types of soil. Here we'll investigate this differences and try to make predictions of radonlevels in different county's based on the county itself and the presence of a basement. We’ll look at Minnesota, a state that contains 85 counties in which different measurements are taken, ranging from 2 to 116 measurements.

In [21]:
df = pd.read_csv('data/radon.csv', index_col=[0])
df['log_radon'] = df['log_radon'].astype('float')
county_names = df.county.unique()
county_idx = df.county_code.values
n_counties = len(df.county.unique())

df.head()
Out[21]:
idnum state state2 stfips zip region typebldg floor room basement windoor rep stratum wave starttm stoptm startdt stopdt activity pcterr adjwt dupflag zipflag cntyfips county fips Uppm county_code log_radon
0 5081.0 MN MN 27.0 55735 5.0 1.0 1.0 3.0 N 2 4.0 41 930.0 930.0 12088.0 12288.0 2.2 9.7 1146.499 1.0 0.0 1.0 AITKIN 27001.0 0.502 0 0.833
1 5082.0 MN MN 27.0 55748 5.0 1.0 0.0 4.0 Y 5 2.0 40 1615.0 1615.0 11888.0 12088.0 2.2 14.5 471.366 0.0 0.0 1.0 AITKIN 27001.0 0.502 0 0.833
2 5083.0 MN MN 27.0 55748 5.0 1.0 0.0 4.0 Y 3 2.0 42 1030.0 1515.0 20288.0 21188.0 2.9 9.6 433.317 0.0 0.0 1.0 AITKIN 27001.0 0.502 0 1.099
3 5084.0 MN MN 27.0 56469 5.0 1.0 0.0 4.0 Y 2 2.0 24 1410.0 1410.0 122987.0 123187.0 1.0 24.3 461.624 0.0 0.0 1.0 AITKIN 27001.0 0.502 0 0.095
4 5085.0 MN MN 27.0 55011 3.0 1.0 0.0 4.0 Y 3 2.0 40 600.0 600.0 12888.0 13088.0 3.1 13.8 433.317 0.0 0.0 3.0 ANOKA 27003.0 0.429 1 1.163
In [22]:
df.shape
Out[22]:
(919, 29)

Each row in the dataframe represents the radon measurements for one house in a specific county including whether the house has a basement (floor = 0) or not (floor = 1). We are interested in whether having a basement increases the radon measured in the house.

To keep things simple let's keep only the following three variables: county, log_radon, and floor

Let's check how many different counties we have. We also notice that they have a different number of houses. Some have a large number of houses measured, some only 1.

In [23]:
print(f'We have {n_counties} counties in {len(df.state.value_counts())} state.')
df['county'].value_counts().head(5)
We have 85 counties in 1 state.
Out[23]:
ST LOUIS      116
HENNEPIN      105
DAKOTA         63
ANOKA          52
WASHINGTON     46
Name: county, dtype: int64
In [24]:
# keep only these variables
data = df[['county', 'log_radon', 'floor']]
data.head()
Out[24]:
county log_radon floor
0 AITKIN 0.833 1.0
1 AITKIN 0.833 0.0
2 AITKIN 1.099 0.0
3 AITKIN 0.095 0.0
4 ANOKA 1.163 0.0
In [25]:
data['county'].value_counts()[-5:]
Out[25]:
ROCK        2
STEVENS     2
MURRAY      1
WILKIN      1
MAHNOMEN    1
Name: county, dtype: int64

Pooling: Same Linear Regression for all

We can just pool all the data and estimate one big regression to asses the influence of having a basement on radon levels across all counties. Our model would be:

\begin{equation} y_{i} = \alpha + \beta*floor_{i} \end{equation}

Where $i$ represents the measurement (house), and floor contains a 0 or 1 if the house has a basement or not. By ignoring the county feature, we do not differenciate on counties.

In [26]:
with pm.Model() as pooled_model:
    # common priors for all
    a = pm.Normal('a', mu=0, sigma=100)
    b = pm.Normal('b', mu=0, sigma=100)

    # radon estimate
    radon_est = a + b*data['floor'].values

    # likelihood after radon observations
    radon_obs = pm.Normal('radon_obs', mu=radon_est,
                           observed=data['log_radon']) # note here we enter the whole dataset
In [27]:
pm.model_to_graphviz(pooled_model)
Out[27]:
%3 cluster919 919 a a ~ Normal radon_obs radon_obs ~ Normal a->radon_obs b b ~ Normal b->radon_obs
In [28]:
with pooled_model:

    pooled_trace = pm.sample(2000, tune=1000, target_accept=0.9)
    print(f'DONE')
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [b, a]
Sampling 4 chains, 0 divergences: 100%|██████████| 12000/12000 [00:02<00:00, 4005.26draws/s]
DONE
In [29]:
pm.traceplot(pooled_trace);

Remember, with the pooled model we have only one intercept, $\alpha$, and only one slope, $\beta$ for all the counties. Let's plot the regression lines.

In [30]:
# plot just a subset of the countries, the five most counted and the 5 less counted
counties_by_counts = data['county'].value_counts()
counties = counties_by_counts.index[:5].tolist() + counties_by_counts.index[75:-5].tolist()
counties
Out[30]:
['ST LOUIS',
 'HENNEPIN',
 'DAKOTA',
 'ANOKA',
 'WASHINGTON',
 'MILLE LACS',
 'FILLMORE',
 'COOK',
 'LAC QUI PARLE',
 'POPE']
In [31]:
# plot just a subset of the countries
#counties = ['HENNEPIN','AITKIN','WASHINGTON', 'MURRAY', 'YELLOW MEDICINE', 'MAHNOMEN']

plt.figure(figsize=(15,5))
rows = 2
gs = gridspec.GridSpec(rows, len(counties)//rows)

for i, county in enumerate(counties):
    county_data = data.loc[data['county']==county]
    x = np.linspace(-0.2, 1.2)
    radon_est = pooled_trace['a'].mean() + pooled_trace['b'].mean()*x
    subplt = plt.subplot(gs[i])

    subplt.set_ylim(0.,4.)
    subplt.scatter(county_data['floor'], county_data['log_radon'])
    subplt.plot(x, radon_est, c='r', label='pooled line');
    subplt.set_xlabel('floor', fontsize=10)
    subplt.set_ylabel('radon level', fontsize=10)
    subplt.set_title(str(county) + ' County')
    subplt.legend()
    plt.tight_layout()

Unpooling: Separate Linear Regression for each county

We believe that different counties have different relationships of radon and basements. Our model would be:

\begin{equation} \textbf{radon}_{i,c} = \alpha_c + \beta_c*\textbf{floor}_{i,c} \end{equation}

Where $\textbf{i}$ represents the measurement, $\textbf{c}$ the county, and floor contains a 0 or 1 if the house has a basement or not.

Notice we have separate coefficients for each county in $a_c$ and $b_c$. They are totally different, they could even come from different distributions.

We will do this for only one county, since this is very time consuming, as an example.

In [32]:
# chose a county
county = 'YELLOW MEDICINE'
county_data = data.loc[data['county']==county]
county_data
Out[32]:
county log_radon floor
917 YELLOW MEDICINE 1.335 0.0
918 YELLOW MEDICINE 1.099 0.0
In [33]:
#help(pm.Normal)
In [34]:
with pm.Model() as unpooled_model:
    mu_a = pm.Normal('mu_a', mu=0., sigma=100)
    sigma_a = pm.HalfNormal('sigma_a', 5.)
    mu_b = pm.Normal('mu_b', mu=0., sigma=100)
    sigma_b = pm.HalfNormal('sigma_b', 5.)

    a = pm.Normal('a', mu=mu_a, sigma=sigma_a)
    b = pm.Normal('b', mu=mu_b, sigma=sigma_b)

    radon_est = a + b*county_data['floor'].values

    radon_obs = pm.Normal('radon_like', mu=radon_est,
                           observed=county_data['log_radon'])
In [35]:
pm.model_to_graphviz(unpooled_model)
Out[35]:
%3 cluster2 2 mu_b mu_b ~ Normal b b ~ Normal mu_b->b radon_like radon_like ~ Normal b->radon_like mu_a mu_a ~ Normal a a ~ Normal mu_a->a a->radon_like sigma_a sigma_a ~ HalfNormal sigma_a->a sigma_b sigma_b ~ HalfNormal sigma_b->b
In [36]:
with unpooled_model:
    unpooled_trace = pm.sample(5000, tune=1000, target_accept=0.99)
    print(f'DONE')
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [b, a, sigma_b, mu_b, sigma_a, mu_a]
Sampling 4 chains, 238 divergences: 100%|██████████| 24000/24000 [10:05<00:00, 39.65draws/s]
There were 53 divergences after tuning. Increase `target_accept` or reparameterize.
There were 53 divergences after tuning. Increase `target_accept` or reparameterize.
There were 61 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 71 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 25% for some parameters.
DONE
In [37]:
pm.traceplot(unpooled_trace);

Print the regression line for our chosen county alone.

In [38]:
county_data = data.loc[data['county']==county]
print(len(county_data))

x = np.linspace(-0.2, 1.2)
radon_est_unpooled = unpooled_trace['a'].mean() + unpooled_trace['b'].mean()*x

xx = np.linspace(-0.2, 1.)
radon_est_pooled = pooled_trace['a'].mean() + pooled_trace['b'].mean()*xx

plt.scatter(county_data['floor'], county_data['log_radon'])
plt.xlim(-0.1,1.1)
plt.xlabel('floor', fontsize=10)
plt.ylabel('radon level', fontsize=10)
plt.title(f'{str(county)} county Radon levels')
plt.plot(x, radon_est_unpooled, c='g', label='unpooled line');
plt.plot(xx, radon_est_pooled, c='r', label='pooled line');
plt.legend();
2

Partial pooling: Hierarchical Regression (Varying-Coefficients Model)

Counties, of course, have similarities, so there is a middle ground to both of these extremes. Specifically, we may assume that while $\alpha_c$ and $\beta_c$are different for each county as in the unpooled case, the coefficients are all drawn from the same distribution:

\begin{equation} radon_{i,c} = \alpha_c + \beta_c*floor_{i,c} \end{equation}

\begin{equation} a_c \sim \mathcal{N}(\mu_a,\,\sigma_a^{2}) \end{equation}

\begin{equation} b_c \sim \mathcal{N}(\mu_b,\,\sigma_b^{2}) \end{equation}

where the common parameters are: \begin{eqnarray} \mu_a \sim \mathcal{N}(0,\,10) \\ \sigma_a^2 \sim |\mathcal{N}(0,\,10)| \\ \mu_b \sim \mathcal{N}(0,\,10) \\ \sigma_b^2 \sim |\mathcal{N}(0,\,10)| \end{eqnarray}

The different counties are effectively sharing information through the common priors. We are thus observing what is known as shrinkage; modeling the groups not as independent from each other, neither as a single group but rather as related.

Discussion: how can we best handle this data? Does it make sense to make inferences without taking into account the county?

TALKING POINTS

We saw that some counties had only one sample, so if that house is a really old with old lead pipes, our prediction will be that all houses in this county have radon. On the other extreme, if we have a newer house with no radon then again we will have missleading results. In one case, you will overestimate the bad quality and in the other underestimate it. Under a hierarchical model, the miss-estimation of one group will be offset by the information provided by the other groups. As always, gathering more data helps, if this is an option.

In [39]:
with pm.Model() as hierarchical_model:
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_a', mu=0., sigma=100)
    sigma_a = pm.HalfNormal('sigma_a', 5.)
    mu_b = pm.Normal('mu_b', mu=0., sigma=100)
    sigma_b = pm.HalfNormal('sigma_b', 5.)

    # Above we just set mu and sd to a fixed value while here we
    # plug in a common group distribution for all a and b (which are
    # vectors of length n_counties).

    # Intercept for each county, distributed around group mean mu_a
    a = pm.Normal('a', mu=mu_a, sigma=sigma_a, shape=n_counties)

    # beta for each county, distributed around group mean mu_b
    b = pm.Normal('b', mu=mu_b, sigma=sigma_b, shape=n_counties)

    # Model error
    #eps = pm.HalfCauchy('eps', 5.)

    radon_est = a[county_idx] + b[county_idx]*data['floor'].values

    # Data likelihood with sigma for random error
#     radon_like = pm.Normal('radon_like', mu=radon_est,
#                            sigma=eps, observed=data['log_radon'])

    # Data likelihood with sigma without random error
    radon_like = pm.Normal('radon_like', mu=radon_est, #sigma=eps, 
                           observed=data['log_radon'])
In [40]:
# uncomment to create graph
pm.model_to_graphviz(hierarchical_model)
Out[40]:
%3 cluster85 85 cluster919 919 sigma_a sigma_a ~ HalfNormal a a ~ Normal sigma_a->a mu_b mu_b ~ Normal b b ~ Normal mu_b->b mu_a mu_a ~ Normal mu_a->a sigma_b sigma_b ~ HalfNormal sigma_b->b radon_like radon_like ~ Normal a->radon_like b->radon_like

Inference

In [41]:
%%time
with hierarchical_model:
    hierarchical_trace = pm.sample(10000, tune=5000, target_accept=.9)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [b, a, sigma_b, mu_b, sigma_a, mu_a]
Sampling 4 chains, 6,158 divergences: 100%|██████████| 60000/60000 [02:31<00:00, 397.35draws/s]
There were 1194 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7248100688488703, but should be close to 0.9. Try to increase the number of tuning steps.
There were 503 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7737296457857813, but should be close to 0.9. Try to increase the number of tuning steps.
There were 1450 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7519143067834009, but should be close to 0.9. Try to increase the number of tuning steps.
There were 3011 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.5847316188952049, but should be close to 0.9. Try to increase the number of tuning steps.
The estimated number of effective samples is smaller than 200 for some parameters.
CPU times: user 44.8 s, sys: 5.47 s, total: 50.3 s
Wall time: 2min 47s
In [42]:
pm.traceplot(hierarchical_trace, var_names=['mu_a', 'mu_b',
                                            'sigma_a', 'sigma_b']);
In [43]:
results = pm.summary(hierarchical_trace)
In [44]:
results[:20] # printing only 20 of them
Out[44]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
mu_a 1.469 0.056 1.362 1.574 0.003 0.002 444.0 435.0 451.0 235.0 1.00
mu_b -0.642 0.095 -0.818 -0.457 0.003 0.002 866.0 857.0 829.0 2698.0 1.01
a[0] 1.354 0.231 0.920 1.780 0.006 0.005 1321.0 1264.0 1279.0 1527.0 1.00
a[1] 1.052 0.133 0.825 1.322 0.010 0.007 164.0 164.0 163.0 417.0 1.02
a[2] 1.474 0.239 1.027 1.927 0.009 0.007 676.0 676.0 675.0 839.0 1.01
a[3] 1.496 0.209 1.082 1.882 0.005 0.003 1851.0 1851.0 1822.0 2660.0 1.01
a[4] 1.440 0.223 1.023 1.870 0.005 0.004 1672.0 1596.0 1630.0 4032.0 1.00
a[5] 1.496 0.237 1.040 1.932 0.010 0.007 586.0 576.0 577.0 1197.0 1.01
a[6] 1.723 0.190 1.363 2.072 0.008 0.005 603.0 603.0 598.0 2845.0 1.01
a[7] 1.562 0.231 1.134 2.005 0.012 0.008 394.0 394.0 389.0 2584.0 1.01
a[8] 1.303 0.198 0.929 1.666 0.005 0.004 1632.0 1577.0 1636.0 2268.0 1.00
a[9] 1.481 0.211 1.086 1.890 0.004 0.003 3168.0 3037.0 2949.0 15587.0 1.00
a[10] 1.463 0.220 1.051 1.883 0.005 0.004 1900.0 1756.0 1922.0 3058.0 1.00
a[11] 1.515 0.230 1.072 1.930 0.007 0.005 1228.0 1228.0 1206.0 2406.0 1.00
a[12] 1.365 0.214 0.953 1.759 0.009 0.007 510.0 494.0 500.0 2764.0 1.01
a[13] 1.711 0.192 1.368 2.092 0.006 0.004 1017.0 1017.0 1006.0 2924.0 1.00
a[14] 1.446 0.230 1.009 1.874 0.007 0.005 1111.0 1015.0 1107.0 974.0 1.00
a[15] 1.386 0.243 0.934 1.848 0.008 0.006 886.0 815.0 873.0 800.0 1.00
a[16] 1.429 0.224 0.997 1.849 0.005 0.004 1975.0 1828.0 1939.0 2158.0 1.01
a[17] 1.323 0.192 0.957 1.665 0.005 0.004 1394.0 1394.0 1412.0 4408.0 1.00
In [45]:
counties
Out[45]:
['ST LOUIS',
 'HENNEPIN',
 'DAKOTA',
 'ANOKA',
 'WASHINGTON',
 'MILLE LACS',
 'FILLMORE',
 'COOK',
 'LAC QUI PARLE',
 'POPE']
In [46]:
# use counties from before or choose new ones
#counties = ['HENNEPIN','AITKIN','WASHINGTON', 'LAKE OF THE WOODS', 'YELLOW MEDICINE', 'ANOKA']
#counties = ['HENNEPIN', 'DAKOTA', 'ANOKA', 'WILKIN', 'MURRAY']

plt.figure(figsize=(15,5))
rows = 2
gs = gridspec.GridSpec(rows, len(counties)//rows)

for i, county in enumerate(counties):
    county_data = data.loc[data['county']==county]
    if county_data.shape[0]==1: break;
    subplt = plt.subplot(gs[i])

    # pooled line (single values coeff for all)
    xx = np.linspace(-0.2, 1.2)
    radon_est = pooled_trace['a'].mean() + pooled_trace['b'].mean()*xx

    radon_est_hier = np.mean(hierarchical_trace['a'][i]) + \
                      np.mean(hierarchical_trace['b'][i])*xx

    # hierarchical line
    subplt.set_ylim(0.,4.)
    subplt.scatter(county_data['floor'], county_data['log_radon'])
    subplt.plot(xx, radon_est, c='r', label='pooled');

    # plot the hierarchical, varying coefficient model
    subplt.plot(xx, radon_est_hier, c='g', label='hierarchical');

    subplt.set_xlabel('floor', fontsize=10)
    subplt.set_ylabel('radon level', fontsize=10)
    subplt.set_title(str(county) + ' County')
    subplt.legend()
    plt.tight_layout()

Appendix A: Is this a fair coin?

Can we find out if this a fair coin without using sampling?

Let's say you visit the casino. You want to test your theory that casinos are dubious places where coins have been manipulated to have a larger probability for tails. So you will try to estimate how fair a coin is based on a certain amount of flips. You have no prior opinion on the coin's fairness (i.e. what $\theta$ might be), and begin flipping the coin. You get either Heads ($H$) or Tails ($T$) as our observed data and want to see if your posterior probabilities change as you obtain more data, that is, more coin flips. A nice way to visualize this is to plot the posterior probabilities as we observe more data. Your data is the pair ($n,k$) where $n$ is the number of flips and $k$ is the number of successes.

We will be using Bayes rule. $\textbf{D}$ is our data.

Posterior $\propto$ Prior $\times$ Likelihood
\begin{equation} P(\theta|\textbf{D}) \propto P(\theta) \times P(\textbf{D} |\theta) \end{equation}

a) Let's say we believe that our experiment is governed by a $\text{Binomial}$ distribution (this will be our Likelihood).

\begin{equation} P(\theta|\textbf{D}) \propto \text{Beta(a,b)} \times \text{Binomial}(\textbf{k},\textbf{n}|\theta) \end{equation}

b) We start with a non-informative prior, a $\text{Beta}$ distribution with (a=b=1.)

\begin{equation} P(\theta|1.,1.) = \text{Beta(1., 1.)} \end{equation}

c) We enter the data as pairs of ($\textbf{k},\textbf{n}$) (we observe $\textbf{k}$ heads in $\textbf{n}$ tosses), and we update our Beta with new a,b as follows $^*$:

\begin{equation} P(\theta|\textbf{k}) = Beta(\alpha + \textbf{k}, \beta + (n - \textbf{k})) \end{equation}

d) We are done! That is why using a congugate prior helps you find your parameters analytically.

$^*$ (the proof for this formula is beyond our scope, if interested, see this Wikipedia article)

In [47]:
# change the data
trials = np.array([0, 1, 3, 5, 10, 15, 20, 500, 200, 400])
heads = np.array( [0, 1, 2, 4, 8,  10, 10, 250, 180, 10])

# non-informative prior Beta(a,b=1), shows our ignorance about the coin
# informative prior Beta(a,b=10.), shows our belief that this is a fair coin
# informative prior Beta(a=20.,b=2.), shows our belief that this is NOT a fair coin
alphas = [1., 10., 20.]
betas = [1., 10., 2.]
colors = ['#348ABD', '#000BCC', '#999BCC']
opacity = [0.4, 0.2, 0.1]

plt.figure(figsize=(10,12))
x = np.linspace(0, 1, 100)
for k, N in enumerate(trials):
    sublabel=f'n trials = {trials[k]}, k = {heads[k]}'
    sx = plt.subplot(len(trials)/2, 2, k+1)
    for i in range(len(alphas)):
        posterior = stats.beta.pdf(x, alphas[i] + heads[k], betas[i] + trials[k] - heads[k])
        plt.plot(x, posterior, alpha = 0.5, c=colors[i], label=f'a={alphas[i]},b={betas[i]}');
        plt.fill_between(x, 0, posterior, color=colors[i], alpha=opacity[i])
        plt.legend(loc='upper left', fontsize=10)
        plt.title(sublabel)
    plt.legend()
    plt.autoscale(tight=True)

plt.suptitle("Posterior probabilities for coin flips", fontsize=15);
plt.tight_layout()
plt.subplots_adjust(top=0.88)

Appendix B: MAP

Fitting the Model with MAP (FYI, we will not directly use this method)

In Bayesian analysis we have our prior(s), we define our likelihood, and, having specified our model, we try to calculate posterior estimates for the unknown variables in the model. We could try to calculate the posterior estimates analytically, but for most the models, this is not feasible. What we do then is compute summaries based on samples drawn from the posterior distribution using Markov Chain Monte Carlo (MCMC) sampling methods.

\begin{equation} P(\theta|\textbf{D}) \rightarrow \{\theta_1,....\theta_n\} \end{equation}

Then we can find any estimate we want by using these samples, for example:

\begin{equation} \mathbb{E}[f(\theta] = \int d\theta{p(\theta) f(\theta)} \end{equation}

So we calculate the maximum a posteriori (MAP) point using optimization methods.

\begin{equation} f(\hat{\theta}), \hat{\theta} = argmax ({p(\theta))} \end{equation}

The maximum a posteriori (MAP) estimate for a model, is the mode of the posterior distribution and is generally found using numerical optimization methods. This is often fast and easy to do, but only gives a point estimate for the parameters and can be biased if the mode isn’t representative of the distribution. PyMC3 provides this functionality with the find_MAP function.

MAP estimate is not always reasonable, especially if the mode is at an extreme or we have a multimodal distribution, or we have high dimensional posteriors. This will often occur in hierarchical models with the variance parameter for the random effect. If the individual group means are all the same, the posterior will have near infinite density if the variance parameter for the group means is almost zero. Most techniques for finding the MAP estimate only find a local optimium (which is often good enough), and can therefore fail badly for multimodal posteriors, as mentioned above.

To solve these issues we turn to sampling as our method for finding the posterior.

You do not have to worry about MAP in our problems. Our pyMC3 models use the MAP method to initialize the variables under the hood and we do not have to explicitly set this.

References:

Information about PyMC3 functions including descriptions of distributions, sampling methods, and other functions, is available via the help command.