Key Word(s): ??
CS109B Data Science 2: Advanced Topics in Data Science
Lecture 13 - MCMC and Hierarchical Models in PyMC3¶
Harvard University
Spring 2021
Instructors: Pavlos Protopapas, Mark Glickman, and Chris Tanner
Additional Instructor: Eleni Angelaki Kaxiras
Content: Eleni Angelaki Kaxiras
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)
import pymc3 as pm
from pymc3 import summary
import arviz as az
from matplotlib import gridspec
# Ignore a common pymc3 warning that comes from library functions, not our code.
# Pymc3 may throw additional warnings, but other warnings should be manageable
# by following the instructions included within the warning messages.
import warnings
messages=[
"Using `from_pymc3` without the model will be deprecated in a future release",
]
# or silence all warnings (not recommended)
# warnings.filterwarnings('ignore')
for m in messages:
warnings.filterwarnings("ignore", message=m)
print(f"Using PyMC3 version: {pm.__version__}")
print(f"Using ArviZ version: {az.__version__}")
import pymc3 as pm
from pymc3 import summary
#import arviz as az
from matplotlib import gridspec
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
%matplotlib inline
# from pymc3 import Model, Normal, HalfNormal, model_to_graphviz
# from pymc3 import NUTS, sample, find_MAP
from scipy import optimize
%%javascript
IPython.OutputArea.auto_scroll_threshold = 20000;
#pandas trick
pd.options.display.max_columns = 50 # None -> No Restrictions
pd.options.display.max_rows = 200 # None -> Be careful with this
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3
Learning Objectives¶
By the end of this lab, you should be able to understand how to:
- run a
PyMC3
bayesian model using MCMC. - implement hierarchical models in
PyMC3
.
1. Markov Chain Monte Carlo (MCMC) Simulations¶
PyMC3 is a Python library for programming Bayesian analysis, and more specifically, data creation,
model definition, model fitting, and posterior analysis. It uses the concept of a model
which contains assigned parametric statistical distributions to unknown quantities in the model.
Within models we define random variables and their distributions. A distribution requires at least a
name
argument, and other parameters
that define it. You may also use the
logp()
method in the model to build the model log-likelihood function. We define and
fit the model.
PyMC3 includes a comprehensive set of pre-defined statistical distributions that can be used as model
building blocks. Although they are not meant to be used outside of a model
, you can
invoke them by using the prefix pm
, as in pm.Normal
.
Monte Carlo methods are rooted in the 1940s when Enrico Fermi, John von Neumann, Stan Ulam, Nicolas Metropolis, and others, employed the use of random numbers to study physics from a stochastic point of view. The name is attributed to Metropolis 1.
For a history of the Monte Carlo Method first used in Physics see: The beginning of the Monte Carlo Method
PyMC3 uses the No-U-Turn Sampler (NUTS) and the Random Walk Metropolis, two Markov chain Monte Carlo (MCMC) algorithms for sampling in posterior space. Monte Carlo gets its name because when we sample in posterior space, we choose our next move via a pseudo-random process. NUTS is a sophisticated algorithm that can handle a large number of unknown (albeit continuous) variables.
#help(pm.Poisson)
Bayesian Linear Regression¶
We will artificially create the data to predict on. We will then see if our model predicts them correctly.
np.random.seed(123)
######## True parameter values
##### our model does not see these
sigma = 1
beta0 = 1
beta = [1, 2.5]
###############################
# Size of dataset
size = 100
# Feature variables
x1 = np.linspace(0, 1., size)
x2 = np.linspace(0,2., size)
# Create outcome variable with random noise
Y = beta0 + beta[0]*x1 + beta[1]*x2 + np.random.randn(size)*sigma
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
fontsize=14
labelsize=8
title='Observed Data (created artificially by ' + r'$Y(x_1,x_2)$)'
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x1, x2, Y)
ax.set_xlabel(r'$x_1$', fontsize=fontsize)
ax.set_ylabel(r'$x_2$', fontsize=fontsize)
ax.set_zlabel(r'$Y$', fontsize=fontsize)
ax.tick_params(labelsize=labelsize)
fig.suptitle(title, fontsize=fontsize)
fig.tight_layout(pad=.1, w_pad=10.1, h_pad=2.)
#fig.subplots_adjust(); #top=0.5
plt.tight_layout
plt.show()
Now let's see if our model will correctly predict the values for our unknown parameters, namely $b_0$, $b_1$, $b_2$ and $\sigma$.
Defining the Problem¶
Our problem is the following: we want to perform multiple linear regression to predict an outcome variable $Y$ which depends on variables $\bf{x}_1$ and $\bf{x}_2$.
We will model $Y$ as normally distributed observations with an expected value $mu$ that is a linear function of the two predictor variables, $\bf{x}_1$ and $\bf{x}_2$.
\begin{equation} Y \sim \mathcal{N}(\mu,\,\sigma^{2}) \end{equation} \begin{equation} \mu = \beta_0 + \beta_1 \bf{x}_1 + \beta_2 x_2 \end{equation}where $\sigma^2$ represents the measurement error (in this example, we will use $\sigma^2 = 10$)
We also choose the parameters to have normal distributions with those parameters set by us.
\begin{eqnarray} \beta_i \sim \mathcal{N}(0,\,10) \\ \sigma^2 \sim |\mathcal{N}(0,\,10)| \end{eqnarray}Defining a Model in PyMC3¶
with pm.Model() as my_linear_model:
# Priors for unknown model parameters, specifically created stochastic random variables
# with Normal prior distributions for the regression coefficients,
# and a half-normal distribution for the standard deviation of the observations.
# These are our parameters. P(theta)
beta0 = pm.Normal('beta0', mu=0, sd=10)
# Note: betas is a vector of two variables, b1 and b2, (denoted by shape=2)
# so, in array notation, our beta1 = betas[0], and beta2=betas[1]
betas = pm.Normal('betas', mu=0, sd=10, shape=2)
sigma = pm.HalfNormal('sigma', sd=1)
# mu is what is called a deterministic random variable, which implies that its value is completely
# determined by its parents’ values (betas and sigma in our case).
# There is no uncertainty in the variable beyond that which is inherent in the parents’ values
mu = beta0 + betas[0]*x1 + betas[1]*x2
# Likelihood function = how probable is my observed data?
# This is a special case of a stochastic variable that we call an observed stochastic.
# It is identical to a standard stochastic, except that its observed argument,
# which passes the data to the variable, indicates that the values for this variable were observed,
# and should not be changed by any fitting algorithm applied to the model.
# The data can be passed in the form of either a numpy.ndarray or pandas.DataFrame object.
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
Note: If our problem was a classification for which we would use Logistic regression
see below
Python Note: pm.Model
is designed as a simple API that abstracts away
the details of the inference. For the use of with
see Compounds statements in
Python..
## do not worry about this, it's just a nice graph to have
## you need to install python-graphviz first
# conda install -c conda-forge python-graphviz
pm.model_to_graphviz(my_linear_model)
Fitting the Model with Sampling - Doing Inference¶
See below for PyMC3's sampling method. As you can see it has quite a few parameters. Most of them are set to default values by the package. For some, it's useful to set your own values.
pymc3.sampling.sample(draws=500, step=None, n_init=200000, chains=None,
cores=None, tune=500, random_seed=None)
Parameters to set:
- draws: (int): Number of samples to keep when drawing, defaults to 500. Number starts after the tuning has ended.
- tune: (int): Number of iterations to use for tuning the model, also called the burn-in period, defaults to 500. Samples from the tuning period will be discarded.
- target_accept (float in $[0, 1]$). The step size is tuned such that we approximate this acceptance rate. Higher values like 0.9 or 0.95 often work better for problematic posteriors.
- (optional) chains (int) number of chains to run in parallel, defaults to the number of CPUs in the system, but at most 4.
pm.sample
returns a pymc3.backends.base.MultiTrace
object that contains the
samples. We usually name it a variation of the word trace
. All the information about
the posterior is in trace
, which also provides statistics about the sampler.
How does a good trace plot look like? Is this a good one?¶
## uncomment this to see more about pm.sample
#help(pm.sample)
#help(pm.backends.base.MultiTrace)
with my_linear_model:
print(f'Starting MCMC process')
# draw nsamples posterior samples and run the default number of chains = 4
nsamples = 1000 # number of samples to keep
burnin = 100 # burnin period
trace = pm.sample(nsamples, tune=burnin, target_accept=0.8)
print(f'DONE')
var_names = trace.varnames
var_names = var_names.remove('sigma_log__')
var_names
Model Plotting¶
PyMC3 provides a variety of visualizations via plots: https://docs.pymc.io/api/plots.html.
arviz
is another library that you can use.
title = f'Traceplots for our artificial dataset with parameters: {var_names}'
pm.traceplot(trace);
plt.suptitle(
title,
fontsize=20
)
plt.show()
az.plot_trace(trace, var_names=var_names);
plt.suptitle(
title,
fontsize=20
)
plt.show()
# generate results table from trace samples
# remember our true hidden values sigma = 1, beta0 = 1, beta = [1, 2.5]
# We want R_hat < 1.1
results = az.summary(
trace,
var_names=var_names
)
display(results)
#help(pm.Normal)
$\hat{R}$ is a metric for comparing how well a chain has converged to the equilibrium distribution by comparing its behavior to other randomly initialized Markov chains. Multiple chains initialized from different initial conditions should give similar results. If all chains converge to the same equilibrium, $\hat{R}$ will be 1. If the chains have not converged to a common distribution, $\hat{R}$ will be > 1.01. $\hat{R}$ is a necessary but not sufficient condition.
For details on the $\hat{R}$ see Gelman and Rubin (1992).
This linear regression example is from the original paper on PyMC3: Salvatier J, Wiecki TV, Fonnesbeck C. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 https://doi.org/10.7717/peerj-cs.55
Gelman et al.'s famous radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in counties in several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to be more strongly present in households containing a basement and to differ in amount present among types of soil. Here we'll investigate this differences and try to make predictions of radonlevels in different county's based on the county itself and the presence of a basement. We’ll look at Minnesota, a state that contains 85 counties in which different measurements are taken, ranging from 2 to 116 measurements.
df = pd.read_csv('data/radon.csv', index_col=[0])
df['log_radon'] = df['log_radon'].astype('float')
county_names = df.county.unique()
county_idx = df.county_code.values
n_counties = len(df.county.unique())
df.head()
df.shape
Each row in the dataframe represents the radon measurements for one house in a specific county including whether the house has a basement (floor = 0) or not (floor = 1). We are interested in whether having a basement increases the radon measured in the house.
To keep things simple let's keep only the following three variables: county
, log_radon
,
and floor
Let's check how many different counties we have. We also notice that they have a different number of houses. Some have a large number of houses measured, some only 1.
print(f'We have {n_counties} counties in {len(df.state.value_counts())} state.')
df['county'].value_counts().head(5)
# keep only these variables
data = df[['county', 'log_radon', 'floor']]
data.head()
data['county'].value_counts()[-5:]
Pooling: Same Linear Regression for all¶
We can just pool all the data and estimate one big regression to asses the influence of having a basement on radon levels across all counties. Our model would be:
Where $i$ represents the measurement (house), and floor contains a 0 or 1 if the house has a basement or not. By ignoring the county feature, we do not differenciate on counties.
with pm.Model() as pooled_model:
# common priors for all
a = pm.Normal('a', mu=0, sigma=100)
b = pm.Normal('b', mu=0, sigma=100)
# radon estimate
radon_est = a + b*data['floor'].values
# likelihood after radon observations
radon_obs = pm.Normal('radon_obs', mu=radon_est,
observed=data['log_radon']) # note here we enter the whole dataset
pm.model_to_graphviz(pooled_model)
with pooled_model:
pooled_trace = pm.sample(2000, tune=1000, target_accept=0.9)
print(f'DONE')
pm.traceplot(pooled_trace);
Remember, with the pooled model we have only one intercept, $\alpha$, and only one slope, $\beta$ for all the counties. Let's plot the regression lines.
# plot just a subset of the countries, the five most counted and the 5 less counted
counties_by_counts = data['county'].value_counts()
counties = counties_by_counts.index[:5].tolist() + counties_by_counts.index[75:-5].tolist()
counties
# plot just a subset of the countries
#counties = ['HENNEPIN','AITKIN','WASHINGTON', 'MURRAY', 'YELLOW MEDICINE', 'MAHNOMEN']
plt.figure(figsize=(15,5))
rows = 2
gs = gridspec.GridSpec(rows, len(counties)//rows)
for i, county in enumerate(counties):
county_data = data.loc[data['county']==county]
x = np.linspace(-0.2, 1.2)
radon_est = pooled_trace['a'].mean() + pooled_trace['b'].mean()*x
subplt = plt.subplot(gs[i])
subplt.set_ylim(0.,4.)
subplt.scatter(county_data['floor'], county_data['log_radon'])
subplt.plot(x, radon_est, c='r', label='pooled line');
subplt.set_xlabel('floor', fontsize=10)
subplt.set_ylabel('radon level', fontsize=10)
subplt.set_title(str(county) + ' County')
subplt.legend()
plt.tight_layout()
Unpooling: Separate Linear Regression for each county¶
We believe that different counties have different relationships of radon and basements. Our model would be:
\begin{equation} \textbf{radon}_{i,c} = \alpha_c + \beta_c*\textbf{floor}_{i,c} \end{equation}Where $\textbf{i}$ represents the measurement, $\textbf{c}$ the county, and floor contains a 0 or 1 if the house has a basement or not.
Notice we have separate coefficients for each county in $a_c$ and $b_c$. They are totally different, they could even come from different distributions.
We will do this for only one county, since this is very time consuming, as an example.
# chose a county
county = 'YELLOW MEDICINE'
county_data = data.loc[data['county']==county]
county_data
#help(pm.Normal)
with pm.Model() as unpooled_model:
mu_a = pm.Normal('mu_a', mu=0., sigma=100)
sigma_a = pm.HalfNormal('sigma_a', 5.)
mu_b = pm.Normal('mu_b', mu=0., sigma=100)
sigma_b = pm.HalfNormal('sigma_b', 5.)
a = pm.Normal('a', mu=mu_a, sigma=sigma_a)
b = pm.Normal('b', mu=mu_b, sigma=sigma_b)
radon_est = a + b*county_data['floor'].values
radon_obs = pm.Normal('radon_like', mu=radon_est,
observed=county_data['log_radon'])
pm.model_to_graphviz(unpooled_model)
with unpooled_model:
unpooled_trace = pm.sample(5000, tune=1000, target_accept=0.99)
print(f'DONE')
pm.traceplot(unpooled_trace);
Print the regression line for our chosen county alone.
county_data = data.loc[data['county']==county]
print(len(county_data))
x = np.linspace(-0.2, 1.2)
radon_est_unpooled = unpooled_trace['a'].mean() + unpooled_trace['b'].mean()*x
xx = np.linspace(-0.2, 1.)
radon_est_pooled = pooled_trace['a'].mean() + pooled_trace['b'].mean()*xx
plt.scatter(county_data['floor'], county_data['log_radon'])
plt.xlim(-0.1,1.1)
plt.xlabel('floor', fontsize=10)
plt.ylabel('radon level', fontsize=10)
plt.title(f'{str(county)} county Radon levels')
plt.plot(x, radon_est_unpooled, c='g', label='unpooled line');
plt.plot(xx, radon_est_pooled, c='r', label='pooled line');
plt.legend();
Partial pooling: Hierarchical Regression (Varying-Coefficients Model)¶
Counties, of course, have similarities, so there is a middle ground to both of these extremes. Specifically, we may assume that while $\alpha_c$ and $\beta_c$are different for each county as in the unpooled case, the coefficients are all drawn from the same distribution:
\begin{equation} b_c \sim \mathcal{N}(\mu_b,\,\sigma_b^{2}) \end{equation}
where the common parameters are: \begin{eqnarray} \mu_a \sim \mathcal{N}(0,\,10) \\ \sigma_a^2 \sim |\mathcal{N}(0,\,10)| \\ \mu_b \sim \mathcal{N}(0,\,10) \\ \sigma_b^2 \sim |\mathcal{N}(0,\,10)| \end{eqnarray}
The different counties are effectively sharing information through the common priors. We are thus observing what is known as shrinkage; modeling the groups not as independent from each other, neither as a single group but rather as related.
TALKING POINTS¶
We saw that some counties had only one sample, so if that house is a really old with old lead pipes, our prediction will be that all houses in this county have radon. On the other extreme, if we have a newer house with no radon then again we will have missleading results. In one case, you will overestimate the bad quality and in the other underestimate it. Under a hierarchical model, the miss-estimation of one group will be offset by the information provided by the other groups. As always, gathering more data helps, if this is an option.
with pm.Model() as hierarchical_model:
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_a', mu=0., sigma=100)
sigma_a = pm.HalfNormal('sigma_a', 5.)
mu_b = pm.Normal('mu_b', mu=0., sigma=100)
sigma_b = pm.HalfNormal('sigma_b', 5.)
# Above we just set mu and sd to a fixed value while here we
# plug in a common group distribution for all a and b (which are
# vectors of length n_counties).
# Intercept for each county, distributed around group mean mu_a
a = pm.Normal('a', mu=mu_a, sigma=sigma_a, shape=n_counties)
# beta for each county, distributed around group mean mu_b
b = pm.Normal('b', mu=mu_b, sigma=sigma_b, shape=n_counties)
# Model error
#eps = pm.HalfCauchy('eps', 5.)
radon_est = a[county_idx] + b[county_idx]*data['floor'].values
# Data likelihood with sigma for random error
# radon_like = pm.Normal('radon_like', mu=radon_est,
# sigma=eps, observed=data['log_radon'])
# Data likelihood with sigma without random error
radon_like = pm.Normal('radon_like', mu=radon_est, #sigma=eps,
observed=data['log_radon'])
# uncomment to create graph
pm.model_to_graphviz(hierarchical_model)
Inference¶
%%time
with hierarchical_model:
hierarchical_trace = pm.sample(10000, tune=5000, target_accept=.9)
pm.traceplot(hierarchical_trace, var_names=['mu_a', 'mu_b',
'sigma_a', 'sigma_b']);
results = pm.summary(hierarchical_trace)
results[:20] # printing only 20 of them
counties
# use counties from before or choose new ones
#counties = ['HENNEPIN','AITKIN','WASHINGTON', 'LAKE OF THE WOODS', 'YELLOW MEDICINE', 'ANOKA']
#counties = ['HENNEPIN', 'DAKOTA', 'ANOKA', 'WILKIN', 'MURRAY']
plt.figure(figsize=(15,5))
rows = 2
gs = gridspec.GridSpec(rows, len(counties)//rows)
for i, county in enumerate(counties):
county_data = data.loc[data['county']==county]
if county_data.shape[0]==1: break;
subplt = plt.subplot(gs[i])
# pooled line (single values coeff for all)
xx = np.linspace(-0.2, 1.2)
radon_est = pooled_trace['a'].mean() + pooled_trace['b'].mean()*xx
radon_est_hier = np.mean(hierarchical_trace['a'][i]) + \
np.mean(hierarchical_trace['b'][i])*xx
# hierarchical line
subplt.set_ylim(0.,4.)
subplt.scatter(county_data['floor'], county_data['log_radon'])
subplt.plot(xx, radon_est, c='r', label='pooled');
# plot the hierarchical, varying coefficient model
subplt.plot(xx, radon_est_hier, c='g', label='hierarchical');
subplt.set_xlabel('floor', fontsize=10)
subplt.set_ylabel('radon level', fontsize=10)
subplt.set_title(str(county) + ' County')
subplt.legend()
plt.tight_layout()
Adapted from PyMC3 docs: https://docs.pymc.io/notebooks/GLM-hierarchical.html
Appendix A: Is this a fair coin?¶
Can we find out if this a fair coin without using sampling?¶
Let's say you visit the casino. You want to test your theory that casinos are dubious places where coins have been manipulated to have a larger probability for tails. So you will try to estimate how fair a coin is based on a certain amount of flips. You have no prior opinion on the coin's fairness (i.e. what $\theta$ might be), and begin flipping the coin. You get either Heads ($H$) or Tails ($T$) as our observed data and want to see if your posterior probabilities change as you obtain more data, that is, more coin flips. A nice way to visualize this is to plot the posterior probabilities as we observe more data. Your data is the pair ($n,k$) where $n$ is the number of flips and $k$ is the number of successes.
We will be using Bayes rule. $\textbf{D}$ is our data.
a) Let's say we believe that our experiment is governed by a $\text{Binomial}$ distribution (this will be our Likelihood).
\begin{equation} P(\theta|\textbf{D}) \propto \text{Beta(a,b)} \times \text{Binomial}(\textbf{k},\textbf{n}|\theta) \end{equation}b) We start with a non-informative prior, a $\text{Beta}$ distribution with (a=b=1.)
\begin{equation} P(\theta|1.,1.) = \text{Beta(1., 1.)} \end{equation}c) We enter the data as pairs of ($\textbf{k},\textbf{n}$) (we observe $\textbf{k}$ heads in $\textbf{n}$ tosses), and we update our Beta with new a,b as follows $^*$:
\begin{equation} P(\theta|\textbf{k}) = Beta(\alpha + \textbf{k}, \beta + (n - \textbf{k})) \end{equation}d) We are done! That is why using a congugate prior helps you find your parameters analytically.
$^*$ (the proof for this formula is beyond our scope, if interested, see this Wikipedia article)
# change the data
trials = np.array([0, 1, 3, 5, 10, 15, 20, 500, 200, 400])
heads = np.array( [0, 1, 2, 4, 8, 10, 10, 250, 180, 10])
# non-informative prior Beta(a,b=1), shows our ignorance about the coin
# informative prior Beta(a,b=10.), shows our belief that this is a fair coin
# informative prior Beta(a=20.,b=2.), shows our belief that this is NOT a fair coin
alphas = [1., 10., 20.]
betas = [1., 10., 2.]
colors = ['#348ABD', '#000BCC', '#999BCC']
opacity = [0.4, 0.2, 0.1]
plt.figure(figsize=(10,12))
x = np.linspace(0, 1, 100)
for k, N in enumerate(trials):
sublabel=f'n trials = {trials[k]}, k = {heads[k]}'
sx = plt.subplot(len(trials)/2, 2, k+1)
for i in range(len(alphas)):
posterior = stats.beta.pdf(x, alphas[i] + heads[k], betas[i] + trials[k] - heads[k])
plt.plot(x, posterior, alpha = 0.5, c=colors[i], label=f'a={alphas[i]},b={betas[i]}');
plt.fill_between(x, 0, posterior, color=colors[i], alpha=opacity[i])
plt.legend(loc='upper left', fontsize=10)
plt.title(sublabel)
plt.legend()
plt.autoscale(tight=True)
plt.suptitle("Posterior probabilities for coin flips", fontsize=15);
plt.tight_layout()
plt.subplots_adjust(top=0.88)
Appendix B: MAP¶
Fitting the Model with MAP (FYI, we will not directly use this method)¶
In Bayesian analysis we have our prior(s), we define our likelihood, and, having specified our model, we try to calculate posterior estimates for the unknown variables in the model. We could try to calculate the posterior estimates analytically, but for most the models, this is not feasible. What we do then is compute summaries based on samples drawn from the posterior distribution using Markov Chain Monte Carlo (MCMC) sampling methods.
\begin{equation} P(\theta|\textbf{D}) \rightarrow \{\theta_1,....\theta_n\} \end{equation}Then we can find any estimate we want by using these samples, for example:
\begin{equation} \mathbb{E}[f(\theta] = \int d\theta{p(\theta) f(\theta)} \end{equation}So we calculate the maximum a posteriori (MAP) point using optimization methods.
\begin{equation} f(\hat{\theta}), \hat{\theta} = argmax ({p(\theta))} \end{equation}The maximum a posteriori (MAP) estimate for a model, is the mode of the posterior distribution and is generally found using numerical optimization methods. This is often fast and easy to do, but only gives a point estimate for the parameters and can be biased if the mode isn’t representative of the distribution. PyMC3 provides this functionality with the find_MAP function.
MAP estimate is not always reasonable, especially if the mode is at an extreme or we have a multimodal distribution, or we have high dimensional posteriors. This will often occur in hierarchical models with the variance parameter for the random effect. If the individual group means are all the same, the posterior will have near infinite density if the variance parameter for the group means is almost zero. Most techniques for finding the MAP estimate only find a local optimium (which is often good enough), and can therefore fail badly for multimodal posteriors, as mentioned above.
To solve these issues we turn to sampling as our method for finding the posterior.
You do not have to worry about MAP in our problems. Our pyMC3 models use the MAP method to initialize the variables under the hood and we do not have to explicitly set this.
References:
- Salvatier J, Wiecki TV, Fonnesbeck C. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 (https://doi.org/10.7717/peerj-cs.55)
- Distributions in PyMC3
- More Details on Distributions
Information about PyMC3 functions including descriptions of distributions, sampling methods, and
other functions, is available via the help
command.