CS109B Data Science 2: Advanced Topics in Data Science
Lab 9 - Bayes, Part 2 - LDA¶
Harvard University
Spring 2019
Instructors: Mark Glickman and Pavlos Protopapas
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import gridspec
import re
import scipy.stats
import pyjags
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)
Schools Data and Bayesian Modeling¶
Once upon a time, eight different schools each implemented a particular SAT prep program. At the end of the year, students from each school's program took the SATs and we've recorded the students' average performance relative to a control group that got no treatment. We've also recorded the standard error for each school's estimated increase, a measure of how accurate the estimated increase is. (Standard Error factors in the number of students who took the program and how variable their scores were).
school_data = pd.read_csv("data/gelman_schools.csv")
school_data
The measurements look something like this:
x_vals = np.linspace(-50,80,1000)
for cur_school_id in range(8):
name = school_data["School"][cur_school_id]
mean = school_data["Estimated Score Increase"][cur_school_id]
sd = school_data["Standard error of the estimate"][cur_school_id]
norm_obj = scipy.stats.norm(mean,sd)
plt.plot(x_vals, norm_obj.pdf(x_vals), label=name)
plt.xlabel("School's Performance")
plt.ylabel("Probability Density")
plt.legend()
plt.show()
- How effective is the program?
- Does it increase scores? By how much?
A Hierarchical Model for the Schools¶
Gelman (or maybe Rubin?) proposed the following model to explain the data and understand the effect of the program on SAT scores. As in class, the point of the model is that it neither takes each school's result at face value, nor ignores variation from school to school.
$$\sigma_j\ given$$ $$\mu \sim Uniform(-20,20)$$ $$\tau \sim Uniform(0,10)$$ $$\theta_j \sim Normal(Mean=\mu,\ SD=\tau)$$ $$y_j \sim Normal(Mean=\theta_j,\ SD=\sigma_j)$$
$y_j$ are the observed SAT increase at each school
$\sigma_j$ are the observed standard errors at each school
$\theta_j$ is the 'true' effect of each school's program
$\mu$ and $\tau$ govern the distribution of possible true effects.
Bayesian models (and others) are often presented like this. Actually, they're usually presented even less clearly (e.g. missing the definition of each symbol). Let's learn how to make sense of this jumble.
First pass: Understanding the parameters
This pass is typically easiest from bottom to top, starting with the observed parameters that you know how to interpret. It often also requires reading about the context of the data or what the modeler is trying to encode with each parameter.
$$y_j \sim Normal(Mean=\theta_j,\ SD=\sigma_j)$$
First, the observed data (one value per school) are 1) normally distributed 2) each centered at a different value, one per school. The 8 $\theta_j$ parameters are the 'true average effect' of the program in school j, separate from the (noisy) effect we actually observe.
$$\theta_j \sim Normal(Mean=\mu,\ SD=\tau)$$
Where do the 'true average effects' in each school come from? Line 2, above, says that they're all draws from a normal distribution with a particular mean and center. Okay, so they're all form the same family, that's fine for now. Moving on,
$$\sigma_j\ given$$ $$\mu \sim Uniform(-20,20)$$ $$\tau \sim Uniform(0,10)$$
Finally, the parameters defining what 'true average effects' we might see come from specific uniform distributions. In particular, the author encodes that the center of the 'true average effect' distribution is strictly between -20 and 20, and the spread of the 'true average effect' distribution is somewhere betweeen 0 and 10.
Seconnd Pass: Summarizing: The author's story is that when someone goes to implement this cirriculum in a given school, the actual (long term?) effectiveness of that program is secretly picked from a normal distribution. Then the actual observed effect is a noisy version of that actual effectiveness, with the noise set by that school's standard error. (The school's standard error is set by the number of students who were in the program and the variability in how well those students did, but this model takes that as given for each school).
Third pass: Critiquing the story
- Does the author's overall story make sense?
- Do you believe in a 'true' effectiveness in each school, distinct from the observed effectiveness?
- Do you believe that schools' 'true' effectiveness all come from some distribution? Or do you think they're all the same? Or all unrelated?
- What does it mean, in context, for the author to say $\mu$ is definitely between -20 and 20? Does that seem reasonable to you?
What does it mean, in context, for the author to say $\tau$ is definitely between 0 and 10? Does that seem reasonable to you?
As a class, come up with a prior for $\mu$ and $\tau$
Coding the Model¶
To work in JAGS, we need to write out the model in a specific code-like format. That format is designed to be a mix of the equation description of the model above and R syntax.
The tricky parts are:
- Writing a loop for any varibles with indices, and double loops if the variable has two indices
- Looking up the abreviation for each ditribution (usualy the letter d and a short-ish version of the distribution name)
- Looking up the parameters the distributions want (does it ask for mean, sd, or precision?)
- Looking up how to do functions like $x^2$ in R
Compare:
There are J schools
$$\sigma_j\ given$$ $$\mu \sim Uniform(-20,20)$$ $$\tau \sim Uniform(0,10)$$ $$\theta_j \sim Normal(Mean=\mu,\ SD=\tau)$$ $$y_j \sim Normal(Mean=\theta_j,\ SD=\sigma_j)$$
To:
schools_model_code = '''
model {
mu ~ dunif(-20,20)
tau ~ dunif(0,10)
for (j in 1:J){
theta[j] ~ dnorm(mu, 1/pow(tau,2))
}
for (j in 1:J){
y[j] ~ dnorm(theta[j], 1/pow(sigma[j],2))
}
}
'''
Running the model¶
To run the model, you need to pass in a dictionary of the observed data. Pyjags is pretty good about giving useful error messages, but definitely turn on line numbers in Jupyter!
We run 500 samples of burn-in (MCMC needs some undefined amount of steps before it produces samples from the target distribution). We then collect 2500 actual samples from each of 4 chains.
observed_vals = {'y':school_data["Estimated Score Increase"],
'sigma':school_data["Standard error of the estimate"],
'J': school_data.shape[0]}
num_chains = 4
school_model = pyjags.Model(schools_model_code, data=observed_vals, chains=num_chains)
burnin = school_model.sample(500) #warmup/burn-in
samples = school_model.sample(2500) #cf 7500
Checking Convergence¶
MCMC is only guaranteed to work if you run it for infinite time. It can give good samples after finite, or even short time, but it's worth checking whether it looks like it did return good samples.
The first thing to check is whether the sampler got stuck in one place for a bit by looking for flat/thin regions in the trace. Luckily, we have a lot to show you
def plot_trace(samples, varname, entry=0):
plt.plot()
sample_array = samples[varname]
vec_len, num_samples, num_chains = sample_array.shape
for cur_chain in range(num_chains):
cur_label = "Chain {}".format(cur_chain)
plt.plot(range(num_samples),sample_array[entry,:,cur_chain], label=cur_label)
plt.legend()
plt.show()
We check the $\mu$ and $\tau$ parameters
plt.xlabel("Iteration")
plt.ylabel("Value of Mu")
plot_trace(samples,'mu')
plt.xlabel("Iteration")
plt.ylabel("Value of Tau")
plot_trace(samples,'tau')
and the 8 different $\theta_j$
for cur_school in range(8):
print("Theta for School {}".format(cur_school))
plt.xlabel("Iteration")
plt.ylabel("Value of Theta_{}".format(cur_school))
plot_trace(samples, 'theta', entry=cur_school)
print("------")
Overall, we see pretty rough traces- lots of places with limited variation.
Fixing these defects is tough. Simply running more samples gives you better odds that you've got stuck for an even amount of time in each trap, though it's more 'hope' than 'strategy'. Changing the priors or even how the model is written can help ease the issues. More advanced samplers (e.g. Hamiltonian Monte Carlo implemented in pymc3 or Stan) can help, too.
There are other measures of whether the traces look reasonable- effective sample size, R-hat, and Geweke.
In real life, you should carefully vet your traces and adjust the model/sampler until they look good. See AM207 for lots more on this topic. Here, we're just going to press on as if the traces and samples are legitimate
Exploring The Posterior¶
The samples produced are basically a big data frame where each row is a sample and each column is one of the prameters of the model. This is everything we know about the posterior. Conceptually, from here forward all we do is describe this data frame- means or histograms or the columns, correlations, etc.
(The samples aren't actually stored as a data frame, but conversion code is provided below)
samples
display(samples['theta'].shape)
display(samples['mu'].shape)
The raw samples from pyjags are a dictionary of parameeter names -> 3d arrays
- Why are the
sample
object's arrays shaped like this?
We can equivalently organize the samples as a data frame (one per chain). The code below will handle this for you
def convert_to_dfs(samples, parameter_names, num_chains):
"""Converts a pyjags sampling result to a list of data frames, one per chain"""
big_list = []
for cur_chain_num in range(num_chains):
df_list = []
for k in parameter_names:
v = samples[k]
chain1_data = v[:,:,cur_chain_num]
cur_df = pd.DataFrame(chain1_data.T)
if cur_df.shape[1]==1:
cur_df = cur_df.rename({0:k}, axis=1)
else:
cur_df = cur_df.add_prefix(k)
df_list.append(cur_df)
chain1_samples_df = pd.concat(df_list, axis=1)
big_list.append(chain1_samples_df)
return big_list
chain_df_list = convert_to_dfs(samples,["J","mu","tau","theta","sigma"],num_chains)
chain_df_list[0].head(15)
Learn About the Parameters¶
Once we have the posterior dataset, we can analyze it like we'd analyze any other dataset (with the warning that the rows are correlated, not IID)
for cur_chain in range(num_chains):
plt.hist(chain_df_list[cur_chain]['mu'], bins=100, histtype='step', density=True, label="Chain {}".format(cur_chain));
plt.xlabel("Value of mu")
plt.ylabel("Approx. Posterior Probability")
plt.legend()
plt.show()
Above, we see that (assuming this model+priors and factoring in all data) $\mu$ seems to range from about -2.5 to about 20, and its most likely value is between 5 and 10.
Example:
- What's the probability that $\mu$ is above 5?
- What does this mean in context?
Answer:
count_above_5 = np.sum(chain_df_list[0]['mu'] > 15)
total = len(chain_df_list[0]['mu'])
count_above_5/total
(Using just chain 0), there's a roughly 2-8% chance (changes from run to run) that $\mu$ is above 15. In context, this means that a there's very little chance the program, on average, increases student scores by 15 or more. Practically, we might decide that the program isn't worth it; maybe we only feel increases of less than 50 SAT points aren't worth paying for.
- Is it more apropriate to use one chain, or combine all four?
Repeating that plot for $\tau$
for cur_chain in range(num_chains):
plt.hist(chain_df_list[cur_chain]['tau'], bins=100, histtype='step', density=True, label="Chain {}".format(cur_chain));
plt.xlabel("Value of tau")
plt.ylabel("Approx. Posterior Probability")
plt.legend()
plt.show()
- What can you conclude from the plot above?
Answer:
Exploring effect in each school¶
Investigating the $\theta_j$ is a touch more complex becuase there is one per school. Overall, it looks like each school's theta is about the same-- the model is suggesting pooling all the schools' data together. We can also see some issues from the sampler manifesting; different chains get stuck in and create bumps in different places.
fig, ax = plt.subplots(1,num_chains, figsize=(20,5), sharex=True, sharey=True)
for cur_chain in range(num_chains):
ax[cur_chain].set_xlabel("Thetas from Chain {}".format(cur_chain))
ax[cur_chain].set_ylabel("Approx. Posterior Probability".format(cur_chain))
for i in range(8):
cur_name = 'theta'+str(i)
all_theta_n = samples['theta'][i,:,:]
ax[cur_chain].hist(chain_df_list[cur_chain][cur_name],
bins=100,
histtype='step',
label="School {}".format(school_data["School"][i]),
density=True
)
plt.legend()
plt.show()
- Interrogate the results
- Summarize the effect of school A's program- how many SAT points do students seem to gain on average?
- What's the probability that the program, on average, actually lowers scores?
- What's the probability that a particular school would end up with a program that lowers student scores?
- Can you make the MCMC converge better?
- Better priors?
- Longer sampling run?
Answers:
- Overall, what does a Bayesian analysis buy you?
- What is the price of entry?
Harry Potter and the Latent Dirichlet Analysis¶
Before we do anything else, let's talk about the Dirichlet distribution
The Dirichlet distribution takes in N parameters and spits out a probability vector of length N. The above graphs show which probability vectors are likely for different parameter settings, but it's easiest just to see for yourself:
length = 3
for rep in range(5):
vec = np.random.dirichlet([.1]*length)
print(np.round(vec,3))
(above) Values less than 1 make a few entries of the output vector large, and most entries small
for rep in range(5):
vec = np.random.dirichlet([1]*length)
print(np.round(vec,3))
(above) Values of 1 make all possible probability vectors equally likely, in some sense
- What happens when the inputs to Dirichlet are all large (above 1)?
- What happens if you make one entry substantially bigger than the others?
Answer:
The LDA Model (as code)¶
Prof Glickman did a good job in lecture covering LDA from lots of different viewpoints. The one thing we haven't seen is actual code to produce a document via the LDA framework. I honestly think this is the clearest way to tell LDA's story.
np.random.seed(7)
# scalar givens
num_docs = 5
num_topics = 3
vocab = ["yes", "no", "harry", "hermione", "ron", "dumbledore"]
vocab_len = len(vocab)
# vector givens: alphas and betas are given (or found when we solve the model)
alphas = .8*np.ones(num_topics)
betas = .3*np.ones(vocab_len)
# each document has a probability of talking about each topic
thetas = np.zeros((num_docs, num_topics))
for cur_doc in range(num_docs):
thetas[cur_doc,:] = np.random.dirichlet(alphas)
# each topic has a probability of talking about each word
phis = np.zeros((num_topics, vocab_len))
for cur_topic in range(num_topics):
phis[cur_topic,:] = np.random.dirichlet(betas)
##
# write document 1 for 20 words, as an example
##
cur_doc =1
doc_words = []
# get the document's probability of talking about each topic
topic_probs = thetas[cur_doc,:]
# for each word in the document's length:
for cur_word in range(20):
# Using the document's topic probabilities, randomly decide the topic this word belongs to
cur_topic_vec = np.random.multinomial(1,topic_probs)
cur_topic_int = np.argmax(cur_topic_vec)
# Using the topic's word probabilites, randomly decide which word will appear
word_probs = phis[cur_topic,:]
cur_word_vec = np.random.multinomial(1,word_probs)
cur_word_int = np.argmax(cur_word_vec)
# store the word
doc_words.append(vocab[cur_word_int])
print("Dpcument 1's Topics")
print(thetas[0,:])
print("Document 1's Words")
print(doc_words)
- Interpret the $\theta$ and $\phi$ matrices, below
- Why do "no", "dumbledor" and "ron" show up so much in document 1?
display(np.round(thetas,2))
display(np.round(phis,2))
Answer:
Answer:
The LDA model (as equations)¶
There are T topics, D documents, V words in the corpus
$$\alpha_t,\ given \text{ for t=1,2,...T}$$ $$\beta_w,\ given \text{ for w=1,2,...V}$$ $$ \theta_{d,:} \sim Dirichlet\left(\alpha_1,\alpha_2, ...,\alpha_T\right) \text{ for d=1,2,...D}$$ $$ \phi_{t,:} \sim Dirichlet\left(\beta_1,\beta_2, ...,\beta_W\right) \text{ for t=1,2,...T}$$
$$z_{d,i} \sim Multinomial\left(\theta_{d,1},\theta_{d,2},...\theta_{d,T}\right) \text{ for each d,i}$$ $$w_{d,i} \sim Multinomial\left(\phi_{z_{d,i},1},\phi_{z_{d,i},2},...\phi_{z_{d,i},V}\right) \text{ for each d,i}$$
$w_{d,i}$ is the $i$th word in the $d$th document
$z_{d,i}$ is the topic of the $i$th word in the $d$th document
$\phi_{t,i}$ is the probability of word $i$ appearing when the topic is $t$
$\theta_{d,t}$ is the probability of document $d$ talking about topic $t$
The $\beta_w$ determine how likely overall each word is, and how many words typically occur in the same topic
The $\alpha_t$ determine how likely overall each topic is, and how many topics typically occur in the same document
Fitting LDA¶
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel
Below is the code used to clean the Harry Potter books. Because we're not about to host the full text on a public GitHub, these cells will produce errors. If you see someone ask about it on Piazza, please cheekily refer them to this sentence.
First, we define what counts as a word for us. The regular expression below translates to "any number of letters, digits, numbers, and apostrophes". Note that we'll include "Ron" and "Ron's" as different words, and "fast-acting" is two words.
tokenizer = RegexpTokenizer(r'[\w\']+') # \w means any letter or digit.
#Overall, "words" in the corpus are any number of letters, digits, and apostrophes. No hyphens or other fancyness.
Often, any very-common words are dropped from the text before extracting topics from it. NLTK provides a list of common words in different languages, which we augment with Potter-specfic words.
stop_words = set(stopwords.words('english'))
stop_words.update(['harry','hermione','ron']) #add stop words for three main characters
stop_words.update(['said','got','get','would','could']) #empirically find these words very common
You can see the cleaning process below. For each book, we split it into chapters by hunting for the phrase "Chapter" or "Epilogue", lowercase the chapter, split it into a list of individual words, and purge the common words.
hp_text_array = [0]*7
for booknum in range(7):
print(booknum)
with open("data/harrypotter_book"+str(booknum+1)+".txt", "r", encoding="UTF-8") as book:
book_text = book.read()
chapter_text_list = re.split(r'Chapter\s?\d*|Epilogue',book_text)
hp_text_array[booknum] = chapter_text_list
for cur_chapter_id,cur_chapter_text in enumerate(chapter_text_list):
#make everything lower case
cur_chapter_text = cur_chapter_text.lower()
#tokenize
chapter_word_list = tokenizer.tokenize(cur_chapter_text)
#remove stop words
purged_word_list = [cur_word for cur_word in chapter_word_list if cur_word not in stop_words]
#store: [book][chapter]->list of words
hp_text_array[booknum][cur_chapter_id] = purged_word_list
import pickle
with open("data/HP_words.pkl", "wb") as outfile:
pickle.dump(hp_text_array, outfile)
This is the cell that will load the processesed data so you can continue with lab. An example of the structure we're working with is below
with open("data/HP_words.pkl", "rb") as infile:
hp_text_array = pickle.load(infile)
hp_text_array[0][11] #all (non-trivial) words in book 1, chapter 11 (yes 11- there's a preamble)
Exploring¶
Let's see if books 1 and 7 differ in their most common words.
Python's Counter object is really good at, well, counting.
from collections import Counter
counts = Counter(hp_text_array[0][11]) #count which words are in Book 1, Chapter 11
counts.most_common(15)
You can add counter objects together. Let's check out the top words in Book 1 versus Book 7. Does Book 7 seem darker?
book1_counter = Counter()
for cur_chapter_words in hp_text_array[0]:
chapter_word_count = Counter(cur_chapter_words)
book1_counter += chapter_word_count
book1_counter.most_common(10)
book7_counter = Counter()
for cur_chapter_words in hp_text_array[6]:
chapter_word_count = Counter(cur_chapter_words)
book7_counter += chapter_word_count
book7_counter.most_common(10)
Hmmm, there's a lot of overlap ("back","know","like","looked"), but book 1 has a lot more Hagrid, and book 7 has a lot more Voldemort.
- Find the 20 most common words across all the books
Answer:
Fitting LDA¶
LDA wants to operate on a list of all documents. (Here, we're treating each chapter as its own document). We need to restructure our data
list_of_docs = []
for book_id in range(7):
for chapter in hp_text_array[book_id]:
list_of_docs.append(chapter)
We build a gensim Dictionary on all the documents -- this tracks and numbers all words used in any of the documents.
masterdictionary = Dictionary(list_of_docs)
We use the doc2bow
to convert each document to a numerical format
mastercorpus = [masterdictionary.doc2bow(doc) for doc in list_of_docs]
mastercorpus[11][:20] #20 words and their counts from book 1, chapter 11
Invoking the dictionary, we can translate to see that book 1, chapter 11 used the word 'hogwarts' once and the word 'one' eleven times.
masterdictionary[59],masterdictionary[82]
Now, we're ready to actually fit a model
seven_book_model = LdaModel(mastercorpus, num_topics=7, id2word = masterdictionary, passes=10)
We can investigate any particular topic
seven_book_model.show_topic(2, topn=20)
It's nicer to plot the heavy-hitting words in each topic, though
top_words = [[word for word,_ in seven_book_model.show_topic(topicno, topn=50)] for topicno in range(seven_book_model.num_topics)]
top_betas = [[beta for _,beta in seven_book_model.show_topic(topicno, topn=50)] for topicno in range(seven_book_model.num_topics)]
top_words[0][:5]
top_betas[0][:5]
gs = gridspec.GridSpec(3,3)
gs.update(wspace=0.5, hspace=0.5)
plt.figure(figsize=(11,8.5))
for i in range(7):
#new subplot
ax = plt.subplot(gs[i])
plt.barh(range(5), top_betas[i][:5], align='center',color='blue', ecolor='black')
ax.invert_yaxis()
ax.set_yticks(range(5))
ax.set_yticklabels(top_words[i][:5])
plt.title("Topic "+str(i))
#finding optimal number of topics for book 1 via coherence measure u_mass
coherence_vals = []
for ntop in range(1,12):
mod = LdaModel(mastercorpus, num_topics = ntop, id2word = masterdictionary, passes=10)
cmod = CoherenceModel(model=mod, corpus=mastercorpus, dictionary=masterdictionary, coherence='u_mass')
cval = cmod.get_coherence()
print(ntop,cval)
coherence_vals.append(cval)
_ = plt.figure(figsize=(11,8.5))
_ = plt.plot(range(1,12),coherence_vals)
_ = plt.xlabel("Number of Topics")
_ = plt.ylabel("Coherence Score")
- Overall, is LDA more like KNN or K-means?