CS109B Data Science 2: Advanced Topics in Data Science

Lab 9 - Bayes, Part 2 - LDA

Harvard University
Spring 2019
Instructors: Mark Glickman and Pavlos Protopapas


In [1]:
import pandas as pd
import numpy as np
%matplotlib inline 
import matplotlib.pyplot as plt
from matplotlib import gridspec
import re

import scipy.stats
import pyjags
In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

Schools Data and Bayesian Modeling

Once upon a time, eight different schools each implemented a particular SAT prep program. At the end of the year, students from each school's program took the SATs and we've recorded the students' average performance relative to a control group that got no treatment. We've also recorded the standard error for each school's estimated increase, a measure of how accurate the estimated increase is. (Standard Error factors in the number of students who took the program and how variable their scores were).

In [3]:
school_data = pd.read_csv("data/gelman_schools.csv")
school_data
Out[3]:
School Estimated Score Increase Standard error of the estimate
0 A 28 15
1 B 8 10
2 C -3 16
3 D 7 11
4 E -1 9
5 F 1 11
6 G 18 10
7 H 12 18

The measurements look something like this:

In [4]:
x_vals = np.linspace(-50,80,1000)
for cur_school_id in range(8):
    
    name = school_data["School"][cur_school_id]
    mean = school_data["Estimated Score Increase"][cur_school_id]
    sd = school_data["Standard error of the estimate"][cur_school_id]
    
    norm_obj = scipy.stats.norm(mean,sd)
    plt.plot(x_vals, norm_obj.pdf(x_vals), label=name)

plt.xlabel("School's Performance")
plt.ylabel("Probability Density")
plt.legend()
plt.show()
Discussion
  • How effective is the program?
    • Does it increase scores? By how much?

A Hierarchical Model for the Schools

Gelman (or maybe Rubin?) proposed the following model to explain the data and understand the effect of the program on SAT scores. As in class, the point of the model is that it neither takes each school's result at face value, nor ignores variation from school to school.

$$\sigma_j\ given$$ $$\mu \sim Uniform(-20,20)$$ $$\tau \sim Uniform(0,10)$$ $$\theta_j \sim Normal(Mean=\mu,\ SD=\tau)$$ $$y_j \sim Normal(Mean=\theta_j,\ SD=\sigma_j)$$

$y_j$ are the observed SAT increase at each school
$\sigma_j$ are the observed standard errors at each school
$\theta_j$ is the 'true' effect of each school's program
$\mu$ and $\tau$ govern the distribution of possible true effects.

Bayesian models (and others) are often presented like this. Actually, they're usually presented even less clearly (e.g. missing the definition of each symbol). Let's learn how to make sense of this jumble.

First pass: Understanding the parameters

This pass is typically easiest from bottom to top, starting with the observed parameters that you know how to interpret. It often also requires reading about the context of the data or what the modeler is trying to encode with each parameter.

$$y_j \sim Normal(Mean=\theta_j,\ SD=\sigma_j)$$

First, the observed data (one value per school) are 1) normally distributed 2) each centered at a different value, one per school. The 8 $\theta_j$ parameters are the 'true average effect' of the program in school j, separate from the (noisy) effect we actually observe.

$$\theta_j \sim Normal(Mean=\mu,\ SD=\tau)$$

Where do the 'true average effects' in each school come from? Line 2, above, says that they're all draws from a normal distribution with a particular mean and center. Okay, so they're all form the same family, that's fine for now. Moving on,

$$\sigma_j\ given$$ $$\mu \sim Uniform(-20,20)$$ $$\tau \sim Uniform(0,10)$$

Finally, the parameters defining what 'true average effects' we might see come from specific uniform distributions. In particular, the author encodes that the center of the 'true average effect' distribution is strictly between -20 and 20, and the spread of the 'true average effect' distribution is somewhere betweeen 0 and 10.

Seconnd Pass: Summarizing: The author's story is that when someone goes to implement this cirriculum in a given school, the actual (long term?) effectiveness of that program is secretly picked from a normal distribution. Then the actual observed effect is a noisy version of that actual effectiveness, with the noise set by that school's standard error. (The school's standard error is set by the number of students who were in the program and the variability in how well those students did, but this model takes that as given for each school).

Third pass: Critiquing the story

Discussion
  1. Does the author's overall story make sense?
    • Do you believe in a 'true' effectiveness in each school, distinct from the observed effectiveness?
    • Do you believe that schools' 'true' effectiveness all come from some distribution? Or do you think they're all the same? Or all unrelated?
  2. What does it mean, in context, for the author to say $\mu$ is definitely between -20 and 20? Does that seem reasonable to you?
  3. What does it mean, in context, for the author to say $\tau$ is definitely between 0 and 10? Does that seem reasonable to you?

  4. As a class, come up with a prior for $\mu$ and $\tau$

Coding the Model

To work in JAGS, we need to write out the model in a specific code-like format. That format is designed to be a mix of the equation description of the model above and R syntax.

The tricky parts are:

  • Writing a loop for any varibles with indices, and double loops if the variable has two indices
  • Looking up the abreviation for each ditribution (usualy the letter d and a short-ish version of the distribution name)
  • Looking up the parameters the distributions want (does it ask for mean, sd, or precision?)
  • Looking up how to do functions like $x^2$ in R

Compare:
There are J schools

$$\sigma_j\ given$$ $$\mu \sim Uniform(-20,20)$$ $$\tau \sim Uniform(0,10)$$ $$\theta_j \sim Normal(Mean=\mu,\ SD=\tau)$$ $$y_j \sim Normal(Mean=\theta_j,\ SD=\sigma_j)$$

To:

In [5]:
schools_model_code = '''
model {
    
    mu ~ dunif(-20,20)
    tau ~ dunif(0,10)
    
    for (j in 1:J){
        theta[j] ~ dnorm(mu, 1/pow(tau,2))
    }
    
    for (j in 1:J){
        y[j] ~ dnorm(theta[j], 1/pow(sigma[j],2))
    }
}
'''

Running the model

To run the model, you need to pass in a dictionary of the observed data. Pyjags is pretty good about giving useful error messages, but definitely turn on line numbers in Jupyter!

We run 500 samples of burn-in (MCMC needs some undefined amount of steps before it produces samples from the target distribution). We then collect 2500 actual samples from each of 4 chains.

In [6]:
observed_vals = {'y':school_data["Estimated Score Increase"],
                 'sigma':school_data["Standard error of the estimate"],
                 'J': school_data.shape[0]}

num_chains = 4
school_model = pyjags.Model(schools_model_code, data=observed_vals, chains=num_chains)
burnin = school_model.sample(500) #warmup/burn-in
samples = school_model.sample(2500) #cf 7500
adapting: iterations 4000 of 4000, elapsed 0:00:00, remaining 0:00:00
sampling: iterations 2000 of 2000, elapsed 0:00:00, remaining 0:00:00
sampling: iterations 10000 of 10000, elapsed 0:00:00, remaining 0:00:00

Checking Convergence

MCMC is only guaranteed to work if you run it for infinite time. It can give good samples after finite, or even short time, but it's worth checking whether it looks like it did return good samples.

The first thing to check is whether the sampler got stuck in one place for a bit by looking for flat/thin regions in the trace. Luckily, we have a lot to show you

In [7]:
def plot_trace(samples, varname, entry=0):
    plt.plot()
    sample_array = samples[varname]

    vec_len, num_samples, num_chains = sample_array.shape
    for cur_chain in range(num_chains):
        cur_label = "Chain {}".format(cur_chain)
        plt.plot(range(num_samples),sample_array[entry,:,cur_chain], label=cur_label)
    plt.legend()
    plt.show()

We check the $\mu$ and $\tau$ parameters

In [8]:
plt.xlabel("Iteration")
plt.ylabel("Value of Mu")
plot_trace(samples,'mu')


plt.xlabel("Iteration")
plt.ylabel("Value of Tau")
plot_trace(samples,'tau')

and the 8 different $\theta_j$

In [9]:
for cur_school in range(8):
    print("Theta for School {}".format(cur_school))
    plt.xlabel("Iteration")
    plt.ylabel("Value of Theta_{}".format(cur_school))
    plot_trace(samples, 'theta', entry=cur_school)
    print("------")
Theta for School 0
------
Theta for School 1
------
Theta for School 2
------
Theta for School 3
------
Theta for School 4
------
Theta for School 5
------
Theta for School 6
------
Theta for School 7
------

Overall, we see pretty rough traces- lots of places with limited variation.

Fixing these defects is tough. Simply running more samples gives you better odds that you've got stuck for an even amount of time in each trap, though it's more 'hope' than 'strategy'. Changing the priors or even how the model is written can help ease the issues. More advanced samplers (e.g. Hamiltonian Monte Carlo implemented in pymc3 or Stan) can help, too.

There are other measures of whether the traces look reasonable- effective sample size, R-hat, and Geweke.

In real life, you should carefully vet your traces and adjust the model/sampler until they look good. See AM207 for lots more on this topic. Here, we're just going to press on as if the traces and samples are legitimate

Exploring The Posterior

The samples produced are basically a big data frame where each row is a sample and each column is one of the prameters of the model. This is everything we know about the posterior. Conceptually, from here forward all we do is describe this data frame- means or histograms or the columns, correlations, etc.

(The samples aren't actually stored as a data frame, but conversion code is provided below)

In [10]:
samples
Out[10]:
{'J': array([[[8., 8., 8., 8.],
         [8., 8., 8., 8.],
         [8., 8., 8., 8.],
         ...,
         [8., 8., 8., 8.],
         [8., 8., 8., 8.],
         [8., 8., 8., 8.]]]),
 'mu': array([[[11.44200703, -5.2483797 ,  3.67912647,  0.0738187 ],
         [12.26069247, -5.48494222,  3.61115491,  4.95019625],
         [10.73895563, -5.085733  ,  3.18646544,  4.80098972],
         ...,
         [11.17434202,  8.62852242,  8.57332894,  3.38636574],
         [11.32547262,  3.60811946,  7.74219932,  6.88329731],
         [11.64152507,  3.81368591,  9.61093627,  7.36313944]]]),
 'sigma': array([[[15., 15., 15., 15.],
         [15., 15., 15., 15.],
         [15., 15., 15., 15.],
         ...,
         [15., 15., 15., 15.],
         [15., 15., 15., 15.],
         [15., 15., 15., 15.]],
 
        [[10., 10., 10., 10.],
         [10., 10., 10., 10.],
         [10., 10., 10., 10.],
         ...,
         [10., 10., 10., 10.],
         [10., 10., 10., 10.],
         [10., 10., 10., 10.]],
 
        [[16., 16., 16., 16.],
         [16., 16., 16., 16.],
         [16., 16., 16., 16.],
         ...,
         [16., 16., 16., 16.],
         [16., 16., 16., 16.],
         [16., 16., 16., 16.]],
 
        ...,
 
        [[11., 11., 11., 11.],
         [11., 11., 11., 11.],
         [11., 11., 11., 11.],
         ...,
         [11., 11., 11., 11.],
         [11., 11., 11., 11.],
         [11., 11., 11., 11.]],
 
        [[10., 10., 10., 10.],
         [10., 10., 10., 10.],
         [10., 10., 10., 10.],
         ...,
         [10., 10., 10., 10.],
         [10., 10., 10., 10.],
         [10., 10., 10., 10.]],
 
        [[18., 18., 18., 18.],
         [18., 18., 18., 18.],
         [18., 18., 18., 18.],
         ...,
         [18., 18., 18., 18.],
         [18., 18., 18., 18.],
         [18., 18., 18., 18.]]]),
 'tau': array([[[5.22944064, 0.51424254, 0.57377092, 7.70896573],
         [4.12126626, 0.62418476, 0.55211472, 5.65434903],
         [5.02714961, 1.34672645, 0.55877668, 7.79227361],
         ...,
         [1.74828205, 5.63677814, 8.53923862, 7.55739259],
         [1.74222428, 5.65749486, 8.58125167, 8.09525945],
         [2.69541102, 4.94712847, 7.39562856, 7.32618989]]]),
 'theta': array([[[16.61972838, -5.56344442,  3.49366297, 13.04717757],
         [ 7.77682071, -5.49960964,  3.53405496,  6.83151556],
         [16.46103894, -5.75185783,  2.86776548,  3.62641676],
         ...,
         [16.8258211 , 10.39344698, -1.57631373, 18.03982207],
         [ 8.72831671,  8.30629003, 13.73868418, 17.59792133],
         [12.05765791, -2.24196389, 27.22559778, 25.53386541]],
 
        [[19.56954737, -5.42537957,  4.05026727,  1.03910367],
         [11.59722917, -5.74709919,  4.54484246,  5.87960291],
         [ 6.367036  , -6.01464545,  3.39471538,  4.42970772],
         ...,
         [ 8.57546221, 10.32027721, 12.7106045 ,  1.31254366],
         [ 8.38944263,  6.24801873,  7.44531707,  5.27553666],
         [10.90258801,  3.43412943, 10.33769674,  9.0311747 ]],
 
        [[ 5.88837309, -4.81366293,  3.15127896, -6.05951826],
         [14.78128587, -5.73556426,  3.71896808,  2.57675358],
         [12.54597606, -5.0653308 ,  3.18413593,  0.49493214],
         ...,
         [11.98891272, 14.48338533, 13.87393438,  0.28335609],
         [10.9354531 ,  8.07284582, -3.44999072,  9.18619352],
         [11.64331293, 10.64072935,  0.95014784, -0.21079075]],
 
        ...,
 
        [[ 1.22263673, -4.68733117,  3.31083688,  0.25609424],
         [18.55268161, -6.09756273,  4.05988122, 12.64476468],
         [ 6.71437458, -4.61057384,  3.52748741, -9.89316271],
         ...,
         [12.20384369,  4.68861594,  6.5070424 ,  3.57409133],
         [ 8.62450786,  8.8124001 ,  4.52069646,  0.46528057],
         [10.86269654,  6.22219415,  4.76471052,  6.61856667]],
 
        [[14.92735666, -5.68611841,  3.78224308,  9.02318814],
         [14.07864705, -4.10942606,  3.2962622 , 14.81264762],
         [16.15917815, -5.94679351,  3.74045733,  5.33112382],
         ...,
         [13.46168853,  8.76175389,  8.48475549,  3.892842  ],
         [ 9.52856636,  7.49486197, 15.59616436,  2.57200075],
         [12.38932389, 10.33383833,  1.72877462,  7.3614203 ]],
 
        [[10.48061998, -5.45665423,  3.91330464,  0.06519383],
         [16.33537128, -5.73639233,  4.15078261, 13.75552883],
         [10.26217066, -6.05283819,  3.24939955, 11.2550645 ],
         ...,
         [ 8.85214681,  4.31514814,  5.14246541,  8.71776731],
         [13.72143201,  9.21182701, 10.58598474, -1.14178797],
         [12.69258184,  6.78170005, 14.78258857, -1.19588779]]]),
 'y': array([[[28., 28., 28., 28.],
         [28., 28., 28., 28.],
         [28., 28., 28., 28.],
         ...,
         [28., 28., 28., 28.],
         [28., 28., 28., 28.],
         [28., 28., 28., 28.]],
 
        [[ 8.,  8.,  8.,  8.],
         [ 8.,  8.,  8.,  8.],
         [ 8.,  8.,  8.,  8.],
         ...,
         [ 8.,  8.,  8.,  8.],
         [ 8.,  8.,  8.,  8.],
         [ 8.,  8.,  8.,  8.]],
 
        [[-3., -3., -3., -3.],
         [-3., -3., -3., -3.],
         [-3., -3., -3., -3.],
         ...,
         [-3., -3., -3., -3.],
         [-3., -3., -3., -3.],
         [-3., -3., -3., -3.]],
 
        ...,
 
        [[ 1.,  1.,  1.,  1.],
         [ 1.,  1.,  1.,  1.],
         [ 1.,  1.,  1.,  1.],
         ...,
         [ 1.,  1.,  1.,  1.],
         [ 1.,  1.,  1.,  1.],
         [ 1.,  1.,  1.,  1.]],
 
        [[18., 18., 18., 18.],
         [18., 18., 18., 18.],
         [18., 18., 18., 18.],
         ...,
         [18., 18., 18., 18.],
         [18., 18., 18., 18.],
         [18., 18., 18., 18.]],
 
        [[12., 12., 12., 12.],
         [12., 12., 12., 12.],
         [12., 12., 12., 12.],
         ...,
         [12., 12., 12., 12.],
         [12., 12., 12., 12.],
         [12., 12., 12., 12.]]])}
In [11]:
display(samples['theta'].shape)
display(samples['mu'].shape)
(8, 2500, 4)
(1, 2500, 4)

The raw samples from pyjags are a dictionary of parameeter names -> 3d arrays

Discussion
  • Why are the sample object's arrays shaped like this?

We can equivalently organize the samples as a data frame (one per chain). The code below will handle this for you

In [12]:
def convert_to_dfs(samples, parameter_names, num_chains):
    """Converts a pyjags sampling result to a list of data frames, one per chain"""
    big_list = []
    for cur_chain_num in range(num_chains):
        df_list = []
        for k in parameter_names:
            v = samples[k]

            chain1_data = v[:,:,cur_chain_num]
            cur_df = pd.DataFrame(chain1_data.T)

            if cur_df.shape[1]==1:
                cur_df = cur_df.rename({0:k}, axis=1)
            else:
                cur_df = cur_df.add_prefix(k)

            df_list.append(cur_df)

        chain1_samples_df = pd.concat(df_list, axis=1)
        big_list.append(chain1_samples_df)
        
    return big_list

chain_df_list = convert_to_dfs(samples,["J","mu","tau","theta","sigma"],num_chains)
chain_df_list[0].head(15)
Out[12]:
J mu tau theta0 theta1 theta2 theta3 theta4 theta5 theta6 theta7 sigma0 sigma1 sigma2 sigma3 sigma4 sigma5 sigma6 sigma7
0 8.0 11.442007 5.229441 16.619728 19.569547 5.888373 11.555302 14.000452 1.222637 14.927357 10.480620 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
1 8.0 12.260692 4.121266 7.776821 11.597229 14.781286 8.849931 11.661383 18.552682 14.078647 16.335371 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
2 8.0 10.738956 5.027150 16.461039 6.367036 12.545976 8.353043 13.858865 6.714375 16.159178 10.262171 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
3 8.0 10.516138 4.554490 7.474957 6.259871 11.719390 10.502339 4.189236 11.253609 16.150101 5.926606 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
4 8.0 10.960373 4.869065 18.473639 12.840373 10.860589 14.785369 3.523220 2.416270 12.767866 10.243774 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
5 8.0 11.096228 3.164246 14.732883 13.481348 5.766054 10.089107 6.318225 15.063753 13.295595 9.481372 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
6 8.0 11.823150 4.010776 12.051695 15.428234 5.185046 13.173989 10.775218 13.420324 7.560915 14.283472 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
7 8.0 11.970236 8.292315 20.341468 10.596276 14.109678 21.072512 8.798100 5.908609 15.986449 15.522908 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
8 8.0 8.504100 7.625681 20.081649 11.187542 -7.337301 4.319860 16.882968 -7.501047 14.923680 23.086732 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
9 8.0 6.782231 6.401574 21.331827 -7.331606 8.548878 7.238340 -1.487119 -7.016653 11.106292 9.737838 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
10 8.0 6.687746 7.104521 12.442858 12.048259 2.113918 9.093249 6.816101 3.553985 13.849016 11.038066 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
11 8.0 6.814846 8.586887 13.693062 7.003443 -1.932552 -5.210427 1.169831 0.766485 15.865447 12.148062 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
12 8.0 5.745068 6.131445 12.496375 15.859827 3.879309 13.467793 4.079493 4.300865 6.819747 -3.968503 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
13 8.0 4.639780 8.804461 18.416387 6.921307 14.155640 -2.364373 8.944646 -2.533777 11.618003 7.952452 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0
14 8.0 5.674181 9.300716 20.937233 -0.637050 3.997101 11.917414 11.382043 7.458958 17.209642 19.006517 15.0 10.0 16.0 11.0 9.0 11.0 10.0 18.0

Learn About the Parameters

Once we have the posterior dataset, we can analyze it like we'd analyze any other dataset (with the warning that the rows are correlated, not IID)

In [13]:
for cur_chain in range(num_chains):
    plt.hist(chain_df_list[cur_chain]['mu'], bins=100, histtype='step', density=True, label="Chain {}".format(cur_chain));
plt.xlabel("Value of mu")
plt.ylabel("Approx. Posterior Probability")
plt.legend()
plt.show()

Above, we see that (assuming this model+priors and factoring in all data) $\mu$ seems to range from about -2.5 to about 20, and its most likely value is between 5 and 10.

Example:

  • What's the probability that $\mu$ is above 5?
    • What does this mean in context?

Answer:

In [14]:
count_above_5 = np.sum(chain_df_list[0]['mu'] > 15)
total = len(chain_df_list[0]['mu'])

count_above_5/total
Out[14]:
0.0468

(Using just chain 0), there's a roughly 2-8% chance (changes from run to run) that $\mu$ is above 15. In context, this means that a there's very little chance the program, on average, increases student scores by 15 or more. Practically, we might decide that the program isn't worth it; maybe we only feel increases of less than 50 SAT points aren't worth paying for.

Discussion
  • Is it more apropriate to use one chain, or combine all four?

Repeating that plot for $\tau$

In [15]:
for cur_chain in range(num_chains):
    plt.hist(chain_df_list[cur_chain]['tau'], bins=100, histtype='step', density=True, label="Chain {}".format(cur_chain));
plt.xlabel("Value of tau")
plt.ylabel("Approx. Posterior Probability")
plt.legend()
plt.show()
Exercise 1
  • What can you conclude from the plot above?

Answer:

Exploring effect in each school

Investigating the $\theta_j$ is a touch more complex becuase there is one per school. Overall, it looks like each school's theta is about the same-- the model is suggesting pooling all the schools' data together. We can also see some issues from the sampler manifesting; different chains get stuck in and create bumps in different places.

In [17]:
fig, ax = plt.subplots(1,num_chains, figsize=(20,5), sharex=True, sharey=True)

for cur_chain in range(num_chains):
    ax[cur_chain].set_xlabel("Thetas from Chain {}".format(cur_chain))
    ax[cur_chain].set_ylabel("Approx. Posterior Probability".format(cur_chain))
    for i in range(8):
        cur_name = 'theta'+str(i)
        all_theta_n = samples['theta'][i,:,:]
        ax[cur_chain].hist(chain_df_list[cur_chain][cur_name], 
                            bins=100, 
                            histtype='step', 
                            label="School {}".format(school_data["School"][i]),
                            density=True
                            )
plt.legend()
plt.show()
Exercise 2
  • Interrogate the results
    • Summarize the effect of school A's program- how many SAT points do students seem to gain on average?
    • What's the probability that the program, on average, actually lowers scores?
    • What's the probability that a particular school would end up with a program that lowers student scores?
  • Can you make the MCMC converge better?
    • Better priors?
    • Longer sampling run?

Answers:

Discussion
  • Overall, what does a Bayesian analysis buy you?
  • What is the price of entry?

Harry Potter and the Latent Dirichlet Analysis

Before we do anything else, let's talk about the Dirichlet distribution

image.png

The Dirichlet distribution takes in N parameters and spits out a probability vector of length N. The above graphs show which probability vectors are likely for different parameter settings, but it's easiest just to see for yourself:

In [50]:
length = 3
for rep in range(5):
    vec = np.random.dirichlet([.1]*length)
    print(np.round(vec,3))
[0.036 0.034 0.931]
[0.    0.034 0.966]
[0. 1. 0.]
[0.344 0.656 0.   ]
[0. 1. 0.]

(above) Values less than 1 make a few entries of the output vector large, and most entries small

In [51]:
for rep in range(5):
    vec = np.random.dirichlet([1]*length)
    print(np.round(vec,3))
[0.559 0.329 0.112]
[0.263 0.038 0.699]
[0.716 0.133 0.151]
[0.408 0.086 0.506]
[0.657 0.272 0.071]

(above) Values of 1 make all possible probability vectors equally likely, in some sense

Exercise 3
  • What happens when the inputs to Dirichlet are all large (above 1)?
  • What happens if you make one entry substantially bigger than the others?

Answer:

The LDA Model (as code)

Prof Glickman did a good job in lecture covering LDA from lots of different viewpoints. The one thing we haven't seen is actual code to produce a document via the LDA framework. I honestly think this is the clearest way to tell LDA's story.

In [54]:
np.random.seed(7)

# scalar givens
num_docs = 5
num_topics = 3
vocab = ["yes", "no", "harry", "hermione", "ron", "dumbledore"]
vocab_len = len(vocab)

# vector givens: alphas and betas are given (or found when we solve the model)
alphas = .8*np.ones(num_topics)
betas = .3*np.ones(vocab_len)

# each document has a probability of talking about each topic
thetas = np.zeros((num_docs, num_topics))
for cur_doc in range(num_docs):
    thetas[cur_doc,:] = np.random.dirichlet(alphas)

# each topic has a probability of talking about each word
phis = np.zeros((num_topics, vocab_len))
for cur_topic in range(num_topics):
    phis[cur_topic,:] = np.random.dirichlet(betas)


##
# write document 1 for 20 words, as an example
##
cur_doc =1
doc_words = []

# get the document's probability of talking about each topic
topic_probs = thetas[cur_doc,:]

# for each word in the document's length:
for cur_word in range(20):
    # Using the document's topic probabilities, randomly decide the topic this word belongs to
    cur_topic_vec = np.random.multinomial(1,topic_probs)
    cur_topic_int = np.argmax(cur_topic_vec)
    
    # Using the topic's word probabilites, randomly decide which word will appear
    word_probs = phis[cur_topic,:]
    cur_word_vec = np.random.multinomial(1,word_probs)
    cur_word_int = np.argmax(cur_word_vec)
    
    # store the word
    doc_words.append(vocab[cur_word_int])

print("Dpcument 1's Topics")
print(thetas[0,:])
print("Document 1's Words")
print(doc_words)
Dpcument 1's Topics
[0.008885   0.08921879 0.90189621]
Document 1's Words
['ron', 'ron', 'dumbledore', 'ron', 'ron', 'ron', 'dumbledore', 'ron', 'no', 'no', 'no', 'no', 'ron', 'ron', 'ron', 'yes', 'no', 'ron', 'ron', 'ron']
Exercise 2
  • Interpret the $\theta$ and $\phi$ matrices, below
  • Why do "no", "dumbledor" and "ron" show up so much in document 1?
In [55]:
display(np.round(thetas,2))
display(np.round(phis,2))
array([[0.01, 0.09, 0.9 ],
       [0.31, 0.12, 0.57],
       [0.47, 0.32, 0.21],
       [0.22, 0.05, 0.73],
       [0.35, 0.56, 0.09]])
array([[0.04, 0.04, 0.44, 0.16, 0.07, 0.25],
       [0.18, 0.13, 0.01, 0.14, 0.54, 0.  ],
       [0.02, 0.38, 0.  , 0.03, 0.45, 0.12]])

Answer:

Answer:

The LDA model (as equations)

There are T topics, D documents, V words in the corpus

$$\alpha_t,\ given \text{ for t=1,2,...T}$$ $$\beta_w,\ given \text{ for w=1,2,...V}$$ $$ \theta_{d,:} \sim Dirichlet\left(\alpha_1,\alpha_2, ...,\alpha_T\right) \text{ for d=1,2,...D}$$ $$ \phi_{t,:} \sim Dirichlet\left(\beta_1,\beta_2, ...,\beta_W\right) \text{ for t=1,2,...T}$$

$$z_{d,i} \sim Multinomial\left(\theta_{d,1},\theta_{d,2},...\theta_{d,T}\right) \text{ for each d,i}$$ $$w_{d,i} \sim Multinomial\left(\phi_{z_{d,i},1},\phi_{z_{d,i},2},...\phi_{z_{d,i},V}\right) \text{ for each d,i}$$

$w_{d,i}$ is the $i$th word in the $d$th document
$z_{d,i}$ is the topic of the $i$th word in the $d$th document
$\phi_{t,i}$ is the probability of word $i$ appearing when the topic is $t$
$\theta_{d,t}$ is the probability of document $d$ talking about topic $t$
The $\beta_w$ determine how likely overall each word is, and how many words typically occur in the same topic
The $\alpha_t$ determine how likely overall each topic is, and how many topics typically occur in the same document

Fitting LDA

In [73]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/40960295/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Below is the code used to clean the Harry Potter books. Because we're not about to host the full text on a public GitHub, these cells will produce errors. If you see someone ask about it on Piazza, please cheekily refer them to this sentence.

First, we define what counts as a word for us. The regular expression below translates to "any number of letters, digits, numbers, and apostrophes". Note that we'll include "Ron" and "Ron's" as different words, and "fast-acting" is two words.

In [74]:
tokenizer = RegexpTokenizer(r'[\w\']+') # \w means any letter or digit. 
#Overall, "words" in the corpus are any number of letters, digits, and apostrophes. No hyphens or other fancyness.

Often, any very-common words are dropped from the text before extracting topics from it. NLTK provides a list of common words in different languages, which we augment with Potter-specfic words.

In [75]:
stop_words = set(stopwords.words('english'))
stop_words.update(['harry','hermione','ron']) #add stop words for three main characters
stop_words.update(['said','got','get','would','could']) #empirically find these words very common

You can see the cleaning process below. For each book, we split it into chapters by hunting for the phrase "Chapter" or "Epilogue", lowercase the chapter, split it into a list of individual words, and purge the common words.

In [112]:
hp_text_array = [0]*7
for booknum in range(7):
    print(booknum)
    with open("data/harrypotter_book"+str(booknum+1)+".txt", "r", encoding="UTF-8") as book: 
        book_text = book.read()
        chapter_text_list = re.split(r'Chapter\s?\d*|Epilogue',book_text)
        hp_text_array[booknum] = chapter_text_list
        
        for cur_chapter_id,cur_chapter_text in enumerate(chapter_text_list):
            #make everything lower case
            cur_chapter_text = cur_chapter_text.lower()
            #tokenize
            chapter_word_list = tokenizer.tokenize(cur_chapter_text)
            #remove stop words
            purged_word_list = [cur_word for cur_word in chapter_word_list if cur_word not in stop_words]
            
            #store: [book][chapter]->list of words
            hp_text_array[booknum][cur_chapter_id] = purged_word_list
0
1
2
3
4
5
6
In [116]:
import pickle
with open("data/HP_words.pkl", "wb") as outfile:
    pickle.dump(hp_text_array, outfile)

This is the cell that will load the processesed data so you can continue with lab. An example of the structure we're working with is below

In [117]:
with open("data/HP_words.pkl", "rb") as infile:
    hp_text_array = pickle.load(infile)
In [118]:
hp_text_array[0][11] #all (non-trivial) words in book 1, chapter 11 (yes 11- there's a preamble)
Out[118]:
['quidditch',
 'entered',
 'november',
 'weather',
 'turned',
 'cold',
 'mountains',
 'around',
 'school',
 'became',
 'icy',
 'gray',
 'lake',
 'like',
 'chilled',
 'steel',
 'every',
 'morning',
 'ground',
 'covered',
 'frost',
 'hagrid',
 'seen',
 'upstairs',
 'windows',
 'defrost',
 'ing',
 'broomsticks',
 'quidditch',
 'field',
 'bundled',
 'long',
 'moleskin',
 'overcoat',
 'rabbit',
 'fur',
 'gloves',
 'enormous',
 'beaverskin',
 'boots',
 'quidditch',
 'season',
 'begun',
 'saturday',
 'playing',
 'first',
 'match',
 'weeks',
 'training',
 'gryffindor',
 'versus',
 'slytherin',
 'gryffindor',
 'move',
 'second',
 'place',
 'house',
 'championship',
 'hardly',
 'anyone',
 'seen',
 'play',
 'wood',
 'decided',
 'secret',
 'weapon',
 'kept',
 'well',
 'secret',
 'news',
 'playing',
 'seeker',
 'leaked',
 'somehow',
 'know',
 'worse',
 'people',
 'telling',
 'brilliant',
 'people',
 'telling',
 'running',
 'around',
 'neath',
 'holding',
 'mattress',
 'really',
 'lucky',
 'friend',
 'know',
 'gotten',
 'homework',
 'last',
 'minute',
 'quidditch',
 'practice',
 'wood',
 'making',
 'also',
 'lent',
 'quidditch',
 'ages',
 'turned',
 'interesting',
 'read',
 'learned',
 'seven',
 'hundred',
 'ways',
 'commit',
 'ting',
 'quidditch',
 'foul',
 'happened',
 'world',
 'cup',
 'match',
 '1473',
 'seekers',
 'usually',
 'smallest',
 'fastest',
 'players',
 'serious',
 'quidditch',
 'accidents',
 'seemed',
 'happen',
 'although',
 'people',
 'rarely',
 'died',
 'play',
 'ing',
 'quidditch',
 'referees',
 'known',
 'vanish',
 'turn',
 'months',
 'later',
 'sahara',
 'desert',
 'become',
 'bit',
 'relaxed',
 'breaking',
 'rules',
 'since',
 'saved',
 'mountain',
 'troll',
 'much',
 'nicer',
 'day',
 'first',
 'quidditch',
 'match',
 'three',
 'freezing',
 'courtyard',
 'break',
 'conjured',
 'bright',
 'blue',
 'fire',
 'carried',
 'around',
 'jam',
 'jar',
 'standing',
 'backs',
 'getting',
 'warm',
 'snape',
 'crossed',
 'yard',
 'noticed',
 'snape',
 'limping',
 'moved',
 'closer',
 'together',
 'block',
 'fire',
 'view',
 'sure',
 'allowed',
 'unfortunately',
 'something',
 'guilty',
 'faces',
 'caught',
 'snape',
 'eye',
 'limped',
 'seen',
 'fire',
 'seemed',
 'looking',
 'reason',
 'tell',
 'anyway',
 'potter',
 'quidditch',
 'ages',
 'showed',
 'library',
 'books',
 'taken',
 'outside',
 'school',
 'snape',
 'give',
 'five',
 'points',
 'gryffindor',
 'made',
 'rule',
 'muttered',
 'angrily',
 'snape',
 'limped',
 'away',
 'wonder',
 'wrong',
 'leg',
 'dunno',
 'hope',
 'really',
 'hurting',
 'bitterly',
 'gryffindor',
 'common',
 'room',
 'noisy',
 'evening',
 'sat',
 'together',
 'next',
 'window',
 'checking',
 'charms',
 'homework',
 'never',
 'let',
 'copy',
 'learn',
 'asking',
 'read',
 'right',
 'answers',
 'anyway',
 'felt',
 'restless',
 'wanted',
 'quidditch',
 'ages',
 'back',
 'take',
 'mind',
 'nerves',
 'tomorrow',
 'afraid',
 'snape',
 'getting',
 'told',
 'go',
 'ing',
 'ask',
 'snape',
 'better',
 'together',
 'idea',
 'snape',
 'refuse',
 'teachers',
 'listening',
 'made',
 'way',
 'staffroom',
 'knocked',
 'answer',
 'knocked',
 'nothing',
 'perhaps',
 'snape',
 'left',
 'book',
 'worth',
 'try',
 'pushed',
 'door',
 'ajar',
 'peered',
 'inside',
 'horrible',
 'scene',
 'met',
 'eyes',
 'snape',
 'filch',
 'inside',
 'alone',
 'snape',
 'holding',
 'robes',
 'knees',
 'one',
 'legs',
 'bloody',
 'mangled',
 'filch',
 'handing',
 'snape',
 'bandages',
 'blasted',
 'thing',
 'snape',
 'saying',
 'supposed',
 'keep',
 'eyes',
 'three',
 'heads',
 'tried',
 'shut',
 'door',
 'quietly',
 'potter',
 'snape',
 'face',
 'twisted',
 'fury',
 'dropped',
 'robes',
 'quickly',
 'hide',
 'leg',
 'gulped',
 'wondered',
 'book',
 'back',
 'left',
 'snape',
 'take',
 'points',
 'gryffindor',
 'sprinted',
 'back',
 'upstairs',
 'asked',
 'joined',
 'matter',
 'low',
 'whisper',
 'told',
 'seen',
 'know',
 'means',
 'finished',
 'breathlessly',
 'tried',
 'past',
 'three',
 'headed',
 'dog',
 'halloween',
 'going',
 'saw',
 'whatever',
 'guarding',
 'bet',
 'broomstick',
 'let',
 'troll',
 'make',
 'diversion',
 'eyes',
 'wide',
 'know',
 'nice',
 'try',
 'steal',
 'something',
 'dumbledore',
 'keeping',
 'safe',
 'honestly',
 'think',
 'teachers',
 'saints',
 'thing',
 'snapped',
 'put',
 'anything',
 'past',
 'snape',
 'dog',
 'guarding',
 'went',
 'bed',
 'head',
 'buzzing',
 'ques',
 'tion',
 'neville',
 'snoring',
 'loudly',
 'sleep',
 'tried',
 'empty',
 'mind',
 'needed',
 'sleep',
 'first',
 'quidditch',
 'match',
 'hours',
 'expression',
 'snape',
 'face',
 'seen',
 'leg',
 'easy',
 'forget',
 'next',
 'morning',
 'dawned',
 'bright',
 'cold',
 'great',
 'hall',
 'full',
 'delicious',
 'smell',
 'fried',
 'sausages',
 'cheerful',
 'chatter',
 'everyone',
 'looking',
 'forward',
 'good',
 'quidditch',
 'match',
 'eat',
 'breakfast',
 'want',
 'anything',
 'bit',
 'toast',
 'wheedled',
 'hungry',
 'felt',
 'terrible',
 'hour',
 'time',
 'walking',
 'onto',
 'field',
 'need',
 'strength',
 'seamus',
 'finnigan',
 'seek',
 'ers',
 'always',
 'ones',
 'clobbered',
 'team',
 'thanks',
 'seamus',
 'watching',
 'seamus',
 'pile',
 'ketchup',
 'sausages',
 'eleven',
 'clock',
 'whole',
 'school',
 'seemed',
 'stands',
 'around',
 'quidditch',
 'pitch',
 'many',
 'students',
 'binoculars',
 'seats',
 'might',
 'raised',
 'high',
 'air',
 'still',
 'difficult',
 'see',
 'going',
 'sometimes',
 'joined',
 'neville',
 'seamus',
 'dean',
 'west',
 'ham',
 'fan',
 'top',
 'row',
 'surprise',
 'painted',
 'large',
 'banner',
 'one',
 'sheets',
 'scabbers',
 'ruined',
 'potter',
 'president',
 'dean',
 'good',
 'drawing',
 'done',
 'large',
 'gryffindor',
 'lion',
 'underneath',
 'performed',
 'tricky',
 'little',
 'charm',
 'paint',
 'flashed',
 'different',
 'colors',
 'meanwhile',
 'locker',
 'room',
 'rest',
 'team',
 'changing',
 'scarlet',
 'quidditch',
 'robes',
 'slytherin',
 'playing',
 'green',
 'wood',
 'cleared',
 'throat',
 'silence',
 'okay',
 'men',
 'women',
 'chaser',
 'angelina',
 'johnson',
 'women',
 'wood',
 'agreed',
 'big',
 'one',
 'fred',
 'weasley',
 'one',
 'waiting',
 'george',
 'know',
 'oliver',
 'speech',
 'heart',
 'fred',
 'told',
 'team',
 'last',
 'year',
 'shut',
 'two',
 'wood',
 'best',
 'team',
 'gryf',
 'findor',
 'years',
 'going',
 'win',
 'know',
 'glared',
 'say',
 'else',
 'right',
 'time',
 'good',
 'luck',
 'followed',
 'fred',
 'george',
 'locker',
 'room',
 'hoping',
 'knees',
 'going',
 'give',
 'way',
 'walked',
 'onto',
 'field',
 'loud',
 'cheers',
 'madam',
 'hooch',
 'refereeing',
 'stood',
 'middle',
 'field',
 'waiting',
 'two',
 'teams',
 'broom',
 'hand',
 'want',
 'nice',
 'fair',
 'game',
 'gathered',
 'around',
 'noticed',
 'seemed',
 'speaking',
 'particularly',
 'slytherin',
 'captain',
 'marcus',
 'flint',
 'fifth',
 'year',
 'thought',
 'flint',
 'looked',
 'troll',
 'blood',
 'corner',
 'eye',
 'saw',
 'fluttering',
 'banner',
 'high',
 'flashing',
 'potter',
 'president',
 'crowd',
 'heart',
 'skipped',
 'felt',
 'braver',
 'mount',
 'brooms',
 'please',
 'clambered',
 'onto',
 'nimbus',
 'two',
 'thousand',
 'madam',
 'hooch',
 'gave',
 'loud',
 'blast',
 'silver',
 'whistle',
 'fifteen',
 'brooms',
 'rose',
 'high',
 'high',
 'air',
 'quaffle',
 'taken',
 'immediately',
 'angelina',
 'johnson',
 'gryffindor',
 'excellent',
 'chaser',
 'girl',
 'rather',
 'tractive',
 'jordan',
 'sorry',
 'professor',
 'weasley',
 'twins',
 'friend',
 'lee',
 'jordan',
 'commen',
 'tary',
 'match',
 'closely',
 'watched',
 'professor',
 'mcgonagall',
 'really',
 'belting',
 'along',
 'neat',
 'pass',
 'alicia',
 'spinnet',
 'good',
 'find',
 'oliver',
 'wood',
 'last',
 'year',
 'reserve',
 'back',
 'johnson',
 'slytherins',
 'taken',
 'quaffle',
 'slytherin',
 'captain',
 'marcus',
 'flint',
 'gains',
 'quaffle',
 'goes',
 'flint',
 'flying',
 'like',
 'eagle',
 'going',
 'sc',
 'stopped',
 'excellent',
 'move',
 'gryffindor',
 'keeper',
 'wood',
 'gryffindors',
 'take',
 'quaffle',
 'chaser',
 'katie',
 'bell',
 'gryffindor',
 'nice',
 'dive',
 'around',
 'flint',
 'field',
 'ouch',
 'must',
 'hurt',
 'hit',
 'back',
 'head',
 'bludger',
 'quaffle',
 'taken',
 'slytherins',
 'adrian',
 'pucey',
 'speeding',
 'toward',
 'goal',
 'posts',
 'blocked',
 'second',
 'bludger',
 'sent',
 'way',
 'fred',
 'george',
 'weasley',
 'tell',
 'nice',
 'play',
 'gryffindor',
 'beater',
 'anyway',
 'johnson',
 'back',
 'possession',
 'quaffle',
 'clear',
 'field',
 'ahead',
 'goes',
 'really',
 'flying',
 'dodges',
 'speeding',
 'bludger',
 'goal',
 'posts',
 'ahead',
 'come',
 'angelina',
 'keeper',
 'bletch',
 'ley',
 'dives',
 'misses',
 'gryffindors',
 'score',
 'gryffindor',
 'cheers',
 'filled',
 'cold',
 'air',
 'howls',
 'moans',
 'slytherins',
 'budge',
 'move',
 'along',
 'hagrid',
 'squeezed',
 'together',
 'give',
 'hagrid',
 'enough',
 'space',
 'join',
 'bin',
 'watchin',
 'hut',
 'hagrid',
 'patting',
 'large',
 'pair',
 'binoculars',
 'around',
 'neck',
 'bein',
 'crowd',
 'sign',
 'snitch',
 'yet',
 'eh',
 'nope',
 'much',
 'yet',
 'kept',
 'outta',
 'trouble',
 'though',
 'somethin',
 'hagrid',
 'rais',
 'ing',
 'binoculars',
 'peering',
 'skyward',
 'speck',
 'way',
 'gliding',
 'game',
 'squinting',
 'sign',
 'snitch',
 'part',
 'wood',
 'game',
 'plan',
 'keep',
 'way',
 'catch',
 'sight',
 'snitch',
 'wood',
 'want',
 'attacked',
 'angelina',
 'scored',
 'done',
 'couple',
 'loop',
 'loops',
 'let',
 'feelings',
 'back',
 'staring',
 'around',
 'snitch',
 'caught',
 'sight',
 'flash',
 'gold',
 'reflection',
 'one',
 'weasleys',
 'wristwatches',
 'bludger',
 'decided',
 'come',
 'pelting',
 'way',
 'like',
 'cannonball',
 'anything',
 'dodged',
 'fred',
 'weasley',
 'came',
 'chas',
 'ing',
 'right',
 'time',
 'yell',
 'beat',
 'bludger',
 'furiously',
 'toward',
 'marcus',
 'flint',
 'slytherin',
 'possession',
 'lee',
 'jordan',
 'saying',
 'chaser',
 'pucey',
 'ducks',
 'two',
 'bludgers',
 'two',
 'weasleys',
 'chaser',
 'bell',
 'speeds',
 'toward',
 'wait',
 'moment',
 'snitch',
 'murmur',
 'ran',
 'crowd',
 'adrian',
 'pucey',
 'dropped',
 'quaffle',
 'busy',
 'looking',
 'shoulder',
 'flash',
 'gold',
 'passed',
 'left',
 'ear',
 'saw',
 'great',
 'rush',
 'excitement',
 'dived',
 'downward',
 'streak',
 'gold',
 'slytherin',
 'seeker',
 'terence',
 'higgs',
 'seen',
 'neck',
 'neck',
 'hurtled',
 'toward',
 'snitch',
 'chasers',
 'seemed',
 'forgotten',
 'supposed',
 'hung',
 'midair',
 'watch',
 'faster',
 'higgs',
 'see',
 'little',
 'round',
 'ball',
 'wings',
 'fluttering',
 'darting',
 'ahead',
 'put',
 'extra',
 'spurt',
 'speed',
 'wham',
 'roar',
 'rage',
 'echoed',
 'gryffindors',
 'marcus',
 'flint',
 'blocked',
 'purpose',
 ...]

Exploring

Let's see if books 1 and 7 differ in their most common words.

Python's Counter object is really good at, well, counting.

In [119]:
from collections import Counter
counts = Counter(hp_text_array[0][11]) #count which words are in Book 1, Chapter 11
counts.most_common(15)
Out[119]:
[('snape', 31),
 ('hagrid', 19),
 ('gryffindor', 17),
 ('quidditch', 15),
 ('broom', 15),
 ('flint', 14),
 ('back', 12),
 ('know', 11),
 ('one', 11),
 ('wood', 10),
 ('quaffle', 9),
 ('around', 8),
 ('snitch', 8),
 ('ing', 7),
 ('field', 7)]

You can add counter objects together. Let's check out the top words in Book 1 versus Book 7. Does Book 7 seem darker?

In [120]:
book1_counter = Counter()
for cur_chapter_words in hp_text_array[0]:
    chapter_word_count = Counter(cur_chapter_words)
    book1_counter += chapter_word_count
    
book1_counter.most_common(10)
Out[120]:
[('hagrid', 357),
 ('back', 261),
 ('one', 258),
 ('know', 215),
 ('like', 192),
 ('see', 180),
 ('snape', 171),
 ('professor', 170),
 ('looked', 169),
 ('dumbledore', 152)]
In [121]:
book7_counter = Counter()
for cur_chapter_words in hp_text_array[6]:
    chapter_word_count = Counter(cur_chapter_words)
    book7_counter += chapter_word_count
    
book7_counter.most_common(10)
Out[121]:
[('wand', 596),
 ('dumbledore', 591),
 ('back', 544),
 ('know', 542),
 ('like', 458),
 ('one', 449),
 ('voldemort', 446),
 ('looked', 432),
 ('around', 363),
 ('still', 352)]

Hmmm, there's a lot of overlap ("back","know","like","looked"), but book 1 has a lot more Hagrid, and book 7 has a lot more Voldemort.

Exercise 2
  • Find the 20 most common words across all the books

Answer:

Fitting LDA

LDA wants to operate on a list of all documents. (Here, we're treating each chapter as its own document). We need to restructure our data

In [123]:
list_of_docs = []
for book_id in range(7):
    for chapter in hp_text_array[book_id]:
        list_of_docs.append(chapter)

We build a gensim Dictionary on all the documents -- this tracks and numbers all words used in any of the documents.

In [124]:
masterdictionary = Dictionary(list_of_docs)

We use the doc2bow to convert each document to a numerical format

In [125]:
mastercorpus = [masterdictionary.doc2bow(doc) for doc in list_of_docs]
In [126]:
mastercorpus[11][:20] #20 words and their counts from book 1, chapter 11
Out[126]:
[(32, 1),
 (54, 3),
 (57, 2),
 (58, 2),
 (59, 1),
 (70, 1),
 (76, 1),
 (82, 11),
 (86, 1),
 (90, 4),
 (111, 3),
 (113, 1),
 (130, 1),
 (133, 1),
 (138, 1),
 (139, 1),
 (145, 1),
 (148, 6),
 (151, 1),
 (152, 2)]

Invoking the dictionary, we can translate to see that book 1, chapter 11 used the word 'hogwarts' once and the word 'one' eleven times.

In [127]:
masterdictionary[59],masterdictionary[82]
Out[127]:
('hogwarts', 'one')

Now, we're ready to actually fit a model

In [128]:
seven_book_model = LdaModel(mastercorpus, num_topics=7, id2word = masterdictionary, passes=10)

We can investigate any particular topic

In [129]:
seven_book_model.show_topic(2, topn=20)
Out[129]:
[('weasley', 0.009971628),
 ('mr', 0.0061152214),
 ('back', 0.0056994683),
 ('know', 0.00521109),
 ('one', 0.003969954),
 ('like', 0.003714172),
 ('car', 0.0036524823),
 ('fred', 0.0035978646),
 ('mrs', 0.0035469406),
 ('around', 0.0033618358),
 ('looked', 0.0031710735),
 ('lupin', 0.0030880375),
 ('george', 0.003071988),
 ('time', 0.0030347034),
 ('door', 0.0030331232),
 ('see', 0.0030115694),
 ('right', 0.00298992),
 ('go', 0.0029416424),
 ('think', 0.0027970911),
 ('hogwarts', 0.0027610017)]

It's nicer to plot the heavy-hitting words in each topic, though

In [130]:
top_words = [[word for word,_ in seven_book_model.show_topic(topicno, topn=50)] for topicno in range(seven_book_model.num_topics)]
top_betas = [[beta for _,beta in seven_book_model.show_topic(topicno, topn=50)] for topicno in range(seven_book_model.num_topics)]

top_words[0][:5]
top_betas[0][:5]
gs  = gridspec.GridSpec(3,3)
gs.update(wspace=0.5, hspace=0.5)
plt.figure(figsize=(11,8.5))
for i in range(7):
    #new subplot
    ax = plt.subplot(gs[i])
    plt.barh(range(5), top_betas[i][:5], align='center',color='blue', ecolor='black')
    ax.invert_yaxis()
    ax.set_yticks(range(5))
    ax.set_yticklabels(top_words[i][:5])
    plt.title("Topic "+str(i))
In [131]:
#finding optimal number of topics for book 1 via coherence measure u_mass
coherence_vals = []
for ntop in range(1,12):
    mod = LdaModel(mastercorpus, num_topics = ntop, id2word = masterdictionary, passes=10)
    cmod = CoherenceModel(model=mod, corpus=mastercorpus, dictionary=masterdictionary, coherence='u_mass')
    cval = cmod.get_coherence()
    print(ntop,cval)
    coherence_vals.append(cval)
1 -0.06658074296200506
2 -0.09209196248981996
3 -0.12238222054387032
4 -0.12166218708149992
5 -0.17002467133413787
6 -0.890047503574987
7 -0.2968548788467019
8 -0.6939044922084053
9 -0.5234586594541473
10 -0.19845307086804698
11 -0.3880701368848243
In [132]:
_ = plt.figure(figsize=(11,8.5))
_ = plt.plot(range(1,12),coherence_vals)
_ = plt.xlabel("Number of Topics")
_ = plt.ylabel("Coherence Score")
Discussion
  • Overall, is LDA more like KNN or K-means?