CS109B Data Science 2: Advanced Topics in Data Science

Lab 9 - Bayes, Part 2 - LDA¶

Harvard University
Spring 2019
Instructors: Mark Glickman and Pavlos Protopapas

In [1]:

import pandas as pd
import numpy as np
%matplotlib inline 
import matplotlib.pyplot as plt
from matplotlib import gridspec
import re

import scipy.stats
import pyjags

In [1]:

## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:

Schools Data and Bayesian Modeling¶

Once upon a time, eight different schools each implemented a particular SAT prep program. At the end of the year, students from each school's program took the SATs and we've recorded the students' average performance relative to a control group that got no treatment. We've also recorded the standard error for each school's estimated increase, a measure of how accurate the estimated increase is. (Standard Error factors in the number of students who took the program and how variable their scores were).

In [3]:

school_data = pd.read_csv("data/gelman_schools.csv")
school_data

Out[3]:

	School	Estimated Score Increase	Standard error of the estimate
0	A	28	15
1	B	8	10
2	C	-3	16
3	D	7	11
4	E	-1	9
5	F	1	11
6	G	18	10
7	H	12	18

The measurements look something like this:

In [4]:

x_vals = np.linspace(-50,80,1000)
for cur_school_id in range(8):
    
    name = school_data["School"][cur_school_id]
    mean = school_data["Estimated Score Increase"][cur_school_id]
    sd = school_data["Standard error of the estimate"][cur_school_id]
    
    norm_obj = scipy.stats.norm(mean,sd)
    plt.plot(x_vals, norm_obj.pdf(x_vals), label=name)

plt.xlabel("School's Performance")
plt.ylabel("Probability Density")
plt.legend()
plt.show()

Discussion

How effective is the program?
- Does it increase scores? By how much?

A Hierarchical Model for the Schools¶

Gelman (or maybe Rubin?) proposed the following model to explain the data and understand the effect of the program on SAT scores. As in class, the point of the model is that it neither takes each school's result at face value, nor ignores variation from school to school.

$$\sigma_j\ given$$ $$\mu \sim Uniform(-20,20)$$ $$\tau \sim Uniform(0,10)$$ $$\theta_j \sim Normal(Mean=\mu,\ SD=\tau)$$ $$y_j \sim Normal(Mean=\theta_j,\ SD=\sigma_j)$$

$y_j$ are the observed SAT increase at each school
$\sigma_j$ are the observed standard errors at each school
$\theta_j$ is the 'true' effect of each school's program
$\mu$ and $\tau$ govern the distribution of possible true effects.

Bayesian models (and others) are often presented like this. Actually, they're usually presented even less clearly (e.g. missing the definition of each symbol). Let's learn how to make sense of this jumble.

First pass: Understanding the parameters

This pass is typically easiest from bottom to top, starting with the observed parameters that you know how to interpret. It often also requires reading about the context of the data or what the modeler is trying to encode with each parameter.

$$y_j \sim Normal(Mean=\theta_j,\ SD=\sigma_j)$$

First, the observed data (one value per school) are 1) normally distributed 2) each centered at a different value, one per school. The 8 $\theta_j$ parameters are the 'true average effect' of the program in school j, separate from the (noisy) effect we actually observe.

$$\theta_j \sim Normal(Mean=\mu,\ SD=\tau)$$

Where do the 'true average effects' in each school come from? Line 2, above, says that they're all draws from a normal distribution with a particular mean and center. Okay, so they're all form the same family, that's fine for now. Moving on,

$$\sigma_j\ given$$ $$\mu \sim Uniform(-20,20)$$ $$\tau \sim Uniform(0,10)$$

Finally, the parameters defining what 'true average effects' we might see come from specific uniform distributions. In particular, the author encodes that the center of the 'true average effect' distribution is strictly between -20 and 20, and the spread of the 'true average effect' distribution is somewhere betweeen 0 and 10.

Seconnd Pass: Summarizing: The author's story is that when someone goes to implement this cirriculum in a given school, the actual (long term?) effectiveness of that program is secretly picked from a normal distribution. Then the actual observed effect is a noisy version of that actual effectiveness, with the noise set by that school's standard error. (The school's standard error is set by the number of students who were in the program and the variability in how well those students did, but this model takes that as given for each school).

Third pass: Critiquing the story

Discussion

Does the author's overall story make sense?
- Do you believe in a 'true' effectiveness in each school, distinct from the observed effectiveness?
- Do you believe that schools' 'true' effectiveness all come from some distribution? Or do you think they're all the same? Or all unrelated?
What does it mean, in context, for the author to say $\mu$ is definitely between -20 and 20? Does that seem reasonable to you?
What does it mean, in context, for the author to say $\tau$ is definitely between 0 and 10? Does that seem reasonable to you?
As a class, come up with a prior for $\mu$ and $\tau$

Coding the Model¶

To work in JAGS, we need to write out the model in a specific code-like format. That format is designed to be a mix of the equation description of the model above and R syntax.

The tricky parts are:

Writing a loop for any varibles with indices, and double loops if the variable has two indices
Looking up the abreviation for each ditribution (usualy the letter d and a short-ish version of the distribution name)
Looking up the parameters the distributions want (does it ask for mean, sd, or precision?)
Looking up how to do functions like $x^2$ in R

Compare:
There are J schools

$$\sigma_j\ given$$ $$\mu \sim Uniform(-20,20)$$ $$\tau \sim Uniform(0,10)$$ $$\theta_j \sim Normal(Mean=\mu,\ SD=\tau)$$ $$y_j \sim Normal(Mean=\theta_j,\ SD=\sigma_j)$$

To:

In [5]:

schools_model_code = '''
model {
    
    mu ~ dunif(-20,20)
    tau ~ dunif(0,10)
    
    for (j in 1:J){
        theta[j] ~ dnorm(mu, 1/pow(tau,2))
    }
    
    for (j in 1:J){
        y[j] ~ dnorm(theta[j], 1/pow(sigma[j],2))
    }
}
'''

Running the model¶

To run the model, you need to pass in a dictionary of the observed data. Pyjags is pretty good about giving useful error messages, but definitely turn on line numbers in Jupyter!

We run 500 samples of burn-in (MCMC needs some undefined amount of steps before it produces samples from the target distribution). We then collect 2500 actual samples from each of 4 chains.

In [6]:

observed_vals = {'y':school_data["Estimated Score Increase"],
                 'sigma':school_data["Standard error of the estimate"],
                 'J': school_data.shape[0]}

num_chains = 4
school_model = pyjags.Model(schools_model_code, data=observed_vals, chains=num_chains)
burnin = school_model.sample(500) #warmup/burn-in
samples = school_model.sample(2500) #cf 7500

adapting: iterations 4000 of 4000, elapsed 0:00:00, remaining 0:00:00
sampling: iterations 2000 of 2000, elapsed 0:00:00, remaining 0:00:00
sampling: iterations 10000 of 10000, elapsed 0:00:00, remaining 0:00:00

Checking Convergence¶

MCMC is only guaranteed to work if you run it for infinite time. It can give good samples after finite, or even short time, but it's worth checking whether it looks like it did return good samples.

The first thing to check is whether the sampler got stuck in one place for a bit by looking for flat/thin regions in the trace. Luckily, we have a lot to show you

In [7]:

def plot_trace(samples, varname, entry=0):
    plt.plot()
    sample_array = samples[varname]

    vec_len, num_samples, num_chains = sample_array.shape
    for cur_chain in range(num_chains):
        cur_label = "Chain {}".format(cur_chain)
        plt.plot(range(num_samples),sample_array[entry,:,cur_chain], label=cur_label)
    plt.legend()
    plt.show()

We check the $\mu$ and $\tau$ parameters

In [8]:

plt.xlabel("Iteration")
plt.ylabel("Value of Mu")
plot_trace(samples,'mu')


plt.xlabel("Iteration")
plt.ylabel("Value of Tau")
plot_trace(samples,'tau')

and the 8 different $\theta_j$

In [9]:

for cur_school in range(8):
    print("Theta for School {}".format(cur_school))
    plt.xlabel("Iteration")
    plt.ylabel("Value of Theta_{}".format(cur_school))
    plot_trace(samples, 'theta', entry=cur_school)
    print("------")

Theta for School 0

------
Theta for School 1

------
Theta for School 2

------
Theta for School 3

------
Theta for School 4

------
Theta for School 5

------
Theta for School 6

------
Theta for School 7

------

Overall, we see pretty rough traces- lots of places with limited variation.

Fixing these defects is tough. Simply running more samples gives you better odds that you've got stuck for an even amount of time in each trap, though it's more 'hope' than 'strategy'. Changing the priors or even how the model is written can help ease the issues. More advanced samplers (e.g. Hamiltonian Monte Carlo implemented in pymc3 or Stan) can help, too.

There are other measures of whether the traces look reasonable- effective sample size, R-hat, and Geweke.

In real life, you should carefully vet your traces and adjust the model/sampler until they look good. See AM207 for lots more on this topic. Here, we're just going to press on as if the traces and samples are legitimate

Exploring The Posterior¶

The samples produced are basically a big data frame where each row is a sample and each column is one of the prameters of the model. This is everything we know about the posterior. Conceptually, from here forward all we do is describe this data frame- means or histograms or the columns, correlations, etc.

(The samples aren't actually stored as a data frame, but conversion code is provided below)

In [10]:

samples

Out[10]:

{'J': array([[[8., 8., 8., 8.],
         [8., 8., 8., 8.],
         [8., 8., 8., 8.],
         ...,
         [8., 8., 8., 8.],
         [8., 8., 8., 8.],
         [8., 8., 8., 8.]]]),
 'mu': array([[[11.44200703, -5.2483797 ,  3.67912647,  0.0738187 ],
         [12.26069247, -5.48494222,  3.61115491,  4.95019625],
         [10.73895563, -5.085733  ,  3.18646544,  4.80098972],
         ...,
         [11.17434202,  8.62852242,  8.57332894,  3.38636574],
         [11.32547262,  3.60811946,  7.74219932,  6.88329731],
         [11.64152507,  3.81368591,  9.61093627,  7.36313944]]]),
 'sigma': array([[[15., 15., 15., 15.],
         [15., 15., 15., 15.],
         [15., 15., 15., 15.],
         ...,
         [15., 15., 15., 15.],
         [15., 15., 15., 15.],
         [15., 15., 15., 15.]],
 
        [[10., 10., 10., 10.],
         [10., 10., 10., 10.],
         [10., 10., 10., 10.],
         ...,
         [10., 10., 10., 10.],
         [10., 10., 10., 10.],
         [10., 10., 10., 10.]],
 
        [[16., 16., 16., 16.],
         [16., 16., 16., 16.],
         [16., 16., 16., 16.],
         ...,
         [16., 16., 16., 16.],
         [16., 16., 16., 16.],
         [16., 16., 16., 16.]],
 
        ...,
 
        [[11., 11., 11., 11.],
         [11., 11., 11., 11.],
         [11., 11., 11., 11.],
         ...,
         [11., 11., 11., 11.],
         [11., 11., 11., 11.],
         [11., 11., 11., 11.]],
 
        [[10., 10., 10., 10.],
         [10., 10., 10., 10.],
         [10., 10., 10., 10.],
         ...,
         [10., 10., 10., 10.],
         [10., 10., 10., 10.],
         [10., 10., 10., 10.]],
 
        [[18., 18., 18., 18.],
         [18., 18., 18., 18.],
         [18., 18., 18., 18.],
         ...,
         [18., 18., 18., 18.],
         [18., 18., 18., 18.],
         [18., 18., 18., 18.]]]),
 'tau': array([[[5.22944064, 0.51424254, 0.57377092, 7.70896573],
         [4.12126626, 0.62418476, 0.55211472, 5.65434903],
         [5.02714961, 1.34672645, 0.55877668, 7.79227361],
         ...,
         [1.74828205, 5.63677814, 8.53923862, 7.55739259],
         [1.74222428, 5.65749486, 8.58125167, 8.09525945],
         [2.69541102, 4.94712847, 7.39562856, 7.32618989]]]),
 'theta': array([[[16.61972838, -5.56344442,  3.49366297, 13.04717757],
         [ 7.77682071, -5.49960964,  3.53405496,  6.83151556],
         [16.46103894, -5.75185783,  2.86776548,  3.62641676],
         ...,
         [16.8258211 , 10.39344698, -1.57631373, 18.03982207],
         [ 8.72831671,  8.30629003, 13.73868418, 17.59792133],
         [12.05765791, -2.24196389, 27.22559778, 25.53386541]],
 
        [[19.56954737, -5.42537957,  4.05026727,  1.03910367],
         [11.59722917, -5.74709919,  4.54484246,  5.87960291],
         [ 6.367036  , -6.01464545,  3.39471538,  4.42970772],
         ...,
         [ 8.57546221, 10.32027721, 12.7106045 ,  1.31254366],
         [ 8.38944263,  6.24801873,  7.44531707,  5.27553666],
         [10.90258801,  3.43412943, 10.33769674,  9.0311747 ]],
 
        [[ 5.88837309, -4.81366293,  3.15127896, -6.05951826],
         [14.78128587, -5.73556426,  3.71896808,  2.57675358],
         [12.54597606, -5.0653308 ,  3.18413593,  0.49493214],
         ...,
         [11.98891272, 14.48338533, 13.87393438,  0.28335609],
         [10.9354531 ,  8.07284582, -3.44999072,  9.18619352],
         [11.64331293, 10.64072935,  0.95014784, -0.21079075]],
 
        ...,
 
        [[ 1.22263673, -4.68733117,  3.31083688,  0.25609424],
         [18.55268161, -6.09756273,  4.05988122, 12.64476468],
         [ 6.71437458, -4.61057384,  3.52748741, -9.89316271],
         ...,
         [12.20384369,  4.68861594,  6.5070424 ,  3.57409133],
         [ 8.62450786,  8.8124001 ,  4.52069646,  0.46528057],
         [10.86269654,  6.22219415,  4.76471052,  6.61856667]],
 
        [[14.92735666, -5.68611841,  3.78224308,  9.02318814],
         [14.07864705, -4.10942606,  3.2962622 , 14.81264762],
         [16.15917815, -5.94679351,  3.74045733,  5.33112382],
         ...,
         [13.46168853,  8.76175389,  8.48475549,  3.892842  ],
         [ 9.52856636,  7.49486197, 15.59616436,  2.57200075],
         [12.38932389, 10.33383833,  1.72877462,  7.3614203 ]],
 
        [[10.48061998, -5.45665423,  3.91330464,  0.06519383],
         [16.33537128, -5.73639233,  4.15078261, 13.75552883],
         [10.26217066, -6.05283819,  3.24939955, 11.2550645 ],
         ...,
         [ 8.85214681,  4.31514814,  5.14246541,  8.71776731],
         [13.72143201,  9.21182701, 10.58598474, -1.14178797],
         [12.69258184,  6.78170005, 14.78258857, -1.19588779]]]),
 'y': array([[[28., 28., 28., 28.],
         [28., 28., 28., 28.],
         [28., 28., 28., 28.],
         ...,
         [28., 28., 28., 28.],
         [28., 28., 28., 28.],
         [28., 28., 28., 28.]],
 
        [[ 8.,  8.,  8.,  8.],
         [ 8.,  8.,  8.,  8.],
         [ 8.,  8.,  8.,  8.],
         ...,
         [ 8.,  8.,  8.,  8.],
         [ 8.,  8.,  8.,  8.],
         [ 8.,  8.,  8.,  8.]],
 
        [[-3., -3., -3., -3.],
         [-3., -3., -3., -3.],
         [-3., -3., -3., -3.],
         ...,
         [-3., -3., -3., -3.],
         [-3., -3., -3., -3.],
         [-3., -3., -3., -3.]],
 
        ...,
 
        [[ 1.,  1.,  1.,  1.],
         [ 1.,  1.,  1.,  1.],
         [ 1.,  1.,  1.,  1.],
         ...,
         [ 1.,  1.,  1.,  1.],
         [ 1.,  1.,  1.,  1.],
         [ 1.,  1.,  1.,  1.]],
 
        [[18., 18., 18., 18.],
         [18., 18., 18., 18.],
         [18., 18., 18., 18.],
         ...,
         [18., 18., 18., 18.],
         [18., 18., 18., 18.],
         [18., 18., 18., 18.]],
 
        [[12., 12., 12., 12.],
         [12., 12., 12., 12.],
         [12., 12., 12., 12.],
         ...,
         [12., 12., 12., 12.],
         [12., 12., 12., 12.],
         [12., 12., 12., 12.]]])}

In [11]:

display(samples['theta'].shape)
display(samples['mu'].shape)

(8, 2500, 4)

(1, 2500, 4)

The raw samples from pyjags are a dictionary of parameeter names -> 3d arrays

Discussion

Why are the sample object's arrays shaped like this?

We can equivalently organize the samples as a data frame (one per chain). The code below will handle this for you

In [12]:

def convert_to_dfs(samples, parameter_names, num_chains):
    """Converts a pyjags sampling result to a list of data frames, one per chain"""
    big_list = []
    for cur_chain_num in range(num_chains):
        df_list = []
        for k in parameter_names:
            v = samples[k]

            chain1_data = v[:,:,cur_chain_num]
            cur_df = pd.DataFrame(chain1_data.T)

            if cur_df.shape[1]==1:
                cur_df = cur_df.rename({0:k}, axis=1)
            else:
                cur_df = cur_df.add_prefix(k)

            df_list.append(cur_df)

        chain1_samples_df = pd.concat(df_list, axis=1)
        big_list.append(chain1_samples_df)
        
    return big_list

chain_df_list = convert_to_dfs(samples,["J","mu","tau","theta","sigma"],num_chains)
chain_df_list[0].head(15)

Out[12]:

	J	mu	tau	theta0	theta1	theta2	theta3	theta4	theta5	theta6	theta7	sigma0	sigma1	sigma2	sigma3	sigma4	sigma5	sigma6	sigma7
0	8.0	11.442007	5.229441	16.619728	19.569547	5.888373	11.555302	14.000452	1.222637	14.927357	10.480620	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
1	8.0	12.260692	4.121266	7.776821	11.597229	14.781286	8.849931	11.661383	18.552682	14.078647	16.335371	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
2	8.0	10.738956	5.027150	16.461039	6.367036	12.545976	8.353043	13.858865	6.714375	16.159178	10.262171	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
3	8.0	10.516138	4.554490	7.474957	6.259871	11.719390	10.502339	4.189236	11.253609	16.150101	5.926606	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
4	8.0	10.960373	4.869065	18.473639	12.840373	10.860589	14.785369	3.523220	2.416270	12.767866	10.243774	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
5	8.0	11.096228	3.164246	14.732883	13.481348	5.766054	10.089107	6.318225	15.063753	13.295595	9.481372	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
6	8.0	11.823150	4.010776	12.051695	15.428234	5.185046	13.173989	10.775218	13.420324	7.560915	14.283472	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
7	8.0	11.970236	8.292315	20.341468	10.596276	14.109678	21.072512	8.798100	5.908609	15.986449	15.522908	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
8	8.0	8.504100	7.625681	20.081649	11.187542	-7.337301	4.319860	16.882968	-7.501047	14.923680	23.086732	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
9	8.0	6.782231	6.401574	21.331827	-7.331606	8.548878	7.238340	-1.487119	-7.016653	11.106292	9.737838	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
10	8.0	6.687746	7.104521	12.442858	12.048259	2.113918	9.093249	6.816101	3.553985	13.849016	11.038066	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
11	8.0	6.814846	8.586887	13.693062	7.003443	-1.932552	-5.210427	1.169831	0.766485	15.865447	12.148062	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
12	8.0	5.745068	6.131445	12.496375	15.859827	3.879309	13.467793	4.079493	4.300865	6.819747	-3.968503	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
13	8.0	4.639780	8.804461	18.416387	6.921307	14.155640	-2.364373	8.944646	-2.533777	11.618003	7.952452	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0
14	8.0	5.674181	9.300716	20.937233	-0.637050	3.997101	11.917414	11.382043	7.458958	17.209642	19.006517	15.0	10.0	16.0	11.0	9.0	11.0	10.0	18.0

Learn About the Parameters¶

Once we have the posterior dataset, we can analyze it like we'd analyze any other dataset (with the warning that the rows are correlated, not IID)

In [13]:

for cur_chain in range(num_chains):
    plt.hist(chain_df_list[cur_chain]['mu'], bins=100, histtype='step', density=True, label="Chain {}".format(cur_chain));
plt.xlabel("Value of mu")
plt.ylabel("Approx. Posterior Probability")
plt.legend()
plt.show()

Above, we see that (assuming this model+priors and factoring in all data) $\mu$ seems to range from about -2.5 to about 20, and its most likely value is between 5 and 10.

Example:

What's the probability that $\mu$ is above 5?
- What does this mean in context?

Answer:

In [14]:

count_above_5 = np.sum(chain_df_list[0]['mu'] > 15)
total = len(chain_df_list[0]['mu'])

count_above_5/total

Out[14]:

0.0468

(Using just chain 0), there's a roughly 2-8% chance (changes from run to run) that $\mu$ is above 15. In context, this means that a there's very little chance the program, on average, increases student scores by 15 or more. Practically, we might decide that the program isn't worth it; maybe we only feel increases of less than 50 SAT points aren't worth paying for.

Discussion

Is it more apropriate to use one chain, or combine all four?

Repeating that plot for $\tau$

In [15]:

for cur_chain in range(num_chains):
    plt.hist(chain_df_list[cur_chain]['tau'], bins=100, histtype='step', density=True, label="Chain {}".format(cur_chain));
plt.xlabel("Value of tau")
plt.ylabel("Approx. Posterior Probability")
plt.legend()
plt.show()

Exercise 1

What can you conclude from the plot above?

Answer:

Exploring effect in each school¶

Investigating the $\theta_j$ is a touch more complex becuase there is one per school. Overall, it looks like each school's theta is about the same-- the model is suggesting pooling all the schools' data together. We can also see some issues from the sampler manifesting; different chains get stuck in and create bumps in different places.

In [17]:

fig, ax = plt.subplots(1,num_chains, figsize=(20,5), sharex=True, sharey=True)

for cur_chain in range(num_chains):
    ax[cur_chain].set_xlabel("Thetas from Chain {}".format(cur_chain))
    ax[cur_chain].set_ylabel("Approx. Posterior Probability".format(cur_chain))
    for i in range(8):
        cur_name = 'theta'+str(i)
        all_theta_n = samples['theta'][i,:,:]
        ax[cur_chain].hist(chain_df_list[cur_chain][cur_name], 
                            bins=100, 
                            histtype='step', 
                            label="School {}".format(school_data["School"][i]),
                            density=True
                            )
plt.legend()
plt.show()

Exercise 2

Interrogate the results
- Summarize the effect of school A's program- how many SAT points do students seem to gain on average?
- What's the probability that the program, on average, actually lowers scores?
- What's the probability that a particular school would end up with a program that lowers student scores?
Can you make the MCMC converge better?
- Better priors?
- Longer sampling run?

Answers:

Discussion

Overall, what does a Bayesian analysis buy you?
What is the price of entry?

Harry Potter and the Latent Dirichlet Analysis¶

Before we do anything else, let's talk about the Dirichlet distribution

The Dirichlet distribution takes in N parameters and spits out a probability vector of length N. The above graphs show which probability vectors are likely for different parameter settings, but it's easiest just to see for yourself:

In [50]:

length = 3
for rep in range(5):
    vec = np.random.dirichlet([.1]*length)
    print(np.round(vec,3))

[0.036 0.034 0.931]
[0.    0.034 0.966]
[0. 1. 0.]
[0.344 0.656 0.   ]
[0. 1. 0.]

(above) Values less than 1 make a few entries of the output vector large, and most entries small

In [51]:

for rep in range(5):
    vec = np.random.dirichlet([1]*length)
    print(np.round(vec,3))

[0.559 0.329 0.112]
[0.263 0.038 0.699]
[0.716 0.133 0.151]
[0.408 0.086 0.506]
[0.657 0.272 0.071]

(above) Values of 1 make all possible probability vectors equally likely, in some sense

Exercise 3

What happens when the inputs to Dirichlet are all large (above 1)?
What happens if you make one entry substantially bigger than the others?

Answer:

The LDA Model (as code)¶

Prof Glickman did a good job in lecture covering LDA from lots of different viewpoints. The one thing we haven't seen is actual code to produce a document via the LDA framework. I honestly think this is the clearest way to tell LDA's story.

In [54]:

np.random.seed(7)

# scalar givens
num_docs = 5
num_topics = 3
vocab = ["yes", "no", "harry", "hermione", "ron", "dumbledore"]
vocab_len = len(vocab)

# vector givens: alphas and betas are given (or found when we solve the model)
alphas = .8*np.ones(num_topics)
betas = .3*np.ones(vocab_len)

# each document has a probability of talking about each topic
thetas = np.zeros((num_docs, num_topics))
for cur_doc in range(num_docs):
    thetas[cur_doc,:] = np.random.dirichlet(alphas)

# each topic has a probability of talking about each word
phis = np.zeros((num_topics, vocab_len))
for cur_topic in range(num_topics):
    phis[cur_topic,:] = np.random.dirichlet(betas)


##
# write document 1 for 20 words, as an example
##
cur_doc =1
doc_words = []

# get the document's probability of talking about each topic
topic_probs = thetas[cur_doc,:]

# for each word in the document's length:
for cur_word in range(20):
    # Using the document's topic probabilities, randomly decide the topic this word belongs to
    cur_topic_vec = np.random.multinomial(1,topic_probs)
    cur_topic_int = np.argmax(cur_topic_vec)
    
    # Using the topic's word probabilites, randomly decide which word will appear
    word_probs = phis[cur_topic,:]
    cur_word_vec = np.random.multinomial(1,word_probs)
    cur_word_int = np.argmax(cur_word_vec)
    
    # store the word
    doc_words.append(vocab[cur_word_int])

print("Dpcument 1's Topics")
print(thetas[0,:])
print("Document 1's Words")
print(doc_words)

Dpcument 1's Topics
[0.008885   0.08921879 0.90189621]
Document 1's Words
['ron', 'ron', 'dumbledore', 'ron', 'ron', 'ron', 'dumbledore', 'ron', 'no', 'no', 'no', 'no', 'ron', 'ron', 'ron', 'yes', 'no', 'ron', 'ron', 'ron']

Exercise 2

Interpret the $\theta$ and $\phi$ matrices, below
Why do "no", "dumbledor" and "ron" show up so much in document 1?

In [55]:

display(np.round(thetas,2))
display(np.round(phis,2))

array([[0.01, 0.09, 0.9 ],
       [0.31, 0.12, 0.57],
       [0.47, 0.32, 0.21],
       [0.22, 0.05, 0.73],
       [0.35, 0.56, 0.09]])

array([[0.04, 0.04, 0.44, 0.16, 0.07, 0.25],
       [0.18, 0.13, 0.01, 0.14, 0.54, 0.  ],
       [0.02, 0.38, 0.  , 0.03, 0.45, 0.12]])

Answer:

The LDA model (as equations)¶

There are T topics, D documents, V words in the corpus

$$\alpha_t,\ given \text{ for t=1,2,...T}$$ $$\beta_w,\ given \text{ for w=1,2,...V}$$ $$ \theta_{d,:} \sim Dirichlet\left(\alpha_1,\alpha_2, ...,\alpha_T\right) \text{ for d=1,2,...D}$$ $$ \phi_{t,:} \sim Dirichlet\left(\beta_1,\beta_2, ...,\beta_W\right) \text{ for t=1,2,...T}$$

$$z_{d,i} \sim Multinomial\left(\theta_{d,1},\theta_{d,2},...\theta_{d,T}\right) \text{ for each d,i}$$ $$w_{d,i} \sim Multinomial\left(\phi_{z_{d,i},1},\phi_{z_{d,i},2},...\phi_{z_{d,i},V}\right) \text{ for each d,i}$$

$w_{d,i}$ is the $i$th word in the $d$th document
$z_{d,i}$ is the topic of the $i$th word in the $d$th document
$\phi_{t,i}$ is the probability of word $i$ appearing when the topic is $t$
$\theta_{d,t}$ is the probability of document $d$ talking about topic $t$
The $\beta_w$ determine how likely overall each word is, and how many words typically occur in the same topic
The $\alpha_t$ determine how likely overall each topic is, and how many topics typically occur in the same document

Fitting LDA¶

In [73]:

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/40960295/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Below is the code used to clean the Harry Potter books. Because we're not about to host the full text on a public GitHub, these cells will produce errors. If you see someone ask about it on Piazza, please cheekily refer them to this sentence.

First, we define what counts as a word for us. The regular expression below translates to "any number of letters, digits, numbers, and apostrophes". Note that we'll include "Ron" and "Ron's" as different words, and "fast-acting" is two words.

In [74]:

tokenizer = RegexpTokenizer(r'[\w\']+') # \w means any letter or digit. 
#Overall, "words" in the corpus are any number of letters, digits, and apostrophes. No hyphens or other fancyness.

Often, any very-common words are dropped from the text before extracting topics from it. NLTK provides a list of common words in different languages, which we augment with Potter-specfic words.

In [75]:

stop_words = set(stopwords.words('english'))
stop_words.update(['harry','hermione','ron']) #add stop words for three main characters
stop_words.update(['said','got','get','would','could']) #empirically find these words very common

You can see the cleaning process below. For each book, we split it into chapters by hunting for the phrase "Chapter" or "Epilogue", lowercase the chapter, split it into a list of individual words, and purge the common words.

In [112]:

hp_text_array = [0]*7
for booknum in range(7):
    print(booknum)
    with open("data/harrypotter_book"+str(booknum+1)+".txt", "r", encoding="UTF-8") as book: 
        book_text = book.read()
        chapter_text_list = re.split(r'Chapter\s?\d*|Epilogue',book_text)
        hp_text_array[booknum] = chapter_text_list
        
        for cur_chapter_id,cur_chapter_text in enumerate(chapter_text_list):
            #make everything lower case
            cur_chapter_text = cur_chapter_text.lower()
            #tokenize
            chapter_word_list = tokenizer.tokenize(cur_chapter_text)
            #remove stop words
            purged_word_list = [cur_word for cur_word in chapter_word_list if cur_word not in stop_words]
            
            #store: [book][chapter]->list of words
            hp_text_array[booknum][cur_chapter_id] = purged_word_list

In [116]:

import pickle
with open("data/HP_words.pkl", "wb") as outfile:
    pickle.dump(hp_text_array, outfile)

This is the cell that will load the processesed data so you can continue with lab. An example of the structure we're working with is below

In [117]:

with open("data/HP_words.pkl", "rb") as infile:
    hp_text_array = pickle.load(infile)

In [118]:

hp_text_array[0][11] #all (non-trivial) words in book 1, chapter 11 (yes 11- there's a preamble)

Out[118]:

['quidditch',
 'entered',
 'november',
 'weather',
 'turned',
 'cold',
 'mountains',
 'around',
 'school',
 'became',
 'icy',
 'gray',
 'lake',
 'like',
 'chilled',
 'steel',
 'every',
 'morning',
 'ground',
 'covered',
 'frost',
 'hagrid',
 'seen',
 'upstairs',
 'windows',
 'defrost',
 'ing',
 'broomsticks',
 'quidditch',
 'field',
 'bundled',
 'long',
 'moleskin',
 'overcoat',
 'rabbit',
 'fur',
 'gloves',
 'enormous',
 'beaverskin',
 'boots',
 'quidditch',
 'season',
 'begun',
 'saturday',
 'playing',
 'first',
 'match',
 'weeks',
 'training',
 'gryffindor',
 'versus',
 'slytherin',
 'gryffindor',
 'move',
 'second',
 'place',
 'house',
 'championship',
 'hardly',
 'anyone',
 'seen',
 'play',
 'wood',
 'decided',
 'secret',
 'weapon',
 'kept',
 'well',
 'secret',
 'news',
 'playing',
 'seeker',
 'leaked',
 'somehow',
 'know',
 'worse',
 'people',
 'telling',
 'brilliant',
 'people',
 'telling',
 'running',
 'around',
 'neath',
 'holding',
 'mattress',
 'really',
 'lucky',
 'friend',
 'know',
 'gotten',
 'homework',
 'last',
 'minute',
 'quidditch',
 'practice',
 'wood',
 'making',
 'also',
 'lent',
 'quidditch',
 'ages',
 'turned',
 'interesting',
 'read',
 'learned',
 'seven',
 'hundred',
 'ways',
 'commit',
 'ting',
 'quidditch',
 'foul',
 'happened',
 'world',
 'cup',
 'match',
 '1473',
 'seekers',
 'usually',
 'smallest',
 'fastest',
 'players',
 'serious',
 'quidditch',
 'accidents',
 'seemed',
 'happen',
 'although',
 'people',
 'rarely',
 'died',
 'play',
 'ing',
 'quidditch',
 'referees',
 'known',
 'vanish',
 'turn',
 'months',
 'later',
 'sahara',
 'desert',
 'become',
 'bit',
 'relaxed',
 'breaking',
 'rules',
 'since',
 'saved',
 'mountain',
 'troll',
 'much',
 'nicer',
 'day',
 'first',
 'quidditch',
 'match',
 'three',
 'freezing',
 'courtyard',
 'break',
 'conjured',
 'bright',
 'blue',
 'fire',
 'carried',
 'around',
 'jam',
 'jar',
 'standing',
 'backs',
 'getting',
 'warm',
 'snape',
 'crossed',
 'yard',
 'noticed',
 'snape',
 'limping',
 'moved',
 'closer',
 'together',
 'block',
 'fire',
 'view',
 'sure',
 'allowed',
 'unfortunately',
 'something',
 'guilty',
 'faces',
 'caught',
 'snape',
 'eye',
 'limped',
 'seen',
 'fire',
 'seemed',
 'looking',
 'reason',
 'tell',
 'anyway',
 'potter',
 'quidditch',
 'ages',
 'showed',
 'library',
 'books',
 'taken',
 'outside',
 'school',
 'snape',
 'give',
 'five',
 'points',
 'gryffindor',
 'made',
 'rule',
 'muttered',
 'angrily',
 'snape',
 'limped',
 'away',
 'wonder',
 'wrong',
 'leg',
 'dunno',
 'hope',
 'really',
 'hurting',
 'bitterly',
 'gryffindor',
 'common',
 'room',
 'noisy',
 'evening',
 'sat',
 'together',
 'next',
 'window',
 'checking',
 'charms',
 'homework',
 'never',
 'let',
 'copy',
 'learn',
 'asking',
 'read',
 'right',
 'answers',
 'anyway',
 'felt',
 'restless',
 'wanted',
 'quidditch',
 'ages',
 'back',
 'take',
 'mind',
 'nerves',
 'tomorrow',
 'afraid',
 'snape',
 'getting',
 'told',
 'go',
 'ing',
 'ask',
 'snape',
 'better',
 'together',
 'idea',
 'snape',
 'refuse',
 'teachers',
 'listening',
 'made',
 'way',
 'staffroom',
 'knocked',
 'answer',
 'knocked',
 'nothing',
 'perhaps',
 'snape',
 'left',
 'book',
 'worth',
 'try',
 'pushed',
 'door',
 'ajar',
 'peered',
 'inside',
 'horrible',
 'scene',
 'met',
 'eyes',
 'snape',
 'filch',
 'inside',
 'alone',
 'snape',
 'holding',
 'robes',
 'knees',
 'one',
 'legs',
 'bloody',
 'mangled',
 'filch',
 'handing',
 'snape',
 'bandages',
 'blasted',
 'thing',
 'snape',
 'saying',
 'supposed',
 'keep',
 'eyes',
 'three',
 'heads',
 'tried',
 'shut',
 'door',
 'quietly',
 'potter',
 'snape',
 'face',
 'twisted',
 'fury',
 'dropped',
 'robes',
 'quickly',
 'hide',
 'leg',
 'gulped',
 'wondered',
 'book',
 'back',
 'left',
 'snape',
 'take',
 'points',
 'gryffindor',
 'sprinted',
 'back',
 'upstairs',
 'asked',
 'joined',
 'matter',
 'low',
 'whisper',
 'told',
 'seen',
 'know',
 'means',
 'finished',
 'breathlessly',
 'tried',
 'past',
 'three',
 'headed',
 'dog',
 'halloween',
 'going',
 'saw',
 'whatever',
 'guarding',
 'bet',
 'broomstick',
 'let',
 'troll',
 'make',
 'diversion',
 'eyes',
 'wide',
 'know',
 'nice',
 'try',
 'steal',
 'something',
 'dumbledore',
 'keeping',
 'safe',
 'honestly',
 'think',
 'teachers',
 'saints',
 'thing',
 'snapped',
 'put',
 'anything',
 'past',
 'snape',
 'dog',
 'guarding',
 'went',
 'bed',
 'head',
 'buzzing',
 'ques',
 'tion',
 'neville',
 'snoring',
 'loudly',
 'sleep',
 'tried',
 'empty',
 'mind',
 'needed',
 'sleep',
 'first',
 'quidditch',
 'match',
 'hours',
 'expression',
 'snape',
 'face',
 'seen',
 'leg',
 'easy',
 'forget',
 'next',
 'morning',
 'dawned',
 'bright',
 'cold',
 'great',
 'hall',
 'full',
 'delicious',
 'smell',
 'fried',
 'sausages',
 'cheerful',
 'chatter',
 'everyone',
 'looking',
 'forward',
 'good',
 'quidditch',
 'match',
 'eat',
 'breakfast',
 'want',
 'anything',
 'bit',
 'toast',
 'wheedled',
 'hungry',
 'felt',
 'terrible',
 'hour',
 'time',
 'walking',
 'onto',
 'field',
 'need',
 'strength',
 'seamus',
 'finnigan',
 'seek',
 'ers',
 'always',
 'ones',
 'clobbered',
 'team',
 'thanks',
 'seamus',
 'watching',
 'seamus',
 'pile',
 'ketchup',
 'sausages',
 'eleven',
 'clock',
 'whole',
 'school',
 'seemed',
 'stands',
 'around',
 'quidditch',
 'pitch',
 'many',
 'students',
 'binoculars',
 'seats',
 'might',
 'raised',
 'high',
 'air',
 'still',
 'difficult',
 'see',
 'going',
 'sometimes',
 'joined',
 'neville',
 'seamus',
 'dean',
 'west',
 'ham',
 'fan',
 'top',
 'row',
 'surprise',
 'painted',
 'large',
 'banner',
 'one',
 'sheets',
 'scabbers',
 'ruined',
 'potter',
 'president',
 'dean',
 'good',
 'drawing',
 'done',
 'large',
 'gryffindor',
 'lion',
 'underneath',
 'performed',
 'tricky',
 'little',
 'charm',
 'paint',
 'flashed',
 'different',
 'colors',
 'meanwhile',
 'locker',
 'room',
 'rest',
 'team',
 'changing',
 'scarlet',
 'quidditch',
 'robes',
 'slytherin',
 'playing',
 'green',
 'wood',
 'cleared',
 'throat',
 'silence',
 'okay',
 'men',
 'women',
 'chaser',
 'angelina',
 'johnson',
 'women',
 'wood',
 'agreed',
 'big',
 'one',
 'fred',
 'weasley',
 'one',
 'waiting',
 'george',
 'know',
 'oliver',
 'speech',
 'heart',
 'fred',
 'told',
 'team',
 'last',
 'year',
 'shut',
 'two',
 'wood',
 'best',
 'team',
 'gryf',
 'findor',
 'years',
 'going',
 'win',
 'know',
 'glared',
 'say',
 'else',
 'right',
 'time',
 'good',
 'luck',
 'followed',
 'fred',
 'george',
 'locker',
 'room',
 'hoping',
 'knees',
 'going',
 'give',
 'way',
 'walked',
 'onto',
 'field',
 'loud',
 'cheers',
 'madam',
 'hooch',
 'refereeing',
 'stood',
 'middle',
 'field',
 'waiting',
 'two',
 'teams',
 'broom',
 'hand',
 'want',
 'nice',
 'fair',
 'game',
 'gathered',
 'around',
 'noticed',
 'seemed',
 'speaking',
 'particularly',
 'slytherin',
 'captain',
 'marcus',
 'flint',
 'fifth',
 'year',
 'thought',
 'flint',
 'looked',
 'troll',
 'blood',
 'corner',
 'eye',
 'saw',
 'fluttering',
 'banner',
 'high',
 'flashing',
 'potter',
 'president',
 'crowd',
 'heart',
 'skipped',
 'felt',
 'braver',
 'mount',
 'brooms',
 'please',
 'clambered',
 'onto',
 'nimbus',
 'two',
 'thousand',
 'madam',
 'hooch',
 'gave',
 'loud',
 'blast',
 'silver',
 'whistle',
 'fifteen',
 'brooms',
 'rose',
 'high',
 'high',
 'air',
 'quaffle',
 'taken',
 'immediately',
 'angelina',
 'johnson',
 'gryffindor',
 'excellent',
 'chaser',
 'girl',
 'rather',
 'tractive',
 'jordan',
 'sorry',
 'professor',
 'weasley',
 'twins',
 'friend',
 'lee',
 'jordan',
 'commen',
 'tary',
 'match',
 'closely',
 'watched',
 'professor',
 'mcgonagall',
 'really',
 'belting',
 'along',
 'neat',
 'pass',
 'alicia',
 'spinnet',
 'good',
 'find',
 'oliver',
 'wood',
 'last',
 'year',
 'reserve',
 'back',
 'johnson',
 'slytherins',
 'taken',
 'quaffle',
 'slytherin',
 'captain',
 'marcus',
 'flint',
 'gains',
 'quaffle',
 'goes',
 'flint',
 'flying',
 'like',
 'eagle',
 'going',
 'sc',
 'stopped',
 'excellent',
 'move',
 'gryffindor',
 'keeper',
 'wood',
 'gryffindors',
 'take',
 'quaffle',
 'chaser',
 'katie',
 'bell',
 'gryffindor',
 'nice',
 'dive',
 'around',
 'flint',
 'field',
 'ouch',
 'must',
 'hurt',
 'hit',
 'back',
 'head',
 'bludger',
 'quaffle',
 'taken',
 'slytherins',
 'adrian',
 'pucey',
 'speeding',
 'toward',
 'goal',
 'posts',
 'blocked',
 'second',
 'bludger',
 'sent',
 'way',
 'fred',
 'george',
 'weasley',
 'tell',
 'nice',
 'play',
 'gryffindor',
 'beater',
 'anyway',
 'johnson',
 'back',
 'possession',
 'quaffle',
 'clear',
 'field',
 'ahead',
 'goes',
 'really',
 'flying',
 'dodges',
 'speeding',
 'bludger',
 'goal',
 'posts',
 'ahead',
 'come',
 'angelina',
 'keeper',
 'bletch',
 'ley',
 'dives',
 'misses',
 'gryffindors',
 'score',
 'gryffindor',
 'cheers',
 'filled',
 'cold',
 'air',
 'howls',
 'moans',
 'slytherins',
 'budge',
 'move',
 'along',
 'hagrid',
 'squeezed',
 'together',
 'give',
 'hagrid',
 'enough',
 'space',
 'join',
 'bin',
 'watchin',
 'hut',
 'hagrid',
 'patting',
 'large',
 'pair',
 'binoculars',
 'around',
 'neck',
 'bein',
 'crowd',
 'sign',
 'snitch',
 'yet',
 'eh',
 'nope',
 'much',
 'yet',
 'kept',
 'outta',
 'trouble',
 'though',
 'somethin',
 'hagrid',
 'rais',
 'ing',
 'binoculars',
 'peering',
 'skyward',
 'speck',
 'way',
 'gliding',
 'game',
 'squinting',
 'sign',
 'snitch',
 'part',
 'wood',
 'game',
 'plan',
 'keep',
 'way',
 'catch',
 'sight',
 'snitch',
 'wood',
 'want',
 'attacked',
 'angelina',
 'scored',
 'done',
 'couple',
 'loop',
 'loops',
 'let',
 'feelings',
 'back',
 'staring',
 'around',
 'snitch',
 'caught',
 'sight',
 'flash',
 'gold',
 'reflection',
 'one',
 'weasleys',
 'wristwatches',
 'bludger',
 'decided',
 'come',
 'pelting',
 'way',
 'like',
 'cannonball',
 'anything',
 'dodged',
 'fred',
 'weasley',
 'came',
 'chas',
 'ing',
 'right',
 'time',
 'yell',
 'beat',
 'bludger',
 'furiously',
 'toward',
 'marcus',
 'flint',
 'slytherin',
 'possession',
 'lee',
 'jordan',
 'saying',
 'chaser',
 'pucey',
 'ducks',
 'two',
 'bludgers',
 'two',
 'weasleys',
 'chaser',
 'bell',
 'speeds',
 'toward',
 'wait',
 'moment',
 'snitch',
 'murmur',
 'ran',
 'crowd',
 'adrian',
 'pucey',
 'dropped',
 'quaffle',
 'busy',
 'looking',
 'shoulder',
 'flash',
 'gold',
 'passed',
 'left',
 'ear',
 'saw',
 'great',
 'rush',
 'excitement',
 'dived',
 'downward',
 'streak',
 'gold',
 'slytherin',
 'seeker',
 'terence',
 'higgs',
 'seen',
 'neck',
 'neck',
 'hurtled',
 'toward',
 'snitch',
 'chasers',
 'seemed',
 'forgotten',
 'supposed',
 'hung',
 'midair',
 'watch',
 'faster',
 'higgs',
 'see',
 'little',
 'round',
 'ball',
 'wings',
 'fluttering',
 'darting',
 'ahead',
 'put',
 'extra',
 'spurt',
 'speed',
 'wham',
 'roar',
 'rage',
 'echoed',
 'gryffindors',
 'marcus',
 'flint',
 'blocked',
 'purpose',
 ...]

Exploring¶

Let's see if books 1 and 7 differ in their most common words.

Python's Counter object is really good at, well, counting.

In [119]:

from collections import Counter
counts = Counter(hp_text_array[0][11]) #count which words are in Book 1, Chapter 11
counts.most_common(15)

Out[119]:

[('snape', 31),
 ('hagrid', 19),
 ('gryffindor', 17),
 ('quidditch', 15),
 ('broom', 15),
 ('flint', 14),
 ('back', 12),
 ('know', 11),
 ('one', 11),
 ('wood', 10),
 ('quaffle', 9),
 ('around', 8),
 ('snitch', 8),
 ('ing', 7),
 ('field', 7)]

You can add counter objects together. Let's check out the top words in Book 1 versus Book 7. Does Book 7 seem darker?

In [120]:

book1_counter = Counter()
for cur_chapter_words in hp_text_array[0]:
    chapter_word_count = Counter(cur_chapter_words)
    book1_counter += chapter_word_count
    
book1_counter.most_common(10)

Out[120]:

[('hagrid', 357),
 ('back', 261),
 ('one', 258),
 ('know', 215),
 ('like', 192),
 ('see', 180),
 ('snape', 171),
 ('professor', 170),
 ('looked', 169),
 ('dumbledore', 152)]

In [121]:

book7_counter = Counter()
for cur_chapter_words in hp_text_array[6]:
    chapter_word_count = Counter(cur_chapter_words)
    book7_counter += chapter_word_count
    
book7_counter.most_common(10)

Out[121]:

[('wand', 596),
 ('dumbledore', 591),
 ('back', 544),
 ('know', 542),
 ('like', 458),
 ('one', 449),
 ('voldemort', 446),
 ('looked', 432),
 ('around', 363),
 ('still', 352)]

Hmmm, there's a lot of overlap ("back","know","like","looked"), but book 1 has a lot more Hagrid, and book 7 has a lot more Voldemort.

Exercise 2

Find the 20 most common words across all the books

Answer:

Fitting LDA¶

LDA wants to operate on a list of all documents. (Here, we're treating each chapter as its own document). We need to restructure our data

In [123]:

list_of_docs = []
for book_id in range(7):
    for chapter in hp_text_array[book_id]:
        list_of_docs.append(chapter)

We build a gensim Dictionary on all the documents -- this tracks and numbers all words used in any of the documents.

In [124]:

masterdictionary = Dictionary(list_of_docs)

We use the doc2bow to convert each document to a numerical format

In [125]:

mastercorpus = [masterdictionary.doc2bow(doc) for doc in list_of_docs]

In [126]:

mastercorpus[11][:20] #20 words and their counts from book 1, chapter 11

Out[126]:

[(32, 1),
 (54, 3),
 (57, 2),
 (58, 2),
 (59, 1),
 (70, 1),
 (76, 1),
 (82, 11),
 (86, 1),
 (90, 4),
 (111, 3),
 (113, 1),
 (130, 1),
 (133, 1),
 (138, 1),
 (139, 1),
 (145, 1),
 (148, 6),
 (151, 1),
 (152, 2)]

Invoking the dictionary, we can translate to see that book 1, chapter 11 used the word 'hogwarts' once and the word 'one' eleven times.

In [127]:

masterdictionary[59],masterdictionary[82]

Out[127]:

('hogwarts', 'one')

Now, we're ready to actually fit a model

In [128]:

seven_book_model = LdaModel(mastercorpus, num_topics=7, id2word = masterdictionary, passes=10)

We can investigate any particular topic

In [129]:

seven_book_model.show_topic(2, topn=20)

Out[129]:

[('weasley', 0.009971628),
 ('mr', 0.0061152214),
 ('back', 0.0056994683),
 ('know', 0.00521109),
 ('one', 0.003969954),
 ('like', 0.003714172),
 ('car', 0.0036524823),
 ('fred', 0.0035978646),
 ('mrs', 0.0035469406),
 ('around', 0.0033618358),
 ('looked', 0.0031710735),
 ('lupin', 0.0030880375),
 ('george', 0.003071988),
 ('time', 0.0030347034),
 ('door', 0.0030331232),
 ('see', 0.0030115694),
 ('right', 0.00298992),
 ('go', 0.0029416424),
 ('think', 0.0027970911),
 ('hogwarts', 0.0027610017)]

It's nicer to plot the heavy-hitting words in each topic, though

In [130]:

top_words = [[word for word,_ in seven_book_model.show_topic(topicno, topn=50)] for topicno in range(seven_book_model.num_topics)]
top_betas = [[beta for _,beta in seven_book_model.show_topic(topicno, topn=50)] for topicno in range(seven_book_model.num_topics)]

top_words[0][:5]
top_betas[0][:5]
gs  = gridspec.GridSpec(3,3)
gs.update(wspace=0.5, hspace=0.5)
plt.figure(figsize=(11,8.5))
for i in range(7):
    #new subplot
    ax = plt.subplot(gs[i])
    plt.barh(range(5), top_betas[i][:5], align='center',color='blue', ecolor='black')
    ax.invert_yaxis()
    ax.set_yticks(range(5))
    ax.set_yticklabels(top_words[i][:5])
    plt.title("Topic "+str(i))

In [131]:

#finding optimal number of topics for book 1 via coherence measure u_mass
coherence_vals = []
for ntop in range(1,12):
    mod = LdaModel(mastercorpus, num_topics = ntop, id2word = masterdictionary, passes=10)
    cmod = CoherenceModel(model=mod, corpus=mastercorpus, dictionary=masterdictionary, coherence='u_mass')
    cval = cmod.get_coherence()
    print(ntop,cval)
    coherence_vals.append(cval)

1 -0.06658074296200506
2 -0.09209196248981996
3 -0.12238222054387032
4 -0.12166218708149992
5 -0.17002467133413787
6 -0.890047503574987
7 -0.2968548788467019
8 -0.6939044922084053
9 -0.5234586594541473
10 -0.19845307086804698
11 -0.3880701368848243

In [132]:

_ = plt.figure(figsize=(11,8.5))
_ = plt.plot(range(1,12),coherence_vals)
_ = plt.xlabel("Number of Topics")
_ = plt.ylabel("Coherence Score")

Discussion

Overall, is LDA more like KNN or K-means?