Key Word(s): Language Modeling, Syntax, Semantics, Discourse, Morphology
Title :¶
Unigram LM
Description :¶
Text data is unlike the typical "design matrix", i.i.d. data that we've often worked with. Here, you'll gain practice working with actual words, as you'll parse, count, and calculate a probability.
An individual unigram's likelihood (unsmoothed) is defined as:
$$L\left(w\right)=\frac{n_w\left(D_t\right)}{n_o\left(D_t\right)}$$where the numerator represents the number of times word $w$ appeared in the training corpus $D_t$.
For this exercise, we will define the smoothed unigram's likelihood as:
$$L\left(w\right)=\frac{n_w\left(D_t\right)\ +\alpha}{n_o\left(D_t\right)\ +\alpha\left|V\right|}$$where $\alpha$ is a specified real-valued number (doesn't have to be an integer), and $|V|$ is the cardinality of the lexicon (i.e., the number of distinct word types in the vocabulary)
The likelihood of a new sequence $H$ is simply defined by the likelihood of each token, multiplied by each other:
$$L\left(H\right)=\prod_{w\ \in H}^{ }L\left(w\right)$$HINTS :¶
Depending on your approach, these functions could help you:
re.sub()
(regular expression).split()
.lower()
.strip()
.replace()
.sum()
defaultdict
data structureCounter
data structure
**REMINDER**: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶
# imports some libraries you might find useful
import re
import math
from collections import Counter
from collections import defaultdict
# necessary for our experiments
training_file = "ex1_train.txt"
dev_file = "ex1_dev.txt"
punctuation = ['.', '!', '?']
sample1 = "I love data science!"
sample2 = "I love NLP!"
Write a function parse_string()
which takes as input a string (e.g., the contents of a
file). It should return this text as a list of tokens. Specifically, the tokens should:
- be lowercased
- be separated by whitespace and any character present in the list of
punctuation
. - include no trailing or preceeding whitespace (none of the returned tokens should be of white space or empty)
For example, if the input is " I LOVE daTa!!", it should return ["i", love", "data", "!", "!"]
### edTest(test_a) ###
def parse_string(text):
# YOUR CODE STARTS HERE
# YOUR CODE ENDS HERE
return text
# DO NOT EDIT THE LINES BELOW
text = open(training_file).read()
tokens = parse_string(text)
Write a function count_tokens()
that takes a list of tokens and simply outputs a
dictionary-style count of the items. For example, if the input is ['run', 'forrest',
'run'], it should return a dict
, defaultdict
, or Counter
with 2 keys: {'run':2, 'forrest':1}
### edTest(test_b) ###
def count_tokens(tokens):
# YOUR CODE STARTS HERE
# YOUR CODE ENDS HERE
return word_counts
# DO NOT EDIT THIS LINE
word_counts = count_tokens(tokens)
Write a function calculate_likelihood()
that takes tokens
(a list of
strings) and word_counts
(dictionary-type) and returns the likelihood of the sequence
of tokens. You will run your function with the tokens parsed from the sample1
string.
### edTest(test_c) ###
def calculate_likelihood(tokens, word_counts):
total_likelihood = 1
# YOUR CODE STARTS HERE
# YOUR CODE ENDS HERE
return total_likelihood
# DO NOT EDIT THE LINES BELOW
sample1_tokens = parse_string(sample1)
likelihood = calculate_likelihood(sample1_tokens, word_counts)
Write a function calculate_smoothed_likelihood()
that is the same as the previous
function but includes a smoothing parameter alpha
. Again, you should return the
likelihood of the sequence of tokens.
### edTest(test_d) ###
def calculate_smoothed_likelihood(alpha, tokens, word_counts):
total_likelihood = 1
# YOUR CODE STARTS HERE
# YOUR CODE ENDS HERE
return total_likelihood
# DO NOT EDIT THE LINES BELOW
sample1_tokens = parse_string(sample1)
sample1_likelihood = calculate_smoothed_likelihood(0.5, sample1_tokens, word_counts)
sample2_tokens = parse_string(sample2)
sample2_likelihood = calculate_smoothed_likelihood(0.5, sample2_tokens, word_counts)