Title :¶

Unigram LM

Description :¶

Text data is unlike the typical "design matrix", i.i.d. data that we've often worked with. Here, you'll gain practice working with actual words, as you'll parse, count, and calculate a probability.

An individual unigram's likelihood (unsmoothed) is defined as:

$$L\left(w\right)=\frac{n_w\left(D_t\right)}{n_o\left(D_t\right)}$$

where the numerator represents the number of times word $w$ appeared in the training corpus $D_t$.

For this exercise, we will define the smoothed unigram's likelihood as:

$$L\left(w\right)=\frac{n_w\left(D_t\right)\ +\alpha}{n_o\left(D_t\right)\ +\alpha\left|V\right|}$$

where $\alpha$ is a specified real-valued number (doesn't have to be an integer), and $|V|$ is the cardinality of the lexicon (i.e., the number of distinct word types in the vocabulary)

The likelihood of a new sequence $H$ is simply defined by the likelihood of each token, multiplied by each other:

$$L\left(H\right)=\prod_{w\ \in H}^{ }L\left(w\right)$$

HINTS :¶

Depending on your approach, these functions could help you:

  • re.sub() (regular expression)
  • .split()
  • .lower()
  • .strip()
  • .replace()
  • .sum()
  • defaultdict data structure
  • Counter data structure

**REMINDER**: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶

In [1]:
# imports some libraries you might find useful
import re
import math
from collections import Counter
from collections import defaultdict
In [2]:
# necessary for our experiments
training_file = "ex1_train.txt"
dev_file = "ex1_dev.txt"
punctuation = ['.', '!', '?']

sample1 = "I love data science!"
sample2 = "I love NLP!"

Write a function parse_string() which takes as input a string (e.g., the contents of a file). It should return this text as a list of tokens. Specifically, the tokens should:

  • be lowercased
  • be separated by whitespace and any character present in the list of punctuation.
  • include no trailing or preceeding whitespace (none of the returned tokens should be of white space or empty)

For example, if the input is " I LOVE daTa!!", it should return ["i", love", "data", "!", "!"]

In [3]:
### edTest(test_a) ###
def parse_string(text):
    
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    return text

# DO NOT EDIT THE LINES BELOW
text = open(training_file).read()
tokens = parse_string(text)

Write a function count_tokens() that takes a list of tokens and simply outputs a dictionary-style count of the items. For example, if the input is ['run', 'forrest', 'run'], it should return a dict, defaultdict, or Counter with 2 keys: {'run':2, 'forrest':1}

In [4]:
### edTest(test_b) ###
def count_tokens(tokens):
    
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    return word_counts

# DO NOT EDIT THIS LINE
word_counts = count_tokens(tokens)

Write a function calculate_likelihood() that takes tokens (a list of strings) and word_counts (dictionary-type) and returns the likelihood of the sequence of tokens. You will run your function with the tokens parsed from the sample1 string.

In [5]:
### edTest(test_c) ###
def calculate_likelihood(tokens, word_counts):
    total_likelihood = 1
    
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
        
    return total_likelihood

# DO NOT EDIT THE LINES BELOW
sample1_tokens = parse_string(sample1)
likelihood = calculate_likelihood(sample1_tokens, word_counts)

Write a function calculate_smoothed_likelihood() that is the same as the previous function but includes a smoothing parameter alpha. Again, you should return the likelihood of the sequence of tokens.

In [6]:
### edTest(test_d) ###

def calculate_smoothed_likelihood(alpha, tokens, word_counts):

    total_likelihood = 1

    # YOUR CODE STARTS HERE
        
    # YOUR CODE ENDS HERE
    return total_likelihood

# DO NOT EDIT THE LINES BELOW
sample1_tokens = parse_string(sample1)
sample1_likelihood = calculate_smoothed_likelihood(0.5, sample1_tokens, word_counts)

sample2_tokens = parse_string(sample2)
sample2_likelihood = calculate_smoothed_likelihood(0.5, sample2_tokens, word_counts)