Processing math: 100%


Title :

Unigram LM

Description :

Text data is unlike the typical "design matrix", i.i.d. data that we've often worked with. Here, you'll gain practice working with actual words, as you'll parse, count, and calculate a probability.

An individual unigram's likelihood (unsmoothed) is defined as:

L(w)=nw(Dt)no(Dt)

where the numerator represents the number of times word w appeared in the training corpus Dt.

For this exercise, we will define the smoothed unigram's likelihood as:

L(w)=nw(Dt) +αno(Dt) +α|V|

where α is a specified real-valued number (doesn't have to be an integer), and |V| is the cardinality of the lexicon (i.e., the number of distinct word types in the vocabulary)

The likelihood of a new sequence H is simply defined by the likelihood of each token, multiplied by each other:

L(H)=w HL(w)

HINTS :

Depending on your approach, these functions could help you:

  • re.sub() (regular expression)
  • .split()
  • .lower()
  • .strip()
  • .replace()
  • .sum()
  • defaultdict data structure
  • Counter data structure

**REMINDER**: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.

In [1]:
# imports some libraries you might find useful
import re
import math
from collections import Counter
from collections import defaultdict
In [2]:
# necessary for our experiments
training_file = "ex1_train.txt"
dev_file = "ex1_dev.txt"
punctuation = ['.', '!', '?']

sample1 = "I love data science!"
sample2 = "I love NLP!"

Write a function parse_string() which takes as input a string (e.g., the contents of a file). It should return this text as a list of tokens. Specifically, the tokens should:

  • be lowercased
  • be separated by whitespace and any character present in the list of punctuation.
  • include no trailing or preceeding whitespace (none of the returned tokens should be of white space or empty)

For example, if the input is " I LOVE daTa!!", it should return ["i", love", "data", "!", "!"]

In [3]:
### edTest(test_a) ###
def parse_string(text):
    
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    return text

# DO NOT EDIT THE LINES BELOW
text = open(training_file).read()
tokens = parse_string(text)

Write a function count_tokens() that takes a list of tokens and simply outputs a dictionary-style count of the items. For example, if the input is ['run', 'forrest', 'run'], it should return a dict, defaultdict, or Counter with 2 keys: {'run':2, 'forrest':1}

In [4]:
### edTest(test_b) ###
def count_tokens(tokens):
    
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    return word_counts

# DO NOT EDIT THIS LINE
word_counts = count_tokens(tokens)

Write a function calculate_likelihood() that takes tokens (a list of strings) and word_counts (dictionary-type) and returns the likelihood of the sequence of tokens. You will run your function with the tokens parsed from the sample1 string.

In [5]:
### edTest(test_c) ###
def calculate_likelihood(tokens, word_counts):
    total_likelihood = 1
    
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
        
    return total_likelihood

# DO NOT EDIT THE LINES BELOW
sample1_tokens = parse_string(sample1)
likelihood = calculate_likelihood(sample1_tokens, word_counts)

Write a function calculate_smoothed_likelihood() that is the same as the previous function but includes a smoothing parameter alpha. Again, you should return the likelihood of the sequence of tokens.

In [6]:
### edTest(test_d) ###

def calculate_smoothed_likelihood(alpha, tokens, word_counts):

    total_likelihood = 1

    # YOUR CODE STARTS HERE
        
    # YOUR CODE ENDS HERE
    return total_likelihood

# DO NOT EDIT THE LINES BELOW
sample1_tokens = parse_string(sample1)
sample1_likelihood = calculate_smoothed_likelihood(0.5, sample1_tokens, word_counts)

sample2_tokens = parse_string(sample2)
sample2_likelihood = calculate_smoothed_likelihood(0.5, sample2_tokens, word_counts)