Title :¶

Unigram LM

Description :¶

Text data is unlike the typical "design matrix", i.i.d. data that we've often worked with. Here, you'll gain practice working with actual words, as you'll parse, count, and calculate a probability.

An individual unigram's likelihood (unsmoothed) is defined as:

$$L\left(w\right)=\frac{n_w\left(D_t\right)}{n_o\left(D_t\right)}$$

where the numerator represents the number of times word $w$ appeared in the training corpus $D_t$.

For this exercise, we will define the smoothed unigram's likelihood as:

$$L\left(w\right)=\frac{n_w\left(D_t\right)\ +\alpha}{n_o\left(D_t\right)\ +\alpha\left|V\right|}$$

where $\alpha$ is a specified real-valued number (doesn't have to be an integer), and $|V|$ is the cardinality of the lexicon (i.e., the number of distinct word types in the vocabulary)

The likelihood of a new sequence $H$ is simply defined by the likelihood of each token, multiplied by each other:

$$L\left(H\right)=\prod_{w\ \in H}^{ }L\left(w\right)$$

HINTS :¶

Depending on your approach, these functions could help you:

re.sub() (regular expression)
.split()
.lower()
.strip()
.replace()
.sum()
defaultdict data structure
Counter data structure

REMINDER: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶

In [1]:

# imports some libraries you might find useful
import re
import math
from collections import Counter
from collections import defaultdict

In [2]:

# necessary for our experiments
training_file = "ex1_train.txt"
dev_file = "ex1_dev.txt"
punctuation = ['.', '!', '?']

sample1 = "I love data science!"
sample2 = "I love NLP!"

Write a function parse_string() which takes as input a string (e.g., the contents of a file). It should return this text as a list of tokens. Specifically, the tokens should:

be lowercased
be separated by whitespace and any character present in the list of punctuation.
include no trailing or preceeding whitespace (none of the returned tokens should be of white space or empty)

For example, if the input is " I LOVE daTa!!", it should return ["i", love", "data", "!", "!"]

In [3]:

### edTest(test_a) ###
def parse_string(text):
    
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    return text

# DO NOT EDIT THE LINES BELOW
text = open(training_file).read()
tokens = parse_string(text)

Write a function count_tokens() that takes a list of tokens and simply outputs a dictionary-style count of the items. For example, if the input is ['run', 'forrest', 'run'], it should return a dict, defaultdict, or Counter with 2 keys: {'run':2, 'forrest':1}

In [4]:

### edTest(test_b) ###
def count_tokens(tokens):
    
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    return word_counts

# DO NOT EDIT THIS LINE
word_counts = count_tokens(tokens)

Write a function calculate_likelihood() that takes tokens (a list of strings) and word_counts (dictionary-type) and returns the likelihood of the sequence of tokens. You will run your function with the tokens parsed from the sample1 string.

In [5]:

### edTest(test_c) ###
def calculate_likelihood(tokens, word_counts):
    total_likelihood = 1
    
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
        
    return total_likelihood

# DO NOT EDIT THE LINES BELOW
sample1_tokens = parse_string(sample1)
likelihood = calculate_likelihood(sample1_tokens, word_counts)

Write a function calculate_smoothed_likelihood() that is the same as the previous function but includes a smoothing parameter alpha. Again, you should return the likelihood of the sequence of tokens.

In [6]:

### edTest(test_d) ###

def calculate_smoothed_likelihood(alpha, tokens, word_counts):

    total_likelihood = 1

    # YOUR CODE STARTS HERE
        
    # YOUR CODE ENDS HERE
    return total_likelihood

# DO NOT EDIT THE LINES BELOW
sample1_tokens = parse_string(sample1)
sample1_likelihood = calculate_smoothed_likelihood(0.5, sample1_tokens, word_counts)

sample2_tokens = parse_string(sample2)
sample2_likelihood = calculate_smoothed_likelihood(0.5, sample2_tokens, word_counts)

Title :¶

Description :¶

HINTS :¶

**REMINDER**: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶

REMINDER: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶