Key Word(s): WordNet, ConceptNet



Title¶

GloVe embeddings

Description :¶

In this exercise, you'll get practice loading word embeddings and finding the most similar words to a given query word ('bank'). The file glove_mini.txt contains GloVe embeddings for nearly 3,000 of the most common English words. Each line of the file contains 51 space-separated pieces of data:

word 50-values

e.g. the 1st two lines of the file:

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581

of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.72603 0.18157 -0.52393 0.10381 -0.17566 0.078852 -0.36216 -0.11829 -0.83336 0.11917 -0.16605 0.061555 -0.012719 -0.56623 0.013616 0.22851 -0.14396 -0.067549 -0.38157 -0.23698 -1.7037 -0.86692 -0.26704 -0.2589 0.1767 3.8676 -0.1613 -0.13273 -0.68881 0.18444 0.0052464 -0.33874 -0.078956 0.24185 0.36576 -0.34727 0.28483 0.075693 -0.062178 -0.38988 0.22902 -0.21617 -0.22562 -0.093918 -0.80375

We provide the cosine_similarity() function.

HINT :¶

One can sort a dict d by values (as opposed to keys) via:

  • sorted(d.items(), key=operator.itemgetter(1))

Alternative approaches are detailed here.

**REMINDER**: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶

In [1]:
# imports useful libraries
import math
import operator
In [2]:
# calculates the cosine simularity of the passed-in lists
def cosine_sim(a, b):
    numerator = 0
    denom_a = 0
    denom_b = 0
    for i in zip(a,b):
        numerator += i[0]*i[1]
        denom_a += i[0]*i[0]
        denom_b += i[1]*i[1]
    denom_a = math.sqrt(denom_a)
    denom_b = math.sqrt(denom_b)
    return numerator / (denom_a * denom_b)

Write the function load_embeddings(), which takes a filename as input and returns the embeddings saved to a dict. That is, the key should be the word (string), and the value should be a list of floats. For example, {"the": [0.418, 0.24968, -0.41242, ..., -0.78581]} should exist within your dictionary.

In [9]:
### edTest(test_a) ###
def load_embeddings(filename):

    embeddings = {}
    f = open(filename)

    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    f.close()
    return embeddings

# DO NOT EDIT THIS LINE
embeddings = load_embeddings("glove_mini.txt")

Write the function get_most_similar(), which finds the top K most similar words (per cosine similarity of their embeddings) to the passed-in word.

To be clear, the function's inputs are:

  • a word (string), to which all other words will be compared
  • k (int), which is the number of top words to return (int).

The output should be a list of strings (the words).

In [4]:
### edTest(test_b) ###
def get_most_similar(word, k):
    
    top_words = []
    
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    return top_words

# DO NOT EDIT THIS LINE
bank_words = get_most_similar('bank', 10)