Key Word(s): WordNet, ConceptNet
Title¶
GloVe embeddings
Description :¶
In this exercise, you'll get practice loading word embeddings and finding the most similar words to a
given query word ('bank'). The file glove_mini.txt
contains GloVe embeddings for nearly
3,000 of the most common English words. Each line of the file contains 51 space-separated pieces of
data:
word 50-values
e.g. the 1st two lines of the file:
the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.72603 0.18157 -0.52393 0.10381 -0.17566 0.078852 -0.36216 -0.11829 -0.83336 0.11917 -0.16605 0.061555 -0.012719 -0.56623 0.013616 0.22851 -0.14396 -0.067549 -0.38157 -0.23698 -1.7037 -0.86692 -0.26704 -0.2589 0.1767 3.8676 -0.1613 -0.13273 -0.68881 0.18444 0.0052464 -0.33874 -0.078956 0.24185 0.36576 -0.34727 0.28483 0.075693 -0.062178 -0.38988 0.22902 -0.21617 -0.22562 -0.093918 -0.80375
We provide the cosine_similarity()
function.
HINT :¶
One can sort a dict d by values (as opposed to keys) via:
- sorted(d.items(), key=operator.itemgetter(1))
Alternative approaches are detailed here.
**REMINDER**: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶
# imports useful libraries
import math
import operator
# calculates the cosine simularity of the passed-in lists
def cosine_sim(a, b):
numerator = 0
denom_a = 0
denom_b = 0
for i in zip(a,b):
numerator += i[0]*i[1]
denom_a += i[0]*i[0]
denom_b += i[1]*i[1]
denom_a = math.sqrt(denom_a)
denom_b = math.sqrt(denom_b)
return numerator / (denom_a * denom_b)
Write the function load_embeddings()
, which takes a filename as input and returns the
embeddings saved to a dict
. That is, the key should be the word
(string
), and the value should be a list of floats
. For example,
{"the": [0.418, 0.24968, -0.41242, ..., -0.78581]} should exist within your dictionary.
### edTest(test_a) ###
def load_embeddings(filename):
embeddings = {}
f = open(filename)
# YOUR CODE STARTS HERE
# YOUR CODE ENDS HERE
f.close()
return embeddings
# DO NOT EDIT THIS LINE
embeddings = load_embeddings("glove_mini.txt")
Write the function get_most_similar()
, which finds the top K most similar words (per
cosine similarity of their embeddings) to the passed-in word.
To be clear, the function's inputs are:
- a word (
string
), to which all other words will be compared - k (
int
), which is the number of top words to return (int
).
The output should be a list of strings (the words).
### edTest(test_b) ###
def get_most_similar(word, k):
top_words = []
# YOUR CODE STARTS HERE
# YOUR CODE ENDS HERE
return top_words
# DO NOT EDIT THIS LINE
bank_words = get_most_similar('bank', 10)