Key Word(s): ??



Data Science 2: Advanced Topics in Data Science

Section 3: Recurrent Neural Networks

Harvard University
Spring 2021
Instructors: Mark Glickman, Pavlos Protopapas, and Chris Tanner
Authors: Chris Gumb and Eleni Kaxiras


In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

Learning Objectives

By the end of this lab, you should understand:

  • how to perform basic preprocessing on text data
  • the layers used in keras to construct RNNs and its variants (GRU, LSTM)
  • how the model's task (i.e., many-to-1, many-to-many) affects architecture choices
In [3]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.layers import BatchNormalization, Bidirectional, Dense, Embedding, GRU, LSTM, SimpleRNN,\
                                    Input, TimeDistributed, Dropout, RepeatVector
from tensorflow.keras.layers import Conv1D, Conv2D, Flatten, MaxPool1D, MaxPool2D, Lambda
from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback, ModelCheckpoint
from tensorflow.keras.initializers import Constant
from tensorflow.keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
import tensorflow_datasets
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import re, sys
# fix random seed for reproducibility
np.random.seed(109)

Case Study: IMDB Review Classifier

Let's frame our discussion of RNNS around the example a text classifier. Specifically, We'll build and evaluate various models that all attempt to descriminate between positive and negative reviews through the Internet Movie Database (IMDB). The dataset is again made available to us through the tensorflow datasets API.

In [4]:
import tensorflow_datasets
In [5]:
(train, test), info = tensorflow_datasets.load('imdb_reviews', split=['train', 'test'], with_info=True)

The helpful info object provides details about the dataset.

In [6]:
info
Out[6]:
tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset.
    This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_path='/home/10914655/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    supervised_keys=('text', 'label'),
    splits={
        'test': ,
        'train': ,
        'unsupervised': ,
    },
    citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
      author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
      title     = {Learning Word Vectors for Sentiment Analysis},
      booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
      month     = {June},
      year      = {2011},
      address   = {Portland, Oregon, USA},
      publisher = {Association for Computational Linguistics},
      pages     = {142--150},
      url       = {http://www.aclweb.org/anthology/P11-1015}
    }""",
)

We see that the dataset consists of text reviews and binary good/bad labels. Here are two examples:

In [7]:
labels = {0: 'bad', 1: 'good'}
seen = {'bad': False, 'good': False}
for review in train:
    label = review['label'].numpy()
    if not seen[labels[label]]:
        print(f"text:\n{review['text'].numpy().decode()}\n")
        print(f"label: {labels[label]}\n")
        seen[labels[label]] = True
    if all(val == True for val in seen.values()):
        break
text:
This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.

label: bad

text:
This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.

label: good

Great! But unfortunately, computers can read! 📖--🤖❓

Preprocessing Text Data

Computers have no built-in knowledge of language and cannot understand text data in any rich way that humans do -- at least not without some help! The first crucial step in natural language processing is to clean and preprocess your data so that your algorithms and models can make use of it.

We'll look at a few preprocess steps:

- tokenization
- padding
- numerical encoding

Depending on your NLP task, you may want to take additional preprocessing steps which we will not cover here. These can include:

  • converting all characters to lowercase
  • treating each punctuation mark as a token (e.g., , . ! ? are each separate tokens)
  • removing punctuation altogether
  • separating each sentence with a unique symbol (e.g., and )
  • removing words that are incredibly common (e.g., function words, (in)definite articles). These are referred to as 'stopwords').
  • Lemmatizing (replacing words with their 'dictionary entry form')
  • Stemming (removing grammatical morphemes)

Useful NLP Python libraries such as NLTK and spaCy provide built in methods for many of these preprocessing steps.

Tokenization

Tokens are the atomic units of meaning which our model will be working with. What should these units be? These could be characters, words, or even sentences. For our movie review classifier we will be working at the word level.

For this example we will process just a subset of the original dataset.

In [8]:
SAMPLE_SIZE = 10
subset = list(train.take(SAMPLE_SIZE))
subset[0]
Out[8]:
{'label': ,
 'text': }

The TFDS format allows for the construction of efficient preprocessing pipelines. But for our own preprocessing example we will be primarily working with Python list objects. This gives us a chance to practice the Python list comprehension which is a powerful tool to have at your disposal. It will serve you well when processing arbitrary text which may not already be in a nice TFDS format (such as in the HW 😉).

We'll convert our data subset into X and y lists.

In [9]:
X = [x['text'].numpy().decode() for x in subset]
y = [x['label'].numpy() for x in subset]
In [10]:
print(f'X has {len(X)} reviews')
print(f'y has {len(y)} labels')
X has 10 reviews
y has 10 labels
In [11]:
N_CHARS = 20
print(f'First {N_CHARS} characters of all reviews:\n{[x[:20]+"..." for x in X]}\n')
print(f'All labels:\n{y}')
First 20 characters of all reviews:
['This was an absolute...', 'I have been known to...', 'Mann photographs the...', 'This is the kind of ...', 'As others have menti...', 'This is a film which...', 'Okay, you have:

Each observation in X is a review. A review is a str object which we can think of as a sequence of characters. This is indeed how Python treats strings as made clear by how we are printing 'slices' of each review in the code cell above.

We'll see a bit later that you can in fact sucessfully train a neural network on text data at the character level.

But for the moment we will work at the word level, treating the word level. This means our observations should be organized as sequences of words rather than sequences of characters.

In [12]:
# list comprehensions again to the rescue!
X = [x.split() for x in X]
# The same thing can be accomplished with:
# list(map(str.split, X))
# but that is much harder to parse! O_o

Now let's look at the first 10 tokens in the first 2 reviews.

In [13]:
X[0][:10], X[1][:10]
Out[13]:
(['This',
  'was',
  'an',
  'absolutely',
  'terrible',
  'movie.',
  "Don't",
  'be',
  'lured',
  'in'],
 ['I',
  'have',
  'been',
  'known',
  'to',
  'fall',
  'asleep',
  'during',
  'films,',
  'but'])

Padding

Let's take a look at the lengths of the reviews in our subset.

In [14]:
[len(x) for x in X]
Out[14]:
[116, 112, 132, 88, 81, 289, 557, 111, 223, 127]

If we were training our RNN one sentence at a time, it would be okay to have sentences of varying lengths. However, as with any neural network, it can be sometimes be advantageous to train inputs in batches. When doing so with RNNs, our input tensors need to be of the same length/dimensions.

Here are two examples of tokenized reviews padded to have a length of 5.

['I', 'loved', 'it', '', '']
['It', 'stinks', '', '', '']

Now let's pad our own examples. Note that 'padding' in this context also means truncating sequences that are longer than our specified max length.

In [15]:
MAX_LEN = 500
PAD = ''
# truncate
X = [x[:MAX_LEN] for x in X]
# pad
for x in X:
    while len(x) < MAX_LEN:
        x.append(PAD)
In [16]:
[len(x) for x in X]
Out[16]:
[500, 500, 500, 500, 500, 500, 500, 500, 500, 500]

Now all reviews are of a uniform length!

Numerical Encoding

If each review in our dataset is an observation, then the features of each observation are the tokens, in this case, words. But these words are still strings. Our machine learning methods require us to be able to multiple our features by weights. If we want to use these words as inputs for a neural network we'll have to convert them into some numerical representation.

One solution is to create a one-to-one mapping between unique words and integers.

If the five sentences below were our entire corpus, our conversion would look this:

  1. i have books - [1, 4, 2]
  2. interesting books are useful [11,2,9,8]
  3. i have computers [1,4,3]
  4. computers are interesting and useful [3,5,11,10,8]
  5. books and computers are both valuable. [2,10,3,9,13,12]
  6. bye bye [7,7]

I-1, books-2, computers-3, have-4, are-5, computers-6,bye-7, useful-8, are-9, and-10,interesting-11, valuable-12, both-13

To accomplish this we'll first need to know what all the unique words are in our dataset.

In [17]:
all_tokens = [word for review in X for word in review]
In [18]:
# sanity check
len(all_tokens), sum([len(x) for x in X])
Out[18]:
(5000, 5000)

Casting our list of words into a set is a great way to get all the unique words in the data.

In [19]:
vocab = sorted(set(all_tokens))
print('Unique Words:', len(vocab))
Unique Words: 892

Now we need to create a mapping from words to integers. For this will a dictionary comprehension.

In [20]:
word2idx = {word: idx for idx, word in enumerate(vocab)}
In [21]:
word2idx
Out[21]:
{'"Absolute': 0,
 '"Bohlen"-Fan': 1,
 '"Brideshead': 2,
 '"Candy"?).': 3,
 '"City': 4,
 '"Dieter': 5,
 '"Dieter"': 6,
 '"Dragonfly"': 7,
 '"I\'ve': 8,
 '"Lady."Ah,': 43,
 '/>And': 44,
 '/>But': 45,
 '/>Canadian': 46,
 '/>David': 47,
 '/>First': 48,
 '/>Henceforth,': 49,
 '/>Joanna': 50,
 '/>Journalist': 51,
 '/>Nothing': 52,
 '/>OK,': 53,
 '/>Penelope': 54,
 '/>Peter': 55,
 '/>Second': 56,
 '/>So': 57,
 '/>Susan': 58,
 '/>Thank': 59,
 '/>Third': 60,
 '/>To': 61,
 '/>When': 62,
 '/>Wrong!': 63,
 '/>and': 64,
 '1-dimensional': 65,
 '14': 66,
 '1950s': 67,
 '20': 68,
 '': 69,
 '
                    

We repeat the process, this time mapping integers to words.

In [22]:
idx2word = {idx: word for idx, word in enumerate(vocab)}
In [23]:
idx2word
Out[23]:
{0: '"Absolute',
 1: '"Bohlen"-Fan',
 2: '"Brideshead',
 3: '"Candy"?).',
 4: '"City',
 5: '"Dieter',
 6: '"Dieter"',
 7: '"Dragonfly"',
 8: '"I\'ve',
 9: '"Lady."Ah,',
 44: '/>And',
 45: '/>But',
 46: '/>Canadian',
 47: '/>David',
 48: '/>First',
 49: '/>Henceforth,',
 50: '/>Joanna',
 51: '/>Journalist',
 52: '/>Nothing',
 53: '/>OK,',
 54: '/>Penelope',
 55: '/>Peter',
 56: '/>Second',
 57: '/>So',
 58: '/>Susan',
 59: '/>Thank',
 60: '/>Third',
 61: '/>To',
 62: '/>When',
 63: '/>Wrong!',
 64: '/>and',
 65: '1-dimensional',
 66: '14',
 67: '1950s',
 68: '20',
 69: '',
 70: '
                    

Now, perform the mapping to encode the observations in our subset. Note the use of nested list comprehensions!

In [24]:
X_proc = [[word2idx[word] for word in review] for review in X]
X_proc[0][:10], X_proc[1][:10]
Out[24]:
([211, 851, 272, 233, 793, 587, 109, 303, 557, 517],
 [131, 495, 308, 536, 819, 436, 289, 406, 449, 327])

X_proc is a list of lists but if we are going to feed it into a keras model we should convert both it and y into numpy arrays.

In [25]:
X_proc = np.hstack(X_proc).reshape(-1, MAX_LEN)
y = np.array(y)
X_proc, y
Out[25]:
(array([[211, 851, 272, ...,  69,  69,  69],
        [131, 495, 308, ...,  69,  69,  69],
        [160, 649, 799, ...,  69,  69,  69],
        ...,
        [206, 445, 525, ...,  69,  69,  69],
        [131, 687, 552, ...,  69,  69,  69],
        [201, 810, 622, ...,  69,  69,  69]]),
 array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0]))

Now, just to prove that we've successfully processed the data, we perform a test train split and feed it into a FNN.

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X_proc, y, test_size=0.2, stratify=y)
In [27]:
model = Sequential()

model.add(Dense(8, activation='relu',input_dim=MAX_LEN))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=2, verbose=2)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense (Dense)                (None, 8)                 4008
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 9
=================================================================
Total params: 4,017
Trainable params: 4,017
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
4/4 - 2s - loss: 187.6442 - accuracy: 0.2500 - val_loss: 149.4720 - val_accuracy: 0.5000
Epoch 2/5
4/4 - 0s - loss: 16.7689 - accuracy: 0.7500 - val_loss: 332.2443 - val_accuracy: 0.5000
Epoch 3/5
4/4 - 0s - loss: 21.6830 - accuracy: 0.7500 - val_loss: 360.5525 - val_accuracy: 0.5000
Epoch 4/5
4/4 - 0s - loss: 11.4073 - accuracy: 0.7500 - val_loss: 362.2109 - val_accuracy: 0.5000
Epoch 5/5
4/4 - 0s - loss: 1.7824e-10 - accuracy: 1.0000 - val_loss: 346.8892 - val_accuracy: 0.5000
Accuracy: 50.00%

It worked! But our subset was very small so we shouldn't get excited about the results above.

The IMDB dataset is very popular so keras also includes an alternative method for loading the data. This method can save us a lot of time for many reasons:

  • Cleaned text with less meaningless punctuation
  • Pre-tokenized and numerically encoded
  • Allows us to specify maximum vocabulary size
In [29]:
from tensorflow.keras.datasets import imdb
import warnings
warnings.filterwarnings('ignore')
In [30]:
# We want to have a finite vocabulary to make sure that our word matrices are not arbitrarily small
MAX_VOCAB = 10000
INDEX_FROM = 3   # word index offset 
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=MAX_VOCAB, index_from=INDEX_FROM)

get_word_index will load a json object we can store in a dictionary. This gives us the word-to-integer mapping.

In [31]:
word2idx = imdb.get_word_index(path='imdb_word_index.json')
word2idx = {k:(v + INDEX_FROM) for k,v in word2idx.items()}
word2idx[""] = 0
word2idx[""] = 1
word2idx[""] = 2
word2idx[""] = 3
word2idx
Out[31]:
{'fawn': 34704,
 'tsukino': 52009,
 'nunnery': 52010,
 'sonja': 16819,
 'vani': 63954,
 'woods': 1411,
 'spiders': 16118,
 'hanging': 2348,
 'woody': 2292,
 'trawling': 52011,
 "hold's": 52012,
 'comically': 11310,
 'localized': 40833,
 'disobeying': 30571,
 "'royale": 52013,
 "harpo's": 40834,
 'canet': 52014,
 'aileen': 19316,
 'acurately': 52015,
 "diplomat's": 52016,
 'rickman': 25245,
 'arranged': 6749,
 'rumbustious': 52017,
 'familiarness': 52018,
 "spider'": 52019,
 'hahahah': 68807,
 "wood'": 52020,
 'transvestism': 40836,
 "hangin'": 34705,
 'bringing': 2341,
 'seamier': 40837,
 'wooded': 34706,
 'bravora': 52021,
 'grueling': 16820,
 'wooden': 1639,
 'wednesday': 16821,
 "'prix": 52022,
 'altagracia': 34707,
 'circuitry': 52023,
 'crotch': 11588,
 'busybody': 57769,
 "tart'n'tangy": 52024,
 'burgade': 14132,
 'thrace': 52026,
 "tom's": 11041,
 'snuggles': 52028,
 'francesco': 29117,
 'complainers': 52030,
 'templarios': 52128,
 '272': 40838,
 '273': 52031,
 'zaniacs': 52133,
 '275': 34709,
 'consenting': 27634,
 'snuggled': 40839,
 'inanimate': 15495,
 'uality': 52033,
 'bronte': 11929,
 'errors': 4013,
 'dialogs': 3233,
 "yomada's": 52034,
 "madman's": 34710,
 'dialoge': 30588,
 'usenet': 52036,
 'videodrome': 40840,
 "kid'": 26341,
 'pawed': 52037,
 "'girlfriend'": 30572,
 "'pleasure": 52038,
 "'reloaded'": 52039,
 "kazakos'": 40842,
 'rocque': 52040,
 'mailings': 52041,
 'brainwashed': 11930,
 'mcanally': 16822,
 "tom''": 52042,
 'kurupt': 25246,
 'affiliated': 21908,
 'babaganoosh': 52043,
 "noe's": 40843,
 'quart': 40844,
 'kids': 362,
 'uplifting': 5037,
 'controversy': 7096,
 'kida': 21909,
 'kidd': 23382,
 "error'": 52044,
 'neurologist': 52045,
 'spotty': 18513,
 'cobblers': 30573,
 'projection': 9881,
 'fastforwarding': 40845,
 'sters': 52046,
 "eggar's": 52047,
 'etherything': 52048,
 'gateshead': 40846,
 'airball': 34711,
 'unsinkable': 25247,
 'stern': 7183,
 "cervi's": 52049,
 'dnd': 40847,
 'dna': 11589,
 'insecurity': 20601,
 "'reboot'": 52050,
 'trelkovsky': 11040,
 'jaekel': 52051,
 'sidebars': 52052,
 "sforza's": 52053,
 'distortions': 17636,
 'mutinies': 52054,
 'sermons': 30605,
 '7ft': 40849,
 'boobage': 52055,
 "o'bannon's": 52056,
 'populations': 23383,
 'chulak': 52057,
 'mesmerize': 27636,
 'quinnell': 52058,
 'yahoo': 10310,
 'meteorologist': 52060,
 'beswick': 42580,
 'boorman': 15496,
 'voicework': 40850,
 "ster'": 52061,
 'blustering': 22925,
 'hj': 52062,
 'intake': 27637,
 'morally': 5624,
 'jumbling': 40852,
 'bowersock': 52063,
 "'porky's'": 52064,
 'gershon': 16824,
 'ludicrosity': 40853,
 'coprophilia': 52065,
 'expressively': 40854,
 "india's": 19503,
 "post's": 34713,
 'wana': 52066,
 'wang': 5286,
 'wand': 30574,
 'wane': 25248,
 'edgeways': 52324,
 'titanium': 34714,
 'pinta': 40855,
 'want': 181,
 'pinto': 30575,
 'whoopdedoodles': 52068,
 'tchaikovsky': 21911,
 'travel': 2106,
 "'victory'": 52069,
 'copious': 11931,
 'gouge': 22436,
 "chapters'": 52070,
 'barbra': 6705,
 'uselessness': 30576,
 "wan'": 52071,
 'assimilated': 27638,
 'petiot': 16119,
 'most\x85and': 52072,
 'dinosaurs': 3933,
 'wrong': 355,
 'seda': 52073,
 'stollen': 52074,
 'sentencing': 34715,
 'ouroboros': 40856,
 'assimilates': 40857,
 'colorfully': 40858,
 'glenne': 27639,
 'dongen': 52075,
 'subplots': 4763,
 'kiloton': 52076,
 'chandon': 23384,
 "effect'": 34716,
 'snugly': 27640,
 'kuei': 40859,
 'welcomed': 9095,
 'dishonor': 30074,
 'concurrence': 52078,
 'stoicism': 23385,
 "guys'": 14899,
 "beroemd'": 52080,
 'butcher': 6706,
 "melfi's": 40860,
 'aargh': 30626,
 'playhouse': 20602,
 'wickedly': 11311,
 'fit': 1183,
 'labratory': 52081,
 'lifeline': 40862,
 'screaming': 1930,
 'fix': 4290,
 'cineliterate': 52082,
 'fic': 52083,
 'fia': 52084,
 'fig': 34717,
 'fmvs': 52085,
 'fie': 52086,
 'reentered': 52087,
 'fin': 30577,
 'doctresses': 52088,
 'fil': 52089,
 'zucker': 12609,
 'ached': 31934,
 'counsil': 52091,
 'paterfamilias': 52092,
 'songwriter': 13888,
 'shivam': 34718,
 'hurting': 9657,
 'effects': 302,
 'slauther': 52093,
 "'flame'": 52094,
 'sommerset': 52095,
 'interwhined': 52096,
 'whacking': 27641,
 'bartok': 52097,
 'barton': 8778,
 'frewer': 21912,
 "fi'": 52098,
 'ingrid': 6195,
 'stribor': 30578,
 'approporiately': 52099,
 'wobblyhand': 52100,
 'tantalisingly': 52101,
 'ankylosaurus': 52102,
 'parasites': 17637,
 'childen': 52103,
 "jenkins'": 52104,
 'metafiction': 52105,
 'golem': 17638,
 'indiscretion': 40863,
 "reeves'": 23386,
 "inamorata's": 57784,
 'brittannica': 52107,
 'adapt': 7919,
 "russo's": 30579,
 'guitarists': 48249,
 'abbott': 10556,
 'abbots': 40864,
 'lanisha': 17652,
 'magickal': 40866,
 'mattter': 52108,
 "'willy": 52109,
 'pumpkins': 34719,
 'stuntpeople': 52110,
 'estimate': 30580,
 'ugghhh': 40867,
 'gameplay': 11312,
 "wern't": 52111,
 "n'sync": 40868,
 'sickeningly': 16120,
 'chiara': 40869,
 'disturbed': 4014,
 'portmanteau': 40870,
 'ineffectively': 52112,
 "duchonvey's": 82146,
 "nasty'": 37522,
 'purpose': 1288,
 'lazers': 52115,
 'lightened': 28108,
 'kaliganj': 52116,
 'popularism': 52117,
 "damme's": 18514,
 'stylistics': 30581,
 'mindgaming': 52118,
 'spoilerish': 46452,
 "'corny'": 52120,
 'boerner': 34721,
 'olds': 6795,
 'bakelite': 52121,
 'renovated': 27642,
 'forrester': 27643,
 "lumiere's": 52122,
 'gaskets': 52027,
 'needed': 887,
 'smight': 34722,
 'master': 1300,
 "edie's": 25908,
 'seeber': 40871,
 'hiya': 52123,
 'fuzziness': 52124,
 'genesis': 14900,
 'rewards': 12610,
 'enthrall': 30582,
 "'about": 40872,
 "recollection's": 52125,
 'mutilated': 11042,
 'fatherlands': 52126,
 "fischer's": 52127,
 'positively': 5402,
 '270': 34708,
 'ahmed': 34723,
 'zatoichi': 9839,
 'bannister': 13889,
 'anniversaries': 52130,
 "helm's": 30583,
 "'work'": 52131,
 'exclaimed': 34724,
 "'unfunny'": 52132,
 '274': 52032,
 'feeling': 547,
 "wanda's": 52134,
 'dolan': 33269,
 '278': 52136,
 'peacoat': 52137,
 'brawny': 40873,
 'mishra': 40874,
 'worlders': 40875,
 'protags': 52138,
 'skullcap': 52139,
 'dastagir': 57599,
 'affairs': 5625,
 'wholesome': 7802,
 'hymen': 52140,
 'paramedics': 25249,
 'unpersons': 52141,
 'heavyarms': 52142,
 'affaire': 52143,
 'coulisses': 52144,
 'hymer': 40876,
 'kremlin': 52145,
 'shipments': 30584,
 'pixilated': 52146,
 "'00s": 30585,
 'diminishing': 18515,
 'cinematic': 1360,
 'resonates': 14901,
 'simplify': 40877,
 "nature'": 40878,
 'temptresses': 40879,
 'reverence': 16825,
 'resonated': 19505,
 'dailey': 34725,
 '2\x85': 52147,
 'treize': 27644,
 'majo': 52148,
 'kiya': 21913,
 'woolnough': 52149,
 'thanatos': 39800,
 'sandoval': 35734,
 'dorama': 40882,
 "o'shaughnessy": 52150,
 'tech': 4991,
 'fugitives': 32021,
 'teck': 30586,
 "'e'": 76128,
 'doesn’t': 40884,
 'purged': 52152,
 'saying': 660,
 "martians'": 41098,
 'norliss': 23421,
 'dickey': 27645,
 'dicker': 52155,
 "'sependipity": 52156,
 'padded': 8425,
 'ordell': 57795,
 "sturges'": 40885,
 'independentcritics': 52157,
 'tempted': 5748,
 "atkinson's": 34727,
 'hounded': 25250,
 'apace': 52158,
 'clicked': 15497,
 "'humor'": 30587,
 "martino's": 17180,
 "'supporting": 52159,
 'warmongering': 52035,
 "zemeckis's": 34728,
 'lube': 21914,
 'shocky': 52160,
 'plate': 7479,
 'plata': 40886,
 'sturgess': 40887,
 "nerds'": 40888,
 'plato': 20603,
 'plath': 34729,
 'platt': 40889,
 'mcnab': 52162,
 'clumsiness': 27646,
 'altogether': 3902,
 'massacring': 42587,
 'bicenntinial': 52163,
 'skaal': 40890,
 'droning': 14363,
 'lds': 8779,
 'jaguar': 21915,
 "cale's": 34730,
 'nicely': 1780,
 'mummy': 4591,
 "lot's": 18516,
 'patch': 10089,
 'kerkhof': 50205,
 "leader's": 52164,
 "'movie": 27647,
 'uncomfirmed': 52165,
 'heirloom': 40891,
 'wrangle': 47363,
 'emotion\x85': 52166,
 "'stargate'": 52167,
 'pinoy': 40892,
 'conchatta': 40893,
 'broeke': 41131,
 'advisedly': 40894,
 "barker's": 17639,
 'descours': 52169,
 'lots': 775,
 'lotr': 9262,
 'irs': 9882,
 'lott': 52170,
 'xvi': 40895,
 'irk': 34731,
 'irl': 52171,
 'ira': 6890,
 'belzer': 21916,
 'irc': 52172,
 'ire': 27648,
 'requisites': 40896,
 'discipline': 7696,
 'lyoko': 52964,
 'extend': 11313,
 'nature': 876,
 "'dickie'": 52173,
 'optimist': 40897,
 'lapping': 30589,
 'superficial': 3903,
 'vestment': 52174,
 'extent': 2826,
 'tendons': 52175,
 "heller's": 52176,
 'quagmires': 52177,
 'miyako': 52178,
 'moocow': 20604,
 "coles'": 52179,
 'lookit': 40898,
 'ravenously': 52180,
 'levitating': 40899,
 'perfunctorily': 52181,
 'lookin': 30590,
 "lot'": 40901,
 'lookie': 52182,
 'fearlessly': 34873,
 'libyan': 52184,
 'fondles': 40902,
 'gopher': 35717,
 'wearying': 40904,
 "nz's": 52185,
 'minuses': 27649,
 'puposelessly': 52186,
 'shandling': 52187,
 'decapitates': 31271,
 'humming': 11932,
 "'nother": 40905,
 'smackdown': 21917,
 'underdone': 30591,
 'frf': 40906,
 'triviality': 52188,
 'fro': 25251,
 'bothers': 8780,
 "'kensington": 52189,
 'much': 76,
 'muco': 34733,
 'wiseguy': 22618,
 "richie's": 27651,
 'tonino': 40907,
 'unleavened': 52190,
 'fry': 11590,
 "'tv'": 40908,
 'toning': 40909,
 'obese': 14364,
 'sensationalized': 30592,
 'spiv': 40910,
 'spit': 6262,
 'arkin': 7367,
 'charleton': 21918,
 'jeon': 16826,
 'boardroom': 21919,
 'doubts': 4992,
 'spin': 3087,
 'hepo': 53086,
 'wildcat': 27652,
 'venoms': 10587,
 'misconstrues': 52194,
 'mesmerising': 18517,
 'misconstrued': 40911,
 'rescinds': 52195,
 'prostrate': 52196,
 'majid': 40912,
 'climbed': 16482,
 'canoeing': 34734,
 'majin': 52198,
 'animie': 57807,
 'sylke': 40913,
 'conditioned': 14902,
 'waddell': 40914,
 '3\x85': 52199,
 'hyperdrive': 41191,
 'conditioner': 34735,
 'bricklayer': 53156,
 'hong': 2579,
 'memoriam': 52201,
 'inventively': 30595,
 "levant's": 25252,
 'portobello': 20641,
 'remand': 52203,
 'mummified': 19507,
 'honk': 27653,
 'spews': 19508,
 'visitations': 40915,
 'mummifies': 52204,
 'cavanaugh': 25253,
 'zeon': 23388,
 "jungle's": 40916,
 'viertel': 34736,
 'frenchmen': 27654,
 'torpedoes': 52205,
 'schlessinger': 52206,
 'torpedoed': 34737,
 'blister': 69879,
 'cinefest': 52207,
 'furlough': 34738,
 'mainsequence': 52208,
 'mentors': 40917,
 'academic': 9097,
 'stillness': 20605,
 'academia': 40918,
 'lonelier': 52209,
 'nibby': 52210,
 "losers'": 52211,
 'cineastes': 40919,
 'corporate': 4452,
 'massaging': 40920,
 'bellow': 30596,
 'absurdities': 19509,
 'expetations': 53244,
 'nyfiken': 40921,
 'mehras': 75641,
 'lasse': 52212,
 'visability': 52213,
 'militarily': 33949,
 "elder'": 52214,
 'gainsbourg': 19026,
 'hah': 20606,
 'hai': 13423,
 'haj': 34739,
 'hak': 25254,
 'hal': 4314,
 'ham': 4895,
 'duffer': 53262,
 'haa': 52216,
 'had': 69,
 'advancement': 11933,
 'hag': 16828,
 "hand'": 25255,
 'hay': 13424,
 'mcnamara': 20607,
 "mozart's": 52217,
 'duffel': 30734,
 'haq': 30597,
 'har': 13890,
 'has': 47,
 'hat': 2404,
 'hav': 40922,
 'haw': 30598,
 'figtings': 52218,
 'elders': 15498,
 'underpanted': 52219,
 'pninson': 52220,
 'unequivocally': 27655,
 "barbara's": 23676,
 "bello'": 52222,
 'indicative': 13000,
 'yawnfest': 40923,
 'hexploitation': 52223,
 "loder's": 52224,
 'sleuthing': 27656,
 "justin's": 32625,
 "'ball": 52225,
 "'summer": 52226,
 "'demons'": 34938,
 "mormon's": 52228,
 "laughton's": 34740,
 'debell': 52229,
 'shipyard': 39727,
 'unabashedly': 30600,
 'disks': 40404,
 'crowd': 2293,
 'crowe': 10090,
 "vancouver's": 56437,
 'mosques': 34741,
 'crown': 6630,
 'culpas': 52230,
 'crows': 27657,
 'surrell': 53347,
 'flowless': 52232,
 'sheirk': 52233,
 "'three": 40926,
 "peterson'": 52234,
 'ooverall': 52235,
 'perchance': 40927,
 'bottom': 1324,
 'chabert': 53366,
 'sneha': 52236,
 'inhuman': 13891,
 'ichii': 52237,
 'ursla': 52238,
 'completly': 30601,
 'moviedom': 40928,
 'raddick': 52239,
 'brundage': 51998,
 'brigades': 40929,
 'starring': 1184,
 "'goal'": 52240,
 'caskets': 52241,
 'willcock': 52242,
 "threesome's": 52243,
 "mosque'": 52244,
 "cover's": 52245,
 'spaceships': 17640,
 'anomalous': 40930,
 'ptsd': 27658,
 'shirdan': 52246,
 'obscenity': 21965,
 'lemmings': 30602,
 'duccio': 30603,
 "levene's": 52247,
 "'gorby'": 52248,
 "teenager's": 25258,
 'marshall': 5343,
 'honeymoon': 9098,
 'shoots': 3234,
 'despised': 12261,
 'okabasho': 52249,
 'fabric': 8292,
 'cannavale': 18518,
 'raped': 3540,
 "tutt's": 52250,
 'grasping': 17641,
 'despises': 18519,
 "thief's": 40931,
 'rapes': 8929,
 'raper': 52251,
 "eyre'": 27659,
 'walchek': 52252,
 "elmo's": 23389,
 'perfumes': 40932,
 'spurting': 21921,
 "exposition'\x85": 52253,
 'denoting': 52254,
 'thesaurus': 34743,
 "shoot'": 40933,
 'bonejack': 49762,
 'simpsonian': 52256,
 'hebetude': 30604,
 "hallow's": 34744,
 'desperation\x85': 52257,
 'incinerator': 34745,
 'congratulations': 10311,
 'humbled': 52258,
 "else's": 5927,
 'trelkovski': 40848,
 "rape'": 52259,
 "'chapters'": 59389,
 '1600s': 52260,
 'martian': 7256,
 'nicest': 25259,
 'eyred': 52262,
 'passenger': 9460,
 'disgrace': 6044,
 'moderne': 52263,
 'barrymore': 5123,
 'yankovich': 52264,
 'moderns': 40934,
 'studliest': 52265,
 'bedsheet': 52266,
 'decapitation': 14903,
 'slurring': 52267,
 "'nunsploitation'": 52268,
 "'character'": 34746,
 'cambodia': 9883,
 'rebelious': 52269,
 'pasadena': 27660,
 'crowne': 40935,
 "'bedchamber": 52270,
 'conjectural': 52271,
 'appologize': 52272,
 'halfassing': 52273,
 'paycheque': 57819,
 'palms': 20609,
 "'islands": 52274,
 'hawked': 40936,
 'palme': 21922,
 'conservatively': 40937,
 'larp': 64010,
 'palma': 5561,
 'smelling': 21923,
 'aragorn': 13001,
 'hawker': 52275,
 'hawkes': 52276,
 'explosions': 3978,
 'loren': 8062,
 "pyle's": 52277,
 'shootout': 6707,
 "mike's": 18520,
 "driscoll's": 52278,
 'cogsworth': 40938,
 "britian's": 52279,
 'childs': 34747,
 "portrait's": 52280,
 'chain': 3629,
 'whoever': 2500,
 'puttered': 52281,
 'childe': 52282,
 'maywether': 52283,
 'chair': 3039,
 "rance's": 52284,
 'machu': 34748,
 'ballet': 4520,
 'grapples': 34749,
 'summerize': 76155,
 'freelance': 30606,
 "andrea's": 52286,
 '\x91very': 52287,
 'coolidge': 45882,
 'mache': 18521,
 'balled': 52288,
 'grappled': 40940,
 'macha': 18522,
 'underlining': 21924,
 'macho': 5626,
 'oversight': 19510,
 'machi': 25260,
 'verbally': 11314,
 'tenacious': 21925,
 'windshields': 40941,
 'paychecks': 18560,
 'jerk': 3399,
 "good'": 11934,
 'prancer': 34751,
 'prances': 21926,
 'olympus': 52289,
 'lark': 21927,
 'embark': 10788,
 'gloomy': 7368,
 'jehaan': 52290,
 'turaqui': 52291,
 "child'": 20610,
 'locked': 2897,
 'pranced': 52292,
 'exact': 2591,
 'unattuned': 52293,
 'minute': 786,
 'skewed': 16121,
 'hodgins': 40943,
 'skewer': 34752,
 'think\x85': 52294,
 'rosenstein': 38768,
 'helmit': 52295,
 'wrestlemanias': 34753,
 'hindered': 16829,
 "martha's": 30607,
 'cheree': 52296,
 "pluckin'": 52297,
 'ogles': 40944,
 'heavyweight': 11935,
 'aada': 82193,
 'chopping': 11315,
 'strongboy': 61537,
 'hegemonic': 41345,
 'adorns': 40945,
 'xxth': 41349,
 'nobuhiro': 34754,
 'capitães': 52301,
 'kavogianni': 52302,
 'antwerp': 13425,
 'celebrated': 6541,
 'roarke': 52303,
 'baggins': 40946,
 'cheeseburgers': 31273,
 'matras': 52304,
 "nineties'": 52305,
 "'craig'": 52306,
 'celebrates': 13002,
 'unintentionally': 3386,
 'drafted': 14365,
 'climby': 52307,
 '303': 52308,
 'oldies': 18523,
 'climbs': 9099,
 'honour': 9658,
 'plucking': 34755,
 '305': 30077,
 'address': 5517,
 'menjou': 40947,
 "'freak'": 42595,
 'dwindling': 19511,
 'benson': 9461,
 'white’s': 52310,
 'shamelessness': 40948,
 'impacted': 21928,
 'upatz': 52311,
 'cusack': 3843,
 "flavia's": 37570,
 'effette': 52312,
 'influx': 34756,
 'boooooooo': 52313,
 'dimitrova': 52314,
 'houseman': 13426,
 'bigas': 25262,
 'boylen': 52315,
 'phillipenes': 52316,
 'fakery': 40949,
 "grandpa's": 27661,
 'darnell': 27662,
 'undergone': 19512,
 'handbags': 52318,
 'perished': 21929,
 'pooped': 37781,
 'vigour': 27663,
 'opposed': 3630,
 'etude': 52319,
 "caine's": 11802,
 'doozers': 52320,
 'photojournals': 34757,
 'perishes': 52321,
 'constrains': 34758,
 'migenes': 40951,
 'consoled': 30608,
 'alastair': 16830,
 'wvs': 52322,
 'ooooooh': 52323,
 'approving': 34759,
 'consoles': 40952,
 'disparagement': 52067,
 'futureistic': 52325,
 'rebounding': 52326,
 "'date": 52327,
 'gregoire': 52328,
 'rutherford': 21930,
 'americanised': 34760,
 'novikov': 82199,
 'following': 1045,
 'munroe': 34761,
 "morita'": 52329,
 'christenssen': 52330,
 'oatmeal': 23109,
 'fossey': 25263,
 'livered': 40953,
 'listens': 13003,
 "'marci": 76167,
 "otis's": 52333,
 'thanking': 23390,
 'maude': 16022,
 'extensions': 34762,
 'ameteurish': 52335,
 "commender's": 52336,
 'agricultural': 27664,
 'convincingly': 4521,
 'fueled': 17642,
 'mahattan': 54017,
 "paris's": 40955,
 'vulkan': 52339,
 'stapes': 52340,
 'odysessy': 52341,
 'harmon': 12262,
 'surfing': 4255,
 'halloran': 23497,
 'unbelieveably': 49583,
 "'offed'": 52342,
 'quadrant': 30610,
 'inhabiting': 19513,
 'nebbish': 34763,
 'forebears': 40956,
 'skirmish': 34764,
 'ocassionally': 52343,
 "'resist": 52344,
 'impactful': 21931,
 'spicier': 52345,
 'touristy': 40957,
 "'football'": 52346,
 'webpage': 40958,
 'exurbia': 52348,
 'jucier': 52349,
 'professors': 14904,
 'structuring': 34765,
 'jig': 30611,
 'overlord': 40959,
 'disconnect': 25264,
 'sniffle': 82204,
 'slimeball': 40960,
 'jia': 40961,
 'milked': 16831,
 'banjoes': 40962,
 'jim': 1240,
 'workforces': 52351,
 'jip': 52352,
 'rotweiller': 52353,
 'mundaneness': 34766,
 "'ninja'": 52354,
 "dead'": 11043,
 "cipriani's": 40963,
 'modestly': 20611,
 "professor'": 52355,
 'shacked': 40964,
 'bashful': 34767,
 'sorter': 23391,
 'overpowering': 16123,
 'workmanlike': 18524,
 'henpecked': 27665,
 'sorted': 18525,
 "jōb's": 52357,
 "'always": 52358,
 "'baptists": 34768,
 'dreamcatchers': 52359,
 "'silence'": 52360,
 'hickory': 21932,
 'fun\x97yet': 52361,
 'breakumentary': 52362,
 'didn': 15499,
 'didi': 52363,
 'pealing': 52364,
 'dispite': 40965,
 "italy's": 25265,
 'instability': 21933,
 'quarter': 6542,
 'quartet': 12611,
 'padmé': 52365,
 "'bleedmedry": 52366,
 'pahalniuk': 52367,
 'honduras': 52368,
 'bursting': 10789,
 "pablo's": 41468,
 'irremediably': 52370,
 'presages': 40966,
 'bowlegged': 57835,
 'dalip': 65186,
 'entering': 6263,
 'newsradio': 76175,
 'presaged': 54153,
 "giallo's": 27666,
 'bouyant': 40967,
 'amerterish': 52371,
 'rajni': 18526,
 'leeves': 30613,
 'macauley': 34770,
 'seriously': 615,
 'sugercoma': 52372,
 'grimstead': 52373,
 "'fairy'": 52374,
 'zenda': 30614,
 "'twins'": 52375,
 'realisation': 17643,
 'highsmith': 27667,
 'raunchy': 7820,
 'incentives': 40968,
 'flatson': 52377,
 'snooker': 35100,
 'crazies': 16832,
 'crazier': 14905,
 'grandma': 7097,
 'napunsaktha': 52378,
 'workmanship': 30615,
 'reisner': 52379,
 "sanford's": 61309,
 '\x91doña': 52380,
 'modest': 6111,
 "everything's": 19156,
 'hamer': 40969,
 "couldn't'": 52382,
 'quibble': 13004,
 'socking': 52383,
 'tingler': 21934,
 'gutman': 52384,
 'lachlan': 40970,
 'tableaus': 52385,
 'headbanger': 52386,
 'spoken': 2850,
 'cerebrally': 34771,
 "'road": 23493,
 'tableaux': 21935,
 "proust's": 40971,
 'periodical': 40972,
 "shoveller's": 52388,
 'tamara': 25266,
 'affords': 17644,
 'concert': 3252,
 "yara's": 87958,
 'someome': 52389,
 'lingering': 8427,
 "abraham's": 41514,
 'beesley': 34772,
 'cherbourg': 34773,
 'kagan': 28627,
 'snatch': 9100,
 "miyazaki's": 9263,
 'absorbs': 25267,
 "koltai's": 40973,
 'tingled': 64030,
 'crossroads': 19514,
 'rehab': 16124,
 'falworth': 52392,
 'sequals': 52393,
 ...}
In [32]:
idx2word = {v: k for k,v in word2idx.items()}
idx2word
Out[32]:
{34704: 'fawn',
 52009: 'tsukino',
 52010: 'nunnery',
 16819: 'sonja',
 63954: 'vani',
 1411: 'woods',
 16118: 'spiders',
 2348: 'hanging',
 2292: 'woody',
 52011: 'trawling',
 52012: "hold's",
 11310: 'comically',
 40833: 'localized',
 30571: 'disobeying',
 52013: "'royale",
 40834: "harpo's",
 52014: 'canet',
 19316: 'aileen',
 52015: 'acurately',
 52016: "diplomat's",
 25245: 'rickman',
 6749: 'arranged',
 52017: 'rumbustious',
 52018: 'familiarness',
 52019: "spider'",
 68807: 'hahahah',
 52020: "wood'",
 40836: 'transvestism',
 34705: "hangin'",
 2341: 'bringing',
 40837: 'seamier',
 34706: 'wooded',
 52021: 'bravora',
 16820: 'grueling',
 1639: 'wooden',
 16821: 'wednesday',
 52022: "'prix",
 34707: 'altagracia',
 52023: 'circuitry',
 11588: 'crotch',
 57769: 'busybody',
 52024: "tart'n'tangy",
 14132: 'burgade',
 52026: 'thrace',
 11041: "tom's",
 52028: 'snuggles',
 29117: 'francesco',
 52030: 'complainers',
 52128: 'templarios',
 40838: '272',
 52031: '273',
 52133: 'zaniacs',
 34709: '275',
 27634: 'consenting',
 40839: 'snuggled',
 15495: 'inanimate',
 52033: 'uality',
 11929: 'bronte',
 4013: 'errors',
 3233: 'dialogs',
 52034: "yomada's",
 34710: "madman's",
 30588: 'dialoge',
 52036: 'usenet',
 40840: 'videodrome',
 26341: "kid'",
 52037: 'pawed',
 30572: "'girlfriend'",
 52038: "'pleasure",
 52039: "'reloaded'",
 40842: "kazakos'",
 52040: 'rocque',
 52041: 'mailings',
 11930: 'brainwashed',
 16822: 'mcanally',
 52042: "tom''",
 25246: 'kurupt',
 21908: 'affiliated',
 52043: 'babaganoosh',
 40843: "noe's",
 40844: 'quart',
 362: 'kids',
 5037: 'uplifting',
 7096: 'controversy',
 21909: 'kida',
 23382: 'kidd',
 52044: "error'",
 52045: 'neurologist',
 18513: 'spotty',
 30573: 'cobblers',
 9881: 'projection',
 40845: 'fastforwarding',
 52046: 'sters',
 52047: "eggar's",
 52048: 'etherything',
 40846: 'gateshead',
 34711: 'airball',
 25247: 'unsinkable',
 7183: 'stern',
 52049: "cervi's",
 40847: 'dnd',
 11589: 'dna',
 20601: 'insecurity',
 52050: "'reboot'",
 11040: 'trelkovsky',
 52051: 'jaekel',
 52052: 'sidebars',
 52053: "sforza's",
 17636: 'distortions',
 52054: 'mutinies',
 30605: 'sermons',
 40849: '7ft',
 52055: 'boobage',
 52056: "o'bannon's",
 23383: 'populations',
 52057: 'chulak',
 27636: 'mesmerize',
 52058: 'quinnell',
 10310: 'yahoo',
 52060: 'meteorologist',
 42580: 'beswick',
 15496: 'boorman',
 40850: 'voicework',
 52061: "ster'",
 22925: 'blustering',
 52062: 'hj',
 27637: 'intake',
 5624: 'morally',
 40852: 'jumbling',
 52063: 'bowersock',
 52064: "'porky's'",
 16824: 'gershon',
 40853: 'ludicrosity',
 52065: 'coprophilia',
 40854: 'expressively',
 19503: "india's",
 34713: "post's",
 52066: 'wana',
 5286: 'wang',
 30574: 'wand',
 25248: 'wane',
 52324: 'edgeways',
 34714: 'titanium',
 40855: 'pinta',
 181: 'want',
 30575: 'pinto',
 52068: 'whoopdedoodles',
 21911: 'tchaikovsky',
 2106: 'travel',
 52069: "'victory'",
 11931: 'copious',
 22436: 'gouge',
 52070: "chapters'",
 6705: 'barbra',
 30576: 'uselessness',
 52071: "wan'",
 27638: 'assimilated',
 16119: 'petiot',
 52072: 'most\x85and',
 3933: 'dinosaurs',
 355: 'wrong',
 52073: 'seda',
 52074: 'stollen',
 34715: 'sentencing',
 40856: 'ouroboros',
 40857: 'assimilates',
 40858: 'colorfully',
 27639: 'glenne',
 52075: 'dongen',
 4763: 'subplots',
 52076: 'kiloton',
 23384: 'chandon',
 34716: "effect'",
 27640: 'snugly',
 40859: 'kuei',
 9095: 'welcomed',
 30074: 'dishonor',
 52078: 'concurrence',
 23385: 'stoicism',
 14899: "guys'",
 52080: "beroemd'",
 6706: 'butcher',
 40860: "melfi's",
 30626: 'aargh',
 20602: 'playhouse',
 11311: 'wickedly',
 1183: 'fit',
 52081: 'labratory',
 40862: 'lifeline',
 1930: 'screaming',
 4290: 'fix',
 52082: 'cineliterate',
 52083: 'fic',
 52084: 'fia',
 34717: 'fig',
 52085: 'fmvs',
 52086: 'fie',
 52087: 'reentered',
 30577: 'fin',
 52088: 'doctresses',
 52089: 'fil',
 12609: 'zucker',
 31934: 'ached',
 52091: 'counsil',
 52092: 'paterfamilias',
 13888: 'songwriter',
 34718: 'shivam',
 9657: 'hurting',
 302: 'effects',
 52093: 'slauther',
 52094: "'flame'",
 52095: 'sommerset',
 52096: 'interwhined',
 27641: 'whacking',
 52097: 'bartok',
 8778: 'barton',
 21912: 'frewer',
 52098: "fi'",
 6195: 'ingrid',
 30578: 'stribor',
 52099: 'approporiately',
 52100: 'wobblyhand',
 52101: 'tantalisingly',
 52102: 'ankylosaurus',
 17637: 'parasites',
 52103: 'childen',
 52104: "jenkins'",
 52105: 'metafiction',
 17638: 'golem',
 40863: 'indiscretion',
 23386: "reeves'",
 57784: "inamorata's",
 52107: 'brittannica',
 7919: 'adapt',
 30579: "russo's",
 48249: 'guitarists',
 10556: 'abbott',
 40864: 'abbots',
 17652: 'lanisha',
 40866: 'magickal',
 52108: 'mattter',
 52109: "'willy",
 34719: 'pumpkins',
 52110: 'stuntpeople',
 30580: 'estimate',
 40867: 'ugghhh',
 11312: 'gameplay',
 52111: "wern't",
 40868: "n'sync",
 16120: 'sickeningly',
 40869: 'chiara',
 4014: 'disturbed',
 40870: 'portmanteau',
 52112: 'ineffectively',
 82146: "duchonvey's",
 37522: "nasty'",
 1288: 'purpose',
 52115: 'lazers',
 28108: 'lightened',
 52116: 'kaliganj',
 52117: 'popularism',
 18514: "damme's",
 30581: 'stylistics',
 52118: 'mindgaming',
 46452: 'spoilerish',
 52120: "'corny'",
 34721: 'boerner',
 6795: 'olds',
 52121: 'bakelite',
 27642: 'renovated',
 27643: 'forrester',
 52122: "lumiere's",
 52027: 'gaskets',
 887: 'needed',
 34722: 'smight',
 1300: 'master',
 25908: "edie's",
 40871: 'seeber',
 52123: 'hiya',
 52124: 'fuzziness',
 14900: 'genesis',
 12610: 'rewards',
 30582: 'enthrall',
 40872: "'about",
 52125: "recollection's",
 11042: 'mutilated',
 52126: 'fatherlands',
 52127: "fischer's",
 5402: 'positively',
 34708: '270',
 34723: 'ahmed',
 9839: 'zatoichi',
 13889: 'bannister',
 52130: 'anniversaries',
 30583: "helm's",
 52131: "'work'",
 34724: 'exclaimed',
 52132: "'unfunny'",
 52032: '274',
 547: 'feeling',
 52134: "wanda's",
 33269: 'dolan',
 52136: '278',
 52137: 'peacoat',
 40873: 'brawny',
 40874: 'mishra',
 40875: 'worlders',
 52138: 'protags',
 52139: 'skullcap',
 57599: 'dastagir',
 5625: 'affairs',
 7802: 'wholesome',
 52140: 'hymen',
 25249: 'paramedics',
 52141: 'unpersons',
 52142: 'heavyarms',
 52143: 'affaire',
 52144: 'coulisses',
 40876: 'hymer',
 52145: 'kremlin',
 30584: 'shipments',
 52146: 'pixilated',
 30585: "'00s",
 18515: 'diminishing',
 1360: 'cinematic',
 14901: 'resonates',
 40877: 'simplify',
 40878: "nature'",
 40879: 'temptresses',
 16825: 'reverence',
 19505: 'resonated',
 34725: 'dailey',
 52147: '2\x85',
 27644: 'treize',
 52148: 'majo',
 21913: 'kiya',
 52149: 'woolnough',
 39800: 'thanatos',
 35734: 'sandoval',
 40882: 'dorama',
 52150: "o'shaughnessy",
 4991: 'tech',
 32021: 'fugitives',
 30586: 'teck',
 76128: "'e'",
 40884: 'doesn’t',
 52152: 'purged',
 660: 'saying',
 41098: "martians'",
 23421: 'norliss',
 27645: 'dickey',
 52155: 'dicker',
 52156: "'sependipity",
 8425: 'padded',
 57795: 'ordell',
 40885: "sturges'",
 52157: 'independentcritics',
 5748: 'tempted',
 34727: "atkinson's",
 25250: 'hounded',
 52158: 'apace',
 15497: 'clicked',
 30587: "'humor'",
 17180: "martino's",
 52159: "'supporting",
 52035: 'warmongering',
 34728: "zemeckis's",
 21914: 'lube',
 52160: 'shocky',
 7479: 'plate',
 40886: 'plata',
 40887: 'sturgess',
 40888: "nerds'",
 20603: 'plato',
 34729: 'plath',
 40889: 'platt',
 52162: 'mcnab',
 27646: 'clumsiness',
 3902: 'altogether',
 42587: 'massacring',
 52163: 'bicenntinial',
 40890: 'skaal',
 14363: 'droning',
 8779: 'lds',
 21915: 'jaguar',
 34730: "cale's",
 1780: 'nicely',
 4591: 'mummy',
 18516: "lot's",
 10089: 'patch',
 50205: 'kerkhof',
 52164: "leader's",
 27647: "'movie",
 52165: 'uncomfirmed',
 40891: 'heirloom',
 47363: 'wrangle',
 52166: 'emotion\x85',
 52167: "'stargate'",
 40892: 'pinoy',
 40893: 'conchatta',
 41131: 'broeke',
 40894: 'advisedly',
 17639: "barker's",
 52169: 'descours',
 775: 'lots',
 9262: 'lotr',
 9882: 'irs',
 52170: 'lott',
 40895: 'xvi',
 34731: 'irk',
 52171: 'irl',
 6890: 'ira',
 21916: 'belzer',
 52172: 'irc',
 27648: 'ire',
 40896: 'requisites',
 7696: 'discipline',
 52964: 'lyoko',
 11313: 'extend',
 876: 'nature',
 52173: "'dickie'",
 40897: 'optimist',
 30589: 'lapping',
 3903: 'superficial',
 52174: 'vestment',
 2826: 'extent',
 52175: 'tendons',
 52176: "heller's",
 52177: 'quagmires',
 52178: 'miyako',
 20604: 'moocow',
 52179: "coles'",
 40898: 'lookit',
 52180: 'ravenously',
 40899: 'levitating',
 52181: 'perfunctorily',
 30590: 'lookin',
 40901: "lot'",
 52182: 'lookie',
 34873: 'fearlessly',
 52184: 'libyan',
 40902: 'fondles',
 35717: 'gopher',
 40904: 'wearying',
 52185: "nz's",
 27649: 'minuses',
 52186: 'puposelessly',
 52187: 'shandling',
 31271: 'decapitates',
 11932: 'humming',
 40905: "'nother",
 21917: 'smackdown',
 30591: 'underdone',
 40906: 'frf',
 52188: 'triviality',
 25251: 'fro',
 8780: 'bothers',
 52189: "'kensington",
 76: 'much',
 34733: 'muco',
 22618: 'wiseguy',
 27651: "richie's",
 40907: 'tonino',
 52190: 'unleavened',
 11590: 'fry',
 40908: "'tv'",
 40909: 'toning',
 14364: 'obese',
 30592: 'sensationalized',
 40910: 'spiv',
 6262: 'spit',
 7367: 'arkin',
 21918: 'charleton',
 16826: 'jeon',
 21919: 'boardroom',
 4992: 'doubts',
 3087: 'spin',
 53086: 'hepo',
 27652: 'wildcat',
 10587: 'venoms',
 52194: 'misconstrues',
 18517: 'mesmerising',
 40911: 'misconstrued',
 52195: 'rescinds',
 52196: 'prostrate',
 40912: 'majid',
 16482: 'climbed',
 34734: 'canoeing',
 52198: 'majin',
 57807: 'animie',
 40913: 'sylke',
 14902: 'conditioned',
 40914: 'waddell',
 52199: '3\x85',
 41191: 'hyperdrive',
 34735: 'conditioner',
 53156: 'bricklayer',
 2579: 'hong',
 52201: 'memoriam',
 30595: 'inventively',
 25252: "levant's",
 20641: 'portobello',
 52203: 'remand',
 19507: 'mummified',
 27653: 'honk',
 19508: 'spews',
 40915: 'visitations',
 52204: 'mummifies',
 25253: 'cavanaugh',
 23388: 'zeon',
 40916: "jungle's",
 34736: 'viertel',
 27654: 'frenchmen',
 52205: 'torpedoes',
 52206: 'schlessinger',
 34737: 'torpedoed',
 69879: 'blister',
 52207: 'cinefest',
 34738: 'furlough',
 52208: 'mainsequence',
 40917: 'mentors',
 9097: 'academic',
 20605: 'stillness',
 40918: 'academia',
 52209: 'lonelier',
 52210: 'nibby',
 52211: "losers'",
 40919: 'cineastes',
 4452: 'corporate',
 40920: 'massaging',
 30596: 'bellow',
 19509: 'absurdities',
 53244: 'expetations',
 40921: 'nyfiken',
 75641: 'mehras',
 52212: 'lasse',
 52213: 'visability',
 33949: 'militarily',
 52214: "elder'",
 19026: 'gainsbourg',
 20606: 'hah',
 13423: 'hai',
 34739: 'haj',
 25254: 'hak',
 4314: 'hal',
 4895: 'ham',
 53262: 'duffer',
 52216: 'haa',
 69: 'had',
 11933: 'advancement',
 16828: 'hag',
 25255: "hand'",
 13424: 'hay',
 20607: 'mcnamara',
 52217: "mozart's",
 30734: 'duffel',
 30597: 'haq',
 13890: 'har',
 47: 'has',
 2404: 'hat',
 40922: 'hav',
 30598: 'haw',
 52218: 'figtings',
 15498: 'elders',
 52219: 'underpanted',
 52220: 'pninson',
 27655: 'unequivocally',
 23676: "barbara's",
 52222: "bello'",
 13000: 'indicative',
 40923: 'yawnfest',
 52223: 'hexploitation',
 52224: "loder's",
 27656: 'sleuthing',
 32625: "justin's",
 52225: "'ball",
 52226: "'summer",
 34938: "'demons'",
 52228: "mormon's",
 34740: "laughton's",
 52229: 'debell',
 39727: 'shipyard',
 30600: 'unabashedly',
 40404: 'disks',
 2293: 'crowd',
 10090: 'crowe',
 56437: "vancouver's",
 34741: 'mosques',
 6630: 'crown',
 52230: 'culpas',
 27657: 'crows',
 53347: 'surrell',
 52232: 'flowless',
 52233: 'sheirk',
 40926: "'three",
 52234: "peterson'",
 52235: 'ooverall',
 40927: 'perchance',
 1324: 'bottom',
 53366: 'chabert',
 52236: 'sneha',
 13891: 'inhuman',
 52237: 'ichii',
 52238: 'ursla',
 30601: 'completly',
 40928: 'moviedom',
 52239: 'raddick',
 51998: 'brundage',
 40929: 'brigades',
 1184: 'starring',
 52240: "'goal'",
 52241: 'caskets',
 52242: 'willcock',
 52243: "threesome's",
 52244: "mosque'",
 52245: "cover's",
 17640: 'spaceships',
 40930: 'anomalous',
 27658: 'ptsd',
 52246: 'shirdan',
 21965: 'obscenity',
 30602: 'lemmings',
 30603: 'duccio',
 52247: "levene's",
 52248: "'gorby'",
 25258: "teenager's",
 5343: 'marshall',
 9098: 'honeymoon',
 3234: 'shoots',
 12261: 'despised',
 52249: 'okabasho',
 8292: 'fabric',
 18518: 'cannavale',
 3540: 'raped',
 52250: "tutt's",
 17641: 'grasping',
 18519: 'despises',
 40931: "thief's",
 8929: 'rapes',
 52251: 'raper',
 27659: "eyre'",
 52252: 'walchek',
 23389: "elmo's",
 40932: 'perfumes',
 21921: 'spurting',
 52253: "exposition'\x85",
 52254: 'denoting',
 34743: 'thesaurus',
 40933: "shoot'",
 49762: 'bonejack',
 52256: 'simpsonian',
 30604: 'hebetude',
 34744: "hallow's",
 52257: 'desperation\x85',
 34745: 'incinerator',
 10311: 'congratulations',
 52258: 'humbled',
 5927: "else's",
 40848: 'trelkovski',
 52259: "rape'",
 59389: "'chapters'",
 52260: '1600s',
 7256: 'martian',
 25259: 'nicest',
 52262: 'eyred',
 9460: 'passenger',
 6044: 'disgrace',
 52263: 'moderne',
 5123: 'barrymore',
 52264: 'yankovich',
 40934: 'moderns',
 52265: 'studliest',
 52266: 'bedsheet',
 14903: 'decapitation',
 52267: 'slurring',
 52268: "'nunsploitation'",
 34746: "'character'",
 9883: 'cambodia',
 52269: 'rebelious',
 27660: 'pasadena',
 40935: 'crowne',
 52270: "'bedchamber",
 52271: 'conjectural',
 52272: 'appologize',
 52273: 'halfassing',
 57819: 'paycheque',
 20609: 'palms',
 52274: "'islands",
 40936: 'hawked',
 21922: 'palme',
 40937: 'conservatively',
 64010: 'larp',
 5561: 'palma',
 21923: 'smelling',
 13001: 'aragorn',
 52275: 'hawker',
 52276: 'hawkes',
 3978: 'explosions',
 8062: 'loren',
 52277: "pyle's",
 6707: 'shootout',
 18520: "mike's",
 52278: "driscoll's",
 40938: 'cogsworth',
 52279: "britian's",
 34747: 'childs',
 52280: "portrait's",
 3629: 'chain',
 2500: 'whoever',
 52281: 'puttered',
 52282: 'childe',
 52283: 'maywether',
 3039: 'chair',
 52284: "rance's",
 34748: 'machu',
 4520: 'ballet',
 34749: 'grapples',
 76155: 'summerize',
 30606: 'freelance',
 52286: "andrea's",
 52287: '\x91very',
 45882: 'coolidge',
 18521: 'mache',
 52288: 'balled',
 40940: 'grappled',
 18522: 'macha',
 21924: 'underlining',
 5626: 'macho',
 19510: 'oversight',
 25260: 'machi',
 11314: 'verbally',
 21925: 'tenacious',
 40941: 'windshields',
 18560: 'paychecks',
 3399: 'jerk',
 11934: "good'",
 34751: 'prancer',
 21926: 'prances',
 52289: 'olympus',
 21927: 'lark',
 10788: 'embark',
 7368: 'gloomy',
 52290: 'jehaan',
 52291: 'turaqui',
 20610: "child'",
 2897: 'locked',
 52292: 'pranced',
 2591: 'exact',
 52293: 'unattuned',
 786: 'minute',
 16121: 'skewed',
 40943: 'hodgins',
 34752: 'skewer',
 52294: 'think\x85',
 38768: 'rosenstein',
 52295: 'helmit',
 34753: 'wrestlemanias',
 16829: 'hindered',
 30607: "martha's",
 52296: 'cheree',
 52297: "pluckin'",
 40944: 'ogles',
 11935: 'heavyweight',
 82193: 'aada',
 11315: 'chopping',
 61537: 'strongboy',
 41345: 'hegemonic',
 40945: 'adorns',
 41349: 'xxth',
 34754: 'nobuhiro',
 52301: 'capitães',
 52302: 'kavogianni',
 13425: 'antwerp',
 6541: 'celebrated',
 52303: 'roarke',
 40946: 'baggins',
 31273: 'cheeseburgers',
 52304: 'matras',
 52305: "nineties'",
 52306: "'craig'",
 13002: 'celebrates',
 3386: 'unintentionally',
 14365: 'drafted',
 52307: 'climby',
 52308: '303',
 18523: 'oldies',
 9099: 'climbs',
 9658: 'honour',
 34755: 'plucking',
 30077: '305',
 5517: 'address',
 40947: 'menjou',
 42595: "'freak'",
 19511: 'dwindling',
 9461: 'benson',
 52310: 'white’s',
 40948: 'shamelessness',
 21928: 'impacted',
 52311: 'upatz',
 3843: 'cusack',
 37570: "flavia's",
 52312: 'effette',
 34756: 'influx',
 52313: 'boooooooo',
 52314: 'dimitrova',
 13426: 'houseman',
 25262: 'bigas',
 52315: 'boylen',
 52316: 'phillipenes',
 40949: 'fakery',
 27661: "grandpa's",
 27662: 'darnell',
 19512: 'undergone',
 52318: 'handbags',
 21929: 'perished',
 37781: 'pooped',
 27663: 'vigour',
 3630: 'opposed',
 52319: 'etude',
 11802: "caine's",
 52320: 'doozers',
 34757: 'photojournals',
 52321: 'perishes',
 34758: 'constrains',
 40951: 'migenes',
 30608: 'consoled',
 16830: 'alastair',
 52322: 'wvs',
 52323: 'ooooooh',
 34759: 'approving',
 40952: 'consoles',
 52067: 'disparagement',
 52325: 'futureistic',
 52326: 'rebounding',
 52327: "'date",
 52328: 'gregoire',
 21930: 'rutherford',
 34760: 'americanised',
 82199: 'novikov',
 1045: 'following',
 34761: 'munroe',
 52329: "morita'",
 52330: 'christenssen',
 23109: 'oatmeal',
 25263: 'fossey',
 40953: 'livered',
 13003: 'listens',
 76167: "'marci",
 52333: "otis's",
 23390: 'thanking',
 16022: 'maude',
 34762: 'extensions',
 52335: 'ameteurish',
 52336: "commender's",
 27664: 'agricultural',
 4521: 'convincingly',
 17642: 'fueled',
 54017: 'mahattan',
 40955: "paris's",
 52339: 'vulkan',
 52340: 'stapes',
 52341: 'odysessy',
 12262: 'harmon',
 4255: 'surfing',
 23497: 'halloran',
 49583: 'unbelieveably',
 52342: "'offed'",
 30610: 'quadrant',
 19513: 'inhabiting',
 34763: 'nebbish',
 40956: 'forebears',
 34764: 'skirmish',
 52343: 'ocassionally',
 52344: "'resist",
 21931: 'impactful',
 52345: 'spicier',
 40957: 'touristy',
 52346: "'football'",
 40958: 'webpage',
 52348: 'exurbia',
 52349: 'jucier',
 14904: 'professors',
 34765: 'structuring',
 30611: 'jig',
 40959: 'overlord',
 25264: 'disconnect',
 82204: 'sniffle',
 40960: 'slimeball',
 40961: 'jia',
 16831: 'milked',
 40962: 'banjoes',
 1240: 'jim',
 52351: 'workforces',
 52352: 'jip',
 52353: 'rotweiller',
 34766: 'mundaneness',
 52354: "'ninja'",
 11043: "dead'",
 40963: "cipriani's",
 20611: 'modestly',
 52355: "professor'",
 40964: 'shacked',
 34767: 'bashful',
 23391: 'sorter',
 16123: 'overpowering',
 18524: 'workmanlike',
 27665: 'henpecked',
 18525: 'sorted',
 52357: "jōb's",
 52358: "'always",
 34768: "'baptists",
 52359: 'dreamcatchers',
 52360: "'silence'",
 21932: 'hickory',
 52361: 'fun\x97yet',
 52362: 'breakumentary',
 15499: 'didn',
 52363: 'didi',
 52364: 'pealing',
 40965: 'dispite',
 25265: "italy's",
 21933: 'instability',
 6542: 'quarter',
 12611: 'quartet',
 52365: 'padmé',
 52366: "'bleedmedry",
 52367: 'pahalniuk',
 52368: 'honduras',
 10789: 'bursting',
 41468: "pablo's",
 52370: 'irremediably',
 40966: 'presages',
 57835: 'bowlegged',
 65186: 'dalip',
 6263: 'entering',
 76175: 'newsradio',
 54153: 'presaged',
 27666: "giallo's",
 40967: 'bouyant',
 52371: 'amerterish',
 18526: 'rajni',
 30613: 'leeves',
 34770: 'macauley',
 615: 'seriously',
 52372: 'sugercoma',
 52373: 'grimstead',
 52374: "'fairy'",
 30614: 'zenda',
 52375: "'twins'",
 17643: 'realisation',
 27667: 'highsmith',
 7820: 'raunchy',
 40968: 'incentives',
 52377: 'flatson',
 35100: 'snooker',
 16832: 'crazies',
 14905: 'crazier',
 7097: 'grandma',
 52378: 'napunsaktha',
 30615: 'workmanship',
 52379: 'reisner',
 61309: "sanford's",
 52380: '\x91doña',
 6111: 'modest',
 19156: "everything's",
 40969: 'hamer',
 52382: "couldn't'",
 13004: 'quibble',
 52383: 'socking',
 21934: 'tingler',
 52384: 'gutman',
 40970: 'lachlan',
 52385: 'tableaus',
 52386: 'headbanger',
 2850: 'spoken',
 34771: 'cerebrally',
 23493: "'road",
 21935: 'tableaux',
 40971: "proust's",
 40972: 'periodical',
 52388: "shoveller's",
 25266: 'tamara',
 17644: 'affords',
 3252: 'concert',
 87958: "yara's",
 52389: 'someome',
 8427: 'lingering',
 41514: "abraham's",
 34772: 'beesley',
 34773: 'cherbourg',
 28627: 'kagan',
 9100: 'snatch',
 9263: "miyazaki's",
 25267: 'absorbs',
 40973: "koltai's",
 64030: 'tingled',
 19514: 'crossroads',
 16124: 'rehab',
 52392: 'falworth',
 52393: 'sequals',
 ...}

We can see that the text data is already preprocessed for us.

In [33]:
print('Number of reviews', len(X_train))
print('Length of first and fifth review before padding', len(X_train[0]) ,len(X_train[4]))
print('First review', X_train[0])
print('First label', y_train[0])
Number of reviews 25000
Length of first and fifth review before padding 218 147
First review [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
First label 1

Here's an example review using the index-to-word mapping we created from the loaded JSON file to view the a review in its original form.

In [34]:
def show_review(x):
    review = ' '.join([idx2word[idx] for idx in x])
    print(review)

show_review(X_train[0])
 this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert  is an amazing actor and now the same being director  father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for  and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also  to the two little boy's that played the  of norman and paul they were just brilliant children are often left out of the  list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

The only thing what isn't done for us is the padding. Looking at the distribution of lengths will help us determine what a reasonable length to pad to will be.

In [35]:
plt.hist([len(x) for x in X_train])
plt.title('review lengths');

We saw one way of doing this earlier, but Keras actually has a built in pad_sequences helper function. This handles both padding and truncating. By default padding is added to the beginning of a sequence.

Q: Why might we want to truncate? Why might we want to pad from the beginning?
In [36]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
In [37]:
MAX_LEN = 500
X_train = pad_sequences(X_train, maxlen=MAX_LEN)
X_test = pad_sequences(X_test, maxlen=MAX_LEN)
print('Length of first and fifth review after padding', len(X_train[0]) ,len(X_train[4]))
Length of first and fifth review after padding 500 500

Model 1: Naive Feed-Forward Network

Let us build a single-layer feed-forward net with a hidden layer of 250 nodes. Each input would be a 500-dim vector of tokens since we padded all our sequences to size 500.


Q: How would you calculate the number of parameters in this network?
In [40]:
model = Sequential(name='Naive_FFNN')
model.add(Dense(250, activation='relu',input_dim=MAX_LEN))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=2)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Model: "Naive_FFNN"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_2 (Dense)              (None, 250)               125250
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 251
=================================================================
Total params: 125,501
Trainable params: 125,501
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
196/196 - 1s - loss: 178.4060 - accuracy: 0.4996 - val_loss: 91.7812 - val_accuracy: 0.4996
Epoch 2/10
196/196 - 0s - loss: 48.6640 - accuracy: 0.5822 - val_loss: 48.4361 - val_accuracy: 0.5026
Epoch 3/10
196/196 - 0s - loss: 17.7305 - accuracy: 0.6612 - val_loss: 31.7317 - val_accuracy: 0.5022
Epoch 4/10
196/196 - 0s - loss: 7.5028 - accuracy: 0.7264 - val_loss: 21.0285 - val_accuracy: 0.5017
Epoch 5/10
196/196 - 0s - loss: 3.9465 - accuracy: 0.7623 - val_loss: 15.6753 - val_accuracy: 0.5025
Epoch 6/10
196/196 - 0s - loss: 2.2523 - accuracy: 0.7980 - val_loss: 12.4736 - val_accuracy: 0.5039
Epoch 7/10
196/196 - 0s - loss: 1.4916 - accuracy: 0.8150 - val_loss: 10.7774 - val_accuracy: 0.5057
Epoch 8/10
196/196 - 0s - loss: 1.1314 - accuracy: 0.8334 - val_loss: 9.6000 - val_accuracy: 0.5060
Epoch 9/10
196/196 - 0s - loss: 0.8617 - accuracy: 0.8504 - val_loss: 8.9963 - val_accuracy: 0.5055
Epoch 10/10
196/196 - 0s - loss: 0.7458 - accuracy: 0.8602 - val_loss: 8.7728 - val_accuracy: 0.5083
Accuracy: 50.83%
Q: Why was the performance so poor? How could we improve our tokenization?

Model 2: Feed-Forward Network /w Embeddings

One can view the embedding process as a linear projection from one vector space to another. For NLP, we usually use embeddings to project the sparse one-hot encodings of words on to a lower-dimensional continuous space so that the input surface is 'dense' and possibly smooth. Thus, one can view this embedding layer process as just a transformation from $\mathbb{R}^{inp}$ to $\mathbb{R}^{emb}$

This not only reduces dimensionality but also allows semantic similarities between tokens to be captured by 'similiarities' between the embedding vectors. This was not possible with one-hot encoding as all vectors there were orthogonal to one another.

It is also possible to load pretrained embeddings that were learned from giant corpora. This would be an instance of transfer learning.

If you are interested in learning more, start with the astromonically impactful papers of word2vec and GloVe.

In Keras we use the Embedding layer:

tf.keras.layers.Embedding(
    input_dim, output_dim, embeddings_initializer='uniform',
    embeddings_regularizer=None, activity_regularizer=None,
    embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs
)

We'll need to specify the input_dim and output_dim. If working with sequences, as we are, you'll also need to set the input_length.

In [42]:
EMBED_DIM = 100

model = Sequential(name='FFNN_EMBED')
model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=2)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Model: "FFNN_EMBED"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 500, 100)          1000000
_________________________________________________________________
flatten_1 (Flatten)          (None, 50000)             0
_________________________________________________________________
dense_6 (Dense)              (None, 250)               12500250
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 251
=================================================================
Total params: 13,500,501
Trainable params: 13,500,501
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/2
196/196 - 6s - loss: 0.6433 - accuracy: 0.6078 - val_loss: 0.3630 - val_accuracy: 0.8497
Epoch 2/2
196/196 - 6s - loss: 0.2349 - accuracy: 0.9025 - val_loss: 0.2977 - val_accuracy: 0.8747
Accuracy: 87.47%

Model 3: 1-Dimensional Convolutional Network

Text can be thought of as 1-dimensional sequence (a single, long vector) and we can apply 1D Convolutions over a set of word embeddings.

More information on convolutions on text data can be found on this blog. If you want to learn more, read this published and well-cited paper from Eleni's friend, Byron Wallace.

Q: Why do we use Conv1D if our input, a sequence of word embeddings, is 2D?
In [43]:
model = Sequential(name='1D_CNN')
model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))
model.add(Conv1D(filters=200, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPool1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, epochs=2, batch_size=128)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Model: "1D_CNN"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_2 (Embedding)      (None, 500, 100)          1000000
_________________________________________________________________
conv1d (Conv1D)              (None, 500, 200)          60200
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 250, 200)          0
_________________________________________________________________
flatten_2 (Flatten)          (None, 50000)             0
_________________________________________________________________
dense_8 (Dense)              (None, 250)               12500250
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 251
=================================================================
Total params: 13,560,701
Trainable params: 13,560,701
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/2
196/196 [==============================] - 9s 34ms/step - loss: 0.5958 - accuracy: 0.6403
Epoch 2/2
196/196 [==============================] - 7s 34ms/step - loss: 0.1796 - accuracy: 0.9358
Accuracy: 88.69%

Model 4: Simple RNN

At a high-level, an RNN is similar to a feed-forward neural network (FFNN) in that there is an input layer, a hidden layer, and an output layer. The input layer is fully connected to the hidden layer, and the hidden layer is fully connected to the output layer. However, the crux of what makes it a recurrent neural network is that the hidden layer for a given time t is not only based on the input layer at time t but also the hidden layer from time t-1.

Here's a popular blog post on The Unreasonable Effectiveness of Recurrent Neural Networks.

In Keras, the vanilla RNN unit is implemented theSimpleRNN layer:

tf.keras.layers.SimpleRNN(
    units, activation='tanh', use_bias=True,
    kernel_initializer='glorot_uniform',
    recurrent_initializer='orthogonal',
    bias_initializer='zeros', kernel_regularizer=None,
    recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None,
    kernel_constraint=None, recurrent_constraint=None, bias_constraint=None,
    dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False,
    go_backwards=False, stateful=False, unroll=False, **kwargs
)

As you can see, recurrent layers in Keras take many arguments. We only need to be concerned with units, which specifies the size of the hidden state, and return_sequences, which will be discussed shortly. For the moment is it fine to leave this set to the default of False.

Due to the limitations of the vanilla RNN unit (more on that next) it tends not to be used much in practice. For this reason it seems that the Keras developers neglected to implement GPU acceleration for this layer! Notice how much slower the trainig is even for a network with far fewer parameters.

In [45]:
model = Sequential(name='SimpleRNN')
model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))
model.add(SimpleRNN(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, epochs=3, batch_size=128)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Model: "SimpleRNN"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_3 (Embedding)      (None, 500, 100)          1000000
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 100)               20100
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 101
=================================================================
Total params: 1,020,201
Trainable params: 1,020,201
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
196/196 [==============================] - 53s 267ms/step - loss: 0.6720 - accuracy: 0.5660
Epoch 2/3
196/196 [==============================] - 52s 266ms/step - loss: 0.5283 - accuracy: 0.7444
Epoch 3/3
196/196 [==============================] - 52s 265ms/step - loss: 0.3406 - accuracy: 0.8588
Accuracy: 83.13%

Vanishing/Exploding Gradients


We need to backpropogate through every time step to calculate the gradients used for our weight updates.

This requires the use of the chain rule which amounts to repeated multiplications.

This can cause two types of problems. First, this product can quickly 'explode,' becoming large, causing destructive updates to the model and numerical overflow. One hack to solve this problem is to clip the gradient at some threshold.

Alternatively, the gradient can 'vanish,' getting smaller and smaller as the gradient moves backwards in time. Gradient clipping will not help us here. If we can't propogate gradients suffuciently far back in time then our network will be unable to learn long temporal dependencies. This problem motivates the architecture of the GRU and LSTM units as substitutes for the 'vanilla' RNN.

For a more detailed look at the vanishing/exploding gradient problem, please see Marios's excellent Advanced Section.

Model 5: GRU

$X_{t}$: input
$U$, $V$, and $\beta$: parameter matrices and vector
$\tilde{h_t}$: candidate activation vector
$h_{t}$: output vector
$R_t$: reset gate
$Z_t$: update gate

The gates of the GRU allow for the gradients to flow more freely to previous time steps, helping to mitigate the vanishing gradient problem.

In Keras, the GRU layer is used in exactly the same way as the SimpleRNN layer.

tf.keras.layers.GRU(
    units, activation='tanh', recurrent_activation='sigmoid',
    use_bias=True, kernel_initializer='glorot_uniform',
    recurrent_initializer='orthogonal',
    bias_initializer='zeros', kernel_regularizer=None,
    recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None,
    kernel_constraint=None, recurrent_constraint=None, bias_constraint=None,
    dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False,
    go_backwards=False, stateful=False, unroll=False, time_major=False,
    reset_after=True, **kwargs
)

Here we just swap it in to the previous architecture. Note how much faster it trains with GPU excelleration than the simple RNN!

In [48]:
model = Sequential(name='GRU')
model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))
model.add(GRU(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, epochs=3, batch_size=64)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Model: "GRU"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_6 (Embedding)      (None, 500, 100)          1000000
_________________________________________________________________
gru_1 (GRU)                  (None, 100)               60600
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 101
=================================================================
Total params: 1,060,701
Trainable params: 1,060,701
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
391/391 [==============================] - 13s 30ms/step - loss: 0.5626 - accuracy: 0.6781
Epoch 2/3
391/391 [==============================] - 12s 30ms/step - loss: 0.2510 - accuracy: 0.9011
Epoch 3/3
391/391 [==============================] - 12s 30ms/step - loss: 0.1757 - accuracy: 0.9349
Accuracy: 88.02%

Model 6: LSTM

The LSTM lacks the GRU's 'short cut' connection (see GRU's $h_t$ above).

The LSTM also has a distinct 'cell state' in addition to the hidden state.

Futher reading:

Again, Kera's LSTM works like all the other recurrent layers.

tf.keras.layers.LSTM(
    units, activation='tanh', recurrent_activation='sigmoid',
    use_bias=True, kernel_initializer='glorot_uniform',
    recurrent_initializer='orthogonal',
    bias_initializer='zeros', unit_forget_bias=True,
    kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None,
    activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,
    bias_constraint=None, dropout=0.0, recurrent_dropout=0.0,
    return_sequences=False, return_state=False, go_backwards=False, stateful=False,
    time_major=False, unroll=False, **kwargs
)
In [47]:
model = Sequential(name='LSTM')
model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, epochs=3, batch_size=64)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Model: "LSTM"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_5 (Embedding)      (None, 500, 100)          1000000
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 101
=================================================================
Total params: 1,080,501
Trainable params: 1,080,501
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
391/391 [==============================] - 14s 33ms/step - loss: 0.5209 - accuracy: 0.7265
Epoch 2/3
391/391 [==============================] - 13s 33ms/step - loss: 0.3275 - accuracy: 0.8671
Epoch 3/3
391/391 [==============================] - 13s 33ms/step - loss: 0.2021 - accuracy: 0.9268
Accuracy: 86.39%

BiDirectional Layer

We may want our model to learn dependencies in either direction. A BiDirectional RNN consists of two separate recurrent units. One processing the sequence from left to right, the other processes that same sequence but in reverse, from right to left. The output of the two units are then merged together (typically concatenated) and feed to the next layer of the network.

Creating a Bidirection RNN in Keras is quite simple. We just 'wrap' a recurrent layer in the Bidirectional layer. The default behavior is to concatenate the output from each direction.

tf.keras.layers.Bidirectional(
    layer, merge_mode='concat', weights=None, backward_layer=None,
    **kwargs
)

Example:

model = Sequential()
...
model.add(Bidirectional(SimpleRNN(n_nodes))
...

Deep RNNs

We may want to stack RNN layers one after another. But there is a problem. A recurrent layer expects to be given a sequence as input, and yet we can see that the recurrent layer in each of our models above outputs a single vector. This is because the default behavior of Keras's recurrent layers is to suppress the output until the final time step. If we want to have two recurrent units in a row then the first will have to given an output after each time step, thus providing a sequence to the 2nd recurrent layer.

We can have our recurrent layers output at each time step setting return_sequences=True.
Example:

model = Sequential()
...
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100)
...

TimeDistributed Layer

TimeDistributed is a 'wrapper' that applies a layer to all time steps of an input sequence.

tf.keras.layers.TimeDistributed(
    layer, **kwargs
)

We use TimeDistributed when we want to input a sequence into a layer that doesn't normally expect a time dimension, such as Dense.

In [146]:
model = Sequential()
model.add(TimeDistributed(Dense(8), input_shape=(3, 5)))
input_array = np.random.randint(10, size=(1,3,5))
print("Shape of input : ", input_array.shape)

model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
print("Shape of output : ", output_array.shape)
Shape of input :  (1, 3, 5)
Shape of output :  (1, 3, 8)

RepeatVector Layer

RepeatVector repeats the vector a specified number of times. Dimension changes from
(batch_size, number_of_elements)
to
(batch_size, number_of_repetitions, number_of_elements)

This effectively generates a sequence from a single input.

In [88]:
model = Sequential()
model.add(Dense(2, input_dim=1))
model.add(RepeatVector(3))
model.summary()
Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_37 (Dense)             (None, 2)                 4
_________________________________________________________________
repeat_vector_5 (RepeatVecto (None, 3, 2)              0
=================================================================
Total params: 4
Trainable params: 4
Non-trainable params: 0
_________________________________________________________________

Model 7: CNN + RNN

CNNs are good at learning spatial features, and sentences can be thought of as 1-D spatial vectors (dimensionality is determined by the number of words in the sentence). We can then take the features learned by the CNN (after a maxpooling layer) and feed them into an RNN! We expect the CNN to be able to pick out invariant features across the 1-D spatial structure (i.e., sentence) that characterize good and bad sentiment. This learned spatial features may then be learned as sequences by a reccurent layer. The classification step is then performed by a final dense layer.

Exercise: Build a CNN + Deep, BiDirectional GRU Model

Let's put together everything we've learned so far.
Create a network with:

  • word embeddings in a 100-dimensional space
  • conv layer with 32 filters, kernels of width 3, 'same' padding, and ReLU activate
  • max pooling of size 2
  • 2 bidirectional GRU layers, each with 50 units per direction
  • dense output layer for binary classification
In [39]:
model = Sequential(name='CNN_GRU')
# your code here
model.add(Embedding(MAX_VOCAB, 100, input_length=MAX_LEN))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPool1D(pool_size=2))
model.add(Bidirectional(GRU(50, return_sequences=True)))
model.add(Bidirectional(GRU(50)))
model.add(Dense(1, activation='sigmoid'))
In [40]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, epochs=3, batch_size=64)

scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Model: "CNN_GRU"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, 500, 100)          1000000
_________________________________________________________________
conv1d (Conv1D)              (None, 500, 32)           9632
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 250, 32)           0
_________________________________________________________________
bidirectional (Bidirectional (None, 250, 100)          25200
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               45600
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101
=================================================================
Total params: 1,080,533
Trainable params: 1,080,533
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
391/391 [==============================] - 27s 43ms/step - loss: 0.5076 - accuracy: 0.7144
Epoch 2/3
391/391 [==============================] - 17s 43ms/step - loss: 0.1790 - accuracy: 0.9341
Epoch 3/3
391/391 [==============================] - 17s 43ms/step - loss: 0.1019 - accuracy: 0.9653
Accuracy: 87.90%

What is the worst movie review in the test set according to your model? 🍅

In [41]:
preds = model.predict_proba(X_test)
worst_review = X_test[preds.argmin()]
show_review(worst_review)
                                                                                                                                                                                                                                                                                                                                                                                          steven seagal has made a really dull bad and boring movie steven seagal plays a doctor this movie has got a few action scenes but they are poorly directed and have nothing to do with the rest of the movie a group of american nazis spread a lethal virus which is able to wipe out the state of montana wesley seagal s character tries desperately to find a cure and that is the story of the  the  is an extremely boring film because nothing happens it is filled with boring dialogue and illogical gaps between events and stupid actors steven seagal has totally up in this movie and i would not recommend this  to my worst enemy 3 10

What is the best movie review in the test set according to your model? 🏆

In [78]:
best_review = X_test[preds.argmax()]
show_review(best_review)
 village with their artist uncle john  after the death of their parents  sister and john's brother simon has given up trying to convince john to allow he and susan to take care of the children and have  to using private detectives to catch him in either  behavior or unemployed and therefore unable to care for the children properly susan finally decides to take matters into her own hands and goes to  village herself posing as an actress to try to gain information and or  him to see reason what she discovers however is that she not only likes the free and artistic lifestyle john and his friends are living and that the girls are being brought up well but that she is quickly falling in love with john inevitably her true identity is discovered and she is faced with the task of convincing everyone on both sides of the custody debate who should belong with whom br br i really enjoyed this film and found that its very short running time 70 minutes was the perfect length to spin this simple but endearing story  hopkins one of the great 1930's 1940's actresses is delightful in this film her energy style and wholesome beauty really lend themselves to creating an endearing character even though you know that she's pulling a fast one on the people she quickly befriends this is the earliest film i've seen ray  in and he was actually young and non  looking and apparently three years younger than his co star his energy and  manner in wise girl were a refreshing change to the demeanor he affects in his usual darker films honestly though i am usually not remotely a fan of child actors i really enjoyed the two young girls who played   they were   and were really the  of the film unfortunately i can't dig up any other films that either of them were subsequently in after this one which is a shame since both  a large amount of natural talent br br wise girl was a film that was made three years after the hollywood code was and to some extent this was  clear by the quick happy ending and the pie in the sky and ease with which the characters lived the alleged  co  was in fact a gorgeous  de  where the artists lived for free or for trade and everything is tied up very nicely throughout fortunately this was a light enough film and the characters were charming enough to make  for its  and short  and i was able to just take wise girl for what it was a good old fashioned love story that was as entertaining as it was endearing unfortunately films of the romantic comedy drama genre today are considerably less intelligent and entertaining or i wouldn't find myself continuously returning to the classics 7 10
End of Exercise: Please return to the main room

Heavy Metal Lyric Generator

Here we'll design an RNN to generate song lyrics character by character!

The model will take in a sequences of 40 character 'windows' of text and predict the most probable next character. This new character is then appended to the original sequence, the first character is dropped, and this new sequene is fed back into the model. We can repeat this process for as long as we like to generate output of arbitrary length.

In [89]:
metal_df = pd.read_csv('data/metal_lyrics_PG.csv')
In [90]:
metal_df.shape
Out[90]:
(4785, 4)

How to we know these are heavy metal lyrics?

In [91]:
metal_df[metal_df.lyrics.str.contains('elves')]
Out[91]:
song year artist lyrics
116 vinvm-sabbati 2001 behemoth waters running down\nby the silver moon rays\n...
197 gaya-s-dream 1993 the-gathering open the gates of the past\nwith the key to ou...
202 generations 2013 answer-with-metal look around, the air is full with fear and we ...
250 dark-of-the-sun 2006 arch-enemy like insects of the night, we are drawn into t...
258 shadows-and-dust 2006 arch-enemy at the mercy of our conscience\nconfined withi...
... ... ... ... ...
4589 scorn 2006 allegiance likes rats we strip the earth\nrampant animals...
4600 armies-of-valinor 2007 galadriel into the battle we ride again\nagainst the dar...
4609 new-priesthood 2006 dark-angel history's shown you that answers can't be foun...
4704 ride-for-glory 2007 dragonland \n"we yearn for the battle and the glory...but...
4782 principle-of-speed 2007 drifter i made an experience\nit just happens once in ...

107 rows × 4 columns

Ok, I'm convinced.

In [92]:
n_samples = 1000
lyrics_sample = metal_df.sample(n=n_samples, random_state=109).lyrics.values
In [93]:
raw_text = ' \n '.join(lyrics_sample)
# remove bad chars
raw_text = re.sub(r"[^\s\w']", "", raw_text)

chars = set(sorted(raw_text))
char2idx = dict((c,i) for i, c in enumerate(chars))
idx2char = dict((i, c) for i, c in enumerate(chars))

n_chars = len(raw_text)
n_vocab = len(chars)

print(f'Sample Corpus Length: {len(raw_text)}')
Sample Corpus Length: 720944

Creating Input/Target Pairs

We need to slice up our lyric data to create input and target pairs that can be to our model for its supervised prediction task. Each input with be a sequence of seq_len characters. This can be though of as a sliding window across the concatenated lyric data. The response is the character after the end of that window in the training data.

In [94]:
# prepare the dataset of input to output pairs encoded as integers
seq_len = 40
seqs = []
targets = []
for i in range(0, n_chars - seq_len):
    seq = raw_text[i:i + seq_len]
    target = raw_text[i + seq_len]
    seqs.append([char2idx[char] for char in seq])
    targets.append(char2idx[target])
n_seqs = len(seqs)
print("Total Char Sequences: ", n_seqs)
Total Char Sequences:  720904

We can create a one-hot encoding by indexing into an n_vocab sized identity matrix using the character index values.

In [95]:
X = np.reshape(seqs, (-1, seq_len))
eye = np.eye(n_vocab)
X = eye[seqs]
y = eye[targets]
In [96]:
X.shape, y.shape
Out[96]:
((720904, 40, 29), (720904, 29))
In [97]:
# remove some large variables from memory
del metal_df
del lyrics_sample
del seqs

LambdaCallback

The loss score is usually not the best way to judge if our language model is learning to generate 'quality' test. It would be better if we could periodically see examples of the kind of text it can generate as it trains so we can judge for ourselves.

The LambdaCallback allows us to execute arbitary functions at different points in the training process and why they are useful when evaluating generative models. We'll use it to generate some sample text at the end of every other epoch.

In [98]:
from tensorflow.keras.callbacks import LambdaCallback
In [99]:
def on_epoch_end(epoch, _):
    # only triggers on every 2nd epoch
    if((epoch + 1) % 2 == 0):
        # select a random seed sequence
        start = np.random.randint(0, len(X)-1)
        seq = X[start]
        seed = ''.join([idx2char[np.argmax(x)] for x in seq])

        print(f"---Seed: \"{repr(seed)}\"---")
        print(f"{seed}", end='')
        # generate characters
        for i in range(200):
            x = seq.reshape(1, seq_len, -1)
            pred = model.predict(x, verbose=0)[0]
            # sampling gives us more 'serendipity' than argmax
#             index = np.argmax(pred)
            index = np.random.choice(n_vocab, p=pred)
            result = idx2char[index]
            sys.stdout.write(result)
            # shift sequence over
            seq[:-1] = seq[1:]
            seq[-1] = eye[index]
        print()

generate_text = LambdaCallback(on_epoch_end=on_epoch_end)

We then add the LambdaCallback to the callbacks list along with ModelCheckpoint and EarlyStopping to be passed to the fit() method at train time.

In [100]:
# define the checkpoint
model_name = 'metal-char'
filepath=f'models/{model_name}.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1,
                             save_weights_only=False,
                             save_best_only=True, mode='min')
es = EarlyStopping(monitor='loss', patience=3, verbose=0,
                   mode='auto',restore_best_weights=True)
callbacks_list = [checkpoint, generate_text, es]
Exercise: Build a Character Based Lyric Generator

Architecture

  • Bidirection LSTM with a hidden dimension of 128 in each direction
  • BatchNormalization to speed up training (Don't tell Pavlos!)
  • Dense output layer to predict the next character
In [11]:
# your code here
hidden_dim = 128
model = Sequential()
model.add(Bidirectional(LSTM(hidden_dim), input_shape=(seq_len, n_vocab)))
model.add(BatchNormalization())
model.add(Dense(n_vocab, activation='softmax'))
In [12]:
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
bidirectional (Bidirectional (None, 256)               161792
_________________________________________________________________
batch_normalization (BatchNo (None, 256)               1024
_________________________________________________________________
dense (Dense)                (None, 29)                7453
=================================================================
Total params: 170,269
Trainable params: 169,757
Non-trainable params: 512
_________________________________________________________________
In [16]:
model.fit(X, y, epochs=30, batch_size=128, callbacks=callbacks_list)
Epoch 1/30
5633/5633 [==============================] - 47s 7ms/step - loss: 2.1992

Epoch 00001: loss improved from inf to 2.02026, saving model to models/metal-char.hdf5
Epoch 2/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.7666

Epoch 00002: loss improved from 2.02026 to 1.72630, saving model to models/metal-char.hdf5
---Seed: ""eeze but why\nwon't you believe\nwon't you""---
eeze but why
won't you believe
won't you will how kyon't now it you've goe
th
so noundapop and so shand will foring hpres
and like heredging return
all the stong love a cryttgin
fe i'm noblows the glort
comes evermeht ard a gong holortions
Epoch 3/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.6288

Epoch 00003: loss improved from 1.72630 to 1.60815, saving model to models/metal-char.hdf5
Epoch 4/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.5469

Epoch 00004: loss improved from 1.60815 to 1.53622, saving model to models/metal-char.hdf5
---Seed: "'l know me inside and outside\ni am here t'"---
l know me inside and outside
i am here the scarture for some no andon gounder time
nair a flose the the shorious of perpore
and glow but arourd she we just you bark for go darkness's need the missing oh your
were the sinsh away
the sky thei
Epoch 5/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.4940

Epoch 00005: loss improved from 1.53622 to 1.48602, saving model to models/metal-char.hdf5
Epoch 6/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.4527

Epoch 00006: loss improved from 1.48602 to 1.44772, saving model to models/metal-char.hdf5
---Seed: "'y your kindred watch us pronouncing sent'"---
y your kindred watch us pronouncing sently life
we go the conceliw
 a watmallch age here release
wrock agard to the haunhown
your life worntly way
too just right yer so many you makes me am now
homence the geas to get away
take it down gol
Epoch 7/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.4195

Epoch 00007: loss improved from 1.44772 to 1.41843, saving model to models/metal-char.hdf5
Epoch 8/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.3948

Epoch 00008: loss improved from 1.41843 to 1.39306, saving model to models/metal-char.hdf5
---Seed: "'here\ndespite what you may think jesus wi'"---
here
despite what you may think jesus with that just sunkness
memories remember free
this i can never knees i am drifted
laying here
infore through the sign
i can't be the completeh something burning stamber cross
reach to our arms
than her
Epoch 9/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.3715

Epoch 00009: loss improved from 1.39306 to 1.37268, saving model to models/metal-char.hdf5
Epoch 10/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.3516

Epoch 00010: loss improved from 1.37268 to 1.35401, saving model to models/metal-char.hdf5
---Seed: "'ence what does it mean\nare you satisfied'"---
ence what does it mean
are you satisfied
are yelcary
our voodes breathled with
weak the pullable do ac truth
there's a neck reach we let me good yoursele
relefcion around freedom
noin the burnt to beer with home
sail becauses
let down it ha
Epoch 11/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.3380

Epoch 00011: loss improved from 1.35401 to 1.33822, saving model to models/metal-char.hdf5
Epoch 12/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.3220

Epoch 00012: loss improved from 1.33822 to 1.32432, saving model to models/metal-char.hdf5
---Seed: "' know now all the reasons\nand all the co'"---
 know now all the reasons
and all the corcely game time
with me
but image tears
the the mairin
take hault me ashameeds
at lay in the ocean of time leave
my hope repouration you're my tears of your heart
like the day is let the way
driving t
Epoch 13/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.3081

Epoch 00013: loss improved from 1.32432 to 1.31071, saving model to models/metal-char.hdf5
Epoch 14/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.2938

Epoch 00014: loss improved from 1.31071 to 1.29885, saving model to models/metal-char.hdf5
---Seed: "'ars\nnow beneath\nelectrical skies\nartille'"---
ars
now beneath
electrical skies
artilleds i tollimily
fall it comes
dopt all to eternity
eternity the humand as the flaughhohhohhohhohhohhohhohhohhohhohhohhohhohhohhohhohhohhohhahhahhahhahhahhahhahhahhahhahhahhahhahhahhahhahhahhahhahh
hhoh
Epoch 15/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.2828

Epoch 00015: loss improved from 1.29885 to 1.28791, saving model to models/metal-char.hdf5
Epoch 16/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.2699

Epoch 00016: loss improved from 1.28791 to 1.27619, saving model to models/metal-char.hdf5
---Seed: "'ady to strike\ncall for us and you will s'"---
ady to strike
call for us and you will sounds shin frozes
into the rold you will open
now you wake up why will come us to me open death
i cannot toll me
i am not envence i could is sidt
the sun away from the tender
on the cloud silfit exuce
Epoch 17/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.2591

Epoch 00017: loss improved from 1.27619 to 1.26540, saving model to models/metal-char.hdf5
Epoch 18/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.2475

Epoch 00018: loss improved from 1.26540 to 1.25442, saving model to models/metal-char.hdf5
---Seed: ""'ve never been hurt\nact like you don't n""---
've never been hurt
act like you don't never gike and read
i know you sustappine everypherians
to my fingers hard to go away
you feel ignorade suffer to
your vicions free put the grave
tils murriors wind skies of sangring imputting building
Epoch 19/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.2388

Epoch 00019: loss improved from 1.25442 to 1.24440, saving model to models/metal-char.hdf5
Epoch 20/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.2306

Epoch 00020: loss improved from 1.24440 to 1.23521, saving model to models/metal-char.hdf5
---Seed: "'bered by offsprings\nwitness the end of m'"---
bered by offsprings
witness the end of madical preceine
but exists havelen't mind too much to avoid
walk harnts of the just throne
an a land lay it leaves
the beauth around myself
this hopely staity that a long shall
it fears the grances of
Epoch 21/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.2214

Epoch 00021: loss improved from 1.23521 to 1.22585, saving model to models/metal-char.hdf5
Epoch 22/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.2067

Epoch 00022: loss improved from 1.22585 to 1.21731, saving model to models/metal-char.hdf5
---Seed: "' deep world of darkness so infinite and '"---
 deep world of darkness so infinite and shelter
 like spring distones tormore
the winds spow i never asween
but when the reasons puning it
i don't know we'll shane this girl enthroid
what do you pane
break what you can led by once was too
Epoch 23/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.1998

Epoch 00023: loss improved from 1.21731 to 1.20829, saving model to models/metal-char.hdf5
Epoch 24/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.1934

Epoch 00024: loss improved from 1.20829 to 1.19972, saving model to models/metal-char.hdf5
---Seed: "'ed\nascend\nto darkness we sail\neternal re'"---
ed
ascend
to darkness we sail
eternal remember
the
when is the humand time is hurt
with each in chance spread the indectimeom the journey begin
we'll be fixano just exploit for us
into the bett it returned
by both our kinn light
i am still
Epoch 25/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.1828

Epoch 00025: loss improved from 1.19972 to 1.19151, saving model to models/metal-char.hdf5
Epoch 26/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.1740

Epoch 00026: loss improved from 1.19151 to 1.18459, saving model to models/metal-char.hdf5
---Seed: "'christian society\nwomen were anathematiz'"---
christian society
women were anathematized of goes
 guestival powers of apotheration fass of my north
no fiust
gliding the provelosms of distrocked
snowledged who now
and this impromise
fullles me stays
before we melding
denied will reveag
Epoch 27/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.1674

Epoch 00027: loss improved from 1.18459 to 1.17692, saving model to models/metal-char.hdf5
Epoch 28/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.1592

Epoch 00028: loss improved from 1.17692 to 1.16979, saving model to models/metal-char.hdf5
---Seed: "'only void just about everywhere\nso i din'"---
only void just about everywhere
so i dinacl
fear for rain and carvable
we shared by it
give us the way that a commour and darows from althounds
now always creatured down and beside the soul
life i never be seen
your nevered in things and tw
Epoch 29/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.1546

Epoch 00029: loss improved from 1.16979 to 1.16355, saving model to models/metal-char.hdf5
Epoch 30/30
5633/5633 [==============================] - 39s 7ms/step - loss: 1.1478

Epoch 00030: loss improved from 1.16355 to 1.15678, saving model to models/metal-char.hdf5
---Seed: "'ng\nwhen the life forsaken again is the o'"---
ng
when the life forsaken again is the one
and dough i can't take her
feel myself spected
sucame alone
my last is always we pride on me
the darkness ofwerd the times
there's a strength to last
the words' dies anate majesty
triet
the old ste
Out[16]:
End of Exercise: Please return to the main room
In [20]:
model = load_model(f'models/{model_name}.hdf5')

With some helper functions we can generate text from an arbitrary seed string.

In [16]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def gen_text_char(seq, temperature=0.3):
    print("Seed:")
    print("\"", ''.join([idx2char[np.argmax(x)] for x in seq]), "\"", end='')
    # generate characters
    for i in range(1000):
        x = seq.reshape(1, seq_len, -1)
        pred = model.predict(x, verbose=0)[0]
        index = sample(pred, temperature)
#             index = np.argmax(pred)
        result = idx2char[index]
        sys.stdout.write(result)
        # shift sequence over
        seq[:-1] = seq[1:]
        seq[-1] = eye[index]
    print()


def text_from_seed(s, temperature=0.3):
    s = s.lower()
    s = re.sub(r"[^\s\w']", "", s)
    char2idx = {c: i for i, c in idx2char.items()}
    seq = [char2idx[c] for c in s]
    if len(seq) < seq_len:
        print(f'Seed must be at least {seq_len} characters long!')
    seq = seq[:seq_len]
    x = np.copy(np.array(seq))
    x = eye[x]
    gen_text_char(x, temperature)

Set a seed string and see where your model takes it.

In [30]:
seed = "Sunshine, lollipops and rainbows\nEverything that's wonderful is what I feel when we're together "
text_from_seed(seed)
Seed:
" sunshine lollipops and rainbows
everythi "ng is not all the world
the shadows of a dead my life
in the world in the wind
the dark and see the land
the soul of a devolut of the darkness
we are the ones who says there a blackened with the way
the world is speak
i want to go back to me
the beauty of the darkness and destroyed are so come
the reason to the battle
with the wind is not all the world
i want to love you met you
i'm doing to me
i want to see you and i don't know
i was so were we see
to be the one we left behind
the fear of the dark in a desert of the ancient life
i will not come to me
the more i can't see the past
and i was the one who was now
the season where the streets of my heart
i wish i was better one the sea
we go to care in the sun of the attic
i will find a world in the wind
the world in the wind
the sands of the halls of a new world
i can't see the fire is all i need
the silence of the beast never been said
and survive on the waves
i'm waiting for an answer
the land of the sun
in the wind to carry on the wind
Q: How might you improve upon this simple model architecture?

Arithmetic with an RNN

Thanks go to Eleni for this code example.

In this exercise, we are going to teach addition to our model. Given two numbers (<999), the model outputs their sum (<9999). The input is provided as a string '231+432' and the model will provide its output as ' 663' (Here the empty space is the padding character). We are not going to use any external dataset and are going to construct our own dataset for this exercise.

The exercise we attempt to do effectively "translates" a sequence of characters '231+432' to another sequence of characters ' 663' and hence, this class of models are called sequence-to-sequence models (aka seq2seq). Such architectures have profound applications in several real-life tasks such as machine translation, summarization, image captioning etc.

To be clear, sequence-to-sequence (aka seq2seq) models take as input a sequence of length N and return a sequence of length M, where N and M may or may not differ, and every single observation/input may be of different values, too. For example, machine translation concerns converting text from one natural language to another (e.g., translating English to French). Google Translate is an example, and their system is a seq2seq model. The input (e.g., an English sentence) can be of any length, and the output (e.g., a French sentence) may be of any length.

Background knowledge: The earliest and most simple seq2seq model works by having one RNN for the input, just like we've always done, and we refer to it as being an "encoder." The final hidden state of the encoder RNN is fed as input to another RNN that we refer to as the "decoder." The job of the decoder is to generate each token, one word at a time. This may seem really limiting, as it relies on the encoder encapsulating the entire input sequence with just 1 hidden layer. It seems unrealistic that we could encode an entire meaning of a sentence with just one hidden layer. Yet, results even in this simplistic manner can be quite impressive. In fact, these early results were compelling enough that these models immediately replaced the decades of earlier machine translation work.

In [134]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, RepeatVector, TimeDistributed

data generation and preprocessing

We can simply generate all the training data we need.

In [141]:
class CharacterTable(object):
    def __init__(self, chars):
        self.chars = sorted(set(chars))
        self.char_indices = {c: i for i, c in enumerate(self.chars)}
        self.indices_char = {i: c for i, c in enumerate(self.chars)}

    # converts a String of characters into a one-hot embedding/vector
    def encode(self, C, num_rows):
        x = np.zeros((num_rows, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x

    # converts a one-hot embedding/vector into a String of characters
    def decode(self, x, calc_argmax=True):
        if calc_argmax:
            x = x.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in x)
In [142]:
TRAINING_SIZE = 50000
DIGITS = 3
MAXOUTPUTLEN = DIGITS + 1
MAXLEN = DIGITS + 1 + DIGITS

chars = '0123456789+ '
ctable = CharacterTable(chars)
In [143]:
def return_random_digit():
      return np.random.choice(list('0123456789'))

# generate a new number of length `DIGITS`
def generate_number():
    num_digits = np.random.randint(1, DIGITS + 1)
    return int(''.join( return_random_digit()
                      for i in range(num_digits)))

# generate `TRAINING_SIZE` # of pairs of random numbers
def data_generate(num_examples):
    questions = []
    answers = []
    seen = set()
    print('Generating data...')
    while len(questions) < TRAINING_SIZE:
        a, b = generate_number(), generate_number()

        # don't allow duplicates; this is good practice for training,
        # as we will minimize memorizing seen examples
        key = tuple(sorted((a, b)))
        if key in seen:
            continue
        seen.add(key)

        # pad the data with spaces so that the length is always MAXLEN.
        q = '{}+{}'.format(a, b)
        query = q + ' ' * (MAXLEN - len(q))
        ans = str(a + b)

        # answers can be of maximum size DIGITS + 1.
        ans += ' ' * (MAXOUTPUTLEN - len(ans))
        questions.append(query)
        answers.append(ans)
    print('Total addition questions:', len(questions))
    return questions, answers

def encode_examples(questions, answers):
    x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=np.bool)
    y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=np.bool)
    for i, sentence in enumerate(questions):
        x[i] = ctable.encode(sentence, MAXLEN)
    for i, sentence in enumerate(answers):
        y[i] = ctable.encode(sentence, DIGITS + 1)

    indices = np.arange(len(y))
    np.random.shuffle(indices)
    return x[indices],y[indices]
In [144]:
q,a = data_generate(TRAINING_SIZE)
x,y = encode_examples(q,a)

# divides our data into training and validation
split_at = len(x) - len(x) // 10
x_train, x_val, y_train, y_val = x[:split_at], x[split_at:],y[:split_at],y[split_at:]

print('Training Data shape:')
print('X : ', x_train.shape)
print('Y : ', y_train.shape)

print('Sample Question(in encoded form) : ', x_train[0], y_train[0])
print('Sample Question(in decoded form) : ', ctable.decode(x_train[0]),'Sample Output : ', ctable.decode(y_train[0]))
Generating data...
Total addition questions: 50000
Training Data shape:
X :  (45000, 7, 12)
Y :  (45000, 4, 12)
Sample Question(in encoded form) :  [[False False False False False False False  True False False False False]
 [False False False False False False False False False False False  True]
 [False False  True False False False False False False False False False]
 [False  True False False False False False False False False False False]
 [False False False False False False False False  True False False False]
 [False False False False False False False False False False  True False]
 [ True False False False False False False False False False False False]] [[False False False False False False False False  True False False False]
 [False False False False False False False  True False False False False]
 [False False False False False False False False False False  True False]
 [ True False False False False False False False False False False False]]
Sample Question(in decoded form) :  590+68  Sample Output :  658
In [145]:
x_train
Out[145]:
array([[[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False,  True],
        [False, False,  True, ..., False, False, False],
        ...,
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False,  True, False],
        [ True, False, False, ..., False, False, False]],

       [[False, False, False, ..., False,  True, False],
        [False, False, False, ...,  True, False, False],
        [False, False, False, ..., False, False, False],
        ...,
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False,  True, False],
        [ True, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False,  True, False],
        ...,
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [ True, False, False, ..., False, False, False]],

       ...,

       [[False, False, False, ...,  True, False, False],
        [False,  True, False, ..., False, False, False],
        [False, False, False, ...,  True, False, False],
        ...,
        [False, False, False, ..., False, False, False],
        [ True, False, False, ..., False, False, False],
        [ True, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False,  True, False],
        [False,  True, False, ..., False, False, False],
        ...,
        [False, False,  True, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [ True, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ...,
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [ True, False, False, ..., False, False, False]]])
Build an RNN for Arithmetic

Note: Whenever you are initializing a LSTM in Keras, by the default the option return_sequences = False. This means that at the end of the step the next component will only get to see the final hidden layer's values. On the other hand, if you set return_sequences = True, the LSTM component will return the hidden layer at each time step. It means that the next component should be able to consume inputs in that form.

Think how this statement is relevant in terms of this model architecture and the TimeDistributed module we just learned.

Build an encoder and decoder both single layer 128 nodes and an appropriate dense layer as needed by the model.

In [150]:
# Hyperaparams
HIDDEN_SIZE = 128
BATCH_SIZE = 128
LAYERS = 1

print('Build model...')
model = Sequential()

#ENCODING
model.add(LSTM(HIDDEN_SIZE, input_shape=(MAXLEN, len(chars))))
model.add(RepeatVector(MAXOUTPUTLEN))

#DECODING
for _ in range(LAYERS):
    # return hidden layer at each time step
    model.add(LSTM(HIDDEN_SIZE, return_sequences=True))

model.add(TimeDistributed(Dense(len(chars), activation='softmax')))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()
Build model...
Model: "sequential_22"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
lstm_16 (LSTM)               (None, 128)               72192
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 4, 128)            0
_________________________________________________________________
lstm_17 (LSTM)               (None, 4, 128)            131584
_________________________________________________________________
time_distributed_1 (TimeDist (None, 4, 12)             1548
=================================================================
Total params: 205,324
Trainable params: 205,324
Non-trainable params: 0
_________________________________________________________________

Let's check how well our model trained.

In [151]:
for iteration in range(1, 2):
    print()
    model.fit(x_train, y_train,
              batch_size=BATCH_SIZE,
              epochs=20,
              validation_data=(x_val, y_val))
    # Select 10 samples from the validation set at random so
    # we can visualize errors.
    print('Finished iteration ', iteration)
    numcorrect = 0
    numtotal = 20

    for i in range(numtotal):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]
        preds = model.predict_classes(rowx, verbose=0)
        q = ctable.decode(rowx[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=False)
        print('Question', q, end=' ')
        print('True', correct, end=' ')
        print('Guess', guess, end=' ')
        if guess == correct :
            print('Good job')
            numcorrect += 1
        else:
            print('Fail')
    print('The model scored ', numcorrect*100/numtotal,' % in its test.')
Epoch 1/20
352/352 [==============================] - 5s 6ms/step - loss: 2.0174 - accuracy: 0.2885 - val_loss: 1.7942 - val_accuracy: 0.3441
Epoch 2/20
352/352 [==============================] - 2s 5ms/step - loss: 1.7772 - accuracy: 0.3434 - val_loss: 1.7091 - val_accuracy: 0.3704
Epoch 3/20
352/352 [==============================] - 2s 5ms/step - loss: 1.6578 - accuracy: 0.3831 - val_loss: 1.5676 - val_accuracy: 0.4189
Epoch 4/20
352/352 [==============================] - 2s 5ms/step - loss: 1.5156 - accuracy: 0.4305 - val_loss: 1.4042 - val_accuracy: 0.4744
Epoch 5/20
352/352 [==============================] - 2s 5ms/step - loss: 1.3777 - accuracy: 0.4846 - val_loss: 1.2784 - val_accuracy: 0.5250
Epoch 6/20
352/352 [==============================] - 2s 5ms/step - loss: 1.2598 - accuracy: 0.5288 - val_loss: 1.1922 - val_accuracy: 0.5512
Epoch 7/20
352/352 [==============================] - 2s 5ms/step - loss: 1.1605 - accuracy: 0.5640 - val_loss: 1.1004 - val_accuracy: 0.5843
Epoch 8/20
352/352 [==============================] - 2s 5ms/step - loss: 1.0719 - accuracy: 0.5927 - val_loss: 1.0159 - val_accuracy: 0.6151
Epoch 9/20
352/352 [==============================] - 2s 5ms/step - loss: 0.9892 - accuracy: 0.6241 - val_loss: 0.9340 - val_accuracy: 0.6425
Epoch 10/20
352/352 [==============================] - 2s 5ms/step - loss: 0.8879 - accuracy: 0.6606 - val_loss: 0.8111 - val_accuracy: 0.6805
Epoch 11/20
352/352 [==============================] - 2s 5ms/step - loss: 0.7547 - accuracy: 0.7132 - val_loss: 0.6573 - val_accuracy: 0.7530
Epoch 12/20
352/352 [==============================] - 2s 5ms/step - loss: 0.6126 - accuracy: 0.7778 - val_loss: 0.5363 - val_accuracy: 0.8079
Epoch 13/20
352/352 [==============================] - 2s 5ms/step - loss: 0.4939 - accuracy: 0.8360 - val_loss: 0.4236 - val_accuracy: 0.8686
Epoch 14/20
352/352 [==============================] - 2s 5ms/step - loss: 0.3952 - accuracy: 0.8824 - val_loss: 0.3391 - val_accuracy: 0.9003
Epoch 15/20
352/352 [==============================] - 2s 5ms/step - loss: 0.3146 - accuracy: 0.9160 - val_loss: 0.2851 - val_accuracy: 0.9208
Epoch 16/20
352/352 [==============================] - 2s 5ms/step - loss: 0.2535 - accuracy: 0.9382 - val_loss: 0.2221 - val_accuracy: 0.9458
Epoch 17/20
352/352 [==============================] - 2s 5ms/step - loss: 0.2063 - accuracy: 0.9535 - val_loss: 0.1934 - val_accuracy: 0.9529
Epoch 18/20
352/352 [==============================] - 2s 5ms/step - loss: 0.1786 - accuracy: 0.9584 - val_loss: 0.1608 - val_accuracy: 0.9613
Epoch 19/20
352/352 [==============================] - 2s 5ms/step - loss: 0.1441 - accuracy: 0.9707 - val_loss: 0.1314 - val_accuracy: 0.9708
Epoch 20/20
352/352 [==============================] - 2s 5ms/step - loss: 0.1210 - accuracy: 0.9758 - val_loss: 0.1156 - val_accuracy: 0.9750
Finished iteration  1
Question 579+42  True 621  Guess 621  Good job
Question 778+40  True 818  Guess 818  Good job
Question 34+574  True 608  Guess 608  Good job
Question 5+553   True 558  Guess 558  Good job
Question 2+27    True 29   Guess 29   Good job
Question 506+30  True 536  Guess 536  Good job
Question 51+714  True 765  Guess 765  Good job
Question 258+31  True 289  Guess 289  Good job
Question 9+70    True 79   Guess 89   Fail
Question 14+83   True 97   Guess 97   Good job
Question 59+378  True 437  Guess 437  Good job
Question 94+836  True 930  Guess 920  Fail
Question 875+483 True 1358 Guess 1358 Good job
Question 482+34  True 516  Guess 516  Good job
Question 257+49  True 306  Guess 306  Good job
Question 591+5   True 596  Guess 596  Good job
Question 88+771  True 859  Guess 859  Good job
Question 248+86  True 334  Guess 334  Good job
Question 27+929  True 956  Guess 956  Good job
Question 23+854  True 877  Guess 877  Good job
The model scored  90.0  % in its test.

Possible Experimentation

  • Try changing the hyperparams, use other RNNs, more layers, check if increasing the number of epochs is useful.

  • Try reversing the data from validation set and check if commutative property of addition is learned by the model.

  • Try printing the hidden layer with two inputs that are commutative and check if the hidden representations it learned are same or similar. Do we expect it to be true? If so, why? If not why? You can access the layer using an index with model.layers and layer.output will give the output of that layer.
  • Try doing addition in the RNN model the same way we do by hand. Reverse the order of digits and at each time step, input two digits get an output use the hidden layer and input next two digits and so on.(units in the first time step, tens in the second time step etc.)