{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "promising-photography",
   "metadata": {},
   "source": [
    "# <img style=\"float: left; padding-right: 10px; width: 45px\" src=\"https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png\"> Data Science 2: Advanced Topics in Data Science \n",
    "## Section 3: Recurrent Neural Networks\n",
    "\n",
    "\n",
    "**Harvard University**<br/>\n",
    "**Spring 2021**<br/>\n",
    "**Instructors**: Mark Glickman, Pavlos Protopapas, and Chris Tanner <br/>\n",
    "**Authors**: Chris Gumb and Eleni Kaxiras\n",
    "\n",
    "\n",
    "<hr style=\"height:2pt\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "central-speech",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "blockquote { background: #AEDE94; }\n",
       "h1 { \n",
       "    padding-top: 25px;\n",
       "    padding-bottom: 25px;\n",
       "    text-align: left; \n",
       "    padding-left: 10px;\n",
       "    background-color: #DDDDDD; \n",
       "    color: black;\n",
       "}\n",
       "h2 { \n",
       "    padding-top: 10px;\n",
       "    padding-bottom: 10px;\n",
       "    text-align: left; \n",
       "    padding-left: 5px;\n",
       "    background-color: #EEEEEE; \n",
       "    color: black;\n",
       "}\n",
       "\n",
       "div.exercise {\n",
       "\tbackground-color: #ffcccc;\n",
       "\tborder-color: #E9967A; \t\n",
       "\tborder-left: 5px solid #800080; \n",
       "\tpadding: 0.5em;\n",
       "}\n",
       "div.discussion {\n",
       "\tbackground-color: #ccffcc;\n",
       "\tborder-color: #88E97A;\n",
       "\tborder-left: 5px solid #0A8000; \n",
       "\tpadding: 0.5em;\n",
       "}\n",
       "div.theme {\n",
       "\tbackground-color: #DDDDDD;\n",
       "\tborder-color: #E9967A; \t\n",
       "\tborder-left: 5px solid #800080; \n",
       "\tpadding: 0.5em;\n",
       "\tfont-size: 18pt;\n",
       "}\n",
       "div.gc { \n",
       "\tbackground-color: #AEDE94;\n",
       "\tborder-color: #E9967A; \t \n",
       "\tborder-left: 5px solid #800080; \n",
       "\tpadding: 0.5em;\n",
       "\tfont-size: 12pt;\n",
       "}\n",
       "p.q1 { \n",
       "    padding-top: 5px;\n",
       "    padding-bottom: 5px;\n",
       "    text-align: left; \n",
       "    padding-left: 5px;\n",
       "    background-color: #EEEEEE; \n",
       "    color: black;\n",
       "}\n",
       "header {\n",
       "   padding-top: 35px;\n",
       "    padding-bottom: 35px;\n",
       "    text-align: left; \n",
       "    padding-left: 10px;\n",
       "    background-color: #DDDDDD; \n",
       "    color: black;\n",
       "}\n",
       "</style>\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES\n",
    "import requests\n",
    "from IPython.core.display import HTML\n",
    "styles = requests.get(\"https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css\").text\n",
    "HTML(styles)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "terminal-northern",
   "metadata": {},
   "source": [
    "## Learning Objectives\n",
    "\n",
    "By the end of this lab, you should understand:\n",
    "- how to perform basic preprocessing on text data\n",
    "- the layers used in `keras` to construct RNNs and its variants (GRU, LSTM)\n",
    "- how the model's task (i.e., many-to-1, many-to-many) affects architecture choices"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "broken-brain",
   "metadata": {},
   "source": [
    "<a id=\"contents\"></a>\n",
    "\n",
    "## Notebook Contents\n",
    "- [**IMDB Review Dataset**](#imdb)\n",
    "- [**Preprocessing Text Data**](#prep)\n",
    "    - [Tokenization](#token)\n",
    "    - [Padding](#pad)\n",
    "    - [Numerical Encoding](#encode)\n",
    "- [**Movie Review Sentiment Analysis**](#FFNN)\n",
    "    - [Naive FFNN](#FFNN)\n",
    "    - [Embedding Layer](#embed)\n",
    "    - [1D CNN](#cnn)\n",
    "    - [Vanilla RNN](#rnn)\n",
    "    - [Vanishing/Exploding Gradients](#vanish)\n",
    "    - [GRU](#gru)\n",
    "    - [LSTM](#lstm)\n",
    "    - [BiDirectional Layer](#bidir)\n",
    "    - [Deep RNNs](#deep)\n",
    "    - [TimeDistributed Layer](#timedis)\n",
    "    - [RepeatVector Layer](#repeatvec)\n",
    "    - [CNN + RNN](#cnnrnn)\n",
    "- [**Heavy Metal Lyric Generator**](#metal)\n",
    "    - [Creating Input/Target Pairs](#pairs)\n",
    "    - [LambdaCallback](#lambdacall)\n",
    "- [**Arithmetic /w RNN**](#math)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "skilled-yield",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow.keras.datasets import imdb\n",
    "from tensorflow.keras.models import Sequential, Model, load_model\n",
    "from tensorflow.keras.layers import BatchNormalization, Bidirectional, Dense, Embedding, GRU, LSTM, SimpleRNN,\\\n",
    "                                    Input, TimeDistributed, Dropout, RepeatVector\n",
    "from tensorflow.keras.layers import Conv1D, Conv2D, Flatten, MaxPool1D, MaxPool2D, Lambda\n",
    "from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback, ModelCheckpoint\n",
    "from tensorflow.keras.initializers import Constant\n",
    "from tensorflow.keras.preprocessing import sequence\n",
    "from sklearn.model_selection import train_test_split\n",
    "import tensorflow_datasets\n",
    "from matplotlib import pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import re, sys\n",
    "# fix random seed for reproducibility\n",
    "np.random.seed(109)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "capable-classics",
   "metadata": {},
   "source": [
    "## Case Study: IMDB Review Classifier \n",
    "<img src='fig/manyto1.png' width='300px'>\n",
    "\n",
    "Let's frame our discussion of RNNS around the example a text classifier. Specifically, We'll build and evaluate various models that all attempt to descriminate between positive and negative reviews through the Internet Movie Database (IMDB). The dataset is again made available to us through the tensorflow datasets API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "coupled-alignment",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow_datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "infinite-consent",
   "metadata": {},
   "outputs": [],
   "source": [
    "(train, test), info = tensorflow_datasets.load('imdb_reviews', split=['train', 'test'], with_info=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "beneficial-course",
   "metadata": {},
   "source": [
    "The helpful `info` object provides details about the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "athletic-scheme",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tfds.core.DatasetInfo(\n",
       "    name='imdb_reviews',\n",
       "    full_name='imdb_reviews/plain_text/1.0.0',\n",
       "    description=\"\"\"\n",
       "    Large Movie Review Dataset.\n",
       "    This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.\n",
       "    \"\"\",\n",
       "    config_description=\"\"\"\n",
       "    Plain text\n",
       "    \"\"\",\n",
       "    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',\n",
       "    data_path='/home/10914655/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',\n",
       "    download_size=80.23 MiB,\n",
       "    dataset_size=129.83 MiB,\n",
       "    features=FeaturesDict({\n",
       "        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),\n",
       "        'text': Text(shape=(), dtype=tf.string),\n",
       "    }),\n",
       "    supervised_keys=('text', 'label'),\n",
       "    splits={\n",
       "        'test': <SplitInfo num_examples=25000, num_shards=1>,\n",
       "        'train': <SplitInfo num_examples=25000, num_shards=1>,\n",
       "        'unsupervised': <SplitInfo num_examples=50000, num_shards=1>,\n",
       "    },\n",
       "    citation=\"\"\"@InProceedings{maas-EtAl:2011:ACL-HLT2011,\n",
       "      author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},\n",
       "      title     = {Learning Word Vectors for Sentiment Analysis},\n",
       "      booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},\n",
       "      month     = {June},\n",
       "      year      = {2011},\n",
       "      address   = {Portland, Oregon, USA},\n",
       "      publisher = {Association for Computational Linguistics},\n",
       "      pages     = {142--150},\n",
       "      url       = {http://www.aclweb.org/anthology/P11-1015}\n",
       "    }\"\"\",\n",
       ")"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "info"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "continuous-perfume",
   "metadata": {},
   "source": [
    "We see that the dataset consists of text reviews and binary good/bad labels. Here are two examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "static-concern",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "text:\n",
      "This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.\n",
      "\n",
      "label: bad\n",
      "\n",
      "text:\n",
      "This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.\n",
      "\n",
      "label: good\n",
      "\n"
     ]
    }
   ],
   "source": [
    "labels = {0: 'bad', 1: 'good'}\n",
    "seen = {'bad': False, 'good': False}\n",
    "for review in train:\n",
    "    label = review['label'].numpy()\n",
    "    if not seen[labels[label]]:\n",
    "        print(f\"text:\\n{review['text'].numpy().decode()}\\n\")\n",
    "        print(f\"label: {labels[label]}\\n\")\n",
    "        seen[labels[label]] = True\n",
    "    if all(val == True for val in seen.values()):\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "nuclear-genealogy",
   "metadata": {},
   "source": [
    "Great! But unfortunately, computers can read! 📖--🤖❓"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "meaning-default",
   "metadata": {},
   "source": [
    "## Preprocessing Text Data <div id='prep'></div>\n",
    "\n",
    "Computers have no built-in knowledge of language and cannot understand text data in any rich way that humans do -- at least not without some help! The first crucial step in natural language processing is to clean and preprocess your data so that your algorithms and models can make use of it.\n",
    "    \n",
    "We'll look at a few preprocess steps:\n",
    "    - tokenization\n",
    "    - padding\n",
    "    - numerical encoding\n",
    "        \n",
    "Depending on your NLP task, you may want to take additional preprocessing steps which we will not cover here. These can include:\n",
    "- converting all characters to lowercase\n",
    "- treating each punctuation mark as a token (e.g., , . ! ? are each separate tokens)\n",
    "- removing punctuation altogether\n",
    "- separating each sentence with a unique symbol (e.g., <S> and </S>)\n",
    "- removing words that are incredibly common (e.g., function words, (in)definite articles). These are referred to as 'stopwords').\n",
    "- Lemmatizing (replacing words with their 'dictionary entry form')\n",
    "- Stemming (removing grammatical morphemes)\n",
    "    \n",
    "Useful NLP Python libraries such as [NLTK](https://www.nltk.org/) and [spaCy](https://spacy.io/) provide built in methods for many of these preprocessing steps."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "parallel-huntington",
   "metadata": {},
   "source": [
    "<div class='exercise' id='token'><b>Tokenization</b></div></br>\n",
    "\n",
    "**Tokens** are the atomic units of meaning which our model will be working with. What should these units be? These could be characters, words, or even sentences. For our movie review classifier we will be working at the word level."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "utility-secret",
   "metadata": {},
   "source": [
    "For this example we will process just a subset of the original dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "bacterial-toolbox",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'label': <tf.Tensor: shape=(), dtype=int64, numpy=0>,\n",
       " 'text': <tf.Tensor: shape=(), dtype=string, numpy=b\"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.\">}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "SAMPLE_SIZE = 10\n",
    "subset = list(train.take(SAMPLE_SIZE))\n",
    "subset[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "tested-greenhouse",
   "metadata": {},
   "source": [
    "The TFDS format allows for the construction of efficient preprocessing pipelines. But for our own preprocessing example we will be primarily working with Python `list` objects. This gives us a chance to practice the Python **list comprehension** which is a powerful tool to have at your disposal. It will serve you well when processing arbitrary text which may not already be in a nice TFDS format (such as in the HW 😉).\n",
    "\n",
    "We'll convert our data subset into X and y lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "artificial-attraction",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = [x['text'].numpy().decode() for x in subset]\n",
    "y = [x['label'].numpy() for x in subset]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "hispanic-pottery",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X has 10 reviews\n",
      "y has 10 labels\n"
     ]
    }
   ],
   "source": [
    "print(f'X has {len(X)} reviews')\n",
    "print(f'y has {len(y)} labels')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "destroyed-hierarchy",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First 20 characters of all reviews:\n",
      "['This was an absolute...', 'I have been known to...', 'Mann photographs the...', 'This is the kind of ...', 'As others have menti...', 'This is a film which...', 'Okay, you have:<br /...', 'The film is based on...', 'I really love the se...', \"Sure, this one isn't...\"]\n",
      "\n",
      "All labels:\n",
      "[0, 0, 0, 1, 1, 1, 0, 0, 0, 0]\n"
     ]
    }
   ],
   "source": [
    "N_CHARS = 20\n",
    "print(f'First {N_CHARS} characters of all reviews:\\n{[x[:20]+\"...\" for x in X]}\\n')\n",
    "print(f'All labels:\\n{y}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "foster-design",
   "metadata": {},
   "source": [
    "Each observation in `X` is a review. A review is a `str` object which we can think of as a sequence of characters. This is indeed how Python treats strings as made clear by how we are printing 'slices' of each review in the code cell above.<br>\n",
    "\n",
    "We'll see a bit later that you can in fact sucessfully train a neural network on text data at the character level.\n",
    "\n",
    "But for the moment we will work at the word level, treating the word level. This means our observations should be organized as **sequences of words** rather than sequences of characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "orange-event",
   "metadata": {},
   "outputs": [],
   "source": [
    "# list comprehensions again to the rescue!\n",
    "X = [x.split() for x in X]\n",
    "# The same thing can be accomplished with:\n",
    "# list(map(str.split, X))\n",
    "# but that is much harder to parse! O_o"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "addressed-detective",
   "metadata": {},
   "source": [
    "Now let's look at the first 10 **tokens** in the first 2 reviews."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "understanding-rabbit",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(['This',\n",
       "  'was',\n",
       "  'an',\n",
       "  'absolutely',\n",
       "  'terrible',\n",
       "  'movie.',\n",
       "  \"Don't\",\n",
       "  'be',\n",
       "  'lured',\n",
       "  'in'],\n",
       " ['I',\n",
       "  'have',\n",
       "  'been',\n",
       "  'known',\n",
       "  'to',\n",
       "  'fall',\n",
       "  'asleep',\n",
       "  'during',\n",
       "  'films,',\n",
       "  'but'])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X[0][:10], X[1][:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "departmental-hello",
   "metadata": {},
   "source": [
    "<div class='exercise' id='pad'><b>Padding</b></div></br>\n",
    "\n",
    "Let's take a look at the lengths of the reviews in our subset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "experienced-order",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[116, 112, 132, 88, 81, 289, 557, 111, 223, 127]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[len(x) for x in X]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "described-mexico",
   "metadata": {},
   "source": [
    "If we were training our RNN one sentence at a time, it would be okay to have sentences of varying lengths. However, as with any neural network, it can be sometimes be advantageous to train inputs in batches. When doing so with RNNs, our input tensors need to be of the same length/dimensions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "divided-venice",
   "metadata": {},
   "source": [
    "Here are two examples of tokenized reviews padded to have a length of 5.\n",
    "```\n",
    "['I', 'loved', 'it', '<PAD>', '<PAD>']\n",
    "['It', 'stinks', '<PAD>', '<PAD>', '<PAD>']\n",
    "```\n",
    "Now let's pad our own examples. Note that 'padding' in this context also means truncating sequences that are longer than our specified max length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "turkish-closing",
   "metadata": {},
   "outputs": [],
   "source": [
    "MAX_LEN = 500\n",
    "PAD = '<PAD>'\n",
    "# truncate\n",
    "X = [x[:MAX_LEN] for x in X]\n",
    "# pad\n",
    "for x in X:\n",
    "    while len(x) < MAX_LEN:\n",
    "        x.append(PAD)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "endangered-juvenile",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[500, 500, 500, 500, 500, 500, 500, 500, 500, 500]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[len(x) for x in X]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "understood-barcelona",
   "metadata": {},
   "source": [
    "Now all reviews are of a uniform length!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "close-worse",
   "metadata": {},
   "source": [
    "<div class='exercise' id='encode'><b>Numerical Encoding</b></div></br>\n",
    "\n",
    "If each review in our dataset is an observation, then the features of each observation are the tokens, in this case, words. But these words are still strings. Our machine learning methods require us to be able to multiple our features by weights. If we want to use these words as inputs for a neural network we'll have to convert them into some numerical representation.\n",
    "\n",
    "One solution is to create a one-to-one mapping between unique words and integers.\n",
    "\n",
    "If the five sentences below were our entire corpus, our conversion would look this:\n",
    "\n",
    "1. i have books - [1, 4, 2]\n",
    "2. interesting books are useful [11,2,9,8]\n",
    "3. i have computers [1,4,3]\n",
    "4. computers are interesting and useful [3,5,11,10,8]\n",
    "5. books and computers are both valuable. [2,10,3,9,13,12]\n",
    "6. bye bye [7,7]\n",
    "\n",
    "I-1, books-2, computers-3, have-4, are-5, computers-6,bye-7, useful-8, are-9, and-10,interesting-11, valuable-12, both-13\n",
    "\n",
    "To accomplish this we'll first need to know what all the unique words are in our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "broke-oakland",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_tokens = [word for review in X for word in review]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "sensitive-transparency",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(5000, 5000)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# sanity check\n",
    "len(all_tokens), sum([len(x) for x in X])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "matched-information",
   "metadata": {},
   "source": [
    "Casting our `list` of words into a `set` is a great way to get all the *unique* words in the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "broad-section",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Unique Words: 892\n"
     ]
    }
   ],
   "source": [
    "vocab = sorted(set(all_tokens))\n",
    "print('Unique Words:', len(vocab))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "external-preliminary",
   "metadata": {},
   "source": [
    "Now we need to create a mapping from words to integers. For this will a **dictionary comprehension**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "absent-coordinate",
   "metadata": {},
   "outputs": [],
   "source": [
    "word2idx = {word: idx for idx, word in enumerate(vocab)}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "northern-teens",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'\"Absolute': 0,\n",
       " '\"Bohlen\"-Fan': 1,\n",
       " '\"Brideshead': 2,\n",
       " '\"Candy\"?).': 3,\n",
       " '\"City': 4,\n",
       " '\"Dieter': 5,\n",
       " '\"Dieter\"': 6,\n",
       " '\"Dragonfly\"': 7,\n",
       " '\"I\\'ve': 8,\n",
       " '\"Lady.\"<br': 9,\n",
       " '\"London': 10,\n",
       " '\"Make': 11,\n",
       " '\"Miss\"': 12,\n",
       " '\"Mr': 13,\n",
       " '\"Mrs.\"': 14,\n",
       " '\"actors\"': 15,\n",
       " '\"dewy-eyed.\"<br': 16,\n",
       " '\"hey,': 17,\n",
       " '\"meanwhile,\")': 18,\n",
       " '\"men\"': 19,\n",
       " \"'Where\": 20,\n",
       " \"'em\": 21,\n",
       " \"'round\": 22,\n",
       " '(Backbone': 23,\n",
       " '(Barbarella).': 24,\n",
       " '(Brit': 25,\n",
       " '(I': 26,\n",
       " '(Jeremy': 27,\n",
       " '(Remember': 28,\n",
       " '(YOU': 29,\n",
       " '(as': 30,\n",
       " '(at': 31,\n",
       " '(not': 32,\n",
       " '(the': 33,\n",
       " \"(there's\": 34,\n",
       " '(they': 35,\n",
       " '(we': 36,\n",
       " '(what': 37,\n",
       " '(when,': 38,\n",
       " '(yes': 39,\n",
       " '-': 40,\n",
       " '.': 41,\n",
       " '/><br': 42,\n",
       " '/>Ah,': 43,\n",
       " '/>And': 44,\n",
       " '/>But': 45,\n",
       " '/>Canadian': 46,\n",
       " '/>David': 47,\n",
       " '/>First': 48,\n",
       " '/>Henceforth,': 49,\n",
       " '/>Joanna': 50,\n",
       " '/>Journalist': 51,\n",
       " '/>Nothing': 52,\n",
       " '/>OK,': 53,\n",
       " '/>Penelope': 54,\n",
       " '/>Peter': 55,\n",
       " '/>Second': 56,\n",
       " '/>So': 57,\n",
       " '/>Susan': 58,\n",
       " '/>Thank': 59,\n",
       " '/>Third': 60,\n",
       " '/>To': 61,\n",
       " '/>When': 62,\n",
       " '/>Wrong!': 63,\n",
       " '/>and': 64,\n",
       " '1-dimensional': 65,\n",
       " '14': 66,\n",
       " '1950s': 67,\n",
       " '20': 68,\n",
       " '<PAD>': 69,\n",
       " '<br': 70,\n",
       " 'A': 71,\n",
       " 'After': 72,\n",
       " 'Alberta': 73,\n",
       " 'Alison': 74,\n",
       " 'Alonso': 75,\n",
       " 'American': 76,\n",
       " 'And': 77,\n",
       " 'As': 78,\n",
       " 'B.B.E.': 79,\n",
       " 'Beginners\",': 80,\n",
       " 'Boarding-School,': 81,\n",
       " 'Bohlen\"': 82,\n",
       " 'Both': 83,\n",
       " 'Brennan': 84,\n",
       " 'Bulimia': 85,\n",
       " 'But': 86,\n",
       " 'Cage': 87,\n",
       " 'Canadian': 88,\n",
       " 'Cher': 89,\n",
       " 'Christmas!)': 90,\n",
       " 'Christopher': 91,\n",
       " 'Circus.<br': 92,\n",
       " 'City': 93,\n",
       " 'City,': 94,\n",
       " 'Col.': 95,\n",
       " 'Colin': 96,\n",
       " 'Colonel': 97,\n",
       " 'Columbian': 98,\n",
       " 'Conchita': 99,\n",
       " 'Constantly': 100,\n",
       " 'Coppola': 101,\n",
       " 'Cricket': 102,\n",
       " \"Cricket's\": 103,\n",
       " 'Davies)': 104,\n",
       " 'Dawson': 105,\n",
       " 'Deadwood,': 106,\n",
       " 'Dean': 107,\n",
       " 'Depardieu,': 108,\n",
       " \"Don't\": 109,\n",
       " 'Dragonfly': 110,\n",
       " 'Emily': 111,\n",
       " 'England)': 112,\n",
       " 'England.)': 113,\n",
       " 'European': 114,\n",
       " 'Even': 115,\n",
       " 'Film': 116,\n",
       " 'First': 117,\n",
       " 'Gerard': 118,\n",
       " 'German': 119,\n",
       " 'Giancarlo': 120,\n",
       " 'Giannini': 121,\n",
       " \"Girls'\": 122,\n",
       " 'Hampshire': 123,\n",
       " 'He': 124,\n",
       " 'Headmistress': 125,\n",
       " 'Her': 126,\n",
       " 'Herringbone-Tweed,': 127,\n",
       " 'Hollywood': 128,\n",
       " 'Home': 129,\n",
       " 'However': 130,\n",
       " 'I': 131,\n",
       " \"I'll\": 132,\n",
       " \"I'm\": 133,\n",
       " 'II': 134,\n",
       " 'If': 135,\n",
       " 'Ironside.': 136,\n",
       " 'It': 137,\n",
       " \"It's\": 138,\n",
       " 'Japanese': 139,\n",
       " 'Jimmy': 140,\n",
       " 'John': 141,\n",
       " 'Justice\".': 142,\n",
       " 'Katt': 143,\n",
       " 'Keith': 144,\n",
       " 'Klondike': 145,\n",
       " 'L.L.': 146,\n",
       " 'L.L.<br': 147,\n",
       " 'Lady': 148,\n",
       " 'Law': 149,\n",
       " 'Leading': 150,\n",
       " \"Lies'.\": 151,\n",
       " 'Lohman': 152,\n",
       " \"Lohman's.\": 153,\n",
       " 'Lohman,': 154,\n",
       " 'London,': 155,\n",
       " 'Lord': 156,\n",
       " 'Love': 157,\n",
       " 'Lumley': 158,\n",
       " 'Madness': 159,\n",
       " 'Mann': 160,\n",
       " 'Manor,': 161,\n",
       " 'Manor.<br': 162,\n",
       " 'Mare': 163,\n",
       " 'Maria': 164,\n",
       " 'McCallum': 165,\n",
       " 'McInnes': 166,\n",
       " \"McInnes's\": 167,\n",
       " 'Meanwhile': 168,\n",
       " 'Michael': 169,\n",
       " 'Miss': 170,\n",
       " 'Mortimer': 171,\n",
       " 'Mountains': 172,\n",
       " 'Mountie': 173,\n",
       " 'Mr.': 174,\n",
       " 'Nancherrow': 175,\n",
       " 'New': 176,\n",
       " 'Nicolas': 177,\n",
       " 'North': 178,\n",
       " \"O'Toole\": 179,\n",
       " 'OK-movie.': 180,\n",
       " 'Okay,': 181,\n",
       " \"Ol'\": 182,\n",
       " 'Our': 183,\n",
       " 'Phillip': 184,\n",
       " 'Pilcher': 185,\n",
       " 'Polonia': 186,\n",
       " 'Reefer': 187,\n",
       " 'Revisited,\"': 188,\n",
       " 'Rocky': 189,\n",
       " 'Roman': 190,\n",
       " \"She's\": 191,\n",
       " 'Shea': 192,\n",
       " 'So': 193,\n",
       " 'Spades\"': 194,\n",
       " 'Speaking': 195,\n",
       " 'Stately': 196,\n",
       " 'Stewart': 197,\n",
       " 'Still,': 198,\n",
       " 'Stockwell': 199,\n",
       " 'Sunday': 200,\n",
       " 'Sure,': 201,\n",
       " 'Susan': 202,\n",
       " 'Teacups,': 203,\n",
       " 'Technically': 204,\n",
       " \"That's\": 205,\n",
       " 'The': 206,\n",
       " 'There': 207,\n",
       " 'They': 208,\n",
       " \"They're\": 209,\n",
       " 'Things': 210,\n",
       " 'This': 211,\n",
       " 'Together,': 212,\n",
       " 'Truth': 213,\n",
       " 'US': 214,\n",
       " 'Venerable': 215,\n",
       " 'Walken': 216,\n",
       " \"Walken's\": 217,\n",
       " 'Walter': 218,\n",
       " 'War': 219,\n",
       " 'Well,': 220,\n",
       " 'West.<br': 221,\n",
       " 'When': 222,\n",
       " 'Wild': 223,\n",
       " 'Winningham': 224,\n",
       " 'Wonderful': 225,\n",
       " 'World': 226,\n",
       " 'York': 227,\n",
       " 'Yukon': 228,\n",
       " 'a': 229,\n",
       " 'ability': 230,\n",
       " 'ably': 231,\n",
       " 'about.': 232,\n",
       " 'absolutely': 233,\n",
       " 'acclaimed;': 234,\n",
       " 'accord': 235,\n",
       " 'accuracy.': 236,\n",
       " 'accurate': 237,\n",
       " 'achievement,': 238,\n",
       " 'act': 239,\n",
       " 'acting': 240,\n",
       " 'action': 241,\n",
       " 'actor': 242,\n",
       " \"actor's\": 243,\n",
       " 'actors': 244,\n",
       " 'actors,': 245,\n",
       " \"actress's\": 246,\n",
       " 'actresses': 247,\n",
       " 'actually': 248,\n",
       " 'admit,': 249,\n",
       " 'advice': 250,\n",
       " 'advice:': 251,\n",
       " 'affair': 252,\n",
       " 'after': 253,\n",
       " 'afternoon': 254,\n",
       " 'age': 255,\n",
       " 'age!).': 256,\n",
       " 'ago': 257,\n",
       " 'ahead': 258,\n",
       " 'air': 259,\n",
       " 'alive': 260,\n",
       " 'all': 261,\n",
       " 'all,': 262,\n",
       " 'all.': 263,\n",
       " 'all.<br': 264,\n",
       " 'along.': 265,\n",
       " 'alright,': 266,\n",
       " 'always': 267,\n",
       " 'always)': 268,\n",
       " 'am': 269,\n",
       " 'amazingly': 270,\n",
       " 'amusing': 271,\n",
       " 'an': 272,\n",
       " 'and': 273,\n",
       " 'anguish': 274,\n",
       " 'animal': 275,\n",
       " 'another!)<br': 276,\n",
       " 'another.': 277,\n",
       " 'any': 278,\n",
       " 'anybody': 279,\n",
       " 'anything': 280,\n",
       " 'apology': 281,\n",
       " 'appear': 282,\n",
       " 'appeared': 283,\n",
       " 'are': 284,\n",
       " 'arm-chair': 285,\n",
       " 'around': 286,\n",
       " 'around,': 287,\n",
       " 'as': 288,\n",
       " 'asleep': 289,\n",
       " 'aspiring': 290,\n",
       " 'astonishing': 291,\n",
       " 'at': 292,\n",
       " 'autobiography': 293,\n",
       " 'award': 294,\n",
       " 'baby': 295,\n",
       " 'backbone!<br': 296,\n",
       " 'backlighting.': 297,\n",
       " 'badly-acted,': 298,\n",
       " 'barely': 299,\n",
       " 'barometers': 300,\n",
       " 'based': 301,\n",
       " 'battling': 302,\n",
       " 'be': 303,\n",
       " 'beautiful': 304,\n",
       " 'because': 305,\n",
       " 'become': 306,\n",
       " 'becomes': 307,\n",
       " 'been': 308,\n",
       " 'before': 309,\n",
       " 'behaviour': 310,\n",
       " 'being': 311,\n",
       " 'best': 312,\n",
       " 'best,': 313,\n",
       " 'best.': 314,\n",
       " 'better': 315,\n",
       " 'big': 316,\n",
       " 'bit': 317,\n",
       " 'blockbuster,': 318,\n",
       " 'body': 319,\n",
       " 'border': 320,\n",
       " 'boring.': 321,\n",
       " 'boy': 322,\n",
       " 'boy.': 323,\n",
       " 'brilliant': 324,\n",
       " 'bulimia': 325,\n",
       " 'business': 326,\n",
       " 'but': 327,\n",
       " 'but,': 328,\n",
       " 'by': 329,\n",
       " 'by,': 330,\n",
       " 'called': 331,\n",
       " 'cameos': 332,\n",
       " 'camps': 333,\n",
       " 'can': 334,\n",
       " 'can\"!': 335,\n",
       " \"can't\": 336,\n",
       " 'cant': 337,\n",
       " 'captures': 338,\n",
       " 'cases': 339,\n",
       " 'cast': 340,\n",
       " 'catatonic.': 341,\n",
       " 'causes': 342,\n",
       " 'causing': 343,\n",
       " 'character': 344,\n",
       " 'choice': 345,\n",
       " 'cinema': 346,\n",
       " 'colonel': 347,\n",
       " 'combination': 348,\n",
       " 'come': 349,\n",
       " 'comes': 350,\n",
       " 'comfortable': 351,\n",
       " 'company': 352,\n",
       " 'complete': 353,\n",
       " 'compulsive': 354,\n",
       " 'concern': 355,\n",
       " 'consent': 356,\n",
       " 'considerate': 357,\n",
       " 'constant.': 358,\n",
       " 'contacts?)<br': 359,\n",
       " 'content': 360,\n",
       " 'continue': 361,\n",
       " 'control': 362,\n",
       " 'convey': 363,\n",
       " 'could': 364,\n",
       " \"couldn't\": 365,\n",
       " 'couple': 366,\n",
       " 'courage': 367,\n",
       " 'course)': 368,\n",
       " 'cover': 369,\n",
       " 'cries': 370,\n",
       " 'criticize.': 371,\n",
       " 'cross,': 372,\n",
       " 'crush': 373,\n",
       " 'crying,': 374,\n",
       " 'dangerous': 375,\n",
       " 'dash': 376,\n",
       " 'days': 377,\n",
       " 'dead': 378,\n",
       " 'decide': 379,\n",
       " 'deep,': 380,\n",
       " 'degree': 381,\n",
       " 'depends': 382,\n",
       " 'depths': 383,\n",
       " 'descend': 384,\n",
       " 'deserves': 385,\n",
       " 'despair,': 386,\n",
       " 'despair.': 387,\n",
       " 'desperation': 388,\n",
       " 'destroy': 389,\n",
       " 'development': 390,\n",
       " 'devoid': 391,\n",
       " 'did': 392,\n",
       " \"didn't\": 393,\n",
       " 'directed': 394,\n",
       " 'director': 395,\n",
       " 'disappointed': 396,\n",
       " 'disgust.': 397,\n",
       " 'disorder.': 398,\n",
       " 'do.': 399,\n",
       " 'does': 400,\n",
       " \"don't\": 401,\n",
       " 'done,': 402,\n",
       " 'done.<br': 403,\n",
       " 'due': 404,\n",
       " 'dumb': 405,\n",
       " 'during': 406,\n",
       " 'each': 407,\n",
       " 'early': 408,\n",
       " 'eaten': 409,\n",
       " 'eating': 410,\n",
       " 'edge.': 411,\n",
       " 'effected': 412,\n",
       " 'either': 413,\n",
       " 'elect': 414,\n",
       " 'else.': 415,\n",
       " 'emblazered': 416,\n",
       " 'emotional': 417,\n",
       " 'emotions': 418,\n",
       " 'end,': 419,\n",
       " 'enforce': 420,\n",
       " 'enjoyable': 421,\n",
       " 'enough': 422,\n",
       " 'enough.': 423,\n",
       " 'ensweatered': 424,\n",
       " 'equally': 425,\n",
       " 'even': 426,\n",
       " 'every': 427,\n",
       " 'everything': 428,\n",
       " 'exactly': 429,\n",
       " 'excellent.': 430,\n",
       " 'experiment': 431,\n",
       " 'explanation': 432,\n",
       " 'exploits': 433,\n",
       " 'extreme': 434,\n",
       " 'eyes': 435,\n",
       " 'fall': 436,\n",
       " 'family': 437,\n",
       " 'fashion,': 438,\n",
       " 'fast': 439,\n",
       " 'fathers': 440,\n",
       " 'fell': 441,\n",
       " 'female': 442,\n",
       " 'few': 443,\n",
       " 'filled': 444,\n",
       " 'film': 445,\n",
       " 'film,': 446,\n",
       " 'filmmaker': 447,\n",
       " 'films': 448,\n",
       " 'films,': 449,\n",
       " 'films.': 450,\n",
       " 'finally:<br': 451,\n",
       " 'find': 452,\n",
       " 'finds': 453,\n",
       " 'finely': 454,\n",
       " 'fired.': 455,\n",
       " 'first': 456,\n",
       " 'folks;': 457,\n",
       " 'for': 458,\n",
       " 'found': 459,\n",
       " 'fourth': 460,\n",
       " 'frenzy': 461,\n",
       " 'from': 462,\n",
       " 'fruits': 463,\n",
       " 'full': 464,\n",
       " 'gal!)<br': 465,\n",
       " 'gave': 466,\n",
       " 'gently': 467,\n",
       " 'genuine': 468,\n",
       " 'get': 469,\n",
       " 'gets': 470,\n",
       " 'girl': 471,\n",
       " 'girl,': 472,\n",
       " 'give': 473,\n",
       " 'glamourous': 474,\n",
       " 'glitzy,': 475,\n",
       " 'go': 476,\n",
       " 'going': 477,\n",
       " 'gold': 478,\n",
       " 'good': 479,\n",
       " 'goodness': 480,\n",
       " 'gorgeous.': 481,\n",
       " 'gradations': 482,\n",
       " 'graduation.': 483,\n",
       " 'great': 484,\n",
       " 'guess': 485,\n",
       " 'gunfighters': 486,\n",
       " 'guy': 487,\n",
       " 'had': 488,\n",
       " 'happen': 489,\n",
       " 'happen,': 490,\n",
       " 'happened': 491,\n",
       " 'happening': 492,\n",
       " 'has': 493,\n",
       " 'hated': 494,\n",
       " 'have': 495,\n",
       " 'have:<br': 496,\n",
       " 'having': 497,\n",
       " 'head': 498,\n",
       " 'her': 499,\n",
       " 'here': 500,\n",
       " 'highly': 501,\n",
       " 'him': 502,\n",
       " 'his': 503,\n",
       " 'history.': 504,\n",
       " 'homage': 505,\n",
       " 'home': 506,\n",
       " 'hours.': 507,\n",
       " 'how': 508,\n",
       " 'howl': 509,\n",
       " 'humor': 510,\n",
       " 'hush-hush': 511,\n",
       " 'hypocrisy': 512,\n",
       " 'hysteria': 513,\n",
       " 'i': 514,\n",
       " 'if': 515,\n",
       " 'immense': 516,\n",
       " 'in': 517,\n",
       " 'in,': 518,\n",
       " 'including,': 519,\n",
       " 'individual': 520,\n",
       " 'inside.': 521,\n",
       " 'intensity.': 522,\n",
       " 'interested': 523,\n",
       " 'into': 524,\n",
       " 'is': 525,\n",
       " \"isn't\": 526,\n",
       " 'it': 527,\n",
       " \"it's\": 528,\n",
       " 'it.': 529,\n",
       " 'its': 530,\n",
       " 'job': 531,\n",
       " 'just': 532,\n",
       " 'killed': 533,\n",
       " 'kind': 534,\n",
       " 'knowledge': 535,\n",
       " 'known': 536,\n",
       " 'last,': 537,\n",
       " 'lastly,': 538,\n",
       " 'later': 539,\n",
       " 'law': 540,\n",
       " 'least': 541,\n",
       " 'let': 542,\n",
       " 'libido.': 543,\n",
       " 'life': 544,\n",
       " 'like': 545,\n",
       " 'live': 546,\n",
       " 'local': 547,\n",
       " 'loss,': 548,\n",
       " 'lost': 549,\n",
       " 'lot.': 550,\n",
       " 'lots': 551,\n",
       " 'love': 552,\n",
       " 'love.': 553,\n",
       " 'loved': 554,\n",
       " 'lucky': 555,\n",
       " 'ludicrous': 556,\n",
       " 'lured': 557,\n",
       " 'made': 558,\n",
       " 'majority': 559,\n",
       " 'make': 560,\n",
       " 'making': 561,\n",
       " 'man': 562,\n",
       " 'marshal': 563,\n",
       " 'marshal!)': 564,\n",
       " 'marvels': 565,\n",
       " 'matter': 566,\n",
       " 'may': 567,\n",
       " 'me': 568,\n",
       " 'meaning.': 569,\n",
       " 'meant': 570,\n",
       " 'measured': 571,\n",
       " 'mellow': 572,\n",
       " 'men': 573,\n",
       " 'mentioned,': 574,\n",
       " 'microscopically': 575,\n",
       " 'mind': 576,\n",
       " 'mine)': 577,\n",
       " 'missed': 578,\n",
       " 'mistaken': 579,\n",
       " 'moist.': 580,\n",
       " 'more.': 581,\n",
       " 'most': 582,\n",
       " 'mostly': 583,\n",
       " 'mother,': 584,\n",
       " 'movie': 585,\n",
       " \"movie's\": 586,\n",
       " 'movie.': 587,\n",
       " 'movies': 588,\n",
       " 'much': 589,\n",
       " 'musical': 590,\n",
       " 'musician,': 591,\n",
       " 'must': 592,\n",
       " 'name': 593,\n",
       " 'name.': 594,\n",
       " 'nature': 595,\n",
       " 'never': 596,\n",
       " 'nightmare': 597,\n",
       " 'nineties': 598,\n",
       " 'no': 599,\n",
       " 'nor': 600,\n",
       " 'nostalgic': 601,\n",
       " 'not': 602,\n",
       " 'not.': 603,\n",
       " 'nothing': 604,\n",
       " 'novel.<br': 605,\n",
       " \"novelist's\": 606,\n",
       " 'novels\":': 607,\n",
       " 'novels:': 608,\n",
       " 'now': 609,\n",
       " 'nude': 610,\n",
       " 'occasion': 611,\n",
       " 'of': 612,\n",
       " 'off': 613,\n",
       " 'off.': 614,\n",
       " 'offensive': 615,\n",
       " 'office': 616,\n",
       " 'old': 617,\n",
       " 'oldest': 618,\n",
       " 'on': 619,\n",
       " 'on-screen': 620,\n",
       " 'once.': 621,\n",
       " 'one': 622,\n",
       " \"one's\": 623,\n",
       " 'one-dimensional': 624,\n",
       " 'only': 625,\n",
       " 'or': 626,\n",
       " 'or,': 627,\n",
       " 'oscillators': 628,\n",
       " 'other': 629,\n",
       " 'others': 630,\n",
       " 'out': 631,\n",
       " 'outdoors': 632,\n",
       " 'outside': 633,\n",
       " 'outstanding,': 634,\n",
       " 'own': 635,\n",
       " 'paddle': 636,\n",
       " 'paid': 637,\n",
       " 'pair': 638,\n",
       " 'parents': 639,\n",
       " 'part': 640,\n",
       " 'past': 641,\n",
       " 'pathetic': 642,\n",
       " 'people': 643,\n",
       " 'people.': 644,\n",
       " 'perfect': 645,\n",
       " 'performances': 646,\n",
       " 'perhaps': 647,\n",
       " 'phony,': 648,\n",
       " 'photographs': 649,\n",
       " 'picture.': 650,\n",
       " 'piece.': 651,\n",
       " 'play': 652,\n",
       " 'played': 653,\n",
       " 'plays': 654,\n",
       " 'plot': 655,\n",
       " 'plot,': 656,\n",
       " 'plug': 657,\n",
       " 'poignant': 658,\n",
       " 'pointlessly': 659,\n",
       " 'popular': 660,\n",
       " 'portrayal': 661,\n",
       " 'position.': 662,\n",
       " 'praise': 663,\n",
       " 'precise,': 664,\n",
       " 'prepared': 665,\n",
       " 'pressure': 666,\n",
       " 'pressure,': 667,\n",
       " 'prime': 668,\n",
       " 'prison': 669,\n",
       " 'producers': 670,\n",
       " 'propaganda': 671,\n",
       " 'proud': 672,\n",
       " 'pseudo-love': 673,\n",
       " 'pursue': 674,\n",
       " 'pursued,': 675,\n",
       " 'pursuers': 676,\n",
       " 'pursuing': 677,\n",
       " 'puts': 678,\n",
       " 'quite': 679,\n",
       " 'range': 680,\n",
       " 'rapids': 681,\n",
       " 're-makings': 682,\n",
       " 'reaches': 683,\n",
       " 'read': 684,\n",
       " 'reading': 685,\n",
       " 'real': 686,\n",
       " 'really': 687,\n",
       " 'reason': 688,\n",
       " 'rebels': 689,\n",
       " 'received.': 690,\n",
       " 'recommend': 691,\n",
       " 'redeem': 692,\n",
       " 'remotely': 693,\n",
       " 'resembling': 694,\n",
       " 'resonance': 695,\n",
       " 'rest': 696,\n",
       " 'revolutions.': 697,\n",
       " 'ridiculous': 698,\n",
       " 'right,': 699,\n",
       " 'right?<br': 700,\n",
       " 'rightly': 701,\n",
       " 'rising': 702,\n",
       " 'role': 703,\n",
       " 'roles': 704,\n",
       " 'romance': 705,\n",
       " 'row': 706,\n",
       " 'rubbish.': 707,\n",
       " 'ruining': 708,\n",
       " 'running': 709,\n",
       " 'rush.': 710,\n",
       " 'said': 711,\n",
       " 'same': 712,\n",
       " 'say': 713,\n",
       " 'says': 714,\n",
       " 'scale': 715,\n",
       " 'scene': 716,\n",
       " 'scenes': 717,\n",
       " 'schools,': 718,\n",
       " 'sci-fi': 719,\n",
       " 'script.': 720,\n",
       " 'see': 721,\n",
       " 'seeing': 722,\n",
       " 'seem': 723,\n",
       " 'seemed': 724,\n",
       " 'seen': 725,\n",
       " 'sense': 726,\n",
       " 'sensitive': 727,\n",
       " 'serving': 728,\n",
       " 'set': 729,\n",
       " 'sette': 730,\n",
       " 'setting': 731,\n",
       " 'sexiest': 732,\n",
       " 'sexual': 733,\n",
       " 'sexy': 734,\n",
       " 'shake': 735,\n",
       " 'shall': 736,\n",
       " 'sharp': 737,\n",
       " 'she': 738,\n",
       " \"she's\": 739,\n",
       " 'shooting': 740,\n",
       " 'should': 741,\n",
       " 'show': 742,\n",
       " 'show.': 743,\n",
       " 'shown': 744,\n",
       " 'shows': 745,\n",
       " 'side': 746,\n",
       " 'side.': 747,\n",
       " 'sign:': 748,\n",
       " 'simply': 749,\n",
       " 'sis': 750,\n",
       " 'sit': 751,\n",
       " 'sixties': 752,\n",
       " 'sixties.': 753,\n",
       " 'slow': 754,\n",
       " 'snowy': 755,\n",
       " 'so': 756,\n",
       " 'so.': 757,\n",
       " 'some': 758,\n",
       " 'somehow,': 759,\n",
       " 'soporific': 760,\n",
       " 'sort.<br': 761,\n",
       " 'soul': 762,\n",
       " 'speaking': 763,\n",
       " 'spectacular': 764,\n",
       " 'spelling,': 765,\n",
       " 'spirit': 766,\n",
       " 'spotlight': 767,\n",
       " 'squeeze.': 768,\n",
       " 'standard)': 769,\n",
       " 'stars': 770,\n",
       " 'still': 771,\n",
       " 'story': 772,\n",
       " 'story,': 773,\n",
       " 'storyline.': 774,\n",
       " 'streets': 775,\n",
       " 'subject.': 776,\n",
       " 'such': 777,\n",
       " 'suffering': 778,\n",
       " 'sum': 779,\n",
       " 'sunlight!\"<br': 780,\n",
       " 'superb': 781,\n",
       " 'supporting': 782,\n",
       " 'sympathise': 783,\n",
       " 'sympathy': 784,\n",
       " 'symptoms.': 785,\n",
       " 'target': 786,\n",
       " 'tea': 787,\n",
       " 'teenage': 788,\n",
       " 'tell': 789,\n",
       " 'telling': 790,\n",
       " 'tells': 791,\n",
       " 'temp-jobs': 792,\n",
       " 'terrible': 793,\n",
       " 'than': 794,\n",
       " 'that': 795,\n",
       " \"that's\": 796,\n",
       " 'that.': 797,\n",
       " 'thats': 798,\n",
       " 'the': 799,\n",
       " 'their': 800,\n",
       " 'them': 801,\n",
       " 'them.': 802,\n",
       " 'themselves': 803,\n",
       " 'themselves,': 804,\n",
       " 'then': 805,\n",
       " 'there': 806,\n",
       " 'these': 807,\n",
       " 'they': 808,\n",
       " 'things': 809,\n",
       " 'this': 810,\n",
       " 'this,': 811,\n",
       " 'those': 812,\n",
       " 'thought': 813,\n",
       " 'three': 814,\n",
       " 'three.': 815,\n",
       " 'through': 816,\n",
       " 'time': 817,\n",
       " 'tired,': 818,\n",
       " 'to': 819,\n",
       " 'today,': 820,\n",
       " 'took': 821,\n",
       " 'tormented': 822,\n",
       " 'town?': 823,\n",
       " 'toy-boy': 824,\n",
       " 'toy-boy,': 825,\n",
       " 'trauma,': 826,\n",
       " 'true': 827,\n",
       " 'tuned': 828,\n",
       " 'turned': 829,\n",
       " 'two': 830,\n",
       " 'type': 831,\n",
       " 'ultimate': 832,\n",
       " 'unbearable': 833,\n",
       " 'up': 834,\n",
       " 'up:': 835,\n",
       " 'us': 836,\n",
       " 'us.<br': 837,\n",
       " 'use': 838,\n",
       " 'uses': 839,\n",
       " 'usually': 840,\n",
       " 'version': 841,\n",
       " 'very': 842,\n",
       " 'vibrating': 843,\n",
       " 'viewers': 844,\n",
       " 'waiting': 845,\n",
       " 'waiting!': 846,\n",
       " 'want': 847,\n",
       " 'warm': 848,\n",
       " 'warn': 849,\n",
       " 'warning': 850,\n",
       " 'was': 851,\n",
       " 'was,': 852,\n",
       " \"wasn't\": 853,\n",
       " 'watched': 854,\n",
       " 'waters,': 855,\n",
       " 'we': 856,\n",
       " 'well.': 857,\n",
       " 'were': 858,\n",
       " 'what': 859,\n",
       " 'when': 860,\n",
       " 'which': 861,\n",
       " 'who': 862,\n",
       " 'whole': 863,\n",
       " 'whom': 864,\n",
       " 'why.': 865,\n",
       " 'will': 866,\n",
       " 'wish': 867,\n",
       " 'with': 868,\n",
       " 'within': 869,\n",
       " 'without': 870,\n",
       " 'witness': 871,\n",
       " 'witty': 872,\n",
       " 'woman': 873,\n",
       " 'women': 874,\n",
       " 'wonder': 875,\n",
       " 'work': 876,\n",
       " 'working': 877,\n",
       " 'world': 878,\n",
       " 'world.\"': 879,\n",
       " 'worst': 880,\n",
       " 'would': 881,\n",
       " 'write': 882,\n",
       " 'written': 883,\n",
       " 'wrote': 884,\n",
       " 'year': 885,\n",
       " 'years': 886,\n",
       " 'you': 887,\n",
       " 'young': 888,\n",
       " 'younger': 889,\n",
       " 'your': 890,\n",
       " 'yours.<br': 891}"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word2idx"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "utility-might",
   "metadata": {},
   "source": [
    "We repeat the process, this time mapping integers to words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "presidential-hydrogen",
   "metadata": {},
   "outputs": [],
   "source": [
    "idx2word = {idx: word for idx, word in enumerate(vocab)}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "sonic-rover",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{0: '\"Absolute',\n",
       " 1: '\"Bohlen\"-Fan',\n",
       " 2: '\"Brideshead',\n",
       " 3: '\"Candy\"?).',\n",
       " 4: '\"City',\n",
       " 5: '\"Dieter',\n",
       " 6: '\"Dieter\"',\n",
       " 7: '\"Dragonfly\"',\n",
       " 8: '\"I\\'ve',\n",
       " 9: '\"Lady.\"<br',\n",
       " 10: '\"London',\n",
       " 11: '\"Make',\n",
       " 12: '\"Miss\"',\n",
       " 13: '\"Mr',\n",
       " 14: '\"Mrs.\"',\n",
       " 15: '\"actors\"',\n",
       " 16: '\"dewy-eyed.\"<br',\n",
       " 17: '\"hey,',\n",
       " 18: '\"meanwhile,\")',\n",
       " 19: '\"men\"',\n",
       " 20: \"'Where\",\n",
       " 21: \"'em\",\n",
       " 22: \"'round\",\n",
       " 23: '(Backbone',\n",
       " 24: '(Barbarella).',\n",
       " 25: '(Brit',\n",
       " 26: '(I',\n",
       " 27: '(Jeremy',\n",
       " 28: '(Remember',\n",
       " 29: '(YOU',\n",
       " 30: '(as',\n",
       " 31: '(at',\n",
       " 32: '(not',\n",
       " 33: '(the',\n",
       " 34: \"(there's\",\n",
       " 35: '(they',\n",
       " 36: '(we',\n",
       " 37: '(what',\n",
       " 38: '(when,',\n",
       " 39: '(yes',\n",
       " 40: '-',\n",
       " 41: '.',\n",
       " 42: '/><br',\n",
       " 43: '/>Ah,',\n",
       " 44: '/>And',\n",
       " 45: '/>But',\n",
       " 46: '/>Canadian',\n",
       " 47: '/>David',\n",
       " 48: '/>First',\n",
       " 49: '/>Henceforth,',\n",
       " 50: '/>Joanna',\n",
       " 51: '/>Journalist',\n",
       " 52: '/>Nothing',\n",
       " 53: '/>OK,',\n",
       " 54: '/>Penelope',\n",
       " 55: '/>Peter',\n",
       " 56: '/>Second',\n",
       " 57: '/>So',\n",
       " 58: '/>Susan',\n",
       " 59: '/>Thank',\n",
       " 60: '/>Third',\n",
       " 61: '/>To',\n",
       " 62: '/>When',\n",
       " 63: '/>Wrong!',\n",
       " 64: '/>and',\n",
       " 65: '1-dimensional',\n",
       " 66: '14',\n",
       " 67: '1950s',\n",
       " 68: '20',\n",
       " 69: '<PAD>',\n",
       " 70: '<br',\n",
       " 71: 'A',\n",
       " 72: 'After',\n",
       " 73: 'Alberta',\n",
       " 74: 'Alison',\n",
       " 75: 'Alonso',\n",
       " 76: 'American',\n",
       " 77: 'And',\n",
       " 78: 'As',\n",
       " 79: 'B.B.E.',\n",
       " 80: 'Beginners\",',\n",
       " 81: 'Boarding-School,',\n",
       " 82: 'Bohlen\"',\n",
       " 83: 'Both',\n",
       " 84: 'Brennan',\n",
       " 85: 'Bulimia',\n",
       " 86: 'But',\n",
       " 87: 'Cage',\n",
       " 88: 'Canadian',\n",
       " 89: 'Cher',\n",
       " 90: 'Christmas!)',\n",
       " 91: 'Christopher',\n",
       " 92: 'Circus.<br',\n",
       " 93: 'City',\n",
       " 94: 'City,',\n",
       " 95: 'Col.',\n",
       " 96: 'Colin',\n",
       " 97: 'Colonel',\n",
       " 98: 'Columbian',\n",
       " 99: 'Conchita',\n",
       " 100: 'Constantly',\n",
       " 101: 'Coppola',\n",
       " 102: 'Cricket',\n",
       " 103: \"Cricket's\",\n",
       " 104: 'Davies)',\n",
       " 105: 'Dawson',\n",
       " 106: 'Deadwood,',\n",
       " 107: 'Dean',\n",
       " 108: 'Depardieu,',\n",
       " 109: \"Don't\",\n",
       " 110: 'Dragonfly',\n",
       " 111: 'Emily',\n",
       " 112: 'England)',\n",
       " 113: 'England.)',\n",
       " 114: 'European',\n",
       " 115: 'Even',\n",
       " 116: 'Film',\n",
       " 117: 'First',\n",
       " 118: 'Gerard',\n",
       " 119: 'German',\n",
       " 120: 'Giancarlo',\n",
       " 121: 'Giannini',\n",
       " 122: \"Girls'\",\n",
       " 123: 'Hampshire',\n",
       " 124: 'He',\n",
       " 125: 'Headmistress',\n",
       " 126: 'Her',\n",
       " 127: 'Herringbone-Tweed,',\n",
       " 128: 'Hollywood',\n",
       " 129: 'Home',\n",
       " 130: 'However',\n",
       " 131: 'I',\n",
       " 132: \"I'll\",\n",
       " 133: \"I'm\",\n",
       " 134: 'II',\n",
       " 135: 'If',\n",
       " 136: 'Ironside.',\n",
       " 137: 'It',\n",
       " 138: \"It's\",\n",
       " 139: 'Japanese',\n",
       " 140: 'Jimmy',\n",
       " 141: 'John',\n",
       " 142: 'Justice\".',\n",
       " 143: 'Katt',\n",
       " 144: 'Keith',\n",
       " 145: 'Klondike',\n",
       " 146: 'L.L.',\n",
       " 147: 'L.L.<br',\n",
       " 148: 'Lady',\n",
       " 149: 'Law',\n",
       " 150: 'Leading',\n",
       " 151: \"Lies'.\",\n",
       " 152: 'Lohman',\n",
       " 153: \"Lohman's.\",\n",
       " 154: 'Lohman,',\n",
       " 155: 'London,',\n",
       " 156: 'Lord',\n",
       " 157: 'Love',\n",
       " 158: 'Lumley',\n",
       " 159: 'Madness',\n",
       " 160: 'Mann',\n",
       " 161: 'Manor,',\n",
       " 162: 'Manor.<br',\n",
       " 163: 'Mare',\n",
       " 164: 'Maria',\n",
       " 165: 'McCallum',\n",
       " 166: 'McInnes',\n",
       " 167: \"McInnes's\",\n",
       " 168: 'Meanwhile',\n",
       " 169: 'Michael',\n",
       " 170: 'Miss',\n",
       " 171: 'Mortimer',\n",
       " 172: 'Mountains',\n",
       " 173: 'Mountie',\n",
       " 174: 'Mr.',\n",
       " 175: 'Nancherrow',\n",
       " 176: 'New',\n",
       " 177: 'Nicolas',\n",
       " 178: 'North',\n",
       " 179: \"O'Toole\",\n",
       " 180: 'OK-movie.',\n",
       " 181: 'Okay,',\n",
       " 182: \"Ol'\",\n",
       " 183: 'Our',\n",
       " 184: 'Phillip',\n",
       " 185: 'Pilcher',\n",
       " 186: 'Polonia',\n",
       " 187: 'Reefer',\n",
       " 188: 'Revisited,\"',\n",
       " 189: 'Rocky',\n",
       " 190: 'Roman',\n",
       " 191: \"She's\",\n",
       " 192: 'Shea',\n",
       " 193: 'So',\n",
       " 194: 'Spades\"',\n",
       " 195: 'Speaking',\n",
       " 196: 'Stately',\n",
       " 197: 'Stewart',\n",
       " 198: 'Still,',\n",
       " 199: 'Stockwell',\n",
       " 200: 'Sunday',\n",
       " 201: 'Sure,',\n",
       " 202: 'Susan',\n",
       " 203: 'Teacups,',\n",
       " 204: 'Technically',\n",
       " 205: \"That's\",\n",
       " 206: 'The',\n",
       " 207: 'There',\n",
       " 208: 'They',\n",
       " 209: \"They're\",\n",
       " 210: 'Things',\n",
       " 211: 'This',\n",
       " 212: 'Together,',\n",
       " 213: 'Truth',\n",
       " 214: 'US',\n",
       " 215: 'Venerable',\n",
       " 216: 'Walken',\n",
       " 217: \"Walken's\",\n",
       " 218: 'Walter',\n",
       " 219: 'War',\n",
       " 220: 'Well,',\n",
       " 221: 'West.<br',\n",
       " 222: 'When',\n",
       " 223: 'Wild',\n",
       " 224: 'Winningham',\n",
       " 225: 'Wonderful',\n",
       " 226: 'World',\n",
       " 227: 'York',\n",
       " 228: 'Yukon',\n",
       " 229: 'a',\n",
       " 230: 'ability',\n",
       " 231: 'ably',\n",
       " 232: 'about.',\n",
       " 233: 'absolutely',\n",
       " 234: 'acclaimed;',\n",
       " 235: 'accord',\n",
       " 236: 'accuracy.',\n",
       " 237: 'accurate',\n",
       " 238: 'achievement,',\n",
       " 239: 'act',\n",
       " 240: 'acting',\n",
       " 241: 'action',\n",
       " 242: 'actor',\n",
       " 243: \"actor's\",\n",
       " 244: 'actors',\n",
       " 245: 'actors,',\n",
       " 246: \"actress's\",\n",
       " 247: 'actresses',\n",
       " 248: 'actually',\n",
       " 249: 'admit,',\n",
       " 250: 'advice',\n",
       " 251: 'advice:',\n",
       " 252: 'affair',\n",
       " 253: 'after',\n",
       " 254: 'afternoon',\n",
       " 255: 'age',\n",
       " 256: 'age!).',\n",
       " 257: 'ago',\n",
       " 258: 'ahead',\n",
       " 259: 'air',\n",
       " 260: 'alive',\n",
       " 261: 'all',\n",
       " 262: 'all,',\n",
       " 263: 'all.',\n",
       " 264: 'all.<br',\n",
       " 265: 'along.',\n",
       " 266: 'alright,',\n",
       " 267: 'always',\n",
       " 268: 'always)',\n",
       " 269: 'am',\n",
       " 270: 'amazingly',\n",
       " 271: 'amusing',\n",
       " 272: 'an',\n",
       " 273: 'and',\n",
       " 274: 'anguish',\n",
       " 275: 'animal',\n",
       " 276: 'another!)<br',\n",
       " 277: 'another.',\n",
       " 278: 'any',\n",
       " 279: 'anybody',\n",
       " 280: 'anything',\n",
       " 281: 'apology',\n",
       " 282: 'appear',\n",
       " 283: 'appeared',\n",
       " 284: 'are',\n",
       " 285: 'arm-chair',\n",
       " 286: 'around',\n",
       " 287: 'around,',\n",
       " 288: 'as',\n",
       " 289: 'asleep',\n",
       " 290: 'aspiring',\n",
       " 291: 'astonishing',\n",
       " 292: 'at',\n",
       " 293: 'autobiography',\n",
       " 294: 'award',\n",
       " 295: 'baby',\n",
       " 296: 'backbone!<br',\n",
       " 297: 'backlighting.',\n",
       " 298: 'badly-acted,',\n",
       " 299: 'barely',\n",
       " 300: 'barometers',\n",
       " 301: 'based',\n",
       " 302: 'battling',\n",
       " 303: 'be',\n",
       " 304: 'beautiful',\n",
       " 305: 'because',\n",
       " 306: 'become',\n",
       " 307: 'becomes',\n",
       " 308: 'been',\n",
       " 309: 'before',\n",
       " 310: 'behaviour',\n",
       " 311: 'being',\n",
       " 312: 'best',\n",
       " 313: 'best,',\n",
       " 314: 'best.',\n",
       " 315: 'better',\n",
       " 316: 'big',\n",
       " 317: 'bit',\n",
       " 318: 'blockbuster,',\n",
       " 319: 'body',\n",
       " 320: 'border',\n",
       " 321: 'boring.',\n",
       " 322: 'boy',\n",
       " 323: 'boy.',\n",
       " 324: 'brilliant',\n",
       " 325: 'bulimia',\n",
       " 326: 'business',\n",
       " 327: 'but',\n",
       " 328: 'but,',\n",
       " 329: 'by',\n",
       " 330: 'by,',\n",
       " 331: 'called',\n",
       " 332: 'cameos',\n",
       " 333: 'camps',\n",
       " 334: 'can',\n",
       " 335: 'can\"!',\n",
       " 336: \"can't\",\n",
       " 337: 'cant',\n",
       " 338: 'captures',\n",
       " 339: 'cases',\n",
       " 340: 'cast',\n",
       " 341: 'catatonic.',\n",
       " 342: 'causes',\n",
       " 343: 'causing',\n",
       " 344: 'character',\n",
       " 345: 'choice',\n",
       " 346: 'cinema',\n",
       " 347: 'colonel',\n",
       " 348: 'combination',\n",
       " 349: 'come',\n",
       " 350: 'comes',\n",
       " 351: 'comfortable',\n",
       " 352: 'company',\n",
       " 353: 'complete',\n",
       " 354: 'compulsive',\n",
       " 355: 'concern',\n",
       " 356: 'consent',\n",
       " 357: 'considerate',\n",
       " 358: 'constant.',\n",
       " 359: 'contacts?)<br',\n",
       " 360: 'content',\n",
       " 361: 'continue',\n",
       " 362: 'control',\n",
       " 363: 'convey',\n",
       " 364: 'could',\n",
       " 365: \"couldn't\",\n",
       " 366: 'couple',\n",
       " 367: 'courage',\n",
       " 368: 'course)',\n",
       " 369: 'cover',\n",
       " 370: 'cries',\n",
       " 371: 'criticize.',\n",
       " 372: 'cross,',\n",
       " 373: 'crush',\n",
       " 374: 'crying,',\n",
       " 375: 'dangerous',\n",
       " 376: 'dash',\n",
       " 377: 'days',\n",
       " 378: 'dead',\n",
       " 379: 'decide',\n",
       " 380: 'deep,',\n",
       " 381: 'degree',\n",
       " 382: 'depends',\n",
       " 383: 'depths',\n",
       " 384: 'descend',\n",
       " 385: 'deserves',\n",
       " 386: 'despair,',\n",
       " 387: 'despair.',\n",
       " 388: 'desperation',\n",
       " 389: 'destroy',\n",
       " 390: 'development',\n",
       " 391: 'devoid',\n",
       " 392: 'did',\n",
       " 393: \"didn't\",\n",
       " 394: 'directed',\n",
       " 395: 'director',\n",
       " 396: 'disappointed',\n",
       " 397: 'disgust.',\n",
       " 398: 'disorder.',\n",
       " 399: 'do.',\n",
       " 400: 'does',\n",
       " 401: \"don't\",\n",
       " 402: 'done,',\n",
       " 403: 'done.<br',\n",
       " 404: 'due',\n",
       " 405: 'dumb',\n",
       " 406: 'during',\n",
       " 407: 'each',\n",
       " 408: 'early',\n",
       " 409: 'eaten',\n",
       " 410: 'eating',\n",
       " 411: 'edge.',\n",
       " 412: 'effected',\n",
       " 413: 'either',\n",
       " 414: 'elect',\n",
       " 415: 'else.',\n",
       " 416: 'emblazered',\n",
       " 417: 'emotional',\n",
       " 418: 'emotions',\n",
       " 419: 'end,',\n",
       " 420: 'enforce',\n",
       " 421: 'enjoyable',\n",
       " 422: 'enough',\n",
       " 423: 'enough.',\n",
       " 424: 'ensweatered',\n",
       " 425: 'equally',\n",
       " 426: 'even',\n",
       " 427: 'every',\n",
       " 428: 'everything',\n",
       " 429: 'exactly',\n",
       " 430: 'excellent.',\n",
       " 431: 'experiment',\n",
       " 432: 'explanation',\n",
       " 433: 'exploits',\n",
       " 434: 'extreme',\n",
       " 435: 'eyes',\n",
       " 436: 'fall',\n",
       " 437: 'family',\n",
       " 438: 'fashion,',\n",
       " 439: 'fast',\n",
       " 440: 'fathers',\n",
       " 441: 'fell',\n",
       " 442: 'female',\n",
       " 443: 'few',\n",
       " 444: 'filled',\n",
       " 445: 'film',\n",
       " 446: 'film,',\n",
       " 447: 'filmmaker',\n",
       " 448: 'films',\n",
       " 449: 'films,',\n",
       " 450: 'films.',\n",
       " 451: 'finally:<br',\n",
       " 452: 'find',\n",
       " 453: 'finds',\n",
       " 454: 'finely',\n",
       " 455: 'fired.',\n",
       " 456: 'first',\n",
       " 457: 'folks;',\n",
       " 458: 'for',\n",
       " 459: 'found',\n",
       " 460: 'fourth',\n",
       " 461: 'frenzy',\n",
       " 462: 'from',\n",
       " 463: 'fruits',\n",
       " 464: 'full',\n",
       " 465: 'gal!)<br',\n",
       " 466: 'gave',\n",
       " 467: 'gently',\n",
       " 468: 'genuine',\n",
       " 469: 'get',\n",
       " 470: 'gets',\n",
       " 471: 'girl',\n",
       " 472: 'girl,',\n",
       " 473: 'give',\n",
       " 474: 'glamourous',\n",
       " 475: 'glitzy,',\n",
       " 476: 'go',\n",
       " 477: 'going',\n",
       " 478: 'gold',\n",
       " 479: 'good',\n",
       " 480: 'goodness',\n",
       " 481: 'gorgeous.',\n",
       " 482: 'gradations',\n",
       " 483: 'graduation.',\n",
       " 484: 'great',\n",
       " 485: 'guess',\n",
       " 486: 'gunfighters',\n",
       " 487: 'guy',\n",
       " 488: 'had',\n",
       " 489: 'happen',\n",
       " 490: 'happen,',\n",
       " 491: 'happened',\n",
       " 492: 'happening',\n",
       " 493: 'has',\n",
       " 494: 'hated',\n",
       " 495: 'have',\n",
       " 496: 'have:<br',\n",
       " 497: 'having',\n",
       " 498: 'head',\n",
       " 499: 'her',\n",
       " 500: 'here',\n",
       " 501: 'highly',\n",
       " 502: 'him',\n",
       " 503: 'his',\n",
       " 504: 'history.',\n",
       " 505: 'homage',\n",
       " 506: 'home',\n",
       " 507: 'hours.',\n",
       " 508: 'how',\n",
       " 509: 'howl',\n",
       " 510: 'humor',\n",
       " 511: 'hush-hush',\n",
       " 512: 'hypocrisy',\n",
       " 513: 'hysteria',\n",
       " 514: 'i',\n",
       " 515: 'if',\n",
       " 516: 'immense',\n",
       " 517: 'in',\n",
       " 518: 'in,',\n",
       " 519: 'including,',\n",
       " 520: 'individual',\n",
       " 521: 'inside.',\n",
       " 522: 'intensity.',\n",
       " 523: 'interested',\n",
       " 524: 'into',\n",
       " 525: 'is',\n",
       " 526: \"isn't\",\n",
       " 527: 'it',\n",
       " 528: \"it's\",\n",
       " 529: 'it.',\n",
       " 530: 'its',\n",
       " 531: 'job',\n",
       " 532: 'just',\n",
       " 533: 'killed',\n",
       " 534: 'kind',\n",
       " 535: 'knowledge',\n",
       " 536: 'known',\n",
       " 537: 'last,',\n",
       " 538: 'lastly,',\n",
       " 539: 'later',\n",
       " 540: 'law',\n",
       " 541: 'least',\n",
       " 542: 'let',\n",
       " 543: 'libido.',\n",
       " 544: 'life',\n",
       " 545: 'like',\n",
       " 546: 'live',\n",
       " 547: 'local',\n",
       " 548: 'loss,',\n",
       " 549: 'lost',\n",
       " 550: 'lot.',\n",
       " 551: 'lots',\n",
       " 552: 'love',\n",
       " 553: 'love.',\n",
       " 554: 'loved',\n",
       " 555: 'lucky',\n",
       " 556: 'ludicrous',\n",
       " 557: 'lured',\n",
       " 558: 'made',\n",
       " 559: 'majority',\n",
       " 560: 'make',\n",
       " 561: 'making',\n",
       " 562: 'man',\n",
       " 563: 'marshal',\n",
       " 564: 'marshal!)',\n",
       " 565: 'marvels',\n",
       " 566: 'matter',\n",
       " 567: 'may',\n",
       " 568: 'me',\n",
       " 569: 'meaning.',\n",
       " 570: 'meant',\n",
       " 571: 'measured',\n",
       " 572: 'mellow',\n",
       " 573: 'men',\n",
       " 574: 'mentioned,',\n",
       " 575: 'microscopically',\n",
       " 576: 'mind',\n",
       " 577: 'mine)',\n",
       " 578: 'missed',\n",
       " 579: 'mistaken',\n",
       " 580: 'moist.',\n",
       " 581: 'more.',\n",
       " 582: 'most',\n",
       " 583: 'mostly',\n",
       " 584: 'mother,',\n",
       " 585: 'movie',\n",
       " 586: \"movie's\",\n",
       " 587: 'movie.',\n",
       " 588: 'movies',\n",
       " 589: 'much',\n",
       " 590: 'musical',\n",
       " 591: 'musician,',\n",
       " 592: 'must',\n",
       " 593: 'name',\n",
       " 594: 'name.',\n",
       " 595: 'nature',\n",
       " 596: 'never',\n",
       " 597: 'nightmare',\n",
       " 598: 'nineties',\n",
       " 599: 'no',\n",
       " 600: 'nor',\n",
       " 601: 'nostalgic',\n",
       " 602: 'not',\n",
       " 603: 'not.',\n",
       " 604: 'nothing',\n",
       " 605: 'novel.<br',\n",
       " 606: \"novelist's\",\n",
       " 607: 'novels\":',\n",
       " 608: 'novels:',\n",
       " 609: 'now',\n",
       " 610: 'nude',\n",
       " 611: 'occasion',\n",
       " 612: 'of',\n",
       " 613: 'off',\n",
       " 614: 'off.',\n",
       " 615: 'offensive',\n",
       " 616: 'office',\n",
       " 617: 'old',\n",
       " 618: 'oldest',\n",
       " 619: 'on',\n",
       " 620: 'on-screen',\n",
       " 621: 'once.',\n",
       " 622: 'one',\n",
       " 623: \"one's\",\n",
       " 624: 'one-dimensional',\n",
       " 625: 'only',\n",
       " 626: 'or',\n",
       " 627: 'or,',\n",
       " 628: 'oscillators',\n",
       " 629: 'other',\n",
       " 630: 'others',\n",
       " 631: 'out',\n",
       " 632: 'outdoors',\n",
       " 633: 'outside',\n",
       " 634: 'outstanding,',\n",
       " 635: 'own',\n",
       " 636: 'paddle',\n",
       " 637: 'paid',\n",
       " 638: 'pair',\n",
       " 639: 'parents',\n",
       " 640: 'part',\n",
       " 641: 'past',\n",
       " 642: 'pathetic',\n",
       " 643: 'people',\n",
       " 644: 'people.',\n",
       " 645: 'perfect',\n",
       " 646: 'performances',\n",
       " 647: 'perhaps',\n",
       " 648: 'phony,',\n",
       " 649: 'photographs',\n",
       " 650: 'picture.',\n",
       " 651: 'piece.',\n",
       " 652: 'play',\n",
       " 653: 'played',\n",
       " 654: 'plays',\n",
       " 655: 'plot',\n",
       " 656: 'plot,',\n",
       " 657: 'plug',\n",
       " 658: 'poignant',\n",
       " 659: 'pointlessly',\n",
       " 660: 'popular',\n",
       " 661: 'portrayal',\n",
       " 662: 'position.',\n",
       " 663: 'praise',\n",
       " 664: 'precise,',\n",
       " 665: 'prepared',\n",
       " 666: 'pressure',\n",
       " 667: 'pressure,',\n",
       " 668: 'prime',\n",
       " 669: 'prison',\n",
       " 670: 'producers',\n",
       " 671: 'propaganda',\n",
       " 672: 'proud',\n",
       " 673: 'pseudo-love',\n",
       " 674: 'pursue',\n",
       " 675: 'pursued,',\n",
       " 676: 'pursuers',\n",
       " 677: 'pursuing',\n",
       " 678: 'puts',\n",
       " 679: 'quite',\n",
       " 680: 'range',\n",
       " 681: 'rapids',\n",
       " 682: 're-makings',\n",
       " 683: 'reaches',\n",
       " 684: 'read',\n",
       " 685: 'reading',\n",
       " 686: 'real',\n",
       " 687: 'really',\n",
       " 688: 'reason',\n",
       " 689: 'rebels',\n",
       " 690: 'received.',\n",
       " 691: 'recommend',\n",
       " 692: 'redeem',\n",
       " 693: 'remotely',\n",
       " 694: 'resembling',\n",
       " 695: 'resonance',\n",
       " 696: 'rest',\n",
       " 697: 'revolutions.',\n",
       " 698: 'ridiculous',\n",
       " 699: 'right,',\n",
       " 700: 'right?<br',\n",
       " 701: 'rightly',\n",
       " 702: 'rising',\n",
       " 703: 'role',\n",
       " 704: 'roles',\n",
       " 705: 'romance',\n",
       " 706: 'row',\n",
       " 707: 'rubbish.',\n",
       " 708: 'ruining',\n",
       " 709: 'running',\n",
       " 710: 'rush.',\n",
       " 711: 'said',\n",
       " 712: 'same',\n",
       " 713: 'say',\n",
       " 714: 'says',\n",
       " 715: 'scale',\n",
       " 716: 'scene',\n",
       " 717: 'scenes',\n",
       " 718: 'schools,',\n",
       " 719: 'sci-fi',\n",
       " 720: 'script.',\n",
       " 721: 'see',\n",
       " 722: 'seeing',\n",
       " 723: 'seem',\n",
       " 724: 'seemed',\n",
       " 725: 'seen',\n",
       " 726: 'sense',\n",
       " 727: 'sensitive',\n",
       " 728: 'serving',\n",
       " 729: 'set',\n",
       " 730: 'sette',\n",
       " 731: 'setting',\n",
       " 732: 'sexiest',\n",
       " 733: 'sexual',\n",
       " 734: 'sexy',\n",
       " 735: 'shake',\n",
       " 736: 'shall',\n",
       " 737: 'sharp',\n",
       " 738: 'she',\n",
       " 739: \"she's\",\n",
       " 740: 'shooting',\n",
       " 741: 'should',\n",
       " 742: 'show',\n",
       " 743: 'show.',\n",
       " 744: 'shown',\n",
       " 745: 'shows',\n",
       " 746: 'side',\n",
       " 747: 'side.',\n",
       " 748: 'sign:',\n",
       " 749: 'simply',\n",
       " 750: 'sis',\n",
       " 751: 'sit',\n",
       " 752: 'sixties',\n",
       " 753: 'sixties.',\n",
       " 754: 'slow',\n",
       " 755: 'snowy',\n",
       " 756: 'so',\n",
       " 757: 'so.',\n",
       " 758: 'some',\n",
       " 759: 'somehow,',\n",
       " 760: 'soporific',\n",
       " 761: 'sort.<br',\n",
       " 762: 'soul',\n",
       " 763: 'speaking',\n",
       " 764: 'spectacular',\n",
       " 765: 'spelling,',\n",
       " 766: 'spirit',\n",
       " 767: 'spotlight',\n",
       " 768: 'squeeze.',\n",
       " 769: 'standard)',\n",
       " 770: 'stars',\n",
       " 771: 'still',\n",
       " 772: 'story',\n",
       " 773: 'story,',\n",
       " 774: 'storyline.',\n",
       " 775: 'streets',\n",
       " 776: 'subject.',\n",
       " 777: 'such',\n",
       " 778: 'suffering',\n",
       " 779: 'sum',\n",
       " 780: 'sunlight!\"<br',\n",
       " 781: 'superb',\n",
       " 782: 'supporting',\n",
       " 783: 'sympathise',\n",
       " 784: 'sympathy',\n",
       " 785: 'symptoms.',\n",
       " 786: 'target',\n",
       " 787: 'tea',\n",
       " 788: 'teenage',\n",
       " 789: 'tell',\n",
       " 790: 'telling',\n",
       " 791: 'tells',\n",
       " 792: 'temp-jobs',\n",
       " 793: 'terrible',\n",
       " 794: 'than',\n",
       " 795: 'that',\n",
       " 796: \"that's\",\n",
       " 797: 'that.',\n",
       " 798: 'thats',\n",
       " 799: 'the',\n",
       " 800: 'their',\n",
       " 801: 'them',\n",
       " 802: 'them.',\n",
       " 803: 'themselves',\n",
       " 804: 'themselves,',\n",
       " 805: 'then',\n",
       " 806: 'there',\n",
       " 807: 'these',\n",
       " 808: 'they',\n",
       " 809: 'things',\n",
       " 810: 'this',\n",
       " 811: 'this,',\n",
       " 812: 'those',\n",
       " 813: 'thought',\n",
       " 814: 'three',\n",
       " 815: 'three.',\n",
       " 816: 'through',\n",
       " 817: 'time',\n",
       " 818: 'tired,',\n",
       " 819: 'to',\n",
       " 820: 'today,',\n",
       " 821: 'took',\n",
       " 822: 'tormented',\n",
       " 823: 'town?',\n",
       " 824: 'toy-boy',\n",
       " 825: 'toy-boy,',\n",
       " 826: 'trauma,',\n",
       " 827: 'true',\n",
       " 828: 'tuned',\n",
       " 829: 'turned',\n",
       " 830: 'two',\n",
       " 831: 'type',\n",
       " 832: 'ultimate',\n",
       " 833: 'unbearable',\n",
       " 834: 'up',\n",
       " 835: 'up:',\n",
       " 836: 'us',\n",
       " 837: 'us.<br',\n",
       " 838: 'use',\n",
       " 839: 'uses',\n",
       " 840: 'usually',\n",
       " 841: 'version',\n",
       " 842: 'very',\n",
       " 843: 'vibrating',\n",
       " 844: 'viewers',\n",
       " 845: 'waiting',\n",
       " 846: 'waiting!',\n",
       " 847: 'want',\n",
       " 848: 'warm',\n",
       " 849: 'warn',\n",
       " 850: 'warning',\n",
       " 851: 'was',\n",
       " 852: 'was,',\n",
       " 853: \"wasn't\",\n",
       " 854: 'watched',\n",
       " 855: 'waters,',\n",
       " 856: 'we',\n",
       " 857: 'well.',\n",
       " 858: 'were',\n",
       " 859: 'what',\n",
       " 860: 'when',\n",
       " 861: 'which',\n",
       " 862: 'who',\n",
       " 863: 'whole',\n",
       " 864: 'whom',\n",
       " 865: 'why.',\n",
       " 866: 'will',\n",
       " 867: 'wish',\n",
       " 868: 'with',\n",
       " 869: 'within',\n",
       " 870: 'without',\n",
       " 871: 'witness',\n",
       " 872: 'witty',\n",
       " 873: 'woman',\n",
       " 874: 'women',\n",
       " 875: 'wonder',\n",
       " 876: 'work',\n",
       " 877: 'working',\n",
       " 878: 'world',\n",
       " 879: 'world.\"',\n",
       " 880: 'worst',\n",
       " 881: 'would',\n",
       " 882: 'write',\n",
       " 883: 'written',\n",
       " 884: 'wrote',\n",
       " 885: 'year',\n",
       " 886: 'years',\n",
       " 887: 'you',\n",
       " 888: 'young',\n",
       " 889: 'younger',\n",
       " 890: 'your',\n",
       " 891: 'yours.<br'}"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idx2word"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "measured-irish",
   "metadata": {},
   "source": [
    "Now, perform the mapping to encode the observations in our subset. Note the use of ***nested list comprehensions***!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "israeli-vitamin",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "([211, 851, 272, 233, 793, 587, 109, 303, 557, 517],\n",
       " [131, 495, 308, 536, 819, 436, 289, 406, 449, 327])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_proc = [[word2idx[word] for word in review] for review in X]\n",
    "X_proc[0][:10], X_proc[1][:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "smoking-action",
   "metadata": {},
   "source": [
    "`X_proc` is a list of lists but if we are going to feed it into a `keras` model we should convert both it and `y` into `numpy` arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "limiting-madness",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([[211, 851, 272, ...,  69,  69,  69],\n",
       "        [131, 495, 308, ...,  69,  69,  69],\n",
       "        [160, 649, 799, ...,  69,  69,  69],\n",
       "        ...,\n",
       "        [206, 445, 525, ...,  69,  69,  69],\n",
       "        [131, 687, 552, ...,  69,  69,  69],\n",
       "        [201, 810, 622, ...,  69,  69,  69]]),\n",
       " array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0]))"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_proc = np.hstack(X_proc).reshape(-1, MAX_LEN)\n",
    "y = np.array(y)\n",
    "X_proc, y"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "complicated-offset",
   "metadata": {},
   "source": [
    "Now, just to prove that we've successfully processed the data, we perform a test train split and feed it into a FNN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "premier-earth",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X_proc, y, test_size=0.2, stratify=y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "dominican-carnival",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "dense (Dense)                (None, 8)                 4008      \n",
      "_________________________________________________________________\n",
      "dense_1 (Dense)              (None, 1)                 9         \n",
      "=================================================================\n",
      "Total params: 4,017\n",
      "Trainable params: 4,017\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Epoch 1/5\n",
      "4/4 - 2s - loss: 187.6442 - accuracy: 0.2500 - val_loss: 149.4720 - val_accuracy: 0.5000\n",
      "Epoch 2/5\n",
      "4/4 - 0s - loss: 16.7689 - accuracy: 0.7500 - val_loss: 332.2443 - val_accuracy: 0.5000\n",
      "Epoch 3/5\n",
      "4/4 - 0s - loss: 21.6830 - accuracy: 0.7500 - val_loss: 360.5525 - val_accuracy: 0.5000\n",
      "Epoch 4/5\n",
      "4/4 - 0s - loss: 11.4073 - accuracy: 0.7500 - val_loss: 362.2109 - val_accuracy: 0.5000\n",
      "Epoch 5/5\n",
      "4/4 - 0s - loss: 1.7824e-10 - accuracy: 1.0000 - val_loss: 346.8892 - val_accuracy: 0.5000\n",
      "Accuracy: 50.00%\n"
     ]
    }
   ],
   "source": [
    "model = Sequential()\n",
    "\n",
    "model.add(Dense(8, activation='relu',input_dim=MAX_LEN))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=2, verbose=2)\n",
    "\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "derived-invalid",
   "metadata": {},
   "source": [
    "It worked! But our subset was very small so we shouldn't get excited about the results above.<br>\n",
    "\n",
    "The IMDB dataset is very popular so `keras` also includes an alternative method for loading the data. This method can save us a lot of time for many reasons:\n",
    "- Cleaned text with less meaningless punctuation\n",
    "- Pre-tokenized and numerically encoded\n",
    "- Allows us to specify maximum vocabulary size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "center-latter",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.datasets import imdb\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "surrounded-import",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We want to have a finite vocabulary to make sure that our word matrices are not arbitrarily small\n",
    "MAX_VOCAB = 10000\n",
    "INDEX_FROM = 3   # word index offset \n",
    "(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=MAX_VOCAB, index_from=INDEX_FROM)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "planned-aurora",
   "metadata": {},
   "source": [
    "`get_word_index` will load a json object we can store in a dictionary. This gives us the word-to-integer mapping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "alert-consultancy",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'fawn': 34704,\n",
       " 'tsukino': 52009,\n",
       " 'nunnery': 52010,\n",
       " 'sonja': 16819,\n",
       " 'vani': 63954,\n",
       " 'woods': 1411,\n",
       " 'spiders': 16118,\n",
       " 'hanging': 2348,\n",
       " 'woody': 2292,\n",
       " 'trawling': 52011,\n",
       " \"hold's\": 52012,\n",
       " 'comically': 11310,\n",
       " 'localized': 40833,\n",
       " 'disobeying': 30571,\n",
       " \"'royale\": 52013,\n",
       " \"harpo's\": 40834,\n",
       " 'canet': 52014,\n",
       " 'aileen': 19316,\n",
       " 'acurately': 52015,\n",
       " \"diplomat's\": 52016,\n",
       " 'rickman': 25245,\n",
       " 'arranged': 6749,\n",
       " 'rumbustious': 52017,\n",
       " 'familiarness': 52018,\n",
       " \"spider'\": 52019,\n",
       " 'hahahah': 68807,\n",
       " \"wood'\": 52020,\n",
       " 'transvestism': 40836,\n",
       " \"hangin'\": 34705,\n",
       " 'bringing': 2341,\n",
       " 'seamier': 40837,\n",
       " 'wooded': 34706,\n",
       " 'bravora': 52021,\n",
       " 'grueling': 16820,\n",
       " 'wooden': 1639,\n",
       " 'wednesday': 16821,\n",
       " \"'prix\": 52022,\n",
       " 'altagracia': 34707,\n",
       " 'circuitry': 52023,\n",
       " 'crotch': 11588,\n",
       " 'busybody': 57769,\n",
       " \"tart'n'tangy\": 52024,\n",
       " 'burgade': 14132,\n",
       " 'thrace': 52026,\n",
       " \"tom's\": 11041,\n",
       " 'snuggles': 52028,\n",
       " 'francesco': 29117,\n",
       " 'complainers': 52030,\n",
       " 'templarios': 52128,\n",
       " '272': 40838,\n",
       " '273': 52031,\n",
       " 'zaniacs': 52133,\n",
       " '275': 34709,\n",
       " 'consenting': 27634,\n",
       " 'snuggled': 40839,\n",
       " 'inanimate': 15495,\n",
       " 'uality': 52033,\n",
       " 'bronte': 11929,\n",
       " 'errors': 4013,\n",
       " 'dialogs': 3233,\n",
       " \"yomada's\": 52034,\n",
       " \"madman's\": 34710,\n",
       " 'dialoge': 30588,\n",
       " 'usenet': 52036,\n",
       " 'videodrome': 40840,\n",
       " \"kid'\": 26341,\n",
       " 'pawed': 52037,\n",
       " \"'girlfriend'\": 30572,\n",
       " \"'pleasure\": 52038,\n",
       " \"'reloaded'\": 52039,\n",
       " \"kazakos'\": 40842,\n",
       " 'rocque': 52040,\n",
       " 'mailings': 52041,\n",
       " 'brainwashed': 11930,\n",
       " 'mcanally': 16822,\n",
       " \"tom''\": 52042,\n",
       " 'kurupt': 25246,\n",
       " 'affiliated': 21908,\n",
       " 'babaganoosh': 52043,\n",
       " \"noe's\": 40843,\n",
       " 'quart': 40844,\n",
       " 'kids': 362,\n",
       " 'uplifting': 5037,\n",
       " 'controversy': 7096,\n",
       " 'kida': 21909,\n",
       " 'kidd': 23382,\n",
       " \"error'\": 52044,\n",
       " 'neurologist': 52045,\n",
       " 'spotty': 18513,\n",
       " 'cobblers': 30573,\n",
       " 'projection': 9881,\n",
       " 'fastforwarding': 40845,\n",
       " 'sters': 52046,\n",
       " \"eggar's\": 52047,\n",
       " 'etherything': 52048,\n",
       " 'gateshead': 40846,\n",
       " 'airball': 34711,\n",
       " 'unsinkable': 25247,\n",
       " 'stern': 7183,\n",
       " \"cervi's\": 52049,\n",
       " 'dnd': 40847,\n",
       " 'dna': 11589,\n",
       " 'insecurity': 20601,\n",
       " \"'reboot'\": 52050,\n",
       " 'trelkovsky': 11040,\n",
       " 'jaekel': 52051,\n",
       " 'sidebars': 52052,\n",
       " \"sforza's\": 52053,\n",
       " 'distortions': 17636,\n",
       " 'mutinies': 52054,\n",
       " 'sermons': 30605,\n",
       " '7ft': 40849,\n",
       " 'boobage': 52055,\n",
       " \"o'bannon's\": 52056,\n",
       " 'populations': 23383,\n",
       " 'chulak': 52057,\n",
       " 'mesmerize': 27636,\n",
       " 'quinnell': 52058,\n",
       " 'yahoo': 10310,\n",
       " 'meteorologist': 52060,\n",
       " 'beswick': 42580,\n",
       " 'boorman': 15496,\n",
       " 'voicework': 40850,\n",
       " \"ster'\": 52061,\n",
       " 'blustering': 22925,\n",
       " 'hj': 52062,\n",
       " 'intake': 27637,\n",
       " 'morally': 5624,\n",
       " 'jumbling': 40852,\n",
       " 'bowersock': 52063,\n",
       " \"'porky's'\": 52064,\n",
       " 'gershon': 16824,\n",
       " 'ludicrosity': 40853,\n",
       " 'coprophilia': 52065,\n",
       " 'expressively': 40854,\n",
       " \"india's\": 19503,\n",
       " \"post's\": 34713,\n",
       " 'wana': 52066,\n",
       " 'wang': 5286,\n",
       " 'wand': 30574,\n",
       " 'wane': 25248,\n",
       " 'edgeways': 52324,\n",
       " 'titanium': 34714,\n",
       " 'pinta': 40855,\n",
       " 'want': 181,\n",
       " 'pinto': 30575,\n",
       " 'whoopdedoodles': 52068,\n",
       " 'tchaikovsky': 21911,\n",
       " 'travel': 2106,\n",
       " \"'victory'\": 52069,\n",
       " 'copious': 11931,\n",
       " 'gouge': 22436,\n",
       " \"chapters'\": 52070,\n",
       " 'barbra': 6705,\n",
       " 'uselessness': 30576,\n",
       " \"wan'\": 52071,\n",
       " 'assimilated': 27638,\n",
       " 'petiot': 16119,\n",
       " 'most\\x85and': 52072,\n",
       " 'dinosaurs': 3933,\n",
       " 'wrong': 355,\n",
       " 'seda': 52073,\n",
       " 'stollen': 52074,\n",
       " 'sentencing': 34715,\n",
       " 'ouroboros': 40856,\n",
       " 'assimilates': 40857,\n",
       " 'colorfully': 40858,\n",
       " 'glenne': 27639,\n",
       " 'dongen': 52075,\n",
       " 'subplots': 4763,\n",
       " 'kiloton': 52076,\n",
       " 'chandon': 23384,\n",
       " \"effect'\": 34716,\n",
       " 'snugly': 27640,\n",
       " 'kuei': 40859,\n",
       " 'welcomed': 9095,\n",
       " 'dishonor': 30074,\n",
       " 'concurrence': 52078,\n",
       " 'stoicism': 23385,\n",
       " \"guys'\": 14899,\n",
       " \"beroemd'\": 52080,\n",
       " 'butcher': 6706,\n",
       " \"melfi's\": 40860,\n",
       " 'aargh': 30626,\n",
       " 'playhouse': 20602,\n",
       " 'wickedly': 11311,\n",
       " 'fit': 1183,\n",
       " 'labratory': 52081,\n",
       " 'lifeline': 40862,\n",
       " 'screaming': 1930,\n",
       " 'fix': 4290,\n",
       " 'cineliterate': 52082,\n",
       " 'fic': 52083,\n",
       " 'fia': 52084,\n",
       " 'fig': 34717,\n",
       " 'fmvs': 52085,\n",
       " 'fie': 52086,\n",
       " 'reentered': 52087,\n",
       " 'fin': 30577,\n",
       " 'doctresses': 52088,\n",
       " 'fil': 52089,\n",
       " 'zucker': 12609,\n",
       " 'ached': 31934,\n",
       " 'counsil': 52091,\n",
       " 'paterfamilias': 52092,\n",
       " 'songwriter': 13888,\n",
       " 'shivam': 34718,\n",
       " 'hurting': 9657,\n",
       " 'effects': 302,\n",
       " 'slauther': 52093,\n",
       " \"'flame'\": 52094,\n",
       " 'sommerset': 52095,\n",
       " 'interwhined': 52096,\n",
       " 'whacking': 27641,\n",
       " 'bartok': 52097,\n",
       " 'barton': 8778,\n",
       " 'frewer': 21912,\n",
       " \"fi'\": 52098,\n",
       " 'ingrid': 6195,\n",
       " 'stribor': 30578,\n",
       " 'approporiately': 52099,\n",
       " 'wobblyhand': 52100,\n",
       " 'tantalisingly': 52101,\n",
       " 'ankylosaurus': 52102,\n",
       " 'parasites': 17637,\n",
       " 'childen': 52103,\n",
       " \"jenkins'\": 52104,\n",
       " 'metafiction': 52105,\n",
       " 'golem': 17638,\n",
       " 'indiscretion': 40863,\n",
       " \"reeves'\": 23386,\n",
       " \"inamorata's\": 57784,\n",
       " 'brittannica': 52107,\n",
       " 'adapt': 7919,\n",
       " \"russo's\": 30579,\n",
       " 'guitarists': 48249,\n",
       " 'abbott': 10556,\n",
       " 'abbots': 40864,\n",
       " 'lanisha': 17652,\n",
       " 'magickal': 40866,\n",
       " 'mattter': 52108,\n",
       " \"'willy\": 52109,\n",
       " 'pumpkins': 34719,\n",
       " 'stuntpeople': 52110,\n",
       " 'estimate': 30580,\n",
       " 'ugghhh': 40867,\n",
       " 'gameplay': 11312,\n",
       " \"wern't\": 52111,\n",
       " \"n'sync\": 40868,\n",
       " 'sickeningly': 16120,\n",
       " 'chiara': 40869,\n",
       " 'disturbed': 4014,\n",
       " 'portmanteau': 40870,\n",
       " 'ineffectively': 52112,\n",
       " \"duchonvey's\": 82146,\n",
       " \"nasty'\": 37522,\n",
       " 'purpose': 1288,\n",
       " 'lazers': 52115,\n",
       " 'lightened': 28108,\n",
       " 'kaliganj': 52116,\n",
       " 'popularism': 52117,\n",
       " \"damme's\": 18514,\n",
       " 'stylistics': 30581,\n",
       " 'mindgaming': 52118,\n",
       " 'spoilerish': 46452,\n",
       " \"'corny'\": 52120,\n",
       " 'boerner': 34721,\n",
       " 'olds': 6795,\n",
       " 'bakelite': 52121,\n",
       " 'renovated': 27642,\n",
       " 'forrester': 27643,\n",
       " \"lumiere's\": 52122,\n",
       " 'gaskets': 52027,\n",
       " 'needed': 887,\n",
       " 'smight': 34722,\n",
       " 'master': 1300,\n",
       " \"edie's\": 25908,\n",
       " 'seeber': 40871,\n",
       " 'hiya': 52123,\n",
       " 'fuzziness': 52124,\n",
       " 'genesis': 14900,\n",
       " 'rewards': 12610,\n",
       " 'enthrall': 30582,\n",
       " \"'about\": 40872,\n",
       " \"recollection's\": 52125,\n",
       " 'mutilated': 11042,\n",
       " 'fatherlands': 52126,\n",
       " \"fischer's\": 52127,\n",
       " 'positively': 5402,\n",
       " '270': 34708,\n",
       " 'ahmed': 34723,\n",
       " 'zatoichi': 9839,\n",
       " 'bannister': 13889,\n",
       " 'anniversaries': 52130,\n",
       " \"helm's\": 30583,\n",
       " \"'work'\": 52131,\n",
       " 'exclaimed': 34724,\n",
       " \"'unfunny'\": 52132,\n",
       " '274': 52032,\n",
       " 'feeling': 547,\n",
       " \"wanda's\": 52134,\n",
       " 'dolan': 33269,\n",
       " '278': 52136,\n",
       " 'peacoat': 52137,\n",
       " 'brawny': 40873,\n",
       " 'mishra': 40874,\n",
       " 'worlders': 40875,\n",
       " 'protags': 52138,\n",
       " 'skullcap': 52139,\n",
       " 'dastagir': 57599,\n",
       " 'affairs': 5625,\n",
       " 'wholesome': 7802,\n",
       " 'hymen': 52140,\n",
       " 'paramedics': 25249,\n",
       " 'unpersons': 52141,\n",
       " 'heavyarms': 52142,\n",
       " 'affaire': 52143,\n",
       " 'coulisses': 52144,\n",
       " 'hymer': 40876,\n",
       " 'kremlin': 52145,\n",
       " 'shipments': 30584,\n",
       " 'pixilated': 52146,\n",
       " \"'00s\": 30585,\n",
       " 'diminishing': 18515,\n",
       " 'cinematic': 1360,\n",
       " 'resonates': 14901,\n",
       " 'simplify': 40877,\n",
       " \"nature'\": 40878,\n",
       " 'temptresses': 40879,\n",
       " 'reverence': 16825,\n",
       " 'resonated': 19505,\n",
       " 'dailey': 34725,\n",
       " '2\\x85': 52147,\n",
       " 'treize': 27644,\n",
       " 'majo': 52148,\n",
       " 'kiya': 21913,\n",
       " 'woolnough': 52149,\n",
       " 'thanatos': 39800,\n",
       " 'sandoval': 35734,\n",
       " 'dorama': 40882,\n",
       " \"o'shaughnessy\": 52150,\n",
       " 'tech': 4991,\n",
       " 'fugitives': 32021,\n",
       " 'teck': 30586,\n",
       " \"'e'\": 76128,\n",
       " 'doesn’t': 40884,\n",
       " 'purged': 52152,\n",
       " 'saying': 660,\n",
       " \"martians'\": 41098,\n",
       " 'norliss': 23421,\n",
       " 'dickey': 27645,\n",
       " 'dicker': 52155,\n",
       " \"'sependipity\": 52156,\n",
       " 'padded': 8425,\n",
       " 'ordell': 57795,\n",
       " \"sturges'\": 40885,\n",
       " 'independentcritics': 52157,\n",
       " 'tempted': 5748,\n",
       " \"atkinson's\": 34727,\n",
       " 'hounded': 25250,\n",
       " 'apace': 52158,\n",
       " 'clicked': 15497,\n",
       " \"'humor'\": 30587,\n",
       " \"martino's\": 17180,\n",
       " \"'supporting\": 52159,\n",
       " 'warmongering': 52035,\n",
       " \"zemeckis's\": 34728,\n",
       " 'lube': 21914,\n",
       " 'shocky': 52160,\n",
       " 'plate': 7479,\n",
       " 'plata': 40886,\n",
       " 'sturgess': 40887,\n",
       " \"nerds'\": 40888,\n",
       " 'plato': 20603,\n",
       " 'plath': 34729,\n",
       " 'platt': 40889,\n",
       " 'mcnab': 52162,\n",
       " 'clumsiness': 27646,\n",
       " 'altogether': 3902,\n",
       " 'massacring': 42587,\n",
       " 'bicenntinial': 52163,\n",
       " 'skaal': 40890,\n",
       " 'droning': 14363,\n",
       " 'lds': 8779,\n",
       " 'jaguar': 21915,\n",
       " \"cale's\": 34730,\n",
       " 'nicely': 1780,\n",
       " 'mummy': 4591,\n",
       " \"lot's\": 18516,\n",
       " 'patch': 10089,\n",
       " 'kerkhof': 50205,\n",
       " \"leader's\": 52164,\n",
       " \"'movie\": 27647,\n",
       " 'uncomfirmed': 52165,\n",
       " 'heirloom': 40891,\n",
       " 'wrangle': 47363,\n",
       " 'emotion\\x85': 52166,\n",
       " \"'stargate'\": 52167,\n",
       " 'pinoy': 40892,\n",
       " 'conchatta': 40893,\n",
       " 'broeke': 41131,\n",
       " 'advisedly': 40894,\n",
       " \"barker's\": 17639,\n",
       " 'descours': 52169,\n",
       " 'lots': 775,\n",
       " 'lotr': 9262,\n",
       " 'irs': 9882,\n",
       " 'lott': 52170,\n",
       " 'xvi': 40895,\n",
       " 'irk': 34731,\n",
       " 'irl': 52171,\n",
       " 'ira': 6890,\n",
       " 'belzer': 21916,\n",
       " 'irc': 52172,\n",
       " 'ire': 27648,\n",
       " 'requisites': 40896,\n",
       " 'discipline': 7696,\n",
       " 'lyoko': 52964,\n",
       " 'extend': 11313,\n",
       " 'nature': 876,\n",
       " \"'dickie'\": 52173,\n",
       " 'optimist': 40897,\n",
       " 'lapping': 30589,\n",
       " 'superficial': 3903,\n",
       " 'vestment': 52174,\n",
       " 'extent': 2826,\n",
       " 'tendons': 52175,\n",
       " \"heller's\": 52176,\n",
       " 'quagmires': 52177,\n",
       " 'miyako': 52178,\n",
       " 'moocow': 20604,\n",
       " \"coles'\": 52179,\n",
       " 'lookit': 40898,\n",
       " 'ravenously': 52180,\n",
       " 'levitating': 40899,\n",
       " 'perfunctorily': 52181,\n",
       " 'lookin': 30590,\n",
       " \"lot'\": 40901,\n",
       " 'lookie': 52182,\n",
       " 'fearlessly': 34873,\n",
       " 'libyan': 52184,\n",
       " 'fondles': 40902,\n",
       " 'gopher': 35717,\n",
       " 'wearying': 40904,\n",
       " \"nz's\": 52185,\n",
       " 'minuses': 27649,\n",
       " 'puposelessly': 52186,\n",
       " 'shandling': 52187,\n",
       " 'decapitates': 31271,\n",
       " 'humming': 11932,\n",
       " \"'nother\": 40905,\n",
       " 'smackdown': 21917,\n",
       " 'underdone': 30591,\n",
       " 'frf': 40906,\n",
       " 'triviality': 52188,\n",
       " 'fro': 25251,\n",
       " 'bothers': 8780,\n",
       " \"'kensington\": 52189,\n",
       " 'much': 76,\n",
       " 'muco': 34733,\n",
       " 'wiseguy': 22618,\n",
       " \"richie's\": 27651,\n",
       " 'tonino': 40907,\n",
       " 'unleavened': 52190,\n",
       " 'fry': 11590,\n",
       " \"'tv'\": 40908,\n",
       " 'toning': 40909,\n",
       " 'obese': 14364,\n",
       " 'sensationalized': 30592,\n",
       " 'spiv': 40910,\n",
       " 'spit': 6262,\n",
       " 'arkin': 7367,\n",
       " 'charleton': 21918,\n",
       " 'jeon': 16826,\n",
       " 'boardroom': 21919,\n",
       " 'doubts': 4992,\n",
       " 'spin': 3087,\n",
       " 'hepo': 53086,\n",
       " 'wildcat': 27652,\n",
       " 'venoms': 10587,\n",
       " 'misconstrues': 52194,\n",
       " 'mesmerising': 18517,\n",
       " 'misconstrued': 40911,\n",
       " 'rescinds': 52195,\n",
       " 'prostrate': 52196,\n",
       " 'majid': 40912,\n",
       " 'climbed': 16482,\n",
       " 'canoeing': 34734,\n",
       " 'majin': 52198,\n",
       " 'animie': 57807,\n",
       " 'sylke': 40913,\n",
       " 'conditioned': 14902,\n",
       " 'waddell': 40914,\n",
       " '3\\x85': 52199,\n",
       " 'hyperdrive': 41191,\n",
       " 'conditioner': 34735,\n",
       " 'bricklayer': 53156,\n",
       " 'hong': 2579,\n",
       " 'memoriam': 52201,\n",
       " 'inventively': 30595,\n",
       " \"levant's\": 25252,\n",
       " 'portobello': 20641,\n",
       " 'remand': 52203,\n",
       " 'mummified': 19507,\n",
       " 'honk': 27653,\n",
       " 'spews': 19508,\n",
       " 'visitations': 40915,\n",
       " 'mummifies': 52204,\n",
       " 'cavanaugh': 25253,\n",
       " 'zeon': 23388,\n",
       " \"jungle's\": 40916,\n",
       " 'viertel': 34736,\n",
       " 'frenchmen': 27654,\n",
       " 'torpedoes': 52205,\n",
       " 'schlessinger': 52206,\n",
       " 'torpedoed': 34737,\n",
       " 'blister': 69879,\n",
       " 'cinefest': 52207,\n",
       " 'furlough': 34738,\n",
       " 'mainsequence': 52208,\n",
       " 'mentors': 40917,\n",
       " 'academic': 9097,\n",
       " 'stillness': 20605,\n",
       " 'academia': 40918,\n",
       " 'lonelier': 52209,\n",
       " 'nibby': 52210,\n",
       " \"losers'\": 52211,\n",
       " 'cineastes': 40919,\n",
       " 'corporate': 4452,\n",
       " 'massaging': 40920,\n",
       " 'bellow': 30596,\n",
       " 'absurdities': 19509,\n",
       " 'expetations': 53244,\n",
       " 'nyfiken': 40921,\n",
       " 'mehras': 75641,\n",
       " 'lasse': 52212,\n",
       " 'visability': 52213,\n",
       " 'militarily': 33949,\n",
       " \"elder'\": 52214,\n",
       " 'gainsbourg': 19026,\n",
       " 'hah': 20606,\n",
       " 'hai': 13423,\n",
       " 'haj': 34739,\n",
       " 'hak': 25254,\n",
       " 'hal': 4314,\n",
       " 'ham': 4895,\n",
       " 'duffer': 53262,\n",
       " 'haa': 52216,\n",
       " 'had': 69,\n",
       " 'advancement': 11933,\n",
       " 'hag': 16828,\n",
       " \"hand'\": 25255,\n",
       " 'hay': 13424,\n",
       " 'mcnamara': 20607,\n",
       " \"mozart's\": 52217,\n",
       " 'duffel': 30734,\n",
       " 'haq': 30597,\n",
       " 'har': 13890,\n",
       " 'has': 47,\n",
       " 'hat': 2404,\n",
       " 'hav': 40922,\n",
       " 'haw': 30598,\n",
       " 'figtings': 52218,\n",
       " 'elders': 15498,\n",
       " 'underpanted': 52219,\n",
       " 'pninson': 52220,\n",
       " 'unequivocally': 27655,\n",
       " \"barbara's\": 23676,\n",
       " \"bello'\": 52222,\n",
       " 'indicative': 13000,\n",
       " 'yawnfest': 40923,\n",
       " 'hexploitation': 52223,\n",
       " \"loder's\": 52224,\n",
       " 'sleuthing': 27656,\n",
       " \"justin's\": 32625,\n",
       " \"'ball\": 52225,\n",
       " \"'summer\": 52226,\n",
       " \"'demons'\": 34938,\n",
       " \"mormon's\": 52228,\n",
       " \"laughton's\": 34740,\n",
       " 'debell': 52229,\n",
       " 'shipyard': 39727,\n",
       " 'unabashedly': 30600,\n",
       " 'disks': 40404,\n",
       " 'crowd': 2293,\n",
       " 'crowe': 10090,\n",
       " \"vancouver's\": 56437,\n",
       " 'mosques': 34741,\n",
       " 'crown': 6630,\n",
       " 'culpas': 52230,\n",
       " 'crows': 27657,\n",
       " 'surrell': 53347,\n",
       " 'flowless': 52232,\n",
       " 'sheirk': 52233,\n",
       " \"'three\": 40926,\n",
       " \"peterson'\": 52234,\n",
       " 'ooverall': 52235,\n",
       " 'perchance': 40927,\n",
       " 'bottom': 1324,\n",
       " 'chabert': 53366,\n",
       " 'sneha': 52236,\n",
       " 'inhuman': 13891,\n",
       " 'ichii': 52237,\n",
       " 'ursla': 52238,\n",
       " 'completly': 30601,\n",
       " 'moviedom': 40928,\n",
       " 'raddick': 52239,\n",
       " 'brundage': 51998,\n",
       " 'brigades': 40929,\n",
       " 'starring': 1184,\n",
       " \"'goal'\": 52240,\n",
       " 'caskets': 52241,\n",
       " 'willcock': 52242,\n",
       " \"threesome's\": 52243,\n",
       " \"mosque'\": 52244,\n",
       " \"cover's\": 52245,\n",
       " 'spaceships': 17640,\n",
       " 'anomalous': 40930,\n",
       " 'ptsd': 27658,\n",
       " 'shirdan': 52246,\n",
       " 'obscenity': 21965,\n",
       " 'lemmings': 30602,\n",
       " 'duccio': 30603,\n",
       " \"levene's\": 52247,\n",
       " \"'gorby'\": 52248,\n",
       " \"teenager's\": 25258,\n",
       " 'marshall': 5343,\n",
       " 'honeymoon': 9098,\n",
       " 'shoots': 3234,\n",
       " 'despised': 12261,\n",
       " 'okabasho': 52249,\n",
       " 'fabric': 8292,\n",
       " 'cannavale': 18518,\n",
       " 'raped': 3540,\n",
       " \"tutt's\": 52250,\n",
       " 'grasping': 17641,\n",
       " 'despises': 18519,\n",
       " \"thief's\": 40931,\n",
       " 'rapes': 8929,\n",
       " 'raper': 52251,\n",
       " \"eyre'\": 27659,\n",
       " 'walchek': 52252,\n",
       " \"elmo's\": 23389,\n",
       " 'perfumes': 40932,\n",
       " 'spurting': 21921,\n",
       " \"exposition'\\x85\": 52253,\n",
       " 'denoting': 52254,\n",
       " 'thesaurus': 34743,\n",
       " \"shoot'\": 40933,\n",
       " 'bonejack': 49762,\n",
       " 'simpsonian': 52256,\n",
       " 'hebetude': 30604,\n",
       " \"hallow's\": 34744,\n",
       " 'desperation\\x85': 52257,\n",
       " 'incinerator': 34745,\n",
       " 'congratulations': 10311,\n",
       " 'humbled': 52258,\n",
       " \"else's\": 5927,\n",
       " 'trelkovski': 40848,\n",
       " \"rape'\": 52259,\n",
       " \"'chapters'\": 59389,\n",
       " '1600s': 52260,\n",
       " 'martian': 7256,\n",
       " 'nicest': 25259,\n",
       " 'eyred': 52262,\n",
       " 'passenger': 9460,\n",
       " 'disgrace': 6044,\n",
       " 'moderne': 52263,\n",
       " 'barrymore': 5123,\n",
       " 'yankovich': 52264,\n",
       " 'moderns': 40934,\n",
       " 'studliest': 52265,\n",
       " 'bedsheet': 52266,\n",
       " 'decapitation': 14903,\n",
       " 'slurring': 52267,\n",
       " \"'nunsploitation'\": 52268,\n",
       " \"'character'\": 34746,\n",
       " 'cambodia': 9883,\n",
       " 'rebelious': 52269,\n",
       " 'pasadena': 27660,\n",
       " 'crowne': 40935,\n",
       " \"'bedchamber\": 52270,\n",
       " 'conjectural': 52271,\n",
       " 'appologize': 52272,\n",
       " 'halfassing': 52273,\n",
       " 'paycheque': 57819,\n",
       " 'palms': 20609,\n",
       " \"'islands\": 52274,\n",
       " 'hawked': 40936,\n",
       " 'palme': 21922,\n",
       " 'conservatively': 40937,\n",
       " 'larp': 64010,\n",
       " 'palma': 5561,\n",
       " 'smelling': 21923,\n",
       " 'aragorn': 13001,\n",
       " 'hawker': 52275,\n",
       " 'hawkes': 52276,\n",
       " 'explosions': 3978,\n",
       " 'loren': 8062,\n",
       " \"pyle's\": 52277,\n",
       " 'shootout': 6707,\n",
       " \"mike's\": 18520,\n",
       " \"driscoll's\": 52278,\n",
       " 'cogsworth': 40938,\n",
       " \"britian's\": 52279,\n",
       " 'childs': 34747,\n",
       " \"portrait's\": 52280,\n",
       " 'chain': 3629,\n",
       " 'whoever': 2500,\n",
       " 'puttered': 52281,\n",
       " 'childe': 52282,\n",
       " 'maywether': 52283,\n",
       " 'chair': 3039,\n",
       " \"rance's\": 52284,\n",
       " 'machu': 34748,\n",
       " 'ballet': 4520,\n",
       " 'grapples': 34749,\n",
       " 'summerize': 76155,\n",
       " 'freelance': 30606,\n",
       " \"andrea's\": 52286,\n",
       " '\\x91very': 52287,\n",
       " 'coolidge': 45882,\n",
       " 'mache': 18521,\n",
       " 'balled': 52288,\n",
       " 'grappled': 40940,\n",
       " 'macha': 18522,\n",
       " 'underlining': 21924,\n",
       " 'macho': 5626,\n",
       " 'oversight': 19510,\n",
       " 'machi': 25260,\n",
       " 'verbally': 11314,\n",
       " 'tenacious': 21925,\n",
       " 'windshields': 40941,\n",
       " 'paychecks': 18560,\n",
       " 'jerk': 3399,\n",
       " \"good'\": 11934,\n",
       " 'prancer': 34751,\n",
       " 'prances': 21926,\n",
       " 'olympus': 52289,\n",
       " 'lark': 21927,\n",
       " 'embark': 10788,\n",
       " 'gloomy': 7368,\n",
       " 'jehaan': 52290,\n",
       " 'turaqui': 52291,\n",
       " \"child'\": 20610,\n",
       " 'locked': 2897,\n",
       " 'pranced': 52292,\n",
       " 'exact': 2591,\n",
       " 'unattuned': 52293,\n",
       " 'minute': 786,\n",
       " 'skewed': 16121,\n",
       " 'hodgins': 40943,\n",
       " 'skewer': 34752,\n",
       " 'think\\x85': 52294,\n",
       " 'rosenstein': 38768,\n",
       " 'helmit': 52295,\n",
       " 'wrestlemanias': 34753,\n",
       " 'hindered': 16829,\n",
       " \"martha's\": 30607,\n",
       " 'cheree': 52296,\n",
       " \"pluckin'\": 52297,\n",
       " 'ogles': 40944,\n",
       " 'heavyweight': 11935,\n",
       " 'aada': 82193,\n",
       " 'chopping': 11315,\n",
       " 'strongboy': 61537,\n",
       " 'hegemonic': 41345,\n",
       " 'adorns': 40945,\n",
       " 'xxth': 41349,\n",
       " 'nobuhiro': 34754,\n",
       " 'capitães': 52301,\n",
       " 'kavogianni': 52302,\n",
       " 'antwerp': 13425,\n",
       " 'celebrated': 6541,\n",
       " 'roarke': 52303,\n",
       " 'baggins': 40946,\n",
       " 'cheeseburgers': 31273,\n",
       " 'matras': 52304,\n",
       " \"nineties'\": 52305,\n",
       " \"'craig'\": 52306,\n",
       " 'celebrates': 13002,\n",
       " 'unintentionally': 3386,\n",
       " 'drafted': 14365,\n",
       " 'climby': 52307,\n",
       " '303': 52308,\n",
       " 'oldies': 18523,\n",
       " 'climbs': 9099,\n",
       " 'honour': 9658,\n",
       " 'plucking': 34755,\n",
       " '305': 30077,\n",
       " 'address': 5517,\n",
       " 'menjou': 40947,\n",
       " \"'freak'\": 42595,\n",
       " 'dwindling': 19511,\n",
       " 'benson': 9461,\n",
       " 'white’s': 52310,\n",
       " 'shamelessness': 40948,\n",
       " 'impacted': 21928,\n",
       " 'upatz': 52311,\n",
       " 'cusack': 3843,\n",
       " \"flavia's\": 37570,\n",
       " 'effette': 52312,\n",
       " 'influx': 34756,\n",
       " 'boooooooo': 52313,\n",
       " 'dimitrova': 52314,\n",
       " 'houseman': 13426,\n",
       " 'bigas': 25262,\n",
       " 'boylen': 52315,\n",
       " 'phillipenes': 52316,\n",
       " 'fakery': 40949,\n",
       " \"grandpa's\": 27661,\n",
       " 'darnell': 27662,\n",
       " 'undergone': 19512,\n",
       " 'handbags': 52318,\n",
       " 'perished': 21929,\n",
       " 'pooped': 37781,\n",
       " 'vigour': 27663,\n",
       " 'opposed': 3630,\n",
       " 'etude': 52319,\n",
       " \"caine's\": 11802,\n",
       " 'doozers': 52320,\n",
       " 'photojournals': 34757,\n",
       " 'perishes': 52321,\n",
       " 'constrains': 34758,\n",
       " 'migenes': 40951,\n",
       " 'consoled': 30608,\n",
       " 'alastair': 16830,\n",
       " 'wvs': 52322,\n",
       " 'ooooooh': 52323,\n",
       " 'approving': 34759,\n",
       " 'consoles': 40952,\n",
       " 'disparagement': 52067,\n",
       " 'futureistic': 52325,\n",
       " 'rebounding': 52326,\n",
       " \"'date\": 52327,\n",
       " 'gregoire': 52328,\n",
       " 'rutherford': 21930,\n",
       " 'americanised': 34760,\n",
       " 'novikov': 82199,\n",
       " 'following': 1045,\n",
       " 'munroe': 34761,\n",
       " \"morita'\": 52329,\n",
       " 'christenssen': 52330,\n",
       " 'oatmeal': 23109,\n",
       " 'fossey': 25263,\n",
       " 'livered': 40953,\n",
       " 'listens': 13003,\n",
       " \"'marci\": 76167,\n",
       " \"otis's\": 52333,\n",
       " 'thanking': 23390,\n",
       " 'maude': 16022,\n",
       " 'extensions': 34762,\n",
       " 'ameteurish': 52335,\n",
       " \"commender's\": 52336,\n",
       " 'agricultural': 27664,\n",
       " 'convincingly': 4521,\n",
       " 'fueled': 17642,\n",
       " 'mahattan': 54017,\n",
       " \"paris's\": 40955,\n",
       " 'vulkan': 52339,\n",
       " 'stapes': 52340,\n",
       " 'odysessy': 52341,\n",
       " 'harmon': 12262,\n",
       " 'surfing': 4255,\n",
       " 'halloran': 23497,\n",
       " 'unbelieveably': 49583,\n",
       " \"'offed'\": 52342,\n",
       " 'quadrant': 30610,\n",
       " 'inhabiting': 19513,\n",
       " 'nebbish': 34763,\n",
       " 'forebears': 40956,\n",
       " 'skirmish': 34764,\n",
       " 'ocassionally': 52343,\n",
       " \"'resist\": 52344,\n",
       " 'impactful': 21931,\n",
       " 'spicier': 52345,\n",
       " 'touristy': 40957,\n",
       " \"'football'\": 52346,\n",
       " 'webpage': 40958,\n",
       " 'exurbia': 52348,\n",
       " 'jucier': 52349,\n",
       " 'professors': 14904,\n",
       " 'structuring': 34765,\n",
       " 'jig': 30611,\n",
       " 'overlord': 40959,\n",
       " 'disconnect': 25264,\n",
       " 'sniffle': 82204,\n",
       " 'slimeball': 40960,\n",
       " 'jia': 40961,\n",
       " 'milked': 16831,\n",
       " 'banjoes': 40962,\n",
       " 'jim': 1240,\n",
       " 'workforces': 52351,\n",
       " 'jip': 52352,\n",
       " 'rotweiller': 52353,\n",
       " 'mundaneness': 34766,\n",
       " \"'ninja'\": 52354,\n",
       " \"dead'\": 11043,\n",
       " \"cipriani's\": 40963,\n",
       " 'modestly': 20611,\n",
       " \"professor'\": 52355,\n",
       " 'shacked': 40964,\n",
       " 'bashful': 34767,\n",
       " 'sorter': 23391,\n",
       " 'overpowering': 16123,\n",
       " 'workmanlike': 18524,\n",
       " 'henpecked': 27665,\n",
       " 'sorted': 18525,\n",
       " \"jōb's\": 52357,\n",
       " \"'always\": 52358,\n",
       " \"'baptists\": 34768,\n",
       " 'dreamcatchers': 52359,\n",
       " \"'silence'\": 52360,\n",
       " 'hickory': 21932,\n",
       " 'fun\\x97yet': 52361,\n",
       " 'breakumentary': 52362,\n",
       " 'didn': 15499,\n",
       " 'didi': 52363,\n",
       " 'pealing': 52364,\n",
       " 'dispite': 40965,\n",
       " \"italy's\": 25265,\n",
       " 'instability': 21933,\n",
       " 'quarter': 6542,\n",
       " 'quartet': 12611,\n",
       " 'padmé': 52365,\n",
       " \"'bleedmedry\": 52366,\n",
       " 'pahalniuk': 52367,\n",
       " 'honduras': 52368,\n",
       " 'bursting': 10789,\n",
       " \"pablo's\": 41468,\n",
       " 'irremediably': 52370,\n",
       " 'presages': 40966,\n",
       " 'bowlegged': 57835,\n",
       " 'dalip': 65186,\n",
       " 'entering': 6263,\n",
       " 'newsradio': 76175,\n",
       " 'presaged': 54153,\n",
       " \"giallo's\": 27666,\n",
       " 'bouyant': 40967,\n",
       " 'amerterish': 52371,\n",
       " 'rajni': 18526,\n",
       " 'leeves': 30613,\n",
       " 'macauley': 34770,\n",
       " 'seriously': 615,\n",
       " 'sugercoma': 52372,\n",
       " 'grimstead': 52373,\n",
       " \"'fairy'\": 52374,\n",
       " 'zenda': 30614,\n",
       " \"'twins'\": 52375,\n",
       " 'realisation': 17643,\n",
       " 'highsmith': 27667,\n",
       " 'raunchy': 7820,\n",
       " 'incentives': 40968,\n",
       " 'flatson': 52377,\n",
       " 'snooker': 35100,\n",
       " 'crazies': 16832,\n",
       " 'crazier': 14905,\n",
       " 'grandma': 7097,\n",
       " 'napunsaktha': 52378,\n",
       " 'workmanship': 30615,\n",
       " 'reisner': 52379,\n",
       " \"sanford's\": 61309,\n",
       " '\\x91doña': 52380,\n",
       " 'modest': 6111,\n",
       " \"everything's\": 19156,\n",
       " 'hamer': 40969,\n",
       " \"couldn't'\": 52382,\n",
       " 'quibble': 13004,\n",
       " 'socking': 52383,\n",
       " 'tingler': 21934,\n",
       " 'gutman': 52384,\n",
       " 'lachlan': 40970,\n",
       " 'tableaus': 52385,\n",
       " 'headbanger': 52386,\n",
       " 'spoken': 2850,\n",
       " 'cerebrally': 34771,\n",
       " \"'road\": 23493,\n",
       " 'tableaux': 21935,\n",
       " \"proust's\": 40971,\n",
       " 'periodical': 40972,\n",
       " \"shoveller's\": 52388,\n",
       " 'tamara': 25266,\n",
       " 'affords': 17644,\n",
       " 'concert': 3252,\n",
       " \"yara's\": 87958,\n",
       " 'someome': 52389,\n",
       " 'lingering': 8427,\n",
       " \"abraham's\": 41514,\n",
       " 'beesley': 34772,\n",
       " 'cherbourg': 34773,\n",
       " 'kagan': 28627,\n",
       " 'snatch': 9100,\n",
       " \"miyazaki's\": 9263,\n",
       " 'absorbs': 25267,\n",
       " \"koltai's\": 40973,\n",
       " 'tingled': 64030,\n",
       " 'crossroads': 19514,\n",
       " 'rehab': 16124,\n",
       " 'falworth': 52392,\n",
       " 'sequals': 52393,\n",
       " ...}"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word2idx = imdb.get_word_index(path='imdb_word_index.json')\n",
    "word2idx = {k:(v + INDEX_FROM) for k,v in word2idx.items()}\n",
    "word2idx[\"<PAD>\"] = 0\n",
    "word2idx[\"<START>\"] = 1\n",
    "word2idx[\"<UNK>\"] = 2\n",
    "word2idx[\"<UNUSED>\"] = 3\n",
    "word2idx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "automotive-printing",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{34704: 'fawn',\n",
       " 52009: 'tsukino',\n",
       " 52010: 'nunnery',\n",
       " 16819: 'sonja',\n",
       " 63954: 'vani',\n",
       " 1411: 'woods',\n",
       " 16118: 'spiders',\n",
       " 2348: 'hanging',\n",
       " 2292: 'woody',\n",
       " 52011: 'trawling',\n",
       " 52012: \"hold's\",\n",
       " 11310: 'comically',\n",
       " 40833: 'localized',\n",
       " 30571: 'disobeying',\n",
       " 52013: \"'royale\",\n",
       " 40834: \"harpo's\",\n",
       " 52014: 'canet',\n",
       " 19316: 'aileen',\n",
       " 52015: 'acurately',\n",
       " 52016: \"diplomat's\",\n",
       " 25245: 'rickman',\n",
       " 6749: 'arranged',\n",
       " 52017: 'rumbustious',\n",
       " 52018: 'familiarness',\n",
       " 52019: \"spider'\",\n",
       " 68807: 'hahahah',\n",
       " 52020: \"wood'\",\n",
       " 40836: 'transvestism',\n",
       " 34705: \"hangin'\",\n",
       " 2341: 'bringing',\n",
       " 40837: 'seamier',\n",
       " 34706: 'wooded',\n",
       " 52021: 'bravora',\n",
       " 16820: 'grueling',\n",
       " 1639: 'wooden',\n",
       " 16821: 'wednesday',\n",
       " 52022: \"'prix\",\n",
       " 34707: 'altagracia',\n",
       " 52023: 'circuitry',\n",
       " 11588: 'crotch',\n",
       " 57769: 'busybody',\n",
       " 52024: \"tart'n'tangy\",\n",
       " 14132: 'burgade',\n",
       " 52026: 'thrace',\n",
       " 11041: \"tom's\",\n",
       " 52028: 'snuggles',\n",
       " 29117: 'francesco',\n",
       " 52030: 'complainers',\n",
       " 52128: 'templarios',\n",
       " 40838: '272',\n",
       " 52031: '273',\n",
       " 52133: 'zaniacs',\n",
       " 34709: '275',\n",
       " 27634: 'consenting',\n",
       " 40839: 'snuggled',\n",
       " 15495: 'inanimate',\n",
       " 52033: 'uality',\n",
       " 11929: 'bronte',\n",
       " 4013: 'errors',\n",
       " 3233: 'dialogs',\n",
       " 52034: \"yomada's\",\n",
       " 34710: \"madman's\",\n",
       " 30588: 'dialoge',\n",
       " 52036: 'usenet',\n",
       " 40840: 'videodrome',\n",
       " 26341: \"kid'\",\n",
       " 52037: 'pawed',\n",
       " 30572: \"'girlfriend'\",\n",
       " 52038: \"'pleasure\",\n",
       " 52039: \"'reloaded'\",\n",
       " 40842: \"kazakos'\",\n",
       " 52040: 'rocque',\n",
       " 52041: 'mailings',\n",
       " 11930: 'brainwashed',\n",
       " 16822: 'mcanally',\n",
       " 52042: \"tom''\",\n",
       " 25246: 'kurupt',\n",
       " 21908: 'affiliated',\n",
       " 52043: 'babaganoosh',\n",
       " 40843: \"noe's\",\n",
       " 40844: 'quart',\n",
       " 362: 'kids',\n",
       " 5037: 'uplifting',\n",
       " 7096: 'controversy',\n",
       " 21909: 'kida',\n",
       " 23382: 'kidd',\n",
       " 52044: \"error'\",\n",
       " 52045: 'neurologist',\n",
       " 18513: 'spotty',\n",
       " 30573: 'cobblers',\n",
       " 9881: 'projection',\n",
       " 40845: 'fastforwarding',\n",
       " 52046: 'sters',\n",
       " 52047: \"eggar's\",\n",
       " 52048: 'etherything',\n",
       " 40846: 'gateshead',\n",
       " 34711: 'airball',\n",
       " 25247: 'unsinkable',\n",
       " 7183: 'stern',\n",
       " 52049: \"cervi's\",\n",
       " 40847: 'dnd',\n",
       " 11589: 'dna',\n",
       " 20601: 'insecurity',\n",
       " 52050: \"'reboot'\",\n",
       " 11040: 'trelkovsky',\n",
       " 52051: 'jaekel',\n",
       " 52052: 'sidebars',\n",
       " 52053: \"sforza's\",\n",
       " 17636: 'distortions',\n",
       " 52054: 'mutinies',\n",
       " 30605: 'sermons',\n",
       " 40849: '7ft',\n",
       " 52055: 'boobage',\n",
       " 52056: \"o'bannon's\",\n",
       " 23383: 'populations',\n",
       " 52057: 'chulak',\n",
       " 27636: 'mesmerize',\n",
       " 52058: 'quinnell',\n",
       " 10310: 'yahoo',\n",
       " 52060: 'meteorologist',\n",
       " 42580: 'beswick',\n",
       " 15496: 'boorman',\n",
       " 40850: 'voicework',\n",
       " 52061: \"ster'\",\n",
       " 22925: 'blustering',\n",
       " 52062: 'hj',\n",
       " 27637: 'intake',\n",
       " 5624: 'morally',\n",
       " 40852: 'jumbling',\n",
       " 52063: 'bowersock',\n",
       " 52064: \"'porky's'\",\n",
       " 16824: 'gershon',\n",
       " 40853: 'ludicrosity',\n",
       " 52065: 'coprophilia',\n",
       " 40854: 'expressively',\n",
       " 19503: \"india's\",\n",
       " 34713: \"post's\",\n",
       " 52066: 'wana',\n",
       " 5286: 'wang',\n",
       " 30574: 'wand',\n",
       " 25248: 'wane',\n",
       " 52324: 'edgeways',\n",
       " 34714: 'titanium',\n",
       " 40855: 'pinta',\n",
       " 181: 'want',\n",
       " 30575: 'pinto',\n",
       " 52068: 'whoopdedoodles',\n",
       " 21911: 'tchaikovsky',\n",
       " 2106: 'travel',\n",
       " 52069: \"'victory'\",\n",
       " 11931: 'copious',\n",
       " 22436: 'gouge',\n",
       " 52070: \"chapters'\",\n",
       " 6705: 'barbra',\n",
       " 30576: 'uselessness',\n",
       " 52071: \"wan'\",\n",
       " 27638: 'assimilated',\n",
       " 16119: 'petiot',\n",
       " 52072: 'most\\x85and',\n",
       " 3933: 'dinosaurs',\n",
       " 355: 'wrong',\n",
       " 52073: 'seda',\n",
       " 52074: 'stollen',\n",
       " 34715: 'sentencing',\n",
       " 40856: 'ouroboros',\n",
       " 40857: 'assimilates',\n",
       " 40858: 'colorfully',\n",
       " 27639: 'glenne',\n",
       " 52075: 'dongen',\n",
       " 4763: 'subplots',\n",
       " 52076: 'kiloton',\n",
       " 23384: 'chandon',\n",
       " 34716: \"effect'\",\n",
       " 27640: 'snugly',\n",
       " 40859: 'kuei',\n",
       " 9095: 'welcomed',\n",
       " 30074: 'dishonor',\n",
       " 52078: 'concurrence',\n",
       " 23385: 'stoicism',\n",
       " 14899: \"guys'\",\n",
       " 52080: \"beroemd'\",\n",
       " 6706: 'butcher',\n",
       " 40860: \"melfi's\",\n",
       " 30626: 'aargh',\n",
       " 20602: 'playhouse',\n",
       " 11311: 'wickedly',\n",
       " 1183: 'fit',\n",
       " 52081: 'labratory',\n",
       " 40862: 'lifeline',\n",
       " 1930: 'screaming',\n",
       " 4290: 'fix',\n",
       " 52082: 'cineliterate',\n",
       " 52083: 'fic',\n",
       " 52084: 'fia',\n",
       " 34717: 'fig',\n",
       " 52085: 'fmvs',\n",
       " 52086: 'fie',\n",
       " 52087: 'reentered',\n",
       " 30577: 'fin',\n",
       " 52088: 'doctresses',\n",
       " 52089: 'fil',\n",
       " 12609: 'zucker',\n",
       " 31934: 'ached',\n",
       " 52091: 'counsil',\n",
       " 52092: 'paterfamilias',\n",
       " 13888: 'songwriter',\n",
       " 34718: 'shivam',\n",
       " 9657: 'hurting',\n",
       " 302: 'effects',\n",
       " 52093: 'slauther',\n",
       " 52094: \"'flame'\",\n",
       " 52095: 'sommerset',\n",
       " 52096: 'interwhined',\n",
       " 27641: 'whacking',\n",
       " 52097: 'bartok',\n",
       " 8778: 'barton',\n",
       " 21912: 'frewer',\n",
       " 52098: \"fi'\",\n",
       " 6195: 'ingrid',\n",
       " 30578: 'stribor',\n",
       " 52099: 'approporiately',\n",
       " 52100: 'wobblyhand',\n",
       " 52101: 'tantalisingly',\n",
       " 52102: 'ankylosaurus',\n",
       " 17637: 'parasites',\n",
       " 52103: 'childen',\n",
       " 52104: \"jenkins'\",\n",
       " 52105: 'metafiction',\n",
       " 17638: 'golem',\n",
       " 40863: 'indiscretion',\n",
       " 23386: \"reeves'\",\n",
       " 57784: \"inamorata's\",\n",
       " 52107: 'brittannica',\n",
       " 7919: 'adapt',\n",
       " 30579: \"russo's\",\n",
       " 48249: 'guitarists',\n",
       " 10556: 'abbott',\n",
       " 40864: 'abbots',\n",
       " 17652: 'lanisha',\n",
       " 40866: 'magickal',\n",
       " 52108: 'mattter',\n",
       " 52109: \"'willy\",\n",
       " 34719: 'pumpkins',\n",
       " 52110: 'stuntpeople',\n",
       " 30580: 'estimate',\n",
       " 40867: 'ugghhh',\n",
       " 11312: 'gameplay',\n",
       " 52111: \"wern't\",\n",
       " 40868: \"n'sync\",\n",
       " 16120: 'sickeningly',\n",
       " 40869: 'chiara',\n",
       " 4014: 'disturbed',\n",
       " 40870: 'portmanteau',\n",
       " 52112: 'ineffectively',\n",
       " 82146: \"duchonvey's\",\n",
       " 37522: \"nasty'\",\n",
       " 1288: 'purpose',\n",
       " 52115: 'lazers',\n",
       " 28108: 'lightened',\n",
       " 52116: 'kaliganj',\n",
       " 52117: 'popularism',\n",
       " 18514: \"damme's\",\n",
       " 30581: 'stylistics',\n",
       " 52118: 'mindgaming',\n",
       " 46452: 'spoilerish',\n",
       " 52120: \"'corny'\",\n",
       " 34721: 'boerner',\n",
       " 6795: 'olds',\n",
       " 52121: 'bakelite',\n",
       " 27642: 'renovated',\n",
       " 27643: 'forrester',\n",
       " 52122: \"lumiere's\",\n",
       " 52027: 'gaskets',\n",
       " 887: 'needed',\n",
       " 34722: 'smight',\n",
       " 1300: 'master',\n",
       " 25908: \"edie's\",\n",
       " 40871: 'seeber',\n",
       " 52123: 'hiya',\n",
       " 52124: 'fuzziness',\n",
       " 14900: 'genesis',\n",
       " 12610: 'rewards',\n",
       " 30582: 'enthrall',\n",
       " 40872: \"'about\",\n",
       " 52125: \"recollection's\",\n",
       " 11042: 'mutilated',\n",
       " 52126: 'fatherlands',\n",
       " 52127: \"fischer's\",\n",
       " 5402: 'positively',\n",
       " 34708: '270',\n",
       " 34723: 'ahmed',\n",
       " 9839: 'zatoichi',\n",
       " 13889: 'bannister',\n",
       " 52130: 'anniversaries',\n",
       " 30583: \"helm's\",\n",
       " 52131: \"'work'\",\n",
       " 34724: 'exclaimed',\n",
       " 52132: \"'unfunny'\",\n",
       " 52032: '274',\n",
       " 547: 'feeling',\n",
       " 52134: \"wanda's\",\n",
       " 33269: 'dolan',\n",
       " 52136: '278',\n",
       " 52137: 'peacoat',\n",
       " 40873: 'brawny',\n",
       " 40874: 'mishra',\n",
       " 40875: 'worlders',\n",
       " 52138: 'protags',\n",
       " 52139: 'skullcap',\n",
       " 57599: 'dastagir',\n",
       " 5625: 'affairs',\n",
       " 7802: 'wholesome',\n",
       " 52140: 'hymen',\n",
       " 25249: 'paramedics',\n",
       " 52141: 'unpersons',\n",
       " 52142: 'heavyarms',\n",
       " 52143: 'affaire',\n",
       " 52144: 'coulisses',\n",
       " 40876: 'hymer',\n",
       " 52145: 'kremlin',\n",
       " 30584: 'shipments',\n",
       " 52146: 'pixilated',\n",
       " 30585: \"'00s\",\n",
       " 18515: 'diminishing',\n",
       " 1360: 'cinematic',\n",
       " 14901: 'resonates',\n",
       " 40877: 'simplify',\n",
       " 40878: \"nature'\",\n",
       " 40879: 'temptresses',\n",
       " 16825: 'reverence',\n",
       " 19505: 'resonated',\n",
       " 34725: 'dailey',\n",
       " 52147: '2\\x85',\n",
       " 27644: 'treize',\n",
       " 52148: 'majo',\n",
       " 21913: 'kiya',\n",
       " 52149: 'woolnough',\n",
       " 39800: 'thanatos',\n",
       " 35734: 'sandoval',\n",
       " 40882: 'dorama',\n",
       " 52150: \"o'shaughnessy\",\n",
       " 4991: 'tech',\n",
       " 32021: 'fugitives',\n",
       " 30586: 'teck',\n",
       " 76128: \"'e'\",\n",
       " 40884: 'doesn’t',\n",
       " 52152: 'purged',\n",
       " 660: 'saying',\n",
       " 41098: \"martians'\",\n",
       " 23421: 'norliss',\n",
       " 27645: 'dickey',\n",
       " 52155: 'dicker',\n",
       " 52156: \"'sependipity\",\n",
       " 8425: 'padded',\n",
       " 57795: 'ordell',\n",
       " 40885: \"sturges'\",\n",
       " 52157: 'independentcritics',\n",
       " 5748: 'tempted',\n",
       " 34727: \"atkinson's\",\n",
       " 25250: 'hounded',\n",
       " 52158: 'apace',\n",
       " 15497: 'clicked',\n",
       " 30587: \"'humor'\",\n",
       " 17180: \"martino's\",\n",
       " 52159: \"'supporting\",\n",
       " 52035: 'warmongering',\n",
       " 34728: \"zemeckis's\",\n",
       " 21914: 'lube',\n",
       " 52160: 'shocky',\n",
       " 7479: 'plate',\n",
       " 40886: 'plata',\n",
       " 40887: 'sturgess',\n",
       " 40888: \"nerds'\",\n",
       " 20603: 'plato',\n",
       " 34729: 'plath',\n",
       " 40889: 'platt',\n",
       " 52162: 'mcnab',\n",
       " 27646: 'clumsiness',\n",
       " 3902: 'altogether',\n",
       " 42587: 'massacring',\n",
       " 52163: 'bicenntinial',\n",
       " 40890: 'skaal',\n",
       " 14363: 'droning',\n",
       " 8779: 'lds',\n",
       " 21915: 'jaguar',\n",
       " 34730: \"cale's\",\n",
       " 1780: 'nicely',\n",
       " 4591: 'mummy',\n",
       " 18516: \"lot's\",\n",
       " 10089: 'patch',\n",
       " 50205: 'kerkhof',\n",
       " 52164: \"leader's\",\n",
       " 27647: \"'movie\",\n",
       " 52165: 'uncomfirmed',\n",
       " 40891: 'heirloom',\n",
       " 47363: 'wrangle',\n",
       " 52166: 'emotion\\x85',\n",
       " 52167: \"'stargate'\",\n",
       " 40892: 'pinoy',\n",
       " 40893: 'conchatta',\n",
       " 41131: 'broeke',\n",
       " 40894: 'advisedly',\n",
       " 17639: \"barker's\",\n",
       " 52169: 'descours',\n",
       " 775: 'lots',\n",
       " 9262: 'lotr',\n",
       " 9882: 'irs',\n",
       " 52170: 'lott',\n",
       " 40895: 'xvi',\n",
       " 34731: 'irk',\n",
       " 52171: 'irl',\n",
       " 6890: 'ira',\n",
       " 21916: 'belzer',\n",
       " 52172: 'irc',\n",
       " 27648: 'ire',\n",
       " 40896: 'requisites',\n",
       " 7696: 'discipline',\n",
       " 52964: 'lyoko',\n",
       " 11313: 'extend',\n",
       " 876: 'nature',\n",
       " 52173: \"'dickie'\",\n",
       " 40897: 'optimist',\n",
       " 30589: 'lapping',\n",
       " 3903: 'superficial',\n",
       " 52174: 'vestment',\n",
       " 2826: 'extent',\n",
       " 52175: 'tendons',\n",
       " 52176: \"heller's\",\n",
       " 52177: 'quagmires',\n",
       " 52178: 'miyako',\n",
       " 20604: 'moocow',\n",
       " 52179: \"coles'\",\n",
       " 40898: 'lookit',\n",
       " 52180: 'ravenously',\n",
       " 40899: 'levitating',\n",
       " 52181: 'perfunctorily',\n",
       " 30590: 'lookin',\n",
       " 40901: \"lot'\",\n",
       " 52182: 'lookie',\n",
       " 34873: 'fearlessly',\n",
       " 52184: 'libyan',\n",
       " 40902: 'fondles',\n",
       " 35717: 'gopher',\n",
       " 40904: 'wearying',\n",
       " 52185: \"nz's\",\n",
       " 27649: 'minuses',\n",
       " 52186: 'puposelessly',\n",
       " 52187: 'shandling',\n",
       " 31271: 'decapitates',\n",
       " 11932: 'humming',\n",
       " 40905: \"'nother\",\n",
       " 21917: 'smackdown',\n",
       " 30591: 'underdone',\n",
       " 40906: 'frf',\n",
       " 52188: 'triviality',\n",
       " 25251: 'fro',\n",
       " 8780: 'bothers',\n",
       " 52189: \"'kensington\",\n",
       " 76: 'much',\n",
       " 34733: 'muco',\n",
       " 22618: 'wiseguy',\n",
       " 27651: \"richie's\",\n",
       " 40907: 'tonino',\n",
       " 52190: 'unleavened',\n",
       " 11590: 'fry',\n",
       " 40908: \"'tv'\",\n",
       " 40909: 'toning',\n",
       " 14364: 'obese',\n",
       " 30592: 'sensationalized',\n",
       " 40910: 'spiv',\n",
       " 6262: 'spit',\n",
       " 7367: 'arkin',\n",
       " 21918: 'charleton',\n",
       " 16826: 'jeon',\n",
       " 21919: 'boardroom',\n",
       " 4992: 'doubts',\n",
       " 3087: 'spin',\n",
       " 53086: 'hepo',\n",
       " 27652: 'wildcat',\n",
       " 10587: 'venoms',\n",
       " 52194: 'misconstrues',\n",
       " 18517: 'mesmerising',\n",
       " 40911: 'misconstrued',\n",
       " 52195: 'rescinds',\n",
       " 52196: 'prostrate',\n",
       " 40912: 'majid',\n",
       " 16482: 'climbed',\n",
       " 34734: 'canoeing',\n",
       " 52198: 'majin',\n",
       " 57807: 'animie',\n",
       " 40913: 'sylke',\n",
       " 14902: 'conditioned',\n",
       " 40914: 'waddell',\n",
       " 52199: '3\\x85',\n",
       " 41191: 'hyperdrive',\n",
       " 34735: 'conditioner',\n",
       " 53156: 'bricklayer',\n",
       " 2579: 'hong',\n",
       " 52201: 'memoriam',\n",
       " 30595: 'inventively',\n",
       " 25252: \"levant's\",\n",
       " 20641: 'portobello',\n",
       " 52203: 'remand',\n",
       " 19507: 'mummified',\n",
       " 27653: 'honk',\n",
       " 19508: 'spews',\n",
       " 40915: 'visitations',\n",
       " 52204: 'mummifies',\n",
       " 25253: 'cavanaugh',\n",
       " 23388: 'zeon',\n",
       " 40916: \"jungle's\",\n",
       " 34736: 'viertel',\n",
       " 27654: 'frenchmen',\n",
       " 52205: 'torpedoes',\n",
       " 52206: 'schlessinger',\n",
       " 34737: 'torpedoed',\n",
       " 69879: 'blister',\n",
       " 52207: 'cinefest',\n",
       " 34738: 'furlough',\n",
       " 52208: 'mainsequence',\n",
       " 40917: 'mentors',\n",
       " 9097: 'academic',\n",
       " 20605: 'stillness',\n",
       " 40918: 'academia',\n",
       " 52209: 'lonelier',\n",
       " 52210: 'nibby',\n",
       " 52211: \"losers'\",\n",
       " 40919: 'cineastes',\n",
       " 4452: 'corporate',\n",
       " 40920: 'massaging',\n",
       " 30596: 'bellow',\n",
       " 19509: 'absurdities',\n",
       " 53244: 'expetations',\n",
       " 40921: 'nyfiken',\n",
       " 75641: 'mehras',\n",
       " 52212: 'lasse',\n",
       " 52213: 'visability',\n",
       " 33949: 'militarily',\n",
       " 52214: \"elder'\",\n",
       " 19026: 'gainsbourg',\n",
       " 20606: 'hah',\n",
       " 13423: 'hai',\n",
       " 34739: 'haj',\n",
       " 25254: 'hak',\n",
       " 4314: 'hal',\n",
       " 4895: 'ham',\n",
       " 53262: 'duffer',\n",
       " 52216: 'haa',\n",
       " 69: 'had',\n",
       " 11933: 'advancement',\n",
       " 16828: 'hag',\n",
       " 25255: \"hand'\",\n",
       " 13424: 'hay',\n",
       " 20607: 'mcnamara',\n",
       " 52217: \"mozart's\",\n",
       " 30734: 'duffel',\n",
       " 30597: 'haq',\n",
       " 13890: 'har',\n",
       " 47: 'has',\n",
       " 2404: 'hat',\n",
       " 40922: 'hav',\n",
       " 30598: 'haw',\n",
       " 52218: 'figtings',\n",
       " 15498: 'elders',\n",
       " 52219: 'underpanted',\n",
       " 52220: 'pninson',\n",
       " 27655: 'unequivocally',\n",
       " 23676: \"barbara's\",\n",
       " 52222: \"bello'\",\n",
       " 13000: 'indicative',\n",
       " 40923: 'yawnfest',\n",
       " 52223: 'hexploitation',\n",
       " 52224: \"loder's\",\n",
       " 27656: 'sleuthing',\n",
       " 32625: \"justin's\",\n",
       " 52225: \"'ball\",\n",
       " 52226: \"'summer\",\n",
       " 34938: \"'demons'\",\n",
       " 52228: \"mormon's\",\n",
       " 34740: \"laughton's\",\n",
       " 52229: 'debell',\n",
       " 39727: 'shipyard',\n",
       " 30600: 'unabashedly',\n",
       " 40404: 'disks',\n",
       " 2293: 'crowd',\n",
       " 10090: 'crowe',\n",
       " 56437: \"vancouver's\",\n",
       " 34741: 'mosques',\n",
       " 6630: 'crown',\n",
       " 52230: 'culpas',\n",
       " 27657: 'crows',\n",
       " 53347: 'surrell',\n",
       " 52232: 'flowless',\n",
       " 52233: 'sheirk',\n",
       " 40926: \"'three\",\n",
       " 52234: \"peterson'\",\n",
       " 52235: 'ooverall',\n",
       " 40927: 'perchance',\n",
       " 1324: 'bottom',\n",
       " 53366: 'chabert',\n",
       " 52236: 'sneha',\n",
       " 13891: 'inhuman',\n",
       " 52237: 'ichii',\n",
       " 52238: 'ursla',\n",
       " 30601: 'completly',\n",
       " 40928: 'moviedom',\n",
       " 52239: 'raddick',\n",
       " 51998: 'brundage',\n",
       " 40929: 'brigades',\n",
       " 1184: 'starring',\n",
       " 52240: \"'goal'\",\n",
       " 52241: 'caskets',\n",
       " 52242: 'willcock',\n",
       " 52243: \"threesome's\",\n",
       " 52244: \"mosque'\",\n",
       " 52245: \"cover's\",\n",
       " 17640: 'spaceships',\n",
       " 40930: 'anomalous',\n",
       " 27658: 'ptsd',\n",
       " 52246: 'shirdan',\n",
       " 21965: 'obscenity',\n",
       " 30602: 'lemmings',\n",
       " 30603: 'duccio',\n",
       " 52247: \"levene's\",\n",
       " 52248: \"'gorby'\",\n",
       " 25258: \"teenager's\",\n",
       " 5343: 'marshall',\n",
       " 9098: 'honeymoon',\n",
       " 3234: 'shoots',\n",
       " 12261: 'despised',\n",
       " 52249: 'okabasho',\n",
       " 8292: 'fabric',\n",
       " 18518: 'cannavale',\n",
       " 3540: 'raped',\n",
       " 52250: \"tutt's\",\n",
       " 17641: 'grasping',\n",
       " 18519: 'despises',\n",
       " 40931: \"thief's\",\n",
       " 8929: 'rapes',\n",
       " 52251: 'raper',\n",
       " 27659: \"eyre'\",\n",
       " 52252: 'walchek',\n",
       " 23389: \"elmo's\",\n",
       " 40932: 'perfumes',\n",
       " 21921: 'spurting',\n",
       " 52253: \"exposition'\\x85\",\n",
       " 52254: 'denoting',\n",
       " 34743: 'thesaurus',\n",
       " 40933: \"shoot'\",\n",
       " 49762: 'bonejack',\n",
       " 52256: 'simpsonian',\n",
       " 30604: 'hebetude',\n",
       " 34744: \"hallow's\",\n",
       " 52257: 'desperation\\x85',\n",
       " 34745: 'incinerator',\n",
       " 10311: 'congratulations',\n",
       " 52258: 'humbled',\n",
       " 5927: \"else's\",\n",
       " 40848: 'trelkovski',\n",
       " 52259: \"rape'\",\n",
       " 59389: \"'chapters'\",\n",
       " 52260: '1600s',\n",
       " 7256: 'martian',\n",
       " 25259: 'nicest',\n",
       " 52262: 'eyred',\n",
       " 9460: 'passenger',\n",
       " 6044: 'disgrace',\n",
       " 52263: 'moderne',\n",
       " 5123: 'barrymore',\n",
       " 52264: 'yankovich',\n",
       " 40934: 'moderns',\n",
       " 52265: 'studliest',\n",
       " 52266: 'bedsheet',\n",
       " 14903: 'decapitation',\n",
       " 52267: 'slurring',\n",
       " 52268: \"'nunsploitation'\",\n",
       " 34746: \"'character'\",\n",
       " 9883: 'cambodia',\n",
       " 52269: 'rebelious',\n",
       " 27660: 'pasadena',\n",
       " 40935: 'crowne',\n",
       " 52270: \"'bedchamber\",\n",
       " 52271: 'conjectural',\n",
       " 52272: 'appologize',\n",
       " 52273: 'halfassing',\n",
       " 57819: 'paycheque',\n",
       " 20609: 'palms',\n",
       " 52274: \"'islands\",\n",
       " 40936: 'hawked',\n",
       " 21922: 'palme',\n",
       " 40937: 'conservatively',\n",
       " 64010: 'larp',\n",
       " 5561: 'palma',\n",
       " 21923: 'smelling',\n",
       " 13001: 'aragorn',\n",
       " 52275: 'hawker',\n",
       " 52276: 'hawkes',\n",
       " 3978: 'explosions',\n",
       " 8062: 'loren',\n",
       " 52277: \"pyle's\",\n",
       " 6707: 'shootout',\n",
       " 18520: \"mike's\",\n",
       " 52278: \"driscoll's\",\n",
       " 40938: 'cogsworth',\n",
       " 52279: \"britian's\",\n",
       " 34747: 'childs',\n",
       " 52280: \"portrait's\",\n",
       " 3629: 'chain',\n",
       " 2500: 'whoever',\n",
       " 52281: 'puttered',\n",
       " 52282: 'childe',\n",
       " 52283: 'maywether',\n",
       " 3039: 'chair',\n",
       " 52284: \"rance's\",\n",
       " 34748: 'machu',\n",
       " 4520: 'ballet',\n",
       " 34749: 'grapples',\n",
       " 76155: 'summerize',\n",
       " 30606: 'freelance',\n",
       " 52286: \"andrea's\",\n",
       " 52287: '\\x91very',\n",
       " 45882: 'coolidge',\n",
       " 18521: 'mache',\n",
       " 52288: 'balled',\n",
       " 40940: 'grappled',\n",
       " 18522: 'macha',\n",
       " 21924: 'underlining',\n",
       " 5626: 'macho',\n",
       " 19510: 'oversight',\n",
       " 25260: 'machi',\n",
       " 11314: 'verbally',\n",
       " 21925: 'tenacious',\n",
       " 40941: 'windshields',\n",
       " 18560: 'paychecks',\n",
       " 3399: 'jerk',\n",
       " 11934: \"good'\",\n",
       " 34751: 'prancer',\n",
       " 21926: 'prances',\n",
       " 52289: 'olympus',\n",
       " 21927: 'lark',\n",
       " 10788: 'embark',\n",
       " 7368: 'gloomy',\n",
       " 52290: 'jehaan',\n",
       " 52291: 'turaqui',\n",
       " 20610: \"child'\",\n",
       " 2897: 'locked',\n",
       " 52292: 'pranced',\n",
       " 2591: 'exact',\n",
       " 52293: 'unattuned',\n",
       " 786: 'minute',\n",
       " 16121: 'skewed',\n",
       " 40943: 'hodgins',\n",
       " 34752: 'skewer',\n",
       " 52294: 'think\\x85',\n",
       " 38768: 'rosenstein',\n",
       " 52295: 'helmit',\n",
       " 34753: 'wrestlemanias',\n",
       " 16829: 'hindered',\n",
       " 30607: \"martha's\",\n",
       " 52296: 'cheree',\n",
       " 52297: \"pluckin'\",\n",
       " 40944: 'ogles',\n",
       " 11935: 'heavyweight',\n",
       " 82193: 'aada',\n",
       " 11315: 'chopping',\n",
       " 61537: 'strongboy',\n",
       " 41345: 'hegemonic',\n",
       " 40945: 'adorns',\n",
       " 41349: 'xxth',\n",
       " 34754: 'nobuhiro',\n",
       " 52301: 'capitães',\n",
       " 52302: 'kavogianni',\n",
       " 13425: 'antwerp',\n",
       " 6541: 'celebrated',\n",
       " 52303: 'roarke',\n",
       " 40946: 'baggins',\n",
       " 31273: 'cheeseburgers',\n",
       " 52304: 'matras',\n",
       " 52305: \"nineties'\",\n",
       " 52306: \"'craig'\",\n",
       " 13002: 'celebrates',\n",
       " 3386: 'unintentionally',\n",
       " 14365: 'drafted',\n",
       " 52307: 'climby',\n",
       " 52308: '303',\n",
       " 18523: 'oldies',\n",
       " 9099: 'climbs',\n",
       " 9658: 'honour',\n",
       " 34755: 'plucking',\n",
       " 30077: '305',\n",
       " 5517: 'address',\n",
       " 40947: 'menjou',\n",
       " 42595: \"'freak'\",\n",
       " 19511: 'dwindling',\n",
       " 9461: 'benson',\n",
       " 52310: 'white’s',\n",
       " 40948: 'shamelessness',\n",
       " 21928: 'impacted',\n",
       " 52311: 'upatz',\n",
       " 3843: 'cusack',\n",
       " 37570: \"flavia's\",\n",
       " 52312: 'effette',\n",
       " 34756: 'influx',\n",
       " 52313: 'boooooooo',\n",
       " 52314: 'dimitrova',\n",
       " 13426: 'houseman',\n",
       " 25262: 'bigas',\n",
       " 52315: 'boylen',\n",
       " 52316: 'phillipenes',\n",
       " 40949: 'fakery',\n",
       " 27661: \"grandpa's\",\n",
       " 27662: 'darnell',\n",
       " 19512: 'undergone',\n",
       " 52318: 'handbags',\n",
       " 21929: 'perished',\n",
       " 37781: 'pooped',\n",
       " 27663: 'vigour',\n",
       " 3630: 'opposed',\n",
       " 52319: 'etude',\n",
       " 11802: \"caine's\",\n",
       " 52320: 'doozers',\n",
       " 34757: 'photojournals',\n",
       " 52321: 'perishes',\n",
       " 34758: 'constrains',\n",
       " 40951: 'migenes',\n",
       " 30608: 'consoled',\n",
       " 16830: 'alastair',\n",
       " 52322: 'wvs',\n",
       " 52323: 'ooooooh',\n",
       " 34759: 'approving',\n",
       " 40952: 'consoles',\n",
       " 52067: 'disparagement',\n",
       " 52325: 'futureistic',\n",
       " 52326: 'rebounding',\n",
       " 52327: \"'date\",\n",
       " 52328: 'gregoire',\n",
       " 21930: 'rutherford',\n",
       " 34760: 'americanised',\n",
       " 82199: 'novikov',\n",
       " 1045: 'following',\n",
       " 34761: 'munroe',\n",
       " 52329: \"morita'\",\n",
       " 52330: 'christenssen',\n",
       " 23109: 'oatmeal',\n",
       " 25263: 'fossey',\n",
       " 40953: 'livered',\n",
       " 13003: 'listens',\n",
       " 76167: \"'marci\",\n",
       " 52333: \"otis's\",\n",
       " 23390: 'thanking',\n",
       " 16022: 'maude',\n",
       " 34762: 'extensions',\n",
       " 52335: 'ameteurish',\n",
       " 52336: \"commender's\",\n",
       " 27664: 'agricultural',\n",
       " 4521: 'convincingly',\n",
       " 17642: 'fueled',\n",
       " 54017: 'mahattan',\n",
       " 40955: \"paris's\",\n",
       " 52339: 'vulkan',\n",
       " 52340: 'stapes',\n",
       " 52341: 'odysessy',\n",
       " 12262: 'harmon',\n",
       " 4255: 'surfing',\n",
       " 23497: 'halloran',\n",
       " 49583: 'unbelieveably',\n",
       " 52342: \"'offed'\",\n",
       " 30610: 'quadrant',\n",
       " 19513: 'inhabiting',\n",
       " 34763: 'nebbish',\n",
       " 40956: 'forebears',\n",
       " 34764: 'skirmish',\n",
       " 52343: 'ocassionally',\n",
       " 52344: \"'resist\",\n",
       " 21931: 'impactful',\n",
       " 52345: 'spicier',\n",
       " 40957: 'touristy',\n",
       " 52346: \"'football'\",\n",
       " 40958: 'webpage',\n",
       " 52348: 'exurbia',\n",
       " 52349: 'jucier',\n",
       " 14904: 'professors',\n",
       " 34765: 'structuring',\n",
       " 30611: 'jig',\n",
       " 40959: 'overlord',\n",
       " 25264: 'disconnect',\n",
       " 82204: 'sniffle',\n",
       " 40960: 'slimeball',\n",
       " 40961: 'jia',\n",
       " 16831: 'milked',\n",
       " 40962: 'banjoes',\n",
       " 1240: 'jim',\n",
       " 52351: 'workforces',\n",
       " 52352: 'jip',\n",
       " 52353: 'rotweiller',\n",
       " 34766: 'mundaneness',\n",
       " 52354: \"'ninja'\",\n",
       " 11043: \"dead'\",\n",
       " 40963: \"cipriani's\",\n",
       " 20611: 'modestly',\n",
       " 52355: \"professor'\",\n",
       " 40964: 'shacked',\n",
       " 34767: 'bashful',\n",
       " 23391: 'sorter',\n",
       " 16123: 'overpowering',\n",
       " 18524: 'workmanlike',\n",
       " 27665: 'henpecked',\n",
       " 18525: 'sorted',\n",
       " 52357: \"jōb's\",\n",
       " 52358: \"'always\",\n",
       " 34768: \"'baptists\",\n",
       " 52359: 'dreamcatchers',\n",
       " 52360: \"'silence'\",\n",
       " 21932: 'hickory',\n",
       " 52361: 'fun\\x97yet',\n",
       " 52362: 'breakumentary',\n",
       " 15499: 'didn',\n",
       " 52363: 'didi',\n",
       " 52364: 'pealing',\n",
       " 40965: 'dispite',\n",
       " 25265: \"italy's\",\n",
       " 21933: 'instability',\n",
       " 6542: 'quarter',\n",
       " 12611: 'quartet',\n",
       " 52365: 'padmé',\n",
       " 52366: \"'bleedmedry\",\n",
       " 52367: 'pahalniuk',\n",
       " 52368: 'honduras',\n",
       " 10789: 'bursting',\n",
       " 41468: \"pablo's\",\n",
       " 52370: 'irremediably',\n",
       " 40966: 'presages',\n",
       " 57835: 'bowlegged',\n",
       " 65186: 'dalip',\n",
       " 6263: 'entering',\n",
       " 76175: 'newsradio',\n",
       " 54153: 'presaged',\n",
       " 27666: \"giallo's\",\n",
       " 40967: 'bouyant',\n",
       " 52371: 'amerterish',\n",
       " 18526: 'rajni',\n",
       " 30613: 'leeves',\n",
       " 34770: 'macauley',\n",
       " 615: 'seriously',\n",
       " 52372: 'sugercoma',\n",
       " 52373: 'grimstead',\n",
       " 52374: \"'fairy'\",\n",
       " 30614: 'zenda',\n",
       " 52375: \"'twins'\",\n",
       " 17643: 'realisation',\n",
       " 27667: 'highsmith',\n",
       " 7820: 'raunchy',\n",
       " 40968: 'incentives',\n",
       " 52377: 'flatson',\n",
       " 35100: 'snooker',\n",
       " 16832: 'crazies',\n",
       " 14905: 'crazier',\n",
       " 7097: 'grandma',\n",
       " 52378: 'napunsaktha',\n",
       " 30615: 'workmanship',\n",
       " 52379: 'reisner',\n",
       " 61309: \"sanford's\",\n",
       " 52380: '\\x91doña',\n",
       " 6111: 'modest',\n",
       " 19156: \"everything's\",\n",
       " 40969: 'hamer',\n",
       " 52382: \"couldn't'\",\n",
       " 13004: 'quibble',\n",
       " 52383: 'socking',\n",
       " 21934: 'tingler',\n",
       " 52384: 'gutman',\n",
       " 40970: 'lachlan',\n",
       " 52385: 'tableaus',\n",
       " 52386: 'headbanger',\n",
       " 2850: 'spoken',\n",
       " 34771: 'cerebrally',\n",
       " 23493: \"'road\",\n",
       " 21935: 'tableaux',\n",
       " 40971: \"proust's\",\n",
       " 40972: 'periodical',\n",
       " 52388: \"shoveller's\",\n",
       " 25266: 'tamara',\n",
       " 17644: 'affords',\n",
       " 3252: 'concert',\n",
       " 87958: \"yara's\",\n",
       " 52389: 'someome',\n",
       " 8427: 'lingering',\n",
       " 41514: \"abraham's\",\n",
       " 34772: 'beesley',\n",
       " 34773: 'cherbourg',\n",
       " 28627: 'kagan',\n",
       " 9100: 'snatch',\n",
       " 9263: \"miyazaki's\",\n",
       " 25267: 'absorbs',\n",
       " 40973: \"koltai's\",\n",
       " 64030: 'tingled',\n",
       " 19514: 'crossroads',\n",
       " 16124: 'rehab',\n",
       " 52392: 'falworth',\n",
       " 52393: 'sequals',\n",
       " ...}"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idx2word = {v: k for k,v in word2idx.items()}\n",
    "idx2word"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "political-plaintiff",
   "metadata": {},
   "source": [
    "We can see that the text data is already preprocessed for us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "supreme-static",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of reviews 25000\n",
      "Length of first and fifth review before padding 218 147\n",
      "First review [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]\n",
      "First label 1\n"
     ]
    }
   ],
   "source": [
    "print('Number of reviews', len(X_train))\n",
    "print('Length of first and fifth review before padding', len(X_train[0]) ,len(X_train[4]))\n",
    "print('First review', X_train[0])\n",
    "print('First label', y_train[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "organized-fiber",
   "metadata": {},
   "source": [
    "Here's an example review using the index-to-word mapping we created from the loaded JSON file to view the a review in its original form."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "intensive-punishment",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all\n"
     ]
    }
   ],
   "source": [
    "def show_review(x):\n",
    "    review = ' '.join([idx2word[idx] for idx in x])\n",
    "    print(review)\n",
    "\n",
    "show_review(X_train[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "automated-episode",
   "metadata": {},
   "source": [
    "The only thing what isn't done for us is the padding. Looking at the distribution of lengths will help us determine what a reasonable length to pad to will be."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "lightweight-dominican",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAEICAYAAABfz4NwAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAX40lEQVR4nO3df7CeZX3n8ffHRLOoRIFENk1SEzTaAaYbJcW4VqWllYCt4I5sk3YFd9mNsDhTt3anUGdXZsfMiK1lhrGisDCAlV8FLdkRVql2QVuEHjTy0+jhh80hWXJaUOOvdBO++8dzHX04PDm/c07OOe/XzDPP/Xzv+7qf6zpPcj7nvu773CdVhSRJL5jpDkiSDg0GgiQJMBAkSY2BIEkCDARJUmMgSJIAA0HzUJJPJvlvB3H/q5JUkoUH6z1GeO/3JPnqdL+v5oZp/wcrzbSqOnem+zAVkqwCHgdeWFX7Zrg7mgM8QtCsMxM/eUvzgYGgWSHJE0n+KMn9wI+SLEyyPsnfJflekm8mOaltuzFJ37D2/yXJ1rZ8dZIPd637rSTb2n7+Lskvt/q/T/K/urbrT3JT1+sdSdaOoe8vS3Jlkl1Jnkzy4SQL2rr3JPlqkj9N8kySx5Oc2tV2dZK7kuxJ8tdJ/jzJX7TVd7Xn7yX5YZI3drU70P7ek+Sxtr/Hk/zeaP3X/GEgaDbZBLwdeDlwNPB54MPAkcAfArckWQpsBV6bZE1X298Frhu+wySvB64C3gscBXwK2JpkEXAn8OYkL0iyDHgh8KbW7hjgpcD9Y+j3NcA+4NXA64C3Af+xa/0bgO3AEuCjwJVJ0tZdB9zb+nYR8O6udm9pzy+vqpdW1d0j7S/JS4BLgVOr6nDgXwPbxtB/zRMGgmaTS6tqR1X9BPh3wG1VdVtVPVtVdwB9wGlV9WPgVjoBQguGX6ITFMP9J+BTVXVPVe2vqmuAvcD6qnoM2AOsBd4KfAF4MskvtddfqapnR+pwkqOBU4H3V9WPqmo3cAmwsWuz71bVFVW1n054LAOOTvKLwK8A/72q/rmqvnqAMQzXc39t3bPA8UkOq6pdVfXQGPanecJA0Gyyo2v5lcCZbZrne0m+B/wqnW9+0PnJelNb/l3gr1pQDPdK4APD9rMS+IW2/k7gJDo/jd8J/B86YfDW9no0r6RzZLGra/+fAl7Rtc3/HVro6uNLWx+eHtbv7q/BgfTcX1X9CPgd4NzWn8+3cJMAA0GzS/eteXcAn66ql3c9XlJVH2nrvwgsaXP8m+gxXdS1ny3D9vPiqrq+rR8KhDe35TsZXyDsoHPEsaRr/4ur6rgxtN0FHJnkxV21lV3L475VcVV9oap+k05wfgu4Yrz70NxlIGi2+gvgt5OckmRBkn+R5KQkKwDaZZg3A39C5xzDHQfYzxXAuUneMDTPnuTtSQ5v6+8Efg04rKoGgK8AG+jM6X9jtE5W1S464fSxJIvb+YhXJXnrGNp+l8402EVJXtROGv921yaDdKaAjhltX9CZvkryjnYuYS/wQ2D/WNpqfjAQNCtV1Q7gdOCP6Xxj3AH8V577b/o64DeAvzzQdfpV1UfnPMLHgWeAfuA9Xeu/Tecb51fa6x8AjwF/2+box+Is4EXAw+09bubnU1uj+T3gjcA/0TmBfiOdb+ZD00FbgL9t01HrR9nXC4APADuBp+kc5fznMfZD80D8AznS7JHkRuBbVfWhme6L5h6PEKRDWJJfaVNML0iygc5R0V/NcLc0R/kbn9Kh7V8Cn6VzzmIAOK+qRj13IU2EU0aSJMApI0lSM2unjJYsWVKrVq2a6W5I0qxy3333/WNVLe21btYGwqpVq+jr6xt9Q0nSzyT57oHWOWUkSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCRhDICS5KsnuJA921W5sf5R8W/vj59tafVWSn3St+2RXmxOSPND+UPmlQ38zNsmitr/+JPckWTX1w5QkjWYsRwhX0/mDID9TVb9TVWurai1wC52bbw15dGhdVZ3bVb8M2AysaY+hfZ4DPFNVr6bzt2YvnshAJEmTM+pvKlfVXQf6qb39lP9vgV8faR9JlgGLq+ru9vpa4Azgdjq3872obXoz8PEkqYN4171VF3z+YO16VE985O0z9t6SNJLJnkN4M/BUVX2nq7Y6yTeS3Jnkza22nM6te4cMtNrQuh3wsz97+H06t/p9niSbk/Ql6RscHJxk1yVJ3SYbCJuA67te7wJ+sapeB/wBcF2SxUB6tB06Ahhp3XOLVZdX1bqqWrd0ac97M0mSJmjCN7dLshD4N8AJQ7Wq2svP/97rfUkeBV5D54hgRVfzFXT+ritt3UpgoO3zZXT+3qskaRpN5gjhN+j8bdefTQUlWZpkQVs+hs7J48eqahewJ8n6dt7hLODW1mwrcHZbfhfw5YN5/kCS1NtYLju9HrgbeG2SgSTntFUbee50EcBbgPuTfJPOCeJzq2rop/3zgP8J9AOP0jmhDHAlcFSSfjrTTBdMYjySpAkay1VGmw5Qf0+P2i10LkPttX0fcHyP+k+BM0frhyTp4PI3lSVJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqDARJEmAgSJKaUQMhyVVJdid5sKt2UZInk2xrj9O61l2YpD/J9iSndNVPSPJAW3dpkrT6oiQ3tvo9SVZN8RglSWMwliOEq4ENPeqXVNXa9rgNIMmxwEbguNbmE0kWtO0vAzYDa9pjaJ/nAM9U1auBS4CLJzgWSdIkjBoIVXUX8PQY93c6cENV7a2qx4F+4MQky4DFVXV3VRVwLXBGV5tr2vLNwMlDRw+SpOkzmXMI70tyf5tSOqLVlgM7urYZaLXlbXl4/Tltqmof8H3gqF5vmGRzkr4kfYODg5PouiRpuIkGwmXAq4C1wC7gY63e6yf7GqE+UpvnF6sur6p1VbVu6dKl4+qwJGlkEwqEqnqqqvZX1bPAFcCJbdUAsLJr0xXAzlZf0aP+nDZJFgIvY+xTVJKkKTKhQGjnBIa8Exi6AmkrsLFdObSazsnje6tqF7Anyfp2fuAs4NauNme35XcBX27nGSRJ02jhaBskuR44CViSZAD4EHBSkrV0pnaeAN4LUFUPJbkJeBjYB5xfVfvbrs6jc8XSYcDt7QFwJfDpJP10jgw2TsG4JEnjNGogVNWmHuUrR9h+C7ClR70POL5H/afAmaP1Q5J0cPmbypIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAsYQCEmuSrI7yYNdtT9J8q0k9yf5XJKXt/qqJD9Jsq09PtnV5oQkDyTpT3JpkrT6oiQ3tvo9SVZN/TAlSaMZyxHC1cCGYbU7gOOr6peBbwMXdq17tKrWtse5XfXLgM3AmvYY2uc5wDNV9WrgEuDicY9CkjRpowZCVd0FPD2s9sWq2tdefg1YMdI+kiwDFlfV3VVVwLXAGW316cA1bflm4OShowdJ0vSZinMI/wG4vev16iTfSHJnkje32nJgoGubgVYbWrcDoIXM94Gjer1Rks1J+pL0DQ4OTkHXJUlDJhUIST4I7AM+00q7gF+sqtcBfwBcl2Qx0Osn/hrazQjrnlusuryq1lXVuqVLl06m65KkYRZOtGGSs4HfAk5u00BU1V5gb1u+L8mjwGvoHBF0TyutAHa25QFgJTCQZCHwMoZNUUmSDr4JHSEk2QD8EfCOqvpxV31pkgVt+Rg6J48fq6pdwJ4k69v5gbOAW1uzrcDZbfldwJeHAkaSNH1GPUJIcj1wErAkyQDwITpXFS0C7mjnf7/Wrih6C/A/kuwD9gPnVtXQT/vn0bli6TA65xyGzjtcCXw6ST+dI4ONUzIySdK4jBoIVbWpR/nKA2x7C3DLAdb1Acf3qP8UOHO0fkiSDi5/U1mSBBgIkqTGQJAkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpGTUQklyVZHeSB7tqRya5I8l32vMRXesuTNKfZHuSU7rqJyR5oK27NElafVGSG1v9niSrpniMkqQxGMsRwtXAhmG1C4AvVdUa4EvtNUmOBTYCx7U2n0iyoLW5DNgMrGmPoX2eAzxTVa8GLgEunuhgJEkTN2ogVNVdwNPDyqcD17Tla4Azuuo3VNXeqnoc6AdOTLIMWFxVd1dVAdcOazO0r5uBk4eOHiRJ02ei5xCOrqpdAO35Fa2+HNjRtd1Aqy1vy8Prz2lTVfuA7wNH9XrTJJuT9CXpGxwcnGDXJUm9TPVJ5V4/2dcI9ZHaPL9YdXlVrauqdUuXLp1gFyVJvUw0EJ5q00C0592tPgCs7NpuBbCz1Vf0qD+nTZKFwMt4/hSVJOkgm2ggbAXObstnA7d21Te2K4dW0zl5fG+bVtqTZH07P3DWsDZD+3oX8OV2nkGSNI0WjrZBkuuBk4AlSQaADwEfAW5Kcg7wD8CZAFX1UJKbgIeBfcD5VbW/7eo8OlcsHQbc3h4AVwKfTtJP58hg45SMTJI0LqMGQlVtOsCqkw+w/RZgS496H3B8j/pPaYEiSZo5/qayJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUjPhQEjy2iTbuh4/SPL+JBclebKrflpXmwuT9CfZnuSUrvoJSR5o6y5NkskOTJI0PhMOhKraXlVrq2otcALwY+BzbfUlQ+uq6jaAJMcCG4HjgA3AJ5IsaNtfBmwG1rTHhon2S5I0MVM1ZXQy8GhVfXeEbU4HbqiqvVX1ONAPnJhkGbC4qu6uqgKuBc6Yon5JksZoqgJhI3B91+v3Jbk/yVVJjmi15cCOrm0GWm15Wx5elyRNo0kHQpIXAe8A/rKVLgNeBawFdgEfG9q0R/Maod7rvTYn6UvSNzg4OJluS5KGmYojhFOBr1fVUwBV9VRV7a+qZ4ErgBPbdgPAyq52K4Cdrb6iR/15quryqlpXVeuWLl06BV2XJA2ZikDYRNd0UTsnMOSdwINteSuwMcmiJKvpnDy+t6p2AXuSrG9XF50F3DoF/ZIkjcPCyTRO8mLgN4H3dpU/mmQtnWmfJ4bWVdVDSW4CHgb2AedX1f7W5jzgauAw4Pb2kCRNo0kFQlX9GDhqWO3dI2y/BdjSo94HHD+ZvkiSJsffVJYkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqFs50B+abVRd8fkbe94mPvH1G3lfS7DGpI4QkTyR5IMm2JH2tdmSSO5J8pz0f0bX9hUn6k2xPckpX/YS2n/4klybJZPolSRq/qZgy+rWqWltV69rrC4AvVdUa4EvtNUmOBTYCxwEbgE8kWdDaXAZsBta0x4Yp6JckaRwOxjmE04Fr2vI1wBld9Ruqam9VPQ70AycmWQYsrqq7q6qAa7vaSJKmyWQDoYAvJrkvyeZWO7qqdgG051e0+nJgR1fbgVZb3paH158nyeYkfUn6BgcHJ9l1SVK3yZ5UflNV7UzyCuCOJN8aYdte5wVqhPrzi1WXA5cDrFu3ruc2kqSJmdQRQlXtbM+7gc8BJwJPtWkg2vPutvkAsLKr+QpgZ6uv6FGXJE2jCQdCkpckOXxoGXgb8CCwFTi7bXY2cGtb3gpsTLIoyWo6J4/vbdNKe5Ksb1cXndXVRpI0TSYzZXQ08Ll2hehC4Lqq+t9J/h64Kck5wD8AZwJU1UNJbgIeBvYB51fV/rav84CrgcOA29tDkjSNJhwIVfUY8K961P8JOPkAbbYAW3rU+4DjJ9oXSdLkeesKSRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqDARJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpGbCgZBkZZK/SfJIkoeS/H6rX5TkySTb2uO0rjYXJulPsj3JKV31E5I80NZdmiSTG5YkabwWTqLtPuADVfX1JIcD9yW5o627pKr+tHvjJMcCG4HjgF8A/jrJa6pqP3AZsBn4GnAbsAG4fRJ9kySN04SPEKpqV1V9vS3vAR4Blo/Q5HTghqraW1WPA/3AiUmWAYur6u6qKuBa4IyJ9kuSNDFTcg4hySrgdcA9rfS+JPcnuSrJEa22HNjR1Wyg1Za35eH1Xu+zOUlfkr7BwcGp6LokqZl0ICR5KXAL8P6q+gGd6Z9XAWuBXcDHhjbt0bxGqD+/WHV5Va2rqnVLly6dbNclSV0mFQhJXkgnDD5TVZ8FqKqnqmp/VT0LXAGc2DYfAFZ2NV8B7Gz1FT3qkqRpNJmrjAJcCTxSVX/WVV/Wtdk7gQfb8lZgY5JFSVYDa4B7q2oXsCfJ+rbPs4BbJ9ovSdLETOYqozcB7wYeSLKt1f4Y2JRkLZ1pnyeA9wJU1UNJbgIepnOF0vntCiOA84CrgcPoXF3kFUaSNM0mHAhV9VV6z//fNkKbLcCWHvU+4PiJ9kWSNHn+prIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkYHK/mKZZZNUFn5+x937iI2+fsfeWNHYeIUiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1HjrCh10M3XbDG+ZIY2PRwiSJMBAkCQ1h0wgJNmQZHuS/iQXzHR/JGm+OSQCIckC4M+BU4FjgU1Jjp3ZXknS/HKonFQ+EeivqscAktwAnA48PKO90qzmyWxpfA6VQFgO7Oh6PQC8YfhGSTYDm9vLHybZPoH3WgL84wTazWbzccwwQ+POxdP9js8xHz9rxzw+rzzQikMlENKjVs8rVF0OXD6pN0r6qmrdZPYx28zHMcP8HLdjnh8O1pgPiXMIdI4IVna9XgHsnKG+SNK8dKgEwt8Da5KsTvIiYCOwdYb7JEnzyiExZVRV+5K8D/gCsAC4qqoeOkhvN6kpp1lqPo4Z5ue4HfP8cFDGnKrnTdVLkuahQ2XKSJI0wwwESRIwjwJhLt8aI8kTSR5Isi1JX6sdmeSOJN9pz0d0bX9h+zpsT3LKzPV8fJJclWR3kge7auMeZ5IT2terP8mlSXpd9nxIOMCYL0ryZPu8tyU5rWvdXBjzyiR/k+SRJA8l+f1Wn7Of9Qhjnt7Puqrm/IPOiepHgWOAFwHfBI6d6X5N4fieAJYMq30UuKAtXwBc3JaPbeNfBKxuX5cFMz2GMY7zLcDrgQcnM07gXuCNdH7/5Xbg1Jke2zjHfBHwhz22nStjXga8vi0fDny7jW3OftYjjHlaP+v5coTws1tjVNU/A0O3xpjLTgeuacvXAGd01W+oqr1V9TjQT+frc8irqruAp4eVxzXOJMuAxVV1d3X+91zb1eaQc4AxH8hcGfOuqvp6W94DPELnbgZz9rMeYcwHclDGPF8CodetMUb6Ys82BXwxyX3t9h4AR1fVLuj8YwNe0epz7Wsx3nEub8vD67PN+5Lc36aUhqZO5tyYk6wCXgfcwzz5rIeNGabxs54vgTCmW2PMYm+qqtfTuVvs+UneMsK2c/1rMeRA45wL478MeBWwFtgFfKzV59SYk7wUuAV4f1X9YKRNe9Rm5bh7jHlaP+v5Eghz+tYYVbWzPe8GPkdnCuipdvhIe97dNp9rX4vxjnOgLQ+vzxpV9VRV7a+qZ4Er+PmU35wZc5IX0vnG+Jmq+mwrz+nPuteYp/uzni+BMGdvjZHkJUkOH1oG3gY8SGd8Z7fNzgZubctbgY1JFiVZDayhcxJqthrXONtUw54k69vVF2d1tZkVhr4pNu+k83nDHBlz6+OVwCNV9Wddq+bsZ32gMU/7Zz3TZ9en6wGcRufM/aPAB2e6P1M4rmPoXG3wTeChobEBRwFfAr7Tno/savPB9nXYziF61cUBxno9ncPm/0fnJ6FzJjJOYF37j/Uo8HHab+wfio8DjPnTwAPA/e0bw7I5NuZfpTPNcT+wrT1Om8uf9QhjntbP2ltXSJKA+TNlJEkahYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1/x+pgx4ABsFpzQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.hist([len(x) for x in X_train])\n",
    "plt.title('review lengths');"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hundred-wound",
   "metadata": {},
   "source": [
    "We saw one way of doing this earlier, but Keras actually has a built in `pad_sequences` helper function. This handles both padding and truncating. By default padding is added to the *beginning* of a sequence."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "accompanied-guide",
   "metadata": {},
   "source": [
    "<div class=\"exercise\"  style=\"background-color:#b3e6ff\">\n",
    "<b>Q</b>: Why might we want to truncate? Why might we want to pad from the beginning?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "sweet-container",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.preprocessing.sequence import pad_sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "practical-museum",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Length of first and fifth review after padding 500 500\n"
     ]
    }
   ],
   "source": [
    "MAX_LEN = 500\n",
    "X_train = pad_sequences(X_train, maxlen=MAX_LEN)\n",
    "X_test = pad_sequences(X_test, maxlen=MAX_LEN)\n",
    "print('Length of first and fifth review after padding', len(X_train[0]) ,len(X_train[4]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "senior-advice",
   "metadata": {},
   "source": [
    "<div class='exercise' id='FFNN'><b>Model 1: Naive Feed-Forward Network</b></div></br>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "known-rover",
   "metadata": {},
   "source": [
    "Let us build a single-layer feed-forward net with a hidden layer of 250 nodes. Each input would be a 500-dim vector of tokens since we padded all our sequences to size 500.\n",
    "\n",
    "<br>\n",
    "<div class=\"exercise\"  style=\"background-color:#b3e6ff\">\n",
    "<b>Q</b>: How would you calculate the number of parameters in this network?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "exact-generic",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"Naive_FFNN\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "dense_2 (Dense)              (None, 250)               125250    \n",
      "_________________________________________________________________\n",
      "dense_3 (Dense)              (None, 1)                 251       \n",
      "=================================================================\n",
      "Total params: 125,501\n",
      "Trainable params: 125,501\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Epoch 1/10\n",
      "196/196 - 1s - loss: 178.4060 - accuracy: 0.4996 - val_loss: 91.7812 - val_accuracy: 0.4996\n",
      "Epoch 2/10\n",
      "196/196 - 0s - loss: 48.6640 - accuracy: 0.5822 - val_loss: 48.4361 - val_accuracy: 0.5026\n",
      "Epoch 3/10\n",
      "196/196 - 0s - loss: 17.7305 - accuracy: 0.6612 - val_loss: 31.7317 - val_accuracy: 0.5022\n",
      "Epoch 4/10\n",
      "196/196 - 0s - loss: 7.5028 - accuracy: 0.7264 - val_loss: 21.0285 - val_accuracy: 0.5017\n",
      "Epoch 5/10\n",
      "196/196 - 0s - loss: 3.9465 - accuracy: 0.7623 - val_loss: 15.6753 - val_accuracy: 0.5025\n",
      "Epoch 6/10\n",
      "196/196 - 0s - loss: 2.2523 - accuracy: 0.7980 - val_loss: 12.4736 - val_accuracy: 0.5039\n",
      "Epoch 7/10\n",
      "196/196 - 0s - loss: 1.4916 - accuracy: 0.8150 - val_loss: 10.7774 - val_accuracy: 0.5057\n",
      "Epoch 8/10\n",
      "196/196 - 0s - loss: 1.1314 - accuracy: 0.8334 - val_loss: 9.6000 - val_accuracy: 0.5060\n",
      "Epoch 9/10\n",
      "196/196 - 0s - loss: 0.8617 - accuracy: 0.8504 - val_loss: 8.9963 - val_accuracy: 0.5055\n",
      "Epoch 10/10\n",
      "196/196 - 0s - loss: 0.7458 - accuracy: 0.8602 - val_loss: 8.7728 - val_accuracy: 0.5083\n",
      "Accuracy: 50.83%\n"
     ]
    }
   ],
   "source": [
    "model = Sequential(name='Naive_FFNN')\n",
    "model.add(Dense(250, activation='relu',input_dim=MAX_LEN))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=2)\n",
    "\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "experimental-doctor",
   "metadata": {},
   "source": [
    "<div class=\"exercise\"  style=\"background-color:#b3e6ff\">\n",
    "<b>Q</b>: Why was the performance so poor? How could we improve our tokenization?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "loose-ownership",
   "metadata": {},
   "source": [
    "<div class='exercise' id='emb'><b>Model 2: Feed-Forward Network /w Embeddings</b></div></br>\n",
    "<img src='fig/wordembedding2.png' width=450px>\n",
    "\n",
    "One can view the embedding process as a linear projection from one vector space to another. For NLP, we usually use embeddings to project the sparse one-hot encodings of words on to a lower-dimensional continuous space so that the input surface is 'dense' and possibly smooth. Thus, one can view this embedding layer process as just a transformation from $\\mathbb{R}^{inp}$ to $\\mathbb{R}^{emb}$\n",
    "\n",
    "This not only reduces dimensionality but also allows semantic similarities between tokens to be captured by 'similiarities' between the embedding vectors. This was not possible with one-hot encoding as all vectors there were orthogonal to one another. \n",
    "\n",
    "<img src='fig/wordembedding.png' width=450px>\n",
    "\n",
    "It is also possible to load pretrained embeddings that were learned from giant corpora. This would be an instance of transfer learning.\n",
    "\n",
    "If you are interested in learning more, start with the astromonically impactful papers of [word2vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) and [GloVe](https://www.aclweb.org/anthology/D14-1162.pdf).\n",
    "\n",
    "In Keras we use the [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer:\n",
    "```\n",
    "tf.keras.layers.Embedding(\n",
    "    input_dim, output_dim, embeddings_initializer='uniform',\n",
    "    embeddings_regularizer=None, activity_regularizer=None,\n",
    "    embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs\n",
    ")\n",
    "```\n",
    "We'll need to specify the `input_dim` and `output_dim`. If working with sequences, as we are, you'll also need to set the `input_length`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "covered-automation",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"FFNN_EMBED\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_1 (Embedding)      (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "flatten_1 (Flatten)          (None, 50000)             0         \n",
      "_________________________________________________________________\n",
      "dense_6 (Dense)              (None, 250)               12500250  \n",
      "_________________________________________________________________\n",
      "dense_7 (Dense)              (None, 1)                 251       \n",
      "=================================================================\n",
      "Total params: 13,500,501\n",
      "Trainable params: 13,500,501\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Epoch 1/2\n",
      "196/196 - 6s - loss: 0.6433 - accuracy: 0.6078 - val_loss: 0.3630 - val_accuracy: 0.8497\n",
      "Epoch 2/2\n",
      "196/196 - 6s - loss: 0.2349 - accuracy: 0.9025 - val_loss: 0.2977 - val_accuracy: 0.8747\n",
      "Accuracy: 87.47%\n"
     ]
    }
   ],
   "source": [
    "EMBED_DIM = 100\n",
    "\n",
    "model = Sequential(name='FFNN_EMBED')\n",
    "model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))\n",
    "model.add(Flatten())\n",
    "model.add(Dense(250, activation='relu'))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=2)\n",
    "\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "silent-reporter",
   "metadata": {},
   "source": [
    "<div class='exercise' id='cnn'><b>Model 3: 1-Dimensional Convolutional Network</b></div></br>\n",
    "<img src='fig/1D-CNN.png'>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "continental-import",
   "metadata": {},
   "source": [
    "Text can be thought of as 1-dimensional sequence (a single, long vector) and we can apply 1D Convolutions over a set of word embeddings.<br>\n",
    "\n",
    "More information on convolutions on text data can be found on [this blog](http://debajyotidatta.github.io/nlp/deep/learning/word-embeddings/2016/11/27/Understanding-Convolutions-In-Text/). If you want to learn more, read this [published and well-cited paper](https://www.aclweb.org/anthology/I17-1026.pdf) from Eleni's friend, Byron Wallace."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adjacent-response",
   "metadata": {},
   "source": [
    "<div class=\"exercise\"  style=\"background-color:#b3e6ff\">\n",
    "<b>Q</b>: Why do we use Conv1D if our input, a sequence of word embeddings, is 2D?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "demographic-laundry",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"1D_CNN\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_2 (Embedding)      (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "conv1d (Conv1D)              (None, 500, 200)          60200     \n",
      "_________________________________________________________________\n",
      "max_pooling1d (MaxPooling1D) (None, 250, 200)          0         \n",
      "_________________________________________________________________\n",
      "flatten_2 (Flatten)          (None, 50000)             0         \n",
      "_________________________________________________________________\n",
      "dense_8 (Dense)              (None, 250)               12500250  \n",
      "_________________________________________________________________\n",
      "dense_9 (Dense)              (None, 1)                 251       \n",
      "=================================================================\n",
      "Total params: 13,560,701\n",
      "Trainable params: 13,560,701\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Epoch 1/2\n",
      "196/196 [==============================] - 9s 34ms/step - loss: 0.5958 - accuracy: 0.6403\n",
      "Epoch 2/2\n",
      "196/196 [==============================] - 7s 34ms/step - loss: 0.1796 - accuracy: 0.9358\n",
      "Accuracy: 88.69%\n"
     ]
    }
   ],
   "source": [
    "model = Sequential(name='1D_CNN')\n",
    "model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))\n",
    "model.add(Conv1D(filters=200, kernel_size=3, padding='same', activation='relu'))\n",
    "model.add(MaxPool1D(pool_size=2))\n",
    "model.add(Flatten())\n",
    "model.add(Dense(250, activation='relu'))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, epochs=2, batch_size=128)\n",
    "\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "removed-valentine",
   "metadata": {},
   "source": [
    "<div class='exercise' id='rnn'><b>Model 4: Simple RNN</b></div></br>\n",
    "<img src='fig/simplernn.png' width=300px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "secret-atlas",
   "metadata": {},
   "source": [
    "At a high-level, an RNN is similar to a feed-forward neural network (FFNN) in that there is an input layer, a hidden layer, and an output layer. The input layer is fully connected to the hidden layer, and the hidden layer is fully connected to the output layer. However, the crux of what makes it a **recurrent** neural network is that the hidden layer for a given time _t_ is not only based on the input layer at time _t_ but also the hidden layer from time _t-1_.\n",
    "\n",
    "Here's a popular blog post on [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).\n",
    "\n",
    "In Keras, the vanilla RNN unit is implemented the`SimpleRNN` layer:\n",
    "```\n",
    "tf.keras.layers.SimpleRNN(\n",
    "    units, activation='tanh', use_bias=True,\n",
    "    kernel_initializer='glorot_uniform',\n",
    "    recurrent_initializer='orthogonal',\n",
    "    bias_initializer='zeros', kernel_regularizer=None,\n",
    "    recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None,\n",
    "    kernel_constraint=None, recurrent_constraint=None, bias_constraint=None,\n",
    "    dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False,\n",
    "    go_backwards=False, stateful=False, unroll=False, **kwargs\n",
    ")\n",
    "```\n",
    "As you can see, recurrent layers in Keras take many arguments. We only need to be concerned with `units`, which specifies the size of the hidden state, and `return_sequences`, which will be discussed shortly. For the moment is it fine to leave this set to the default of `False`.\n",
    "\n",
    "Due to the limitations of the vanilla RNN unit (more on that next) it tends not to be used much in practice. For this reason it seems that the Keras developers neglected to implement GPU acceleration for this layer! Notice how much slower the trainig is even for a network with far fewer parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "stretch-andorra",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"SimpleRNN\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_3 (Embedding)      (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "simple_rnn (SimpleRNN)       (None, 100)               20100     \n",
      "_________________________________________________________________\n",
      "dense_10 (Dense)             (None, 1)                 101       \n",
      "=================================================================\n",
      "Total params: 1,020,201\n",
      "Trainable params: 1,020,201\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Epoch 1/3\n",
      "196/196 [==============================] - 53s 267ms/step - loss: 0.6720 - accuracy: 0.5660\n",
      "Epoch 2/3\n",
      "196/196 [==============================] - 52s 266ms/step - loss: 0.5283 - accuracy: 0.7444\n",
      "Epoch 3/3\n",
      "196/196 [==============================] - 52s 265ms/step - loss: 0.3406 - accuracy: 0.8588\n",
      "Accuracy: 83.13%\n"
     ]
    }
   ],
   "source": [
    "model = Sequential(name='SimpleRNN')\n",
    "model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))\n",
    "model.add(SimpleRNN(100))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, epochs=3, batch_size=128)\n",
    "\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dependent-blast",
   "metadata": {},
   "source": [
    "<div class='exercise' id='vanish'><b>Vanishing/Exploding Gradients</b></div></br>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "weekly-comparison",
   "metadata": {},
   "source": [
    "<img src = \"fig/backprop.png\" width=500px>\n",
    "<br>\n",
    "\n",
    "We need to backpropogate through every time step to calculate the gradients used for our weight updates.\n",
    "\n",
    "This requires the use of the chain rule which amounts to repeated multiplications.\n",
    "\n",
    "This can cause two types of problems. First, this product can quickly 'explode,' becoming large, causing destructive updates to the model and numerical overflow. One hack to solve this problem is to **clip** the gradient at some threshold.\n",
    "\n",
    "Alternatively, the gradient can 'vanish,' getting smaller and smaller as the gradient moves backwards in time. Gradient clipping will not help us here. If we can't propogate gradients suffuciently far back in time then our network will be unable to learn long temporal dependencies. This problem motivates the architecture of the GRU and LSTM units as substitutes for the 'vanilla' RNN.\n",
    "\n",
    "For a more detailed look at the vanishing/exploding gradient problem, please see [Marios's excellent Advanced Section](https://edstem.org/us/courses/3773/lessons/11753/slides/56629). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "annual-composition",
   "metadata": {},
   "source": [
    "<div class='exercise' id='gru'><b>Model 5: GRU</b></div></br>\n",
    "<img src='fig/gru.png' width=800px>\n",
    "\n",
    "$X_{t}$: input<br>\n",
    "$U$, $V$, and $\\beta$: parameter matrices and vector<br>\n",
    "$\\tilde{h_t}$: candidate activation vector<br>\n",
    "$h_{t}$: output vector<br>\n",
    "$R_t$: reset gate<br>\n",
    "$Z_t$: update gate<br>\n",
    "\n",
    "The gates of the GRU allow for the gradients to flow more freely to previous time steps, helping to mitigate the vanishing gradient problem.\n",
    "\n",
    "In Keras, the [`GRU`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) layer is used in exactly the same way as the `SimpleRNN` layer. \n",
    "```\n",
    "tf.keras.layers.GRU(\n",
    "    units, activation='tanh', recurrent_activation='sigmoid',\n",
    "    use_bias=True, kernel_initializer='glorot_uniform',\n",
    "    recurrent_initializer='orthogonal',\n",
    "    bias_initializer='zeros', kernel_regularizer=None,\n",
    "    recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None,\n",
    "    kernel_constraint=None, recurrent_constraint=None, bias_constraint=None,\n",
    "    dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False,\n",
    "    go_backwards=False, stateful=False, unroll=False, time_major=False,\n",
    "    reset_after=True, **kwargs\n",
    ")\n",
    "```\n",
    "\n",
    "Here we just swap it in to the previous architecture. Note how much faster it trains with GPU excelleration than the simple RNN!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "organizational-chosen",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"GRU\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_6 (Embedding)      (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "gru_1 (GRU)                  (None, 100)               60600     \n",
      "_________________________________________________________________\n",
      "dense_13 (Dense)             (None, 1)                 101       \n",
      "=================================================================\n",
      "Total params: 1,060,701\n",
      "Trainable params: 1,060,701\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Epoch 1/3\n",
      "391/391 [==============================] - 13s 30ms/step - loss: 0.5626 - accuracy: 0.6781\n",
      "Epoch 2/3\n",
      "391/391 [==============================] - 12s 30ms/step - loss: 0.2510 - accuracy: 0.9011\n",
      "Epoch 3/3\n",
      "391/391 [==============================] - 12s 30ms/step - loss: 0.1757 - accuracy: 0.9349\n",
      "Accuracy: 88.02%\n"
     ]
    }
   ],
   "source": [
    "model = Sequential(name='GRU')\n",
    "model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))\n",
    "model.add(GRU(100))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, epochs=3, batch_size=64)\n",
    "\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adopted-anchor",
   "metadata": {},
   "source": [
    "<div class='exercise' id='lstm'><b>Model 6: LSTM</b></div></br>\n",
    "<img src='fig/lstm.png' width=600px>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "electrical-navigator",
   "metadata": {},
   "source": [
    "The LSTM lacks the GRU's 'short cut' connection (see GRU's $h_t$ above).\n",
    "\n",
    "The LSTM also has a distinct 'cell state' in addition to the hidden state. \n",
    "\n",
    "Futher reading: \n",
    "- [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)\n",
    "- [LSTM: A Search Space Odyssey](https://arxiv.org/abs/1503.04069)\n",
    "- [An Empirical Exploration of Recurrent Network Architectures](http://proceedings.mlr.press/v37/jozefowicz15.pdf)\n",
    "\n",
    "Again, Kera's [`LSTM`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) works like all the other recurrent layers.\n",
    "```\n",
    "tf.keras.layers.LSTM(\n",
    "    units, activation='tanh', recurrent_activation='sigmoid',\n",
    "    use_bias=True, kernel_initializer='glorot_uniform',\n",
    "    recurrent_initializer='orthogonal',\n",
    "    bias_initializer='zeros', unit_forget_bias=True,\n",
    "    kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None,\n",
    "    activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None,\n",
    "    bias_constraint=None, dropout=0.0, recurrent_dropout=0.0,\n",
    "    return_sequences=False, return_state=False, go_backwards=False, stateful=False,\n",
    "    time_major=False, unroll=False, **kwargs\n",
    ")\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "black-performer",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"LSTM\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_5 (Embedding)      (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "lstm (LSTM)                  (None, 100)               80400     \n",
      "_________________________________________________________________\n",
      "dense_12 (Dense)             (None, 1)                 101       \n",
      "=================================================================\n",
      "Total params: 1,080,501\n",
      "Trainable params: 1,080,501\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Epoch 1/3\n",
      "391/391 [==============================] - 14s 33ms/step - loss: 0.5209 - accuracy: 0.7265\n",
      "Epoch 2/3\n",
      "391/391 [==============================] - 13s 33ms/step - loss: 0.3275 - accuracy: 0.8671\n",
      "Epoch 3/3\n",
      "391/391 [==============================] - 13s 33ms/step - loss: 0.2021 - accuracy: 0.9268\n",
      "Accuracy: 86.39%\n"
     ]
    }
   ],
   "source": [
    "model = Sequential(name='LSTM')\n",
    "model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))\n",
    "model.add(LSTM(100))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, epochs=3, batch_size=64)\n",
    "\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "specified-matrix",
   "metadata": {},
   "source": [
    "<div class='exercise' id='bidir'><b>BiDirectional Layer</b></div></br>\n",
    "<img src='fig/birnn1.png' width=600px>\n",
    "\n",
    "We may want our model to learn dependencies in either direction. A **BiDirectional RNN** consists of two separate recurrent units. One processing the sequence from left to right, the other processes that same sequence but in reverse, from right to left. The output of the two units are then merged together (typically concatenated) and feed to the next layer of the network.<br>\n",
    "\n",
    "\n",
    "\n",
    "Creating a Bidirection RNN in Keras is quite simple. We just 'wrap' a recurrent layer in the [`Bidirectional`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) layer. The default behavior is to concatenate the output from each direction.\n",
    "\n",
    "```\n",
    "tf.keras.layers.Bidirectional(\n",
    "    layer, merge_mode='concat', weights=None, backward_layer=None,\n",
    "    **kwargs\n",
    ")\n",
    "```\n",
    "\n",
    "Example:\n",
    "```\n",
    "model = Sequential()\n",
    "...\n",
    "model.add(Bidirectional(SimpleRNN(n_nodes))\n",
    "...\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "valuable-canvas",
   "metadata": {},
   "source": [
    "<div class='exercise' id='deep'><b>Deep RNNs</b></div></br>\n",
    "<img src='fig/deeprnn.png' width=600px>\n",
    "\n",
    "We may want to stack RNN layers one after another. But there is a problem. A recurrent layer expects to be given a sequence as input, and yet we can see that the recurrent layer in each of our models above outputs a single vector. This is because the default behavior of Keras's recurrent layers is to suppress the output until the final time step. If we want to have two recurrent units in a row then the first will have to given an output after each time step, thus providing a sequence to the 2nd recurrent layer.\n",
    "\n",
    "We can have our recurrent layers output at each time step setting `return_sequences=True`.<br>\n",
    "Example:\n",
    "```\n",
    "model = Sequential()\n",
    "...\n",
    "model.add(LSTM(100, return_sequences=True))\n",
    "model.add(LSTM(100)\n",
    "...\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "assisted-pollution",
   "metadata": {},
   "source": [
    "<div class='exercise' id='timedist'><b>TimeDistributed Layer</b></div></br>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "loved-camcorder",
   "metadata": {},
   "source": [
    "[`TimeDistributed`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TimeDistributed) is a 'wrapper' that applies a layer to all time steps of an input sequence.\n",
    "```\n",
    "tf.keras.layers.TimeDistributed(\n",
    "    layer, **kwargs\n",
    ")\n",
    "```\n",
    "We use `TimeDistributed` when we want to input a sequence into a layer that doesn't normally expect a time dimension, such as `Dense`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "id": "noted-database",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of input :  (1, 3, 5)\n",
      "Shape of output :  (1, 3, 8)\n"
     ]
    }
   ],
   "source": [
    "model = Sequential()\n",
    "model.add(TimeDistributed(Dense(8), input_shape=(3, 5)))\n",
    "input_array = np.random.randint(10, size=(1,3,5))\n",
    "print(\"Shape of input : \", input_array.shape)\n",
    "\n",
    "model.compile('rmsprop', 'mse')\n",
    "output_array = model.predict(input_array)\n",
    "print(\"Shape of output : \", output_array.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "touched-occasions",
   "metadata": {},
   "source": [
    "<div class='exercise' id='repeatvec'><b>RepeatVector Layer</b></div></br>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "strange-fiction",
   "metadata": {},
   "source": [
    "[`RepeatVector`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TimeDistributed) repeats the vector a specified number of times. Dimension changes from <br>\n",
    "(batch_size, number_of_elements)<br>\n",
    "to<br>\n",
    "(batch_size, number_of_repetitions, number_of_elements)\n",
    "\n",
    "This effectively generates a sequence from a single input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "id": "original-spider",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_9\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "dense_37 (Dense)             (None, 2)                 4         \n",
      "_________________________________________________________________\n",
      "repeat_vector_5 (RepeatVecto (None, 3, 2)              0         \n",
      "=================================================================\n",
      "Total params: 4\n",
      "Trainable params: 4\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "model = Sequential()\n",
    "model.add(Dense(2, input_dim=1))\n",
    "model.add(RepeatVector(3))\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "circular-graphics",
   "metadata": {},
   "source": [
    "<div class='exercise' id='cnnrnn'><b>Model 7: CNN + RNN</b></div></br>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "brilliant-welding",
   "metadata": {},
   "source": [
    "CNNs are good at learning spatial features, and sentences can be thought of as 1-D spatial vectors (dimensionality is determined by the number of words in the sentence). We can then take the features learned by the CNN (after a maxpooling layer) and feed them into an RNN! We expect the CNN to be able to pick out invariant features across the 1-D spatial structure (i.e., sentence) that characterize good and bad sentiment. This learned spatial features may then be learned as sequences by a reccurent layer. The classification step is then performed by a final dense layer."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "chicken-verse",
   "metadata": {},
   "source": [
    "<div class=\"exercise\"  style=\"background-color:#F5E4C3\">\n",
    "    <b>Exercise:</b> Build a CNN + Deep, BiDirectional GRU Model\n",
    "</div>\n",
    "\n",
    "Let's put together everything we've learned so far.<br>\n",
    "Create a network with:\n",
    "- word embeddings in a 100-dimensional space\n",
    "- conv layer with 32 filters, kernels of width 3, 'same' padding, and ReLU activate\n",
    "- max pooling of size 2\n",
    "- 2 bidirectional GRU layers, each with 50 units *per direction*\n",
    "- dense output layer for binary classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "finnish-poker",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = Sequential(name='CNN_GRU')\n",
    "# your code here\n",
    "model.add(Embedding(MAX_VOCAB, 100, input_length=MAX_LEN))\n",
    "model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))\n",
    "model.add(MaxPool1D(pool_size=2))\n",
    "model.add(Bidirectional(GRU(50, return_sequences=True)))\n",
    "model.add(Bidirectional(GRU(50)))\n",
    "model.add(Dense(1, activation='sigmoid'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "thermal-freeze",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"CNN_GRU\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding (Embedding)        (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "conv1d (Conv1D)              (None, 500, 32)           9632      \n",
      "_________________________________________________________________\n",
      "max_pooling1d (MaxPooling1D) (None, 250, 32)           0         \n",
      "_________________________________________________________________\n",
      "bidirectional (Bidirectional (None, 250, 100)          25200     \n",
      "_________________________________________________________________\n",
      "bidirectional_1 (Bidirection (None, 100)               45600     \n",
      "_________________________________________________________________\n",
      "dense_2 (Dense)              (None, 1)                 101       \n",
      "=================================================================\n",
      "Total params: 1,080,533\n",
      "Trainable params: 1,080,533\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Epoch 1/3\n",
      "391/391 [==============================] - 27s 43ms/step - loss: 0.5076 - accuracy: 0.7144\n",
      "Epoch 2/3\n",
      "391/391 [==============================] - 17s 43ms/step - loss: 0.1790 - accuracy: 0.9341\n",
      "Epoch 3/3\n",
      "391/391 [==============================] - 17s 43ms/step - loss: 0.1019 - accuracy: 0.9653\n",
      "Accuracy: 87.90%\n"
     ]
    }
   ],
   "source": [
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, epochs=3, batch_size=64)\n",
    "\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "streaming-hebrew",
   "metadata": {},
   "source": [
    "What is the *worst* movie review in the test set according to your model? 🍅"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "finnish-prescription",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> steven seagal has made a really dull bad and boring movie steven seagal plays a doctor this movie has got a few action scenes but they are poorly directed and have nothing to do with the rest of the movie a group of american nazis spread a lethal virus which is able to wipe out the state of montana wesley seagal s character tries desperately to find a cure and that is the story of the <UNK> the <UNK> is an extremely boring film because nothing happens it is filled with boring dialogue and illogical gaps between events and stupid actors steven seagal has totally up in this movie and i would not recommend this <UNK> to my worst enemy 3 10\n"
     ]
    }
   ],
   "source": [
    "preds = model.predict_proba(X_test)\n",
    "worst_review = X_test[preds.argmin()]\n",
    "show_review(worst_review)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "molecular-cooper",
   "metadata": {},
   "source": [
    "What is the *best* movie review in the test set according to your model? 🏆"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "quantitative-chosen",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<UNK> village with their artist uncle john <UNK> after the death of their parents <UNK> sister and john's brother simon has given up trying to convince john to allow he and susan to take care of the children and have <UNK> to using private detectives to catch him in either <UNK> behavior or unemployed and therefore unable to care for the children properly susan finally decides to take matters into her own hands and goes to <UNK> village herself posing as an actress to try to gain information and or <UNK> him to see reason what she discovers however is that she not only likes the free and artistic lifestyle john and his friends are living and that the girls are being brought up well but that she is quickly falling in love with john inevitably her true identity is discovered and she is faced with the task of convincing everyone on both sides of the custody debate who should belong with whom br br i really enjoyed this film and found that its very short running time 70 minutes was the perfect length to spin this simple but endearing story <UNK> hopkins one of the great 1930's 1940's actresses is delightful in this film her energy style and wholesome beauty really lend themselves to creating an endearing character even though you know that she's pulling a fast one on the people she quickly befriends this is the earliest film i've seen ray <UNK> in and he was actually young and non <UNK> looking and apparently three years younger than his co star his energy and <UNK> manner in wise girl were a refreshing change to the demeanor he affects in his usual darker films honestly though i am usually not remotely a fan of child actors i really enjoyed the two young girls who played <UNK> <UNK> they were <UNK> <UNK> and were really the <UNK> of the film unfortunately i can't dig up any other films that either of them were subsequently in after this one which is a shame since both <UNK> a large amount of natural talent br br wise girl was a film that was made three years after the hollywood code was and to some extent this was <UNK> clear by the quick happy ending and the pie in the sky and ease with which the characters lived the alleged <UNK> co <UNK> was in fact a gorgeous <UNK> de <UNK> where the artists lived for free or for trade and everything is tied up very nicely throughout fortunately this was a light enough film and the characters were charming enough to make <UNK> for its <UNK> and short <UNK> and i was able to just take wise girl for what it was a good old fashioned love story that was as entertaining as it was endearing unfortunately films of the romantic comedy drama genre today are considerably less intelligent and entertaining or i wouldn't find myself continuously returning to the classics 7 10\n"
     ]
    }
   ],
   "source": [
    "best_review = X_test[preds.argmax()]\n",
    "show_review(best_review)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "regulation-dover",
   "metadata": {},
   "source": [
    "<div class=\"exercise\"  style=\"background-color:#F5E4C3\">\n",
    "    <b>End of Exercise:</b> Please return to the main room\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cordless-humanitarian",
   "metadata": {},
   "source": [
    "## Heavy Metal Lyric Generator<div id='metal'></div>\n",
    "\n",
    "<img src='fig/many2manyNN.png' width=400px>\n",
    "\n",
    "Here we'll design an RNN to generate song lyrics character by character!\n",
    "\n",
    "The model will take in a sequences of 40 character 'windows' of text and predict the most probable next character. This new character is then appended to the original sequence, the first character is dropped, and this new sequene is fed back into the model. We can repeat this process for as long as we like to generate output of arbitrary length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "id": "immediate-racing",
   "metadata": {},
   "outputs": [],
   "source": [
    "metal_df = pd.read_csv('data/metal_lyrics_PG.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "id": "complete-command",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4785, 4)"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "metal_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "manufactured-tackle",
   "metadata": {},
   "source": [
    "How to we know these are heavy metal lyrics?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "id": "ideal-active",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>song</th>\n",
       "      <th>year</th>\n",
       "      <th>artist</th>\n",
       "      <th>lyrics</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>116</th>\n",
       "      <td>vinvm-sabbati</td>\n",
       "      <td>2001</td>\n",
       "      <td>behemoth</td>\n",
       "      <td>waters running down\\nby the silver moon rays\\n...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>gaya-s-dream</td>\n",
       "      <td>1993</td>\n",
       "      <td>the-gathering</td>\n",
       "      <td>open the gates of the past\\nwith the key to ou...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>202</th>\n",
       "      <td>generations</td>\n",
       "      <td>2013</td>\n",
       "      <td>answer-with-metal</td>\n",
       "      <td>look around, the air is full with fear and we ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>250</th>\n",
       "      <td>dark-of-the-sun</td>\n",
       "      <td>2006</td>\n",
       "      <td>arch-enemy</td>\n",
       "      <td>like insects of the night, we are drawn into t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258</th>\n",
       "      <td>shadows-and-dust</td>\n",
       "      <td>2006</td>\n",
       "      <td>arch-enemy</td>\n",
       "      <td>at the mercy of our conscience\\nconfined withi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4589</th>\n",
       "      <td>scorn</td>\n",
       "      <td>2006</td>\n",
       "      <td>allegiance</td>\n",
       "      <td>likes rats we strip the earth\\nrampant animals...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4600</th>\n",
       "      <td>armies-of-valinor</td>\n",
       "      <td>2007</td>\n",
       "      <td>galadriel</td>\n",
       "      <td>into the battle we ride again\\nagainst the dar...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4609</th>\n",
       "      <td>new-priesthood</td>\n",
       "      <td>2006</td>\n",
       "      <td>dark-angel</td>\n",
       "      <td>history's shown you that answers can't be foun...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4704</th>\n",
       "      <td>ride-for-glory</td>\n",
       "      <td>2007</td>\n",
       "      <td>dragonland</td>\n",
       "      <td>\\n\"we yearn for the battle and the glory...but...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4782</th>\n",
       "      <td>principle-of-speed</td>\n",
       "      <td>2007</td>\n",
       "      <td>drifter</td>\n",
       "      <td>i made an experience\\nit just happens once in ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>107 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                    song  year             artist  \\\n",
       "116        vinvm-sabbati  2001           behemoth   \n",
       "197         gaya-s-dream  1993      the-gathering   \n",
       "202          generations  2013  answer-with-metal   \n",
       "250      dark-of-the-sun  2006         arch-enemy   \n",
       "258     shadows-and-dust  2006         arch-enemy   \n",
       "...                  ...   ...                ...   \n",
       "4589               scorn  2006         allegiance   \n",
       "4600   armies-of-valinor  2007          galadriel   \n",
       "4609      new-priesthood  2006         dark-angel   \n",
       "4704      ride-for-glory  2007         dragonland   \n",
       "4782  principle-of-speed  2007            drifter   \n",
       "\n",
       "                                                 lyrics  \n",
       "116   waters running down\\nby the silver moon rays\\n...  \n",
       "197   open the gates of the past\\nwith the key to ou...  \n",
       "202   look around, the air is full with fear and we ...  \n",
       "250   like insects of the night, we are drawn into t...  \n",
       "258   at the mercy of our conscience\\nconfined withi...  \n",
       "...                                                 ...  \n",
       "4589  likes rats we strip the earth\\nrampant animals...  \n",
       "4600  into the battle we ride again\\nagainst the dar...  \n",
       "4609  history's shown you that answers can't be foun...  \n",
       "4704  \\n\"we yearn for the battle and the glory...but...  \n",
       "4782  i made an experience\\nit just happens once in ...  \n",
       "\n",
       "[107 rows x 4 columns]"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "metal_df[metal_df.lyrics.str.contains('elves')]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "surgical-damages",
   "metadata": {},
   "source": [
    "Ok, I'm convinced."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "id": "preliminary-tsunami",
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 1000\n",
    "lyrics_sample = metal_df.sample(n=n_samples, random_state=109).lyrics.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "forward-basketball",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sample Corpus Length: 720944\n"
     ]
    }
   ],
   "source": [
    "raw_text = ' \\n '.join(lyrics_sample)\n",
    "# remove bad chars\n",
    "raw_text = re.sub(r\"[^\\s\\w']\", \"\", raw_text)\n",
    "\n",
    "chars = set(sorted(raw_text))\n",
    "char2idx = dict((c,i) for i, c in enumerate(chars))\n",
    "idx2char = dict((i, c) for i, c in enumerate(chars))\n",
    "\n",
    "n_chars = len(raw_text)\n",
    "n_vocab = len(chars)\n",
    "\n",
    "print(f'Sample Corpus Length: {len(raw_text)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "double-basics",
   "metadata": {},
   "source": [
    "<div class='exercise' id='pairs'><b>Creating Input/Target Pairs</b></div></br>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "continent-vector",
   "metadata": {},
   "source": [
    "We need to slice up our lyric data to create input and target pairs that can be to our model for its supervised prediction task. Each input with be a sequence of `seq_len` characters. This can be though of as a sliding window across the concatenated lyric data. The response is the character after the end of that window in the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "surgical-liquid",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total Char Sequences:  720904\n"
     ]
    }
   ],
   "source": [
    "# prepare the dataset of input to output pairs encoded as integers\n",
    "seq_len = 40\n",
    "seqs = []\n",
    "targets = []\n",
    "for i in range(0, n_chars - seq_len):\n",
    "    seq = raw_text[i:i + seq_len]\n",
    "    target = raw_text[i + seq_len]\n",
    "    seqs.append([char2idx[char] for char in seq])\n",
    "    targets.append(char2idx[target])\n",
    "n_seqs = len(seqs)\n",
    "print(\"Total Char Sequences: \", n_seqs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "junior-third",
   "metadata": {},
   "source": [
    "We can create a one-hot encoding by indexing into an `n_vocab` sized identity matrix using the character index values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "proud-inspiration",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.reshape(seqs, (-1, seq_len))\n",
    "eye = np.eye(n_vocab)\n",
    "X = eye[seqs]\n",
    "y = eye[targets]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "center-method",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((720904, 40, 29), (720904, 29))"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape, y.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "id": "meaning-smooth",
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove some large variables from memory\n",
    "del metal_df\n",
    "del lyrics_sample\n",
    "del seqs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "nearby-station",
   "metadata": {},
   "source": [
    "<div class='exercise' id='lambdacall'><b>LambdaCallback</b></div></br>\n",
    "\n",
    "The loss score is usually not the best way to judge if our language model is learning to generate 'quality' test. It would be better if we could periodically see examples of the kind of text it can generate as it trains so we can judge for ourselves.\n",
    "\n",
    "The `LambdaCallback` allows us to execute arbitary functions at different points in the training process and why they are useful when evaluating generative models. We'll use it to generate some sample text at the end of every other epoch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "wicked-concord",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.callbacks import LambdaCallback"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "id": "recreational-linux",
   "metadata": {},
   "outputs": [],
   "source": [
    "def on_epoch_end(epoch, _):\n",
    "    # only triggers on every 2nd epoch\n",
    "    if((epoch + 1) % 2 == 0):\n",
    "        # select a random seed sequence\n",
    "        start = np.random.randint(0, len(X)-1)\n",
    "        seq = X[start]\n",
    "        seed = ''.join([idx2char[np.argmax(x)] for x in seq])\n",
    "\n",
    "        print(f\"---Seed: \\\"{repr(seed)}\\\"---\")\n",
    "        print(f\"{seed}\", end='')\n",
    "        # generate characters\n",
    "        for i in range(200):\n",
    "            x = seq.reshape(1, seq_len, -1)\n",
    "            pred = model.predict(x, verbose=0)[0]\n",
    "            # sampling gives us more 'serendipity' than argmax\n",
    "#             index = np.argmax(pred)\n",
    "            index = np.random.choice(n_vocab, p=pred)\n",
    "            result = idx2char[index]\n",
    "            sys.stdout.write(result)\n",
    "            # shift sequence over\n",
    "            seq[:-1] = seq[1:]\n",
    "            seq[-1] = eye[index]\n",
    "        print()\n",
    "        \n",
    "generate_text = LambdaCallback(on_epoch_end=on_epoch_end)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "incomplete-precipitation",
   "metadata": {},
   "source": [
    "We then add the `LambdaCallback` to the `callbacks` list along with `ModelCheckpoint` and `EarlyStopping` to be passed to the `fit()` method at train time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "id": "equivalent-season",
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the checkpoint\n",
    "model_name = 'metal-char'\n",
    "filepath=f'models/{model_name}.hdf5'\n",
    "checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1,\n",
    "                             save_weights_only=False,\n",
    "                             save_best_only=True, mode='min')\n",
    "es = EarlyStopping(monitor='loss', patience=3, verbose=0,\n",
    "                   mode='auto',restore_best_weights=True)\n",
    "callbacks_list = [checkpoint, generate_text, es]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "boolean-simon",
   "metadata": {},
   "source": [
    "<div class=\"exercise\"  style=\"background-color:#F5E4C3\">\n",
    "    <b>Exercise:</b> Build a Character Based Lyric Generator\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "least-tutorial",
   "metadata": {},
   "source": [
    "**Architecture**\n",
    "- Bidirection LSTM with a hidden dimension of 128 *in each direction*\n",
    "- BatchNormalization to speed up training (Don't tell Pavlos!)\n",
    "- Dense output layer to predict the next character"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "opposite-liberty",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code here\n",
    "hidden_dim = 128\n",
    "model = Sequential()\n",
    "model.add(Bidirectional(LSTM(hidden_dim), input_shape=(seq_len, n_vocab)))\n",
    "model.add(BatchNormalization())\n",
    "model.add(Dense(n_vocab, activation='softmax'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ignored-stewart",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "bidirectional (Bidirectional (None, 256)               161792    \n",
      "_________________________________________________________________\n",
      "batch_normalization (BatchNo (None, 256)               1024      \n",
      "_________________________________________________________________\n",
      "dense (Dense)                (None, 29)                7453      \n",
      "=================================================================\n",
      "Total params: 170,269\n",
      "Trainable params: 169,757\n",
      "Non-trainable params: 512\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "model.compile(loss='categorical_crossentropy', optimizer='adam')\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "traditional-layout",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/30\n",
      "5633/5633 [==============================] - 47s 7ms/step - loss: 2.1992\n",
      "\n",
      "Epoch 00001: loss improved from inf to 2.02026, saving model to models/metal-char.hdf5\n",
      "Epoch 2/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.7666\n",
      "\n",
      "Epoch 00002: loss improved from 2.02026 to 1.72630, saving model to models/metal-char.hdf5\n",
      "---Seed: \"\"eeze but why\\nwon't you believe\\nwon't you\"\"---\n",
      "eeze but why\n",
      "won't you believe\n",
      "won't you will how kyon't now it you've goe\n",
      "th\n",
      "so noundapop and so shand will foring hpres\n",
      "and like heredging return\n",
      "all the stong love a cryttgin\n",
      "fe i'm noblows the glort\n",
      "comes evermeht ard a gong holortions \n",
      "Epoch 3/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.6288\n",
      "\n",
      "Epoch 00003: loss improved from 1.72630 to 1.60815, saving model to models/metal-char.hdf5\n",
      "Epoch 4/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.5469\n",
      "\n",
      "Epoch 00004: loss improved from 1.60815 to 1.53622, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'l know me inside and outside\\ni am here t'\"---\n",
      "l know me inside and outside\n",
      "i am here the scarture for some no andon gounder time\n",
      "nair a flose the the shorious of perpore\n",
      "and glow but arourd she we just you bark for go darkness's need the missing oh your\n",
      "were the sinsh away\n",
      "the sky thei\n",
      "Epoch 5/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.4940\n",
      "\n",
      "Epoch 00005: loss improved from 1.53622 to 1.48602, saving model to models/metal-char.hdf5\n",
      "Epoch 6/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.4527\n",
      "\n",
      "Epoch 00006: loss improved from 1.48602 to 1.44772, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'y your kindred watch us pronouncing sent'\"---\n",
      "y your kindred watch us pronouncing sently life\n",
      "we go the conceliw \n",
      " a watmallch age here release\n",
      "wrock agard to the haunhown\n",
      "your life worntly way\n",
      "too just right yer so many you makes me am now\n",
      "homence the geas to get away\n",
      "take it down gol\n",
      "Epoch 7/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.4195\n",
      "\n",
      "Epoch 00007: loss improved from 1.44772 to 1.41843, saving model to models/metal-char.hdf5\n",
      "Epoch 8/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.3948\n",
      "\n",
      "Epoch 00008: loss improved from 1.41843 to 1.39306, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'here\\ndespite what you may think jesus wi'\"---\n",
      "here\n",
      "despite what you may think jesus with that just sunkness\n",
      "memories remember free\n",
      "this i can never knees i am drifted\n",
      "laying here\n",
      "infore through the sign\n",
      "i can't be the completeh something burning stamber cross\n",
      "reach to our arms\n",
      "than her\n",
      "Epoch 9/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.3715\n",
      "\n",
      "Epoch 00009: loss improved from 1.39306 to 1.37268, saving model to models/metal-char.hdf5\n",
      "Epoch 10/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.3516\n",
      "\n",
      "Epoch 00010: loss improved from 1.37268 to 1.35401, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'ence what does it mean\\nare you satisfied'\"---\n",
      "ence what does it mean\n",
      "are you satisfied\n",
      "are yelcary\n",
      "our voodes breathled with\n",
      "weak the pullable do ac truth\n",
      "there's a neck reach we let me good yoursele\n",
      "relefcion around freedom\n",
      "noin the burnt to beer with home\n",
      "sail becauses\n",
      "let down it ha\n",
      "Epoch 11/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.3380\n",
      "\n",
      "Epoch 00011: loss improved from 1.35401 to 1.33822, saving model to models/metal-char.hdf5\n",
      "Epoch 12/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.3220\n",
      "\n",
      "Epoch 00012: loss improved from 1.33822 to 1.32432, saving model to models/metal-char.hdf5\n",
      "---Seed: \"' know now all the reasons\\nand all the co'\"---\n",
      " know now all the reasons\n",
      "and all the corcely game time\n",
      "with me\n",
      "but image tears\n",
      "the the mairin\n",
      "take hault me ashameeds\n",
      "at lay in the ocean of time leave\n",
      "my hope repouration you're my tears of your heart\n",
      "like the day is let the way\n",
      "driving t\n",
      "Epoch 13/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.3081\n",
      "\n",
      "Epoch 00013: loss improved from 1.32432 to 1.31071, saving model to models/metal-char.hdf5\n",
      "Epoch 14/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.2938\n",
      "\n",
      "Epoch 00014: loss improved from 1.31071 to 1.29885, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'ars\\nnow beneath\\nelectrical skies\\nartille'\"---\n",
      "ars\n",
      "now beneath\n",
      "electrical skies\n",
      "artilleds i tollimily\n",
      "fall it comes\n",
      "dopt all to eternity\n",
      "eternity the humand as the flaughhohhohhohhohhohhohhohhohhohhohhohhohhohhohhohhohhohhohhahhahhahhahhahhahhahhahhahhahhahhahhahhahhahhahhahhahhahh\n",
      "hhoh\n",
      "Epoch 15/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.2828\n",
      "\n",
      "Epoch 00015: loss improved from 1.29885 to 1.28791, saving model to models/metal-char.hdf5\n",
      "Epoch 16/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.2699\n",
      "\n",
      "Epoch 00016: loss improved from 1.28791 to 1.27619, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'ady to strike\\ncall for us and you will s'\"---\n",
      "ady to strike\n",
      "call for us and you will sounds shin frozes\n",
      "into the rold you will open\n",
      "now you wake up why will come us to me open death\n",
      "i cannot toll me\n",
      "i am not envence i could is sidt\n",
      "the sun away from the tender\n",
      "on the cloud silfit exuce\n",
      "Epoch 17/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.2591\n",
      "\n",
      "Epoch 00017: loss improved from 1.27619 to 1.26540, saving model to models/metal-char.hdf5\n",
      "Epoch 18/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.2475\n",
      "\n",
      "Epoch 00018: loss improved from 1.26540 to 1.25442, saving model to models/metal-char.hdf5\n",
      "---Seed: \"\"'ve never been hurt\\nact like you don't n\"\"---\n",
      "'ve never been hurt\n",
      "act like you don't never gike and read\n",
      "i know you sustappine everypherians\n",
      "to my fingers hard to go away\n",
      "you feel ignorade suffer to\n",
      "your vicions free put the grave\n",
      "tils murriors wind skies of sangring imputting building\n",
      "Epoch 19/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.2388\n",
      "\n",
      "Epoch 00019: loss improved from 1.25442 to 1.24440, saving model to models/metal-char.hdf5\n",
      "Epoch 20/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.2306\n",
      "\n",
      "Epoch 00020: loss improved from 1.24440 to 1.23521, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'bered by offsprings\\nwitness the end of m'\"---\n",
      "bered by offsprings\n",
      "witness the end of madical preceine\n",
      "but exists havelen't mind too much to avoid\n",
      "walk harnts of the just throne\n",
      "an a land lay it leaves\n",
      "the beauth around myself\n",
      "this hopely staity that a long shall\n",
      "it fears the grances of\n",
      "Epoch 21/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.2214\n",
      "\n",
      "Epoch 00021: loss improved from 1.23521 to 1.22585, saving model to models/metal-char.hdf5\n",
      "Epoch 22/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.2067\n",
      "\n",
      "Epoch 00022: loss improved from 1.22585 to 1.21731, saving model to models/metal-char.hdf5\n",
      "---Seed: \"' deep world of darkness so infinite and '\"---\n",
      " deep world of darkness so infinite and shelter \n",
      " like spring distones tormore\n",
      "the winds spow i never asween\n",
      "but when the reasons puning it\n",
      "i don't know we'll shane this girl enthroid\n",
      "what do you pane\n",
      "break what you can led by once was too \n",
      "Epoch 23/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.1998\n",
      "\n",
      "Epoch 00023: loss improved from 1.21731 to 1.20829, saving model to models/metal-char.hdf5\n",
      "Epoch 24/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.1934\n",
      "\n",
      "Epoch 00024: loss improved from 1.20829 to 1.19972, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'ed\\nascend\\nto darkness we sail\\neternal re'\"---\n",
      "ed\n",
      "ascend\n",
      "to darkness we sail\n",
      "eternal remember\n",
      "the\n",
      "when is the humand time is hurt\n",
      "with each in chance spread the indectimeom the journey begin\n",
      "we'll be fixano just exploit for us\n",
      "into the bett it returned\n",
      "by both our kinn light\n",
      "i am still \n",
      "Epoch 25/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.1828\n",
      "\n",
      "Epoch 00025: loss improved from 1.19972 to 1.19151, saving model to models/metal-char.hdf5\n",
      "Epoch 26/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.1740\n",
      "\n",
      "Epoch 00026: loss improved from 1.19151 to 1.18459, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'christian society\\nwomen were anathematiz'\"---\n",
      "christian society\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "women were anathematized of goes \n",
      " guestival powers of apotheration fass of my north\n",
      "no fiust\n",
      "gliding the provelosms of distrocked\n",
      "snowledged who now\n",
      "and this impromise\n",
      "fullles me stays\n",
      "before we melding\n",
      "denied will reveag\n",
      "Epoch 27/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.1674\n",
      "\n",
      "Epoch 00027: loss improved from 1.18459 to 1.17692, saving model to models/metal-char.hdf5\n",
      "Epoch 28/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.1592\n",
      "\n",
      "Epoch 00028: loss improved from 1.17692 to 1.16979, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'only void just about everywhere\\nso i din'\"---\n",
      "only void just about everywhere\n",
      "so i dinacl\n",
      "fear for rain and carvable\n",
      "we shared by it\n",
      "give us the way that a commour and darows from althounds\n",
      "now always creatured down and beside the soul\n",
      "life i never be seen\n",
      "your nevered in things and tw\n",
      "Epoch 29/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.1546\n",
      "\n",
      "Epoch 00029: loss improved from 1.16979 to 1.16355, saving model to models/metal-char.hdf5\n",
      "Epoch 30/30\n",
      "5633/5633 [==============================] - 39s 7ms/step - loss: 1.1478\n",
      "\n",
      "Epoch 00030: loss improved from 1.16355 to 1.15678, saving model to models/metal-char.hdf5\n",
      "---Seed: \"'ng\\nwhen the life forsaken again is the o'\"---\n",
      "ng\n",
      "when the life forsaken again is the one\n",
      "and dough i can't take her\n",
      "feel myself spected\n",
      "sucame alone\n",
      "my last is always we pride on me\n",
      "the darkness ofwerd the times\n",
      "there's a strength to last\n",
      "the words' dies anate majesty\n",
      "triet\n",
      "the old ste\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.keras.callbacks.History at 0x7f9100e086a0>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(X, y, epochs=30, batch_size=128, callbacks=callbacks_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "certified-mathematics",
   "metadata": {},
   "source": [
    "<div class=\"exercise\"  style=\"background-color:#F5E4C3\">\n",
    "    <b>End of Exercise:</b> Please return to the main room\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "adopted-april",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = load_model(f'models/{model_name}.hdf5')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "incredible-active",
   "metadata": {},
   "source": [
    "With some helper functions we can generate text from an arbitrary seed string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "empty-greene",
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample(preds, temperature=1.0):\n",
    "    # helper function to sample an index from a probability array\n",
    "    preds = np.asarray(preds).astype('float64')\n",
    "    preds = np.log(preds) / temperature\n",
    "    exp_preds = np.exp(preds)\n",
    "    preds = exp_preds / np.sum(exp_preds)\n",
    "    probas = np.random.multinomial(1, preds, 1)\n",
    "    return np.argmax(probas)\n",
    "\n",
    "\n",
    "def gen_text_char(seq, temperature=0.3):\n",
    "    print(\"Seed:\")\n",
    "    print(\"\\\"\", ''.join([idx2char[np.argmax(x)] for x in seq]), \"\\\"\", end='')\n",
    "    # generate characters\n",
    "    for i in range(1000):\n",
    "        x = seq.reshape(1, seq_len, -1)\n",
    "        pred = model.predict(x, verbose=0)[0]\n",
    "        index = sample(pred, temperature)\n",
    "#             index = np.argmax(pred)\n",
    "        result = idx2char[index]\n",
    "        sys.stdout.write(result)\n",
    "        # shift sequence over\n",
    "        seq[:-1] = seq[1:]\n",
    "        seq[-1] = eye[index]\n",
    "    print()\n",
    "    \n",
    "    \n",
    "def text_from_seed(s, temperature=0.3):\n",
    "    s = s.lower()\n",
    "    s = re.sub(r\"[^\\s\\w']\", \"\", s)\n",
    "    char2idx = {c: i for i, c in idx2char.items()}\n",
    "    seq = [char2idx[c] for c in s]\n",
    "    if len(seq) < seq_len:\n",
    "        print(f'Seed must be at least {seq_len} characters long!')\n",
    "    seq = seq[:seq_len] \n",
    "    x = np.copy(np.array(seq))\n",
    "    x = eye[x]\n",
    "    gen_text_char(x, temperature)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "australian-metro",
   "metadata": {},
   "source": [
    "Set a seed string and see where your model takes it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "advanced-asset",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Seed:\n",
      "\" sunshine lollipops and rainbows\n",
      "everythi \"ng is not all the world\n",
      "the shadows of a dead my life\n",
      "in the world in the wind\n",
      "the dark and see the land\n",
      "the soul of a devolut of the darkness\n",
      "we are the ones who says there a blackened with the way\n",
      "the world is speak\n",
      "i want to go back to me\n",
      "the beauty of the darkness and destroyed are so come\n",
      "the reason to the battle\n",
      "with the wind is not all the world\n",
      "i want to love you met you\n",
      "i'm doing to me\n",
      "i want to see you and i don't know\n",
      "i was so were we see\n",
      "to be the one we left behind\n",
      "the fear of the dark in a desert of the ancient life\n",
      "i will not come to me\n",
      "the more i can't see the past\n",
      "and i was the one who was now\n",
      "the season where the streets of my heart\n",
      "i wish i was better one the sea\n",
      "we go to care in the sun of the attic\n",
      "i will find a world in the wind\n",
      "the world in the wind\n",
      "the sands of the halls of a new world\n",
      "i can't see the fire is all i need\n",
      "the silence of the beast never been said\n",
      "and survive on the waves\n",
      "i'm waiting for an answer\n",
      "the land of the sun\n",
      "in the wind to carry on the wind\n"
     ]
    }
   ],
   "source": [
    "seed = \"Sunshine, lollipops and rainbows\\nEverything that's wonderful is what I feel when we're together \"\n",
    "text_from_seed(seed)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "empirical-interference",
   "metadata": {},
   "source": [
    "<div class=\"exercise\"  style=\"background-color:#b3e6ff\">\n",
    "<b>Q</b>: How might you improve upon this simple model architecture?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "miniature-planning",
   "metadata": {},
   "source": [
    "<a id='#math'></a>\n",
    "## Arithmetic with an RNN<div id='math'></div>\n",
    "<img src=\"fig/manytomanyMN.png\" width=\"400\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "otherwise-maximum",
   "metadata": {},
   "source": [
    "*Thanks go to Eleni for this code example.*\n",
    "\n",
    "In this exercise, we are going to teach addition to our model. Given two numbers (<999), the model outputs their sum (<9999). The input is provided as a string '231+432' and the model will provide its output as ' 663' (Here the empty space is the padding character). We are not going to use any external dataset and are going to construct our own dataset for this exercise.\n",
    "\n",
    "The exercise we attempt to do effectively \"translates\" a sequence of characters '231+432' to another sequence of characters ' 663' and hence, this class of models are called sequence-to-sequence models (aka seq2seq). Such architectures have profound applications in several real-life tasks such as machine translation, summarization, image captioning etc.\n",
    "\n",
    "To be clear, sequence-to-sequence (aka seq2seq) models take as input a sequence of length N and return a sequence of length M, where N and M may or may not differ, and every single observation/input may be of different values, too. For example, machine translation concerns converting text from one natural language to another (e.g., translating English to French). Google Translate is an example, and their system is a seq2seq model. The input (e.g., an English sentence) can be of any length, and the output (e.g., a French sentence) may be of any length.\n",
    "\n",
    "**Background knowledge:** The earliest and most simple seq2seq model works by having one RNN for the input, just like we've always done, and we refer to it as being an \"encoder.\" The final hidden state of the encoder RNN is fed as input to another RNN that we refer to as the \"decoder.\" The job of the decoder is to generate each token, one word at a time. This may seem really limiting, as it relies on the encoder encapsulating the entire input sequence with just 1 hidden layer. It seems unrealistic that we could encode an entire meaning of a sentence with just one hidden layer. Yet, results even in this simplistic manner can be quite impressive. In fact, these early results were compelling enough that these models immediately replaced the decades of earlier machine translation work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "id": "rapid-pharmacology",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.models import Sequential\n",
    "from tensorflow.keras.layers import Dense, RepeatVector, TimeDistributed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "theoretical-opening",
   "metadata": {},
   "source": [
    "#### data generation and preprocessing\n",
    "\n",
    "We can simply generate all the training data we need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "id": "manual-shell",
   "metadata": {},
   "outputs": [],
   "source": [
    "class CharacterTable(object):\n",
    "    def __init__(self, chars):        \n",
    "        self.chars = sorted(set(chars))\n",
    "        self.char_indices = {c: i for i, c in enumerate(self.chars)}\n",
    "        self.indices_char = {i: c for i, c in enumerate(self.chars)}\n",
    "\n",
    "    # converts a String of characters into a one-hot embedding/vector\n",
    "    def encode(self, C, num_rows):        \n",
    "        x = np.zeros((num_rows, len(self.chars)))\n",
    "        for i, c in enumerate(C):\n",
    "            x[i, self.char_indices[c]] = 1\n",
    "        return x\n",
    "    \n",
    "    # converts a one-hot embedding/vector into a String of characters\n",
    "    def decode(self, x, calc_argmax=True):        \n",
    "        if calc_argmax:\n",
    "            x = x.argmax(axis=-1)\n",
    "        return ''.join(self.indices_char[x] for x in x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "id": "alternative-fellowship",
   "metadata": {},
   "outputs": [],
   "source": [
    "TRAINING_SIZE = 50000\n",
    "DIGITS = 3\n",
    "MAXOUTPUTLEN = DIGITS + 1\n",
    "MAXLEN = DIGITS + 1 + DIGITS\n",
    "\n",
    "chars = '0123456789+ '\n",
    "ctable = CharacterTable(chars)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "id": "casual-uniform",
   "metadata": {},
   "outputs": [],
   "source": [
    "def return_random_digit():\n",
    "      return np.random.choice(list('0123456789'))  \n",
    "\n",
    "# generate a new number of length `DIGITS`\n",
    "def generate_number():\n",
    "    num_digits = np.random.randint(1, DIGITS + 1)  \n",
    "    return int(''.join( return_random_digit()\n",
    "                      for i in range(num_digits)))\n",
    "\n",
    "# generate `TRAINING_SIZE` # of pairs of random numbers\n",
    "def data_generate(num_examples):\n",
    "    questions = []\n",
    "    answers = []\n",
    "    seen = set()\n",
    "    print('Generating data...')\n",
    "    while len(questions) < TRAINING_SIZE:      \n",
    "        a, b = generate_number(), generate_number()\n",
    "\n",
    "        # don't allow duplicates; this is good practice for training,\n",
    "        # as we will minimize memorizing seen examples\n",
    "        key = tuple(sorted((a, b)))\n",
    "        if key in seen:\n",
    "            continue\n",
    "        seen.add(key)\n",
    "\n",
    "        # pad the data with spaces so that the length is always MAXLEN.\n",
    "        q = '{}+{}'.format(a, b)\n",
    "        query = q + ' ' * (MAXLEN - len(q))\n",
    "        ans = str(a + b)\n",
    "\n",
    "        # answers can be of maximum size DIGITS + 1.\n",
    "        ans += ' ' * (MAXOUTPUTLEN - len(ans))\n",
    "        questions.append(query)\n",
    "        answers.append(ans)\n",
    "    print('Total addition questions:', len(questions))\n",
    "    return questions, answers\n",
    "\n",
    "def encode_examples(questions, answers):\n",
    "    x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=np.bool)\n",
    "    y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=np.bool)\n",
    "    for i, sentence in enumerate(questions):\n",
    "        x[i] = ctable.encode(sentence, MAXLEN)\n",
    "    for i, sentence in enumerate(answers):\n",
    "        y[i] = ctable.encode(sentence, DIGITS + 1)\n",
    "\n",
    "    indices = np.arange(len(y))\n",
    "    np.random.shuffle(indices)\n",
    "    return x[indices],y[indices]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "id": "divided-tension",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generating data...\n",
      "Total addition questions: 50000\n",
      "Training Data shape:\n",
      "X :  (45000, 7, 12)\n",
      "Y :  (45000, 4, 12)\n",
      "Sample Question(in encoded form) :  [[False False False False False False False  True False False False False]\n",
      " [False False False False False False False False False False False  True]\n",
      " [False False  True False False False False False False False False False]\n",
      " [False  True False False False False False False False False False False]\n",
      " [False False False False False False False False  True False False False]\n",
      " [False False False False False False False False False False  True False]\n",
      " [ True False False False False False False False False False False False]] [[False False False False False False False False  True False False False]\n",
      " [False False False False False False False  True False False False False]\n",
      " [False False False False False False False False False False  True False]\n",
      " [ True False False False False False False False False False False False]]\n",
      "Sample Question(in decoded form) :  590+68  Sample Output :  658 \n"
     ]
    }
   ],
   "source": [
    "q,a = data_generate(TRAINING_SIZE)\n",
    "x,y = encode_examples(q,a)\n",
    "\n",
    "# divides our data into training and validation\n",
    "split_at = len(x) - len(x) // 10\n",
    "x_train, x_val, y_train, y_val = x[:split_at], x[split_at:],y[:split_at],y[split_at:]\n",
    "\n",
    "print('Training Data shape:')\n",
    "print('X : ', x_train.shape)\n",
    "print('Y : ', y_train.shape)\n",
    "\n",
    "print('Sample Question(in encoded form) : ', x_train[0], y_train[0])\n",
    "print('Sample Question(in decoded form) : ', ctable.decode(x_train[0]),'Sample Output : ', ctable.decode(y_train[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "id": "western-conservation",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[[False, False, False, ..., False, False, False],\n",
       "        [False, False, False, ..., False, False,  True],\n",
       "        [False, False,  True, ..., False, False, False],\n",
       "        ...,\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        [False, False, False, ..., False,  True, False],\n",
       "        [ True, False, False, ..., False, False, False]],\n",
       "\n",
       "       [[False, False, False, ..., False,  True, False],\n",
       "        [False, False, False, ...,  True, False, False],\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        ...,\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        [False, False, False, ..., False,  True, False],\n",
       "        [ True, False, False, ..., False, False, False]],\n",
       "\n",
       "       [[False, False, False, ..., False, False, False],\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        [False, False, False, ..., False,  True, False],\n",
       "        ...,\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        [ True, False, False, ..., False, False, False]],\n",
       "\n",
       "       ...,\n",
       "\n",
       "       [[False, False, False, ...,  True, False, False],\n",
       "        [False,  True, False, ..., False, False, False],\n",
       "        [False, False, False, ...,  True, False, False],\n",
       "        ...,\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        [ True, False, False, ..., False, False, False],\n",
       "        [ True, False, False, ..., False, False, False]],\n",
       "\n",
       "       [[False, False, False, ..., False, False, False],\n",
       "        [False, False, False, ..., False,  True, False],\n",
       "        [False,  True, False, ..., False, False, False],\n",
       "        ...,\n",
       "        [False, False,  True, ..., False, False, False],\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        [ True, False, False, ..., False, False, False]],\n",
       "\n",
       "       [[False, False, False, ..., False, False, False],\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        ...,\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        [False, False, False, ..., False, False, False],\n",
       "        [ True, False, False, ..., False, False, False]]])"
      ]
     },
     "execution_count": 145,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x_train"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "unauthorized-storm",
   "metadata": {},
   "source": [
    "<div class=\"exercise\">\n",
    "    <b></b> Build an RNN for Arithmetic\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "pediatric-surgeon",
   "metadata": {},
   "source": [
    "**Note:** Whenever you are initializing a LSTM in Keras, by the default the option `return_sequences = False`. This means that at the end of the step the next component will only get to see the final hidden layer's values. On the other hand, if you set `return_sequences = True`, the LSTM component will return the hidden layer at each time step. It means that the next component should be able to consume inputs in that form. \n",
    "\n",
    "Think how this statement is relevant in terms of this model architecture and the TimeDistributed module we just learned.\n",
    "\n",
    "Build an encoder and decoder both single layer 128 nodes and an appropriate dense layer as needed by the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "id": "brown-bahrain",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Build model...\n",
      "Model: \"sequential_22\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "lstm_16 (LSTM)               (None, 128)               72192     \n",
      "_________________________________________________________________\n",
      "repeat_vector_2 (RepeatVecto (None, 4, 128)            0         \n",
      "_________________________________________________________________\n",
      "lstm_17 (LSTM)               (None, 4, 128)            131584    \n",
      "_________________________________________________________________\n",
      "time_distributed_1 (TimeDist (None, 4, 12)             1548      \n",
      "=================================================================\n",
      "Total params: 205,324\n",
      "Trainable params: 205,324\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "# Hyperaparams\n",
    "HIDDEN_SIZE = 128\n",
    "BATCH_SIZE = 128\n",
    "LAYERS = 1\n",
    "\n",
    "print('Build model...')\n",
    "model = Sequential()\n",
    "\n",
    "#ENCODING\n",
    "model.add(LSTM(HIDDEN_SIZE, input_shape=(MAXLEN, len(chars))))\n",
    "model.add(RepeatVector(MAXOUTPUTLEN))\n",
    "\n",
    "#DECODING\n",
    "for _ in range(LAYERS):\n",
    "    # return hidden layer at each time step\n",
    "    model.add(LSTM(HIDDEN_SIZE, return_sequences=True)) \n",
    "\n",
    "model.add(TimeDistributed(Dense(len(chars), activation='softmax')))\n",
    "model.compile(loss='categorical_crossentropy',\n",
    "              optimizer='adam',\n",
    "              metrics=['accuracy'])\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "martial-rough",
   "metadata": {},
   "source": [
    "Let's check how well our model trained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "id": "strategic-reducing",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Epoch 1/20\n",
      "352/352 [==============================] - 5s 6ms/step - loss: 2.0174 - accuracy: 0.2885 - val_loss: 1.7942 - val_accuracy: 0.3441\n",
      "Epoch 2/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 1.7772 - accuracy: 0.3434 - val_loss: 1.7091 - val_accuracy: 0.3704\n",
      "Epoch 3/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 1.6578 - accuracy: 0.3831 - val_loss: 1.5676 - val_accuracy: 0.4189\n",
      "Epoch 4/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 1.5156 - accuracy: 0.4305 - val_loss: 1.4042 - val_accuracy: 0.4744\n",
      "Epoch 5/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 1.3777 - accuracy: 0.4846 - val_loss: 1.2784 - val_accuracy: 0.5250\n",
      "Epoch 6/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 1.2598 - accuracy: 0.5288 - val_loss: 1.1922 - val_accuracy: 0.5512\n",
      "Epoch 7/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 1.1605 - accuracy: 0.5640 - val_loss: 1.1004 - val_accuracy: 0.5843\n",
      "Epoch 8/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 1.0719 - accuracy: 0.5927 - val_loss: 1.0159 - val_accuracy: 0.6151\n",
      "Epoch 9/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.9892 - accuracy: 0.6241 - val_loss: 0.9340 - val_accuracy: 0.6425\n",
      "Epoch 10/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.8879 - accuracy: 0.6606 - val_loss: 0.8111 - val_accuracy: 0.6805\n",
      "Epoch 11/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.7547 - accuracy: 0.7132 - val_loss: 0.6573 - val_accuracy: 0.7530\n",
      "Epoch 12/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.6126 - accuracy: 0.7778 - val_loss: 0.5363 - val_accuracy: 0.8079\n",
      "Epoch 13/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.4939 - accuracy: 0.8360 - val_loss: 0.4236 - val_accuracy: 0.8686\n",
      "Epoch 14/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.3952 - accuracy: 0.8824 - val_loss: 0.3391 - val_accuracy: 0.9003\n",
      "Epoch 15/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.3146 - accuracy: 0.9160 - val_loss: 0.2851 - val_accuracy: 0.9208\n",
      "Epoch 16/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.2535 - accuracy: 0.9382 - val_loss: 0.2221 - val_accuracy: 0.9458\n",
      "Epoch 17/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.2063 - accuracy: 0.9535 - val_loss: 0.1934 - val_accuracy: 0.9529\n",
      "Epoch 18/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.1786 - accuracy: 0.9584 - val_loss: 0.1608 - val_accuracy: 0.9613\n",
      "Epoch 19/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.1441 - accuracy: 0.9707 - val_loss: 0.1314 - val_accuracy: 0.9708\n",
      "Epoch 20/20\n",
      "352/352 [==============================] - 2s 5ms/step - loss: 0.1210 - accuracy: 0.9758 - val_loss: 0.1156 - val_accuracy: 0.9750\n",
      "Finished iteration  1\n",
      "Question 579+42  True 621  Guess 621  Good job\n",
      "Question 778+40  True 818  Guess 818  Good job\n",
      "Question 34+574  True 608  Guess 608  Good job\n",
      "Question 5+553   True 558  Guess 558  Good job\n",
      "Question 2+27    True 29   Guess 29   Good job\n",
      "Question 506+30  True 536  Guess 536  Good job\n",
      "Question 51+714  True 765  Guess 765  Good job\n",
      "Question 258+31  True 289  Guess 289  Good job\n",
      "Question 9+70    True 79   Guess 89   Fail\n",
      "Question 14+83   True 97   Guess 97   Good job\n",
      "Question 59+378  True 437  Guess 437  Good job\n",
      "Question 94+836  True 930  Guess 920  Fail\n",
      "Question 875+483 True 1358 Guess 1358 Good job\n",
      "Question 482+34  True 516  Guess 516  Good job\n",
      "Question 257+49  True 306  Guess 306  Good job\n",
      "Question 591+5   True 596  Guess 596  Good job\n",
      "Question 88+771  True 859  Guess 859  Good job\n",
      "Question 248+86  True 334  Guess 334  Good job\n",
      "Question 27+929  True 956  Guess 956  Good job\n",
      "Question 23+854  True 877  Guess 877  Good job\n",
      "The model scored  90.0  % in its test.\n"
     ]
    }
   ],
   "source": [
    "for iteration in range(1, 2):\n",
    "    print()  \n",
    "    model.fit(x_train, y_train,\n",
    "              batch_size=BATCH_SIZE,\n",
    "              epochs=20,\n",
    "              validation_data=(x_val, y_val))\n",
    "    # Select 10 samples from the validation set at random so\n",
    "    # we can visualize errors.\n",
    "    print('Finished iteration ', iteration)\n",
    "    numcorrect = 0\n",
    "    numtotal = 20\n",
    "    \n",
    "    for i in range(numtotal):\n",
    "        ind = np.random.randint(0, len(x_val))\n",
    "        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]\n",
    "        preds = model.predict_classes(rowx, verbose=0)\n",
    "        q = ctable.decode(rowx[0])\n",
    "        correct = ctable.decode(rowy[0])\n",
    "        guess = ctable.decode(preds[0], calc_argmax=False)\n",
    "        print('Question', q, end=' ')\n",
    "        print('True', correct, end=' ')\n",
    "        print('Guess', guess, end=' ')\n",
    "        if guess == correct :\n",
    "            print('Good job')\n",
    "            numcorrect += 1\n",
    "        else:\n",
    "            print('Fail')\n",
    "    print('The model scored ', numcorrect*100/numtotal,' % in its test.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "unexpected-nicaragua",
   "metadata": {},
   "source": [
    "**Possible Experimentation**\n",
    "\n",
    " * Try changing the hyperparams, use other RNNs, more layers, check if increasing the number of epochs is useful.\n",
    "\n",
    " * Try reversing the data from validation set and check if commutative property of addition is learned by the model.\n",
    " * Try printing the hidden layer with two inputs that are commutative and check if the hidden representations it learned are same or similar. Do we expect it to be true? If so, why? If not why? You can access the layer using an index with model.layers and layer.output will give the output of that layer.\n",
    "\n",
    "* Try doing addition in the RNN model the same way we do by hand. Reverse the order of digits and at each time step, input two digits get an output use the hidden layer and input next two digits and so on.(units in the first time step, tens in the second time step etc.)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}