{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <img style=\"float: left; padding-right: 10px; width: 45px\" src=\"https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png\"> CS-109B Advanced Data Science\n",
    "## Lab 8: Recurrent Neural Networks\n",
    "\n",
    "**Harvard University**<br>\n",
    "**Fall 2020**<br>\n",
    "**Instructors:** Mark Glickman, Pavlos Protopapas, and Chris Tanner<br>\n",
    "**Lab Instructors:** Chris Tanner and Eleni Angelaki Kaxiras<br>\n",
    "**Content:** Srivatsan Srinivasan, Pavlos Protopapas, Chris Tanner"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "blockquote { background: #AEDE94; }\n",
       "h1 { \n",
       "    padding-top: 25px;\n",
       "    padding-bottom: 25px;\n",
       "    text-align: left; \n",
       "    padding-left: 10px;\n",
       "    background-color: #DDDDDD; \n",
       "    color: black;\n",
       "}\n",
       "h2 { \n",
       "    padding-top: 10px;\n",
       "    padding-bottom: 10px;\n",
       "    text-align: left; \n",
       "    padding-left: 5px;\n",
       "    background-color: #EEEEEE; \n",
       "    color: black;\n",
       "}\n",
       "\n",
       "div.exercise {\n",
       "\tbackground-color: #ffcccc;\n",
       "\tborder-color: #E9967A; \t\n",
       "\tborder-left: 5px solid #800080; \n",
       "\tpadding: 0.5em;\n",
       "}\n",
       "div.discussion {\n",
       "\tbackground-color: #ccffcc;\n",
       "\tborder-color: #88E97A;\n",
       "\tborder-left: 5px solid #0A8000; \n",
       "\tpadding: 0.5em;\n",
       "}\n",
       "div.theme {\n",
       "\tbackground-color: #DDDDDD;\n",
       "\tborder-color: #E9967A; \t\n",
       "\tborder-left: 5px solid #800080; \n",
       "\tpadding: 0.5em;\n",
       "\tfont-size: 18pt;\n",
       "}\n",
       "div.gc { \n",
       "\tbackground-color: #AEDE94;\n",
       "\tborder-color: #E9967A; \t \n",
       "\tborder-left: 5px solid #800080; \n",
       "\tpadding: 0.5em;\n",
       "\tfont-size: 12pt;\n",
       "}\n",
       "p.q1 { \n",
       "    padding-top: 5px;\n",
       "    padding-bottom: 5px;\n",
       "    text-align: left; \n",
       "    padding-left: 5px;\n",
       "    background-color: #EEEEEE; \n",
       "    color: black;\n",
       "}\n",
       "header {\n",
       "   padding-top: 35px;\n",
       "    padding-bottom: 35px;\n",
       "    text-align: left; \n",
       "    padding-left: 10px;\n",
       "    background-color: #DDDDDD; \n",
       "    color: black;\n",
       "}\n",
       "</style>\n",
       "\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES\n",
    "import requests\n",
    "from IPython.core.display import HTML\n",
    "styles = requests.get(\"https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css\").text\n",
    "HTML(styles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning Goals\n",
    "\n",
    "In this lab we will look at Recurrent Neural Networks (RNNs), LSTMs and their building blocks.\n",
    "\n",
    "By the end of this lab, you should:\n",
    "\n",
    "- know how to put together the building blocks used in RNNs and its variants (GRU, LSTM) in `keras` with an example.\n",
    "- have a good undertanding on how sequences -- any data that has some temporal semantics (e.g., time series, natural language, images etc.) -- fit into and benefit from a recurrent architecture\n",
    "- be familiar with preprocessing text and dynamic embeddings\n",
    "- be familiar with gradient issues on RNNs processing longer sentence lengths\n",
    "- understand different kinds of LSTM architectures, classifiers, sequence to sequence models and their far-reaching applications\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. IMDb Review Classification: Feedforward, CNN, RNN, LSTM\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this task, we are going to do sentiment classification on a movie review dataset. We are going to build a feedforward net, a convolutional neural net, a recurrent net and combine one or more of them to understand performance of each of them. A sentence can be thought of as a sequence of words that collectively represent meaning. Individual words impact the meaning. Thus, the context matters; words that occur earlier in the sentence influence the sentence's structure and meaning in the latter part of the sentence (e.g., Jose asked Anqi if she were going to the library today). Likewise, words that occur later in a sentence can affect the meaning of earlier words (e.g., Apple is an interesting company). As we have seen in lecture, if we wish to make use of a full sentence's context in both directions, then we should use a bi-directional RNN (e.g., Bi-LSTM). For the purpose of this tutorial, we are going to restrict ourselves to only uni-directional RNNs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy\n",
    "from keras.datasets import imdb\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense\n",
    "from keras.layers import LSTM, SimpleRNN\n",
    "from keras.layers.embeddings import Embedding\n",
    "from keras.layers import Flatten\n",
    "from keras.preprocessing import sequence\n",
    "from keras.layers.convolutional import Conv1D\n",
    "from keras.layers.convolutional import MaxPooling1D\n",
    "from keras.layers.embeddings import Embedding\n",
    "import numpy as np\n",
    "# fix random seed for reproducibility\n",
    "numpy.random.seed(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We want to have a finite vocabulary to make sure that our word matrices are not arbitrarily small\n",
    "vocabulary_size = 10000\n",
    "\n",
    "# We also want to have a finite length of reviews and not have to process really long sentences.\n",
    "max_review_length = 500"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Computers have no built-in knowledge of words or their meanings and cannot understand them in any rich way that humans do -- hence, the purpose of Natural Language Processing (NLP). As with any data science, computer science, machine learning task, the first crucial step is to clean (pre-process) your data so that you can soundly make use of it. Within NLP, this first step is called Tokenization and it concerns how to represent each token (a.k.a. word) of your corpus (i.e., dataset).\n",
    "\n",
    "#### TOKENIZATION\n",
    "\n",
    "A ``token`` refers to a single, atomic unit of meaning (i.e., a word). How should our computers represent each word? We could read in our corpus word by word and store each word as a String (data structure). However, Strings tend to use more computer memory than Integers and can become cumbersome. As long as we preserve the uniqueness of the tokens and are consistent, we are better off converting each distinct word to a distinct number (Integer). This is standard practice within NLP / computer science / data science, etc.\n",
    "As a simple example of tokenization, we can see a small example.\n",
    "\n",
    "If the five sentences below were our entire corpus, our conversion would look as follows:\n",
    "\n",
    "1. i have books - [1, 4, 7]\n",
    "2. interesting books are useful [10,2,9,8]\n",
    "3. i have computers [1,4,6]\n",
    "4. computers are interesting and useful [6,9,11,10,8]\n",
    "5. books and computers are both valuable. [2,10,2,9,13,12]\n",
    "6. Bye Bye [7,7]\n",
    "\n",
    "Create tokens for vocabulary based on frequency of occurrence. Hence, we assign the following tokens\n",
    "\n",
    "I-1, books-2, computers-3, have-4, are-5, computers-6,bye-7, useful-8, are-9, and-10,interesting-11, valuable-12, both-13\n",
    "\n",
    "Thankfully, our dataset is already represented in such a tokenized form.\n",
    "\n",
    "**NOTE:** Often times, depending on your NLP task, it is useful to also perform other pre-processing, cleaning steps, such as:\n",
    "- treating each punctuation mark as a token (e.g., , . ! ? are each separate tokens)\n",
    "- lower-casing all words (so that a given word isn't treated differently just because it starts a sentence or not)\n",
    "- separating each sentence with a unique symbol (e.g., <S> and </S>)\n",
    "- removing words that are incredibly common (e.g., function words, (in)definite articles). These are referred to as 'stopwords'). For language modelling, we DO NOT remove stopwords. A sentence's meaning needs to include all of the original words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[   0    0    0 ...   19  178   32]\n",
      " [   0    0    0 ...   16  145   95]\n",
      " [   0    0    0 ...    7  129  113]\n",
      " [ 687   23    4 ...   21   64 2574]]\n",
      "Number of reviews 25000\n",
      "Length of first and fifth review before padding 500 500\n",
      "First review [   0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    1   14   22   16   43  530  973 1622 1385   65  458 4468\n",
      "   66 3941    4  173   36  256    5   25  100   43  838  112   50  670\n",
      "    2    9   35  480  284    5  150    4  172  112  167    2  336  385\n",
      "   39    4  172 4536 1111   17  546   38   13  447    4  192   50   16\n",
      "    6  147 2025   19   14   22    4 1920 4613  469    4   22   71   87\n",
      "   12   16   43  530   38   76   15   13 1247    4   22   17  515   17\n",
      "   12   16  626   18    2    5   62  386   12    8  316    8  106    5\n",
      "    4 2223 5244   16  480   66 3785   33    4  130   12   16   38  619\n",
      "    5   25  124   51   36  135   48   25 1415   33    6   22   12  215\n",
      "   28   77   52    5   14  407   16   82    2    8    4  107  117 5952\n",
      "   15  256    4    2    7 3766    5  723   36   71   43  530  476   26\n",
      "  400  317   46    7    4    2 1029   13  104   88    4  381   15  297\n",
      "   98   32 2071   56   26  141    6  194 7486   18    4  226   22   21\n",
      "  134  476   26  480    5  144   30 5535   18   51   36   28  224   92\n",
      "   25  104    4  226   65   16   38 1334   88   12   16  283    5   16\n",
      " 4472  113  103   32   15   16 5345   19  178   32]\n",
      "First label 1\n"
     ]
    }
   ],
   "source": [
    "(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)\n",
    "print('Number of reviews', len(X_train))\n",
    "print('Length of first and fifth review before padding', len(X_train[0]) ,len(X_train[4]))\n",
    "print('First review', X_train[0])\n",
    "print('First label', y_train[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Preprocess data\n",
    "\n",
    "If we were training our RNN one sentence at a time, it would be okay to have sentences of varying lengths. However, as with any neural network, it can be sometimes be advantageous to train inputs in batches. When doing so with RNNs, our input tensors need to be of the same length/dimensions. Thus, let's pad our sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Length of first and fifth review after padding 500 500\n"
     ]
    }
   ],
   "source": [
    "X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)\n",
    "X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)\n",
    "print('Length of first and fifth review after padding', len(X_train[0]) ,len(X_train[4]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MODEL 1A : FEED-FORWARD NETWORKS WITHOUT EMBEDDINGS \n",
    "\n",
    "Let us build a single-layer feed-forward net with a hidden layer of 250 nodes. Each input would be a 500-dim vector of tokens since we padded all our sequences to size 500.\n",
    "\n",
    "<br>\n",
    "<div class=\"exercise\"  style=\"background-color:#b3e6ff\">\n",
    "<b>EXERCISE</b>: Calculate the number of parameters involved in this network and implement a feedforward net to do classification without looking at cells below.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_2\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "dense_3 (Dense)              (None, 250)               125250    \n",
      "_________________________________________________________________\n",
      "dense_4 (Dense)              (None, 1)                 251       \n",
      "=================================================================\n",
      "Total params: 125,501\n",
      "Trainable params: 125,501\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n",
      "Train on 25000 samples, validate on 25000 samples\n",
      "Epoch 1/10\n",
      " - 1s - loss: 176.8025 - accuracy: 0.5048 - val_loss: 98.0177 - val_accuracy: 0.4998\n",
      "Epoch 2/10\n",
      " - 1s - loss: 49.7650 - accuracy: 0.5830 - val_loss: 52.1167 - val_accuracy: 0.5024\n",
      "Epoch 3/10\n",
      " - 1s - loss: 17.7281 - accuracy: 0.6645 - val_loss: 32.8438 - val_accuracy: 0.5027\n",
      "Epoch 4/10\n",
      " - 1s - loss: 7.7892 - accuracy: 0.7189 - val_loss: 21.9369 - val_accuracy: 0.5012\n",
      "Epoch 5/10\n",
      " - 1s - loss: 3.8072 - accuracy: 0.7672 - val_loss: 16.1270 - val_accuracy: 0.5046\n",
      "Epoch 6/10\n",
      " - 1s - loss: 2.2670 - accuracy: 0.7916 - val_loss: 11.6964 - val_accuracy: 0.5064\n",
      "Epoch 7/10\n",
      " - 1s - loss: 1.4903 - accuracy: 0.8087 - val_loss: 9.6540 - val_accuracy: 0.5030\n",
      "Epoch 8/10\n",
      " - 2s - loss: 1.2193 - accuracy: 0.8132 - val_loss: 9.1117 - val_accuracy: 0.5051\n",
      "Epoch 9/10\n",
      " - 1s - loss: 0.9770 - accuracy: 0.8321 - val_loss: 8.4030 - val_accuracy: 0.5035\n",
      "Epoch 10/10\n",
      " - 1s - loss: 0.7273 - accuracy: 0.8542 - val_loss: 7.8003 - val_accuracy: 0.5035\n",
      "Accuracy: 50.35%\n"
     ]
    }
   ],
   "source": [
    "model = Sequential()\n",
    "\n",
    "model.add(Dense(250, activation='relu',input_dim=max_review_length))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=2)\n",
    "# Final evaluation of the model\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"discussion\"  style=\"background-color:#F5E4C3\">\n",
    "    <b>Discussion:</b> Why was the performance bad? What was wrong with tokenization?\n",
    "</div>\n",
    "\n",
    "### MODEL 1B : FEED-FORWARD NETWORKS WITH EMBEDDINGS\n",
    "\n",
    "#### What is an embedding layer ? \n",
    "\n",
    "An embedding is a \"distributed representation\" (e.g., vector) of a particular atomic item (e.g., word token, object, etc). When representing items by embeddings:\n",
    "- each distinct item should be represented by its own unique embedding\n",
    "- the semantic similarity between items should correspond to the similarity between their respective embeddings (i.e., words that are more similar to one another should have embeddings that are more similar to each other).\n",
    "\n",
    "There are essentially an infinite number of ways to create such embeddings, and since these representations have such a great influence on the performance of our models, there has been an incredible amount of research dedicated to this very aspect. If you are interested in learning more, start with the astromonically impactful papers of [word2vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) and [GloVe](https://www.aclweb.org/anthology/D14-1162.pdf).\n",
    "\n",
    "In general, though, one can view the embedding process as a linear projection from one vector space to another (e.g., a vector space of unique words being mapped to a world of fixed-length, dense vectors filled with continuous-valued numbers. For NLP, we usually use embeddings to project the one-hot encodings of words (i.e., a vector that is the length of the entire vocabulary, and it is filled with all zeros except for a single value of 1 that corresponds to the particular word) on to a lower-dimensional continuous space (e.g., vectors of size 100) so that the input surface is dense and possibly smooth. Thus, one can view this embedding layer process as just a transformation from $\\mathbb{R}^{inp}$ to $\\mathbb{R}^{emb}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_dim = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_3\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_1 (Embedding)      (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "flatten_1 (Flatten)          (None, 50000)             0         \n",
      "_________________________________________________________________\n",
      "dense_5 (Dense)              (None, 250)               12500250  \n",
      "_________________________________________________________________\n",
      "dense_6 (Dense)              (None, 1)                 251       \n",
      "=================================================================\n",
      "Total params: 13,500,501\n",
      "Trainable params: 13,500,501\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "model = Sequential()\n",
    "\n",
    "# inputs will be converted from batch_size * sentence_length to batch_size*sentence_length*embedding _dim\n",
    "model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))\n",
    "model.add(Flatten())\n",
    "model.add(Dense(250, activation='relu'))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.\n",
      "  \"Converting sparse IndexedSlices to a dense Tensor of unknown shape. \"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 25000 samples, validate on 25000 samples\n",
      "Epoch 1/2\n",
      " - 52s - loss: 0.4555 - accuracy: 0.7728 - val_loss: 0.2868 - val_accuracy: 0.8772\n",
      "Epoch 2/2\n",
      " - 50s - loss: 0.1172 - accuracy: 0.9584 - val_loss: 0.3433 - val_accuracy: 0.8594\n",
      "Accuracy: 85.94%\n"
     ]
    }
   ],
   "source": [
    "# fit the model\n",
    "model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=2)\n",
    "\n",
    "# evaluate the model on the test set\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MODEL 2 : CONVOLUTIONAL NEURAL NETWORKS (CNN)\n",
    "Text can be thought of as 1-dimensional sequence (a single, long vector) and we can apply 1D Convolutions over a set of word embeddings. Let us walk through convolutions on text data with [this blog](http://debajyotidatta.github.io/nlp/deep/learning/word-embeddings/2016/11/27/Understanding-Convolutions-In-Text/). If you want to learn more, read this [published and well-cited paper](https://www.aclweb.org/anthology/I17-1026.pdf) from my friend, Byron Wallace.\n",
    "\n",
    "Fit a 1D convolution with 200 filters, kernel size 3, followed by a feed-forward layer of 250 nodes, and ReLU and Sigmoid activations as appropriate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_4\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_2 (Embedding)      (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "conv1d_1 (Conv1D)            (None, 500, 200)          60200     \n",
      "_________________________________________________________________\n",
      "max_pooling1d_1 (MaxPooling1 (None, 250, 200)          0         \n",
      "_________________________________________________________________\n",
      "flatten_2 (Flatten)          (None, 50000)             0         \n",
      "_________________________________________________________________\n",
      "dense_7 (Dense)              (None, 250)               12500250  \n",
      "_________________________________________________________________\n",
      "dense_8 (Dense)              (None, 1)                 251       \n",
      "=================================================================\n",
      "Total params: 13,560,701\n",
      "Trainable params: 13,560,701\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.\n",
      "  \"Converting sparse IndexedSlices to a dense Tensor of unknown shape. \"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/2\n",
      "15104/25000 [=================>............] - ETA: 38s - loss: 0.7021 - accuracy: 0.5011"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-18-5fe4f15dc9f1>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     13\u001b[0m \u001b[0;31m# train the CNN\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 14\u001b[0;31m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mepochs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbatch_size\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m128\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     16\u001b[0m \u001b[0;31m# evalute the CNN\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/keras/engine/training.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)\u001b[0m\n\u001b[1;32m   1237\u001b[0m                                         \u001b[0msteps_per_epoch\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msteps_per_epoch\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1238\u001b[0m                                         \u001b[0mvalidation_steps\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidation_steps\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1239\u001b[0;31m                                         validation_freq=validation_freq)\n\u001b[0m\u001b[1;32m   1240\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1241\u001b[0m     def evaluate(self,\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/keras/engine/training_arrays.py\u001b[0m in \u001b[0;36mfit_loop\u001b[0;34m(model, fit_function, fit_inputs, out_labels, batch_size, epochs, verbose, callbacks, val_function, val_inputs, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq)\u001b[0m\n\u001b[1;32m    194\u001b[0m                     \u001b[0mins_batch\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mins_batch\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtoarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    195\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 196\u001b[0;31m                 \u001b[0mouts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfit_function\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mins_batch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    197\u001b[0m                 \u001b[0mouts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mto_list\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mouts\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    198\u001b[0m                 \u001b[0;32mfor\u001b[0m \u001b[0ml\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mo\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mout_labels\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mouts\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, inputs)\u001b[0m\n\u001b[1;32m   3738\u001b[0m         \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmath_ops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcast\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtensor\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3739\u001b[0m       \u001b[0mconverted_inputs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3740\u001b[0;31m     \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_graph_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0mconverted_inputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   3741\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3742\u001b[0m     \u001b[0;31m# EagerTensor.numpy() will often make a copy to ensure memory safety.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1079\u001b[0m       \u001b[0mTypeError\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mFor\u001b[0m \u001b[0minvalid\u001b[0m \u001b[0mpositional\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0mkeyword\u001b[0m \u001b[0margument\u001b[0m \u001b[0mcombinations\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1080\u001b[0m     \"\"\"\n\u001b[0;32m-> 1081\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_impl\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1082\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1083\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m_call_impl\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcancellation_manager\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, args, kwargs, cancellation_manager)\u001b[0m\n\u001b[1;32m   1119\u001b[0m       raise TypeError(\"Keyword arguments {} unknown. Expected {}.\".format(\n\u001b[1;32m   1120\u001b[0m           list(kwargs.keys()), list(self._arg_keywords)))\n\u001b[0;32m-> 1121\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_flat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcaptured_inputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcancellation_manager\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1122\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1123\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m_filtered_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py\u001b[0m in \u001b[0;36m_call_flat\u001b[0;34m(self, args, captured_inputs, cancellation_manager)\u001b[0m\n\u001b[1;32m   1222\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mexecuting_eagerly\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1223\u001b[0m       flat_outputs = forward_function.call(\n\u001b[0;32m-> 1224\u001b[0;31m           ctx, args, cancellation_manager=cancellation_manager)\n\u001b[0m\u001b[1;32m   1225\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1226\u001b[0m       \u001b[0mgradient_name\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_delayed_rewrite_functions\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mregister\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py\u001b[0m in \u001b[0;36mcall\u001b[0;34m(self, ctx, args, cancellation_manager)\u001b[0m\n\u001b[1;32m    509\u001b[0m               \u001b[0minputs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    510\u001b[0m               \u001b[0mattrs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"executor_type\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexecutor_type\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"config_proto\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 511\u001b[0;31m               ctx=ctx)\n\u001b[0m\u001b[1;32m    512\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    513\u001b[0m           outputs = execute.execute_with_cancellation(\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py\u001b[0m in \u001b[0;36mquick_execute\u001b[0;34m(op_name, num_outputs, inputs, attrs, ctx, name)\u001b[0m\n\u001b[1;32m     59\u001b[0m     tensors = pywrap_tensorflow.TFE_Py_Execute(ctx._handle, device_name,\n\u001b[1;32m     60\u001b[0m                                                \u001b[0mop_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mattrs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 61\u001b[0;31m                                                num_outputs)\n\u001b[0m\u001b[1;32m     62\u001b[0m   \u001b[0;32mexcept\u001b[0m \u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_NotOkStatusException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     63\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mname\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
     ]
    }
   ],
   "source": [
    "# %load sol2.py\n",
    "# create the CNN\n",
    "model = Sequential()\n",
    "model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))\n",
    "model.add(Conv1D(filters=200, kernel_size=3, padding='same', activation='relu'))\n",
    "model.add(MaxPooling1D(pool_size=2))\n",
    "model.add(Flatten())\n",
    "model.add(Dense(250, activation='relu'))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "# train the CNN\n",
    "model.fit(X_train, y_train, epochs=2, batch_size=128)\n",
    "\n",
    "# evalute the CNN\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MODEL 3 : Simple RNN\n",
    "\n",
    "Two great blogs that are helpful for understanding the workings of a RNN and LSTM are\n",
    "\n",
    "1. http://karpathy.github.io/2015/05/21/rnn-effectiveness/\n",
    "2. http://colah.github.io/posts/2015-08-Understanding-LSTMs/\n",
    "\n",
    "At a high-level, an RNN is similar to a feed-forward neural network (FFNN) in that there is an input layer, a hidden layer, and an output layer. The input layer is fully connected to the hidden layer, and the hidden layer is fully connected to the output layer. However, the crux of what makes it a **recurrent** neural network is that the hidden layer for a given time _t_ is not only based on the input layer at time _t_ but also the hidden layer from time _t-1_.\n",
    "\n",
    "Mathematically, a simpleRNN can be defined by the following recurrence relation.\n",
    "\n",
    "<center>$h_t = \\sigma(W([h_{t-1},x_{t}])+b)$\n",
    "    \n",
    "If we extend this recurrence relation to the length of sequences we have in hand, our RNN architecture can be viewed as follows (this is also referred to as 'unrolling' the network):\n",
    "\n",
    "<img src=\"files/fig/LSTM_classification.jpg\" width=\"400\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_5\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_3 (Embedding)      (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "simple_rnn_1 (SimpleRNN)     (None, 100)               20100     \n",
      "_________________________________________________________________\n",
      "dense_9 (Dense)              (None, 1)                 101       \n",
      "=================================================================\n",
      "Total params: 1,020,201\n",
      "Trainable params: 1,020,201\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.\n",
      "  \"Converting sparse IndexedSlices to a dense Tensor of unknown shape. \"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/3\n",
      "25000/25000 [==============================] - 75s 3ms/step - loss: 0.6572 - accuracy: 0.5923\n",
      "Epoch 2/3\n",
      "25000/25000 [==============================] - 59s 2ms/step - loss: 0.5133 - accuracy: 0.7533\n",
      "Epoch 3/3\n",
      "25000/25000 [==============================] - 66s 3ms/step - loss: 0.4463 - accuracy: 0.8013\n",
      "Accuracy: 71.55%\n"
     ]
    }
   ],
   "source": [
    "# %load sol3.py\n",
    "model = Sequential()\n",
    "model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))\n",
    "model.add(SimpleRNN(100))\n",
    "\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, epochs=20, batch_size=16)\n",
    "# Final evaluation of the model\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### RNNs and vanishing/exploding gradients\n",
    "\n",
    "Let us use sigmoid activations as example. Derivative of a sigmoid can be written as \n",
    "<center> $\\sigma'(x) = \\sigma(x) \\cdot \\sigma(1-x)$. </center>\n",
    "\n",
    "<img src = \"files/fig/vanishing_gradients.png\">\n",
    "<br>\n",
    "\n",
    "Remember, an RNN is a \"really deep\" feed-forward-esque network (when unrolled in time). Hence, backpropagation happens from $h_t$ all the way to $h_1$. Also realize that sigmoid gradients are multiplicatively dependent on the value of sigmoid. Hence, if the non-activated output of any layer $h_l$ is < 0, then $\\sigma$ tends to 0, effectively \"vanishing\" the gradient. Any layer that the current layer backprops to $H_{1:L-1}$ does not learn anything useful out of the gradients.\n",
    "\n",
    "#### LSTMs and GRU\n",
    "LSTM and GRU are two sophisticated implementations of RNNs that have gates (one could say that their success hinges on using gates). A gate emits probability between 0 and 1. For instance, LSTM is built on these state updates:\n",
    "\n",
    "$L$ is a linear transformation $L(x) = W*x + b.$\n",
    "\n",
    "$f_t = \\sigma(L([h_{t-1},x_t))$\n",
    "\n",
    "$i_t = \\sigma(L([h_{t-1},x_t))$\n",
    "\n",
    "$o_t = \\sigma(L([h_{t-1},x_t))$\n",
    "\n",
    "$\\hat{C}_t = \\tanh(L([h_{t-1},x_t))$\n",
    "\n",
    "$C_t = f_t * C_{t-1}+i_t*\\hat{C}_t$  (Using the forget gate, the neural network can learn to control how much information it has to retain or forget)\n",
    "\n",
    "$h_t = o_t * \\tanh(c_t)$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MODEL 4 : LSTM\n",
    "\n",
    "Now, let's use an LSTM model to do classification! To make it a fair comparison to the SimpleRNN, let's start with the same architecture hyper-parameters (e.g., number of hidden nodes, epochs, and batch size). Then, experiment with increasing the number of nodes, stacking multiple layers, applying dropouts etc. Check the number of parameters that this model entails."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_6\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_4 (Embedding)      (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "lstm_1 (LSTM)                (None, 100)               80400     \n",
      "_________________________________________________________________\n",
      "dense_10 (Dense)             (None, 1)                 101       \n",
      "=================================================================\n",
      "Total params: 1,080,501\n",
      "Trainable params: 1,080,501\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.\n",
      "  \"Converting sparse IndexedSlices to a dense Tensor of unknown shape. \"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/3\n",
      " 2368/25000 [=>............................] - ETA: 3:34 - loss: 0.7128 - accuracy: 0.5752"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-22-ce37a34f78e7>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      7\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msummary\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mepochs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbatch_size\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m64\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     11\u001b[0m \u001b[0;31m# evaluation of the LSTM's performance\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/keras/engine/training.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)\u001b[0m\n\u001b[1;32m   1237\u001b[0m                                         \u001b[0msteps_per_epoch\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msteps_per_epoch\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1238\u001b[0m                                         \u001b[0mvalidation_steps\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidation_steps\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1239\u001b[0;31m                                         validation_freq=validation_freq)\n\u001b[0m\u001b[1;32m   1240\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1241\u001b[0m     def evaluate(self,\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/keras/engine/training_arrays.py\u001b[0m in \u001b[0;36mfit_loop\u001b[0;34m(model, fit_function, fit_inputs, out_labels, batch_size, epochs, verbose, callbacks, val_function, val_inputs, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq)\u001b[0m\n\u001b[1;32m    194\u001b[0m                     \u001b[0mins_batch\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mins_batch\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtoarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    195\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 196\u001b[0;31m                 \u001b[0mouts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfit_function\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mins_batch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    197\u001b[0m                 \u001b[0mouts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mto_list\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mouts\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    198\u001b[0m                 \u001b[0;32mfor\u001b[0m \u001b[0ml\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mo\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mout_labels\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mouts\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, inputs)\u001b[0m\n\u001b[1;32m   3738\u001b[0m         \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmath_ops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcast\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtensor\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3739\u001b[0m       \u001b[0mconverted_inputs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3740\u001b[0;31m     \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_graph_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0mconverted_inputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   3741\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3742\u001b[0m     \u001b[0;31m# EagerTensor.numpy() will often make a copy to ensure memory safety.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1079\u001b[0m       \u001b[0mTypeError\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mFor\u001b[0m \u001b[0minvalid\u001b[0m \u001b[0mpositional\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0mkeyword\u001b[0m \u001b[0margument\u001b[0m \u001b[0mcombinations\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1080\u001b[0m     \"\"\"\n\u001b[0;32m-> 1081\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_impl\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1082\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1083\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m_call_impl\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcancellation_manager\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py\u001b[0m in \u001b[0;36m_call_impl\u001b[0;34m(self, args, kwargs, cancellation_manager)\u001b[0m\n\u001b[1;32m   1119\u001b[0m       raise TypeError(\"Keyword arguments {} unknown. Expected {}.\".format(\n\u001b[1;32m   1120\u001b[0m           list(kwargs.keys()), list(self._arg_keywords)))\n\u001b[0;32m-> 1121\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_flat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcaptured_inputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcancellation_manager\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1122\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1123\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m_filtered_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py\u001b[0m in \u001b[0;36m_call_flat\u001b[0;34m(self, args, captured_inputs, cancellation_manager)\u001b[0m\n\u001b[1;32m   1222\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mexecuting_eagerly\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1223\u001b[0m       flat_outputs = forward_function.call(\n\u001b[0;32m-> 1224\u001b[0;31m           ctx, args, cancellation_manager=cancellation_manager)\n\u001b[0m\u001b[1;32m   1225\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1226\u001b[0m       \u001b[0mgradient_name\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_delayed_rewrite_functions\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mregister\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py\u001b[0m in \u001b[0;36mcall\u001b[0;34m(self, ctx, args, cancellation_manager)\u001b[0m\n\u001b[1;32m    509\u001b[0m               \u001b[0minputs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    510\u001b[0m               \u001b[0mattrs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"executor_type\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexecutor_type\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"config_proto\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 511\u001b[0;31m               ctx=ctx)\n\u001b[0m\u001b[1;32m    512\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    513\u001b[0m           outputs = execute.execute_with_cancellation(\n",
      "\u001b[0;32m/usr/local/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py\u001b[0m in \u001b[0;36mquick_execute\u001b[0;34m(op_name, num_outputs, inputs, attrs, ctx, name)\u001b[0m\n\u001b[1;32m     59\u001b[0m     tensors = pywrap_tensorflow.TFE_Py_Execute(ctx._handle, device_name,\n\u001b[1;32m     60\u001b[0m                                                \u001b[0mop_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mattrs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 61\u001b[0;31m                                                num_outputs)\n\u001b[0m\u001b[1;32m     62\u001b[0m   \u001b[0;32mexcept\u001b[0m \u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_NotOkStatusException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     63\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mname\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
     ]
    }
   ],
   "source": [
    "# %load sol4.py\n",
    "model = Sequential()\n",
    "model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))\n",
    "model.add(LSTM(100))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, epochs=3, batch_size=64)\n",
    "\n",
    "# evaluation of the LSTM's performance\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MODEL 5 : CNN + LSTM \n",
    "\n",
    "CNNs are good at learning spatial features, and sentences can be thought of as 1-D spatial vectors (dimensionality is determined by the number of words in the sentence). We apply an LSTM over the features learned by the CNN (after a maxpooling layer). This leverages the power of CNNs and LSTMs combined! We expect the CNN to be able to pick out invariant features across the 1-D spatial structure (i.e., sentence) that characterize good and bad sentiment. This learned spatial features may then be learned as sequences by an LSTM layer, and the final classification can be made via a feed-forward connection to a single node."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_12\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_8 (Embedding)      (None, 500, 100)          1000000   \n",
      "_________________________________________________________________\n",
      "conv1d_5 (Conv1D)            (None, 500, 32)           9632      \n",
      "_________________________________________________________________\n",
      "max_pooling1d_5 (MaxPooling1 (None, 250, 32)           0         \n",
      "_________________________________________________________________\n",
      "lstm_1 (LSTM)                (None, 100)               53200     \n",
      "_________________________________________________________________\n",
      "dense_23 (Dense)             (None, 1)                 101       \n",
      "=================================================================\n",
      "Total params: 1,062,933\n",
      "Trainable params: 1,062,933\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n",
      "None\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.\n",
      "  \"Converting sparse IndexedSlices to a dense Tensor of unknown shape. \"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/3\n",
      "25000/25000 [==============================] - 90s 4ms/step - loss: 0.3942 - accuracy: 0.8035\n",
      "Epoch 2/3\n",
      "25000/25000 [==============================] - 88s 4ms/step - loss: 0.2084 - accuracy: 0.9208\n",
      "Epoch 3/3\n",
      "25000/25000 [==============================] - 89s 4ms/step - loss: 0.1347 - accuracy: 0.9522\n",
      "Accuracy: 87.59%\n"
     ]
    }
   ],
   "source": [
    "model = Sequential()\n",
    "model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))\n",
    "model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))\n",
    "model.add(MaxPooling1D(pool_size=2))\n",
    "model.add(LSTM(100))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "print(model.summary())\n",
    "\n",
    "model.fit(X_train, y_train, epochs=3, batch_size=64)\n",
    "# Final evaluation of the model\n",
    "scores = model.evaluate(X_test, y_test, verbose=0)\n",
    "print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CONCLUSION\n",
    "\n",
    "We saw the power of sequence models and how they are useful in text classification. They give a solid performance, low memory footprint (thanks to shared parameters) and are able to understand and leverage the temporally connected information contained in the inputs. There is still an open debate about the performance vs memory benefits of CNNs vs RNNs in the research community.\n",
    "\n",
    "As a side-note and bit of history: how amazing is it that we can construct these neural networks, train them, and evaluate them in just a few lines of code?! Imagine if we didn't have deep learning libraries like Keras and Tensorflow; we'd have to write backpropagation and gradient descent by hand. Our last network could easily require thousands of lines of code and many hours of debugging. This is what many people did just 8 years ago, since deep learning wasn't common and the community hadn't yet written nicely packaged libraries. Many libraries have come and gone, but nowadays most people use either Tensorflow/Keras (by Google) or PyTorch (by Facebook). I expect them to remain as great libraries for the foreseeable future, so if you're interested in deep learning, it's worth the investment to learn one, or both, of them well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. 231+432 = 665.... It's not ? Let's ask our LSTM\n",
    "\n",
    "In this exercise, we are going to teach addition to our model. Given two numbers (<999), the model outputs their sum (<9999). The input is provided as a string '231+432' and the model will provide its output as ' 663' (Here the empty space is the padding character). We are not going to use any external dataset and are going to construct our own dataset for this exercise.\n",
    "\n",
    "The exercise we attempt to do effectively \"translates\" a sequence of characters '231+432' to another sequence of characters ' 663' and hence, this class of models are called sequence-to-sequence models (aka seq2seq). Such architectures have profound applications in several real-life tasks such as machine translation, summarization, image captioning etc.\n",
    "\n",
    "To be clear, sequence-to-sequence (aka seq2seq) models take as input a sequence of length N and return a sequence of length M, where N and M may or may not differ, and every single observation/input may be of different values, too. For example, machine translation concerns converting text from one natural language to another (e.g., translating English to French). Google Translate is an example, and their system is a seq2seq model. The input (e.g., an English sentence) can be of any length, and the output (e.g., a French sentence) may be of any length.\n",
    "\n",
    "**Background knowledge:** The earliest and most simple seq2seq model works by having one RNN for the input, just like we've always done, and we refer to it as being an \"encoder.\" The final hidden state of the encoder RNN is fed as input to another RNN that we refer to as the \"decoder.\" The job of the decoder is to generate each token, one word at a time. This may seem really limiting, as it relies on the encoder encapsulating the entire input sequence with just 1 hidden layer. It seems unrealistic that we could encode an entire meaning of a sentence with just one hidden layer. Yet, results even in this simplistic manner can be quite impressive. In fact, these early results were compelling enough that these models immediately replaced the decades of earlier machine translation work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "from keras.models import Sequential\n",
    "from keras import layers\n",
    "from keras.layers import Dense, RepeatVector, TimeDistributed\n",
    "import numpy as np\n",
    "from six.moves import range"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The less interesting data generation and preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "class CharacterTable(object):\n",
    "    def __init__(self, chars):        \n",
    "        self.chars = sorted(set(chars))\n",
    "        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))\n",
    "        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))\n",
    "\n",
    "    # converts a String of characters into a one-hot embedding/vector\n",
    "    def encode(self, C, num_rows):        \n",
    "        x = np.zeros((num_rows, len(self.chars)))\n",
    "        for i, c in enumerate(C):\n",
    "            x[i, self.char_indices[c]] = 1\n",
    "        return x\n",
    "    \n",
    "    # converts a one-hot embedding/vector into a String of characters\n",
    "    def decode(self, x, calc_argmax=True):        \n",
    "        if calc_argmax:\n",
    "            x = x.argmax(axis=-1)\n",
    "        return ''.join(self.indices_char[x] for x in x)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "TRAINING_SIZE = 50000\n",
    "DIGITS = 3\n",
    "MAXOUTPUTLEN = DIGITS + 1\n",
    "MAXLEN = DIGITS + 1 + DIGITS\n",
    "\n",
    "chars = '0123456789+ '\n",
    "ctable = CharacterTable(chars)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "def return_random_digit():\n",
    "  return np.random.choice(list('0123456789'))  \n",
    "\n",
    "# generate a new number of length `DIGITS`\n",
    "def generate_number():\n",
    "  num_digits = np.random.randint(1, DIGITS + 1)  \n",
    "  return int(''.join( return_random_digit()\n",
    "                      for i in range(num_digits)))\n",
    "\n",
    "# generate `TRAINING_SIZE` # of pairs of random numbers\n",
    "def data_generate(num_examples):\n",
    "  questions = []\n",
    "  answers = []\n",
    "  seen = set()\n",
    "  print('Generating data...')\n",
    "  while len(questions) < TRAINING_SIZE:      \n",
    "      a, b = generate_number(), generate_number()\n",
    "        \n",
    "      # don't allow duplicates; this is good practice for training,\n",
    "      # as we will minimize memorizing seen examples\n",
    "      key = tuple(sorted((a, b)))\n",
    "      if key in seen:\n",
    "          continue\n",
    "      seen.add(key)\n",
    "    \n",
    "      # pad the data with spaces so that the length is always MAXLEN.\n",
    "      q = '{}+{}'.format(a, b)\n",
    "      query = q + ' ' * (MAXLEN - len(q))\n",
    "      ans = str(a + b)\n",
    "    \n",
    "      # answers can be of maximum size DIGITS + 1.\n",
    "      ans += ' ' * (MAXOUTPUTLEN - len(ans))\n",
    "      questions.append(query)\n",
    "      answers.append(ans)\n",
    "  print('Total addition questions:', len(questions))\n",
    "  return questions, answers\n",
    "\n",
    "def encode_examples(questions, answers):\n",
    "  x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=np.bool)\n",
    "  y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=np.bool)\n",
    "  for i, sentence in enumerate(questions):\n",
    "      x[i] = ctable.encode(sentence, MAXLEN)\n",
    "  for i, sentence in enumerate(answers):\n",
    "      y[i] = ctable.encode(sentence, DIGITS + 1)\n",
    "\n",
    "  indices = np.arange(len(y))\n",
    "  np.random.shuffle(indices)\n",
    "  return x[indices],y[indices]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generating data...\n",
      "Total addition questions: 50000\n",
      "Training Data shape:\n",
      "X :  (45000, 7, 12)\n",
      "Y :  (45000, 4, 12)\n",
      "Sample Question(in encoded form) :  [[False False False False  True False False False False False False False]\n",
      " [False False False  True False False False False False False False False]\n",
      " [False False False False False False False  True False False False False]\n",
      " [False  True False False False False False False False False False False]\n",
      " [False False False False False False False False False False  True False]\n",
      " [False False False False False False  True False False False False False]\n",
      " [ True False False False False False False False False False False False]] [[False False False False  True False False False False False False False]\n",
      " [False False False False False False False False False False False  True]\n",
      " [False False False False False False False False False False False  True]\n",
      " [ True False False False False False False False False False False False]]\n",
      "Sample Question(in decoded form) :  215+84  Sample Output :  299 \n"
     ]
    }
   ],
   "source": [
    "q,a = data_generate(TRAINING_SIZE)\n",
    "x,y = encode_examples(q,a)\n",
    "\n",
    "# divides our data into training and validation\n",
    "split_at = len(x) - len(x) // 10\n",
    "x_train, x_val, y_train, y_val = x[:split_at], x[split_at:],y[:split_at],y[split_at:]\n",
    "\n",
    "print('Training Data shape:')\n",
    "print('X : ', x_train.shape)\n",
    "print('Y : ', y_train.shape)\n",
    "\n",
    "print('Sample Question(in encoded form) : ', x_train[0], y_train[0])\n",
    "print('Sample Question(in decoded form) : ', ctable.decode(x_train[0]),'Sample Output : ', ctable.decode(y_train[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Let's learn two wrapper functions in Keras - TimeDistributed and RepeatVector with some dummy examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TimeDistributed** is a wrapper function call that applies an input operation on all the timesteps of an input data. For instance, if I have a feed-forward network which converts a 10-dim vector to a 5-dim vector, then wrapping this TimeDistributed layer on that feed-forward operation would convert a batch_size  \\* sentence_len \\* vector_len(=10) to batch_size  \\* sentence_len \\*  output_len(=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of input :  (1, 3, 5)\n",
      "Shape of output :  (1, 3, 8)\n"
     ]
    }
   ],
   "source": [
    "model = Sequential()\n",
    "#Inputs to it will be batch_size*time_steps*input_vector_dim(to Dense)\n",
    "# Output will be batch_size*time_steps* output_vector_dim\n",
    "# Here, Dense() converts a 5-dim input vector to a 8-dim vector.\n",
    "model.add(TimeDistributed(Dense(8), input_shape=(3, 5)))\n",
    "input_array = np.random.randint(10, size=(1,3,5))\n",
    "print(\"Shape of input : \", input_array.shape)\n",
    "\n",
    "model.compile('rmsprop', 'mse')\n",
    "output_array = model.predict(input_array)\n",
    "print(\"Shape of output : \", output_array.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**RepeatVector** repeats the vector a specified number of times. Dimension changes from batch_size * number of elements to batch_size* number of repetitions * number of elements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "model = Sequential()\n",
    "# converts from 1*10 to 1*6\n",
    "model.add(Dense(6, input_dim=10))\n",
    "print(model.output_shape)\n",
    "\n",
    "# converts from 1*6 to 1*3*6\n",
    "model.add(RepeatVector(3))\n",
    "print(model.output_shape) \n",
    "\n",
    "input_array = np.random.randint(1000, size=(1, 10))\n",
    "print(\"Shape of input : \", input_array.shape)\n",
    "\n",
    "model.compile('rmsprop', 'mse')\n",
    "output_array = model.predict(input_array)\n",
    "\n",
    "print(\"Shape of output : \", output_array.shape)\n",
    "# note: `None` is the batch dimension\n",
    "print('Input : ', input_array[0])\n",
    "print('Output : ', output_array[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MODEL ARCHITECTURE\n",
    "\n",
    "<img src=\"files/fig/LSTM_addition.jpg\" width=\"400\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note:** Whenever you are initializing a LSTM in Keras, by the default the option `return_sequences = False`. This means that at the end of the step the next component will only get to see the final hidden layer's values. On the other hand, if you set `return_sequences = True`, the LSTM component will return the hidden layer at each time step. It means that the next component should be able to consume inputs in that form. \n",
    "\n",
    "Think how this statement is relevant in terms of this model architecture and the TimeDistributed module we just learned.\n",
    "\n",
    "Build an encoder and decoder both single layer 128 nodes and an appropriate dense layer as needed by the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Build model...\n",
      "Model: \"sequential_17\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "lstm_5 (LSTM)                (None, 128)               72192     \n",
      "_________________________________________________________________\n",
      "repeat_vector_2 (RepeatVecto (None, 4, 128)            0         \n",
      "_________________________________________________________________\n",
      "lstm_6 (LSTM)                (None, 4, 128)            131584    \n",
      "_________________________________________________________________\n",
      "time_distributed_3 (TimeDist (None, 4, 12)             1548      \n",
      "=================================================================\n",
      "Total params: 205,324\n",
      "Trainable params: 205,324\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "# Hyperaparams\n",
    "RNN = layers.LSTM\n",
    "HIDDEN_SIZE = 128\n",
    "BATCH_SIZE = 128\n",
    "LAYERS = 1\n",
    "\n",
    "print('Build model...')\n",
    "model = Sequential()\n",
    "\n",
    "#ENCODING\n",
    "model.add(RNN(HIDDEN_SIZE, input_shape=(MAXLEN, len(chars))))\n",
    "model.add(RepeatVector(MAXOUTPUTLEN))\n",
    "\n",
    "#DECODING\n",
    "for _ in range(LAYERS):\n",
    "    # return hidden layer at each time step\n",
    "    model.add(RNN(HIDDEN_SIZE, return_sequences=True)) \n",
    "\n",
    "model.add(TimeDistributed(layers.Dense(len(chars), activation='softmax')))\n",
    "model.compile(loss='categorical_crossentropy',\n",
    "              optimizer='adam',\n",
    "              metrics=['accuracy'])\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check how well our model trained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Train on 45000 samples, validate on 5000 samples\n",
      "Epoch 1/20\n",
      "45000/45000 [==============================] - 10s 213us/step - loss: 1.8913 - accuracy: 0.3215 - val_loss: 1.8148 - val_accuracy: 0.3392\n",
      "Epoch 2/20\n",
      "45000/45000 [==============================] - 9s 194us/step - loss: 1.7547 - accuracy: 0.3544 - val_loss: 1.6947 - val_accuracy: 0.3761\n",
      "Epoch 3/20\n",
      "45000/45000 [==============================] - 8s 188us/step - loss: 1.6045 - accuracy: 0.3990 - val_loss: 1.5127 - val_accuracy: 0.4290\n",
      "Epoch 4/20\n",
      "45000/45000 [==============================] - 10s 223us/step - loss: 1.4356 - accuracy: 0.4599 - val_loss: 1.3487 - val_accuracy: 0.4931\n",
      "Epoch 5/20\n",
      "45000/45000 [==============================] - 11s 239us/step - loss: 1.2919 - accuracy: 0.5167 - val_loss: 1.2355 - val_accuracy: 0.5404\n",
      "Epoch 6/20\n",
      "45000/45000 [==============================] - 9s 196us/step - loss: 1.1851 - accuracy: 0.5556 - val_loss: 1.1403 - val_accuracy: 0.5719\n",
      "Epoch 7/20\n",
      "45000/45000 [==============================] - 9s 191us/step - loss: 1.0988 - accuracy: 0.5872 - val_loss: 1.0659 - val_accuracy: 0.5950\n",
      "Epoch 8/20\n",
      "45000/45000 [==============================] - 9s 194us/step - loss: 1.0235 - accuracy: 0.6134 - val_loss: 0.9967 - val_accuracy: 0.6226\n",
      "Epoch 9/20\n",
      "45000/45000 [==============================] - 9s 198us/step - loss: 0.9560 - accuracy: 0.6388 - val_loss: 0.9391 - val_accuracy: 0.6388\n",
      "Epoch 10/20\n",
      "45000/45000 [==============================] - 9s 207us/step - loss: 0.8848 - accuracy: 0.6645 - val_loss: 0.8588 - val_accuracy: 0.6694\n",
      "Epoch 11/20\n",
      "45000/45000 [==============================] - 10s 217us/step - loss: 0.7927 - accuracy: 0.6988 - val_loss: 0.7477 - val_accuracy: 0.7097\n",
      "Epoch 12/20\n",
      "45000/45000 [==============================] - 9s 198us/step - loss: 0.6750 - accuracy: 0.7465 - val_loss: 0.6294 - val_accuracy: 0.7650\n",
      "Epoch 13/20\n",
      "45000/45000 [==============================] - 10s 223us/step - loss: 0.5542 - accuracy: 0.8015 - val_loss: 0.5125 - val_accuracy: 0.8166\n",
      "Epoch 14/20\n",
      "45000/45000 [==============================] - 10s 217us/step - loss: 0.4452 - accuracy: 0.8534 - val_loss: 0.4086 - val_accuracy: 0.8692\n",
      "Epoch 15/20\n",
      "45000/45000 [==============================] - 9s 197us/step - loss: 0.3586 - accuracy: 0.8935 - val_loss: 0.3416 - val_accuracy: 0.8946\n",
      "Epoch 16/20\n",
      "45000/45000 [==============================] - 9s 200us/step - loss: 0.2982 - accuracy: 0.9166 - val_loss: 0.2824 - val_accuracy: 0.9196\n",
      "Epoch 17/20\n",
      "45000/45000 [==============================] - 9s 199us/step - loss: 0.2379 - accuracy: 0.9403 - val_loss: 0.2302 - val_accuracy: 0.9381\n",
      "Epoch 18/20\n",
      "45000/45000 [==============================] - 9s 190us/step - loss: 0.1988 - accuracy: 0.9516 - val_loss: 0.1899 - val_accuracy: 0.9512\n",
      "Epoch 19/20\n",
      "45000/45000 [==============================] - 9s 203us/step - loss: 0.1605 - accuracy: 0.9644 - val_loss: 0.1664 - val_accuracy: 0.9562\n",
      "Epoch 20/20\n",
      "45000/45000 [==============================] - 9s 195us/step - loss: 0.1354 - accuracy: 0.9705 - val_loss: 0.1462 - val_accuracy: 0.9622\n",
      "Finished iteration  1\n",
      "Question 551+581 True 1132 Guess 1132 Good job\n",
      "Question 332+746 True 1078 Guess 1078 Good job\n",
      "Question 736+654 True 1390 Guess 1490 Fail\n",
      "Question 77+705  True 782  Guess 782  Good job\n",
      "Question 7+266   True 273  Guess 273  Good job\n",
      "Question 18+7    True 25   Guess 25   Good job\n",
      "Question 123+666 True 789  Guess 889  Fail\n",
      "Question 886+48  True 934  Guess 934  Good job\n",
      "Question 6+862   True 868  Guess 868  Good job\n",
      "Question 492+378 True 870  Guess 870  Good job\n",
      "Question 59+30   True 89   Guess 89   Good job\n",
      "Question 542+95  True 637  Guess 637  Good job\n",
      "Question 58+738  True 796  Guess 796  Good job\n",
      "Question 29+62   True 91   Guess 91   Good job\n",
      "Question 217+21  True 238  Guess 238  Good job\n",
      "Question 140+811 True 951  Guess 951  Good job\n",
      "Question 601+814 True 1415 Guess 1425 Fail\n",
      "Question 815+337 True 1152 Guess 1152 Good job\n",
      "Question 578+52  True 630  Guess 630  Good job\n",
      "Question 3+424   True 427  Guess 427  Good job\n",
      "The model scored  85.0  % in its test.\n"
     ]
    }
   ],
   "source": [
    "for iteration in range(1, 2):\n",
    "    print()  \n",
    "    model.fit(x_train, y_train,\n",
    "              batch_size=BATCH_SIZE,\n",
    "              epochs=20,\n",
    "              validation_data=(x_val, y_val))\n",
    "    # Select 10 samples from the validation set at random so\n",
    "    # we can visualize errors.\n",
    "    print('Finished iteration ', iteration)\n",
    "    numcorrect = 0\n",
    "    numtotal = 20\n",
    "    \n",
    "    for i in range(numtotal):\n",
    "        ind = np.random.randint(0, len(x_val))\n",
    "        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]\n",
    "        preds = model.predict_classes(rowx, verbose=0)\n",
    "        q = ctable.decode(rowx[0])\n",
    "        correct = ctable.decode(rowy[0])\n",
    "        guess = ctable.decode(preds[0], calc_argmax=False)\n",
    "        print('Question', q, end=' ')\n",
    "        print('True', correct, end=' ')\n",
    "        print('Guess', guess, end=' ')\n",
    "        if guess == correct :\n",
    "          print('Good job')\n",
    "          numcorrect += 1\n",
    "        else:\n",
    "          print('Fail')\n",
    "    print('The model scored ', numcorrect*100/numtotal,' % in its test.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### EXERCISE\n",
    "\n",
    " * Try changing the hyperparams, use other RNNs, more layers, check if increasing the number of epochs is useful.\n",
    "\n",
    " * Try reversing the data from validation set and check if commutative property of addition is learned by the model.\n",
    " * Try printing the hidden layer with two inputs that are commutative and check if the hidden representations it learned are same or similar. Do we expect it to be true? If so, why? If not why? You can access the layer using an index with model.layers and layer.output will give the output of that layer.\n",
    "\n",
    "* Try doing addition in the RNN model the same way we do by hand. Reverse the order of digits and at each time step, input two digits get an output use the hidden layer and input next two digits and so on.(units in the first time step, tens in the second time step etc.)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}