{
"cells": [
{
"cell_type": "markdown",
"id": "promising-photography",
"metadata": {},
"source": [
"# Data Science 2: Advanced Topics in Data Science \n",
"## Section 3: Recurrent Neural Networks\n",
"\n",
"\n",
"**Harvard University** \n",
"**Spring 2021** \n",
"**Instructors**: Mark Glickman, Pavlos Protopapas, and Chris Tanner \n",
"**Authors**: Chris Gumb and Eleni Kaxiras\n",
"\n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "central-speech",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES\n",
"import requests\n",
"from IPython.core.display import HTML\n",
"styles = requests.get(\"https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css\").text\n",
"HTML(styles)"
]
},
{
"cell_type": "markdown",
"id": "terminal-northern",
"metadata": {},
"source": [
"## Learning Objectives\n",
"\n",
"By the end of this lab, you should understand:\n",
"- how to perform basic preprocessing on text data\n",
"- the layers used in `keras` to construct RNNs and its variants (GRU, LSTM)\n",
"- how the model's task (i.e., many-to-1, many-to-many) affects architecture choices"
]
},
{
"cell_type": "markdown",
"id": "broken-brain",
"metadata": {},
"source": [
"\n",
"\n",
"## Notebook Contents\n",
"- [**IMDB Review Dataset**](#imdb)\n",
"- [**Preprocessing Text Data**](#prep)\n",
" - [Tokenization](#token)\n",
" - [Padding](#pad)\n",
" - [Numerical Encoding](#encode)\n",
"- [**Movie Review Sentiment Analysis**](#FFNN)\n",
" - [Naive FFNN](#FFNN)\n",
" - [Embedding Layer](#embed)\n",
" - [1D CNN](#cnn)\n",
" - [Vanilla RNN](#rnn)\n",
" - [Vanishing/Exploding Gradients](#vanish)\n",
" - [GRU](#gru)\n",
" - [LSTM](#lstm)\n",
" - [BiDirectional Layer](#bidir)\n",
" - [Deep RNNs](#deep)\n",
" - [TimeDistributed Layer](#timedis)\n",
" - [RepeatVector Layer](#repeatvec)\n",
" - [CNN + RNN](#cnnrnn)\n",
"- [**Heavy Metal Lyric Generator**](#metal)\n",
" - [Creating Input/Target Pairs](#pairs)\n",
" - [LambdaCallback](#lambdacall)\n",
"- [**Arithmetic /w RNN**](#math)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "skilled-yield",
"metadata": {},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"from tensorflow.keras.datasets import imdb\n",
"from tensorflow.keras.models import Sequential, Model, load_model\n",
"from tensorflow.keras.layers import BatchNormalization, Bidirectional, Dense, Embedding, GRU, LSTM, SimpleRNN,\\\n",
" Input, TimeDistributed, Dropout, RepeatVector\n",
"from tensorflow.keras.layers import Conv1D, Conv2D, Flatten, MaxPool1D, MaxPool2D, Lambda\n",
"from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback, ModelCheckpoint\n",
"from tensorflow.keras.initializers import Constant\n",
"from tensorflow.keras.preprocessing import sequence\n",
"from sklearn.model_selection import train_test_split\n",
"import tensorflow_datasets\n",
"from matplotlib import pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"import re, sys\n",
"# fix random seed for reproducibility\n",
"np.random.seed(109)"
]
},
{
"cell_type": "markdown",
"id": "capable-classics",
"metadata": {},
"source": [
"## Case Study: IMDB Review Classifier \n",
"\n",
"\n",
"Let's frame our discussion of RNNS around the example a text classifier. Specifically, We'll build and evaluate various models that all attempt to descriminate between positive and negative reviews through the Internet Movie Database (IMDB). The dataset is again made available to us through the tensorflow datasets API."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "coupled-alignment",
"metadata": {},
"outputs": [],
"source": [
"import tensorflow_datasets"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "infinite-consent",
"metadata": {},
"outputs": [],
"source": [
"(train, test), info = tensorflow_datasets.load('imdb_reviews', split=['train', 'test'], with_info=True)"
]
},
{
"cell_type": "markdown",
"id": "beneficial-course",
"metadata": {},
"source": [
"The helpful `info` object provides details about the dataset."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "athletic-scheme",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tfds.core.DatasetInfo(\n",
" name='imdb_reviews',\n",
" full_name='imdb_reviews/plain_text/1.0.0',\n",
" description=\"\"\"\n",
" Large Movie Review Dataset.\n",
" This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.\n",
" \"\"\",\n",
" config_description=\"\"\"\n",
" Plain text\n",
" \"\"\",\n",
" homepage='http://ai.stanford.edu/~amaas/data/sentiment/',\n",
" data_path='/home/10914655/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',\n",
" download_size=80.23 MiB,\n",
" dataset_size=129.83 MiB,\n",
" features=FeaturesDict({\n",
" 'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),\n",
" 'text': Text(shape=(), dtype=tf.string),\n",
" }),\n",
" supervised_keys=('text', 'label'),\n",
" splits={\n",
" 'test': ,\n",
" 'train': ,\n",
" 'unsupervised': ,\n",
" },\n",
" citation=\"\"\"@InProceedings{maas-EtAl:2011:ACL-HLT2011,\n",
" author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},\n",
" title = {Learning Word Vectors for Sentiment Analysis},\n",
" booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},\n",
" month = {June},\n",
" year = {2011},\n",
" address = {Portland, Oregon, USA},\n",
" publisher = {Association for Computational Linguistics},\n",
" pages = {142--150},\n",
" url = {http://www.aclweb.org/anthology/P11-1015}\n",
" }\"\"\",\n",
")"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"info"
]
},
{
"cell_type": "markdown",
"id": "continuous-perfume",
"metadata": {},
"source": [
"We see that the dataset consists of text reviews and binary good/bad labels. Here are two examples:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "static-concern",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"text:\n",
"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.\n",
"\n",
"label: bad\n",
"\n",
"text:\n",
"This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.\n",
"\n",
"label: good\n",
"\n"
]
}
],
"source": [
"labels = {0: 'bad', 1: 'good'}\n",
"seen = {'bad': False, 'good': False}\n",
"for review in train:\n",
" label = review['label'].numpy()\n",
" if not seen[labels[label]]:\n",
" print(f\"text:\\n{review['text'].numpy().decode()}\\n\")\n",
" print(f\"label: {labels[label]}\\n\")\n",
" seen[labels[label]] = True\n",
" if all(val == True for val in seen.values()):\n",
" break"
]
},
{
"cell_type": "markdown",
"id": "nuclear-genealogy",
"metadata": {},
"source": [
"Great! But unfortunately, computers can read! 📖--🤖❓"
]
},
{
"cell_type": "markdown",
"id": "meaning-default",
"metadata": {},
"source": [
"## Preprocessing Text Data \n",
"\n",
"Computers have no built-in knowledge of language and cannot understand text data in any rich way that humans do -- at least not without some help! The first crucial step in natural language processing is to clean and preprocess your data so that your algorithms and models can make use of it.\n",
" \n",
"We'll look at a few preprocess steps:\n",
" - tokenization\n",
" - padding\n",
" - numerical encoding\n",
" \n",
"Depending on your NLP task, you may want to take additional preprocessing steps which we will not cover here. These can include:\n",
"- converting all characters to lowercase\n",
"- treating each punctuation mark as a token (e.g., , . ! ? are each separate tokens)\n",
"- removing punctuation altogether\n",
"- separating each sentence with a unique symbol (e.g., and )\n",
"- removing words that are incredibly common (e.g., function words, (in)definite articles). These are referred to as 'stopwords').\n",
"- Lemmatizing (replacing words with their 'dictionary entry form')\n",
"- Stemming (removing grammatical morphemes)\n",
" \n",
"Useful NLP Python libraries such as [NLTK](https://www.nltk.org/) and [spaCy](https://spacy.io/) provide built in methods for many of these preprocessing steps."
]
},
{
"cell_type": "markdown",
"id": "parallel-huntington",
"metadata": {},
"source": [
"
Tokenization
\n",
"\n",
"**Tokens** are the atomic units of meaning which our model will be working with. What should these units be? These could be characters, words, or even sentences. For our movie review classifier we will be working at the word level."
]
},
{
"cell_type": "markdown",
"id": "utility-secret",
"metadata": {},
"source": [
"For this example we will process just a subset of the original dataset."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "bacterial-toolbox",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'label': ,\n",
" 'text': }"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"SAMPLE_SIZE = 10\n",
"subset = list(train.take(SAMPLE_SIZE))\n",
"subset[0]"
]
},
{
"cell_type": "markdown",
"id": "tested-greenhouse",
"metadata": {},
"source": [
"The TFDS format allows for the construction of efficient preprocessing pipelines. But for our own preprocessing example we will be primarily working with Python `list` objects. This gives us a chance to practice the Python **list comprehension** which is a powerful tool to have at your disposal. It will serve you well when processing arbitrary text which may not already be in a nice TFDS format (such as in the HW 😉).\n",
"\n",
"We'll convert our data subset into X and y lists."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "artificial-attraction",
"metadata": {},
"outputs": [],
"source": [
"X = [x['text'].numpy().decode() for x in subset]\n",
"y = [x['label'].numpy() for x in subset]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "hispanic-pottery",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"X has 10 reviews\n",
"y has 10 labels\n"
]
}
],
"source": [
"print(f'X has {len(X)} reviews')\n",
"print(f'y has {len(y)} labels')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "destroyed-hierarchy",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First 20 characters of all reviews:\n",
"['This was an absolute...', 'I have been known to...', 'Mann photographs the...', 'This is the kind of ...', 'As others have menti...', 'This is a film which...', 'Okay, you have: \n",
"\n",
"We'll see a bit later that you can in fact sucessfully train a neural network on text data at the character level.\n",
"\n",
"But for the moment we will work at the word level, treating the word level. This means our observations should be organized as **sequences of words** rather than sequences of characters."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "orange-event",
"metadata": {},
"outputs": [],
"source": [
"# list comprehensions again to the rescue!\n",
"X = [x.split() for x in X]\n",
"# The same thing can be accomplished with:\n",
"# list(map(str.split, X))\n",
"# but that is much harder to parse! O_o"
]
},
{
"cell_type": "markdown",
"id": "addressed-detective",
"metadata": {},
"source": [
"Now let's look at the first 10 **tokens** in the first 2 reviews."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "understanding-rabbit",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(['This',\n",
" 'was',\n",
" 'an',\n",
" 'absolutely',\n",
" 'terrible',\n",
" 'movie.',\n",
" \"Don't\",\n",
" 'be',\n",
" 'lured',\n",
" 'in'],\n",
" ['I',\n",
" 'have',\n",
" 'been',\n",
" 'known',\n",
" 'to',\n",
" 'fall',\n",
" 'asleep',\n",
" 'during',\n",
" 'films,',\n",
" 'but'])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X[0][:10], X[1][:10]"
]
},
{
"cell_type": "markdown",
"id": "departmental-hello",
"metadata": {},
"source": [
"
Padding
\n",
"\n",
"Let's take a look at the lengths of the reviews in our subset."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "experienced-order",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[116, 112, 132, 88, 81, 289, 557, 111, 223, 127]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[len(x) for x in X]"
]
},
{
"cell_type": "markdown",
"id": "described-mexico",
"metadata": {},
"source": [
"If we were training our RNN one sentence at a time, it would be okay to have sentences of varying lengths. However, as with any neural network, it can be sometimes be advantageous to train inputs in batches. When doing so with RNNs, our input tensors need to be of the same length/dimensions."
]
},
{
"cell_type": "markdown",
"id": "divided-venice",
"metadata": {},
"source": [
"Here are two examples of tokenized reviews padded to have a length of 5.\n",
"```\n",
"['I', 'loved', 'it', '', '']\n",
"['It', 'stinks', '', '', '']\n",
"```\n",
"Now let's pad our own examples. Note that 'padding' in this context also means truncating sequences that are longer than our specified max length."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "turkish-closing",
"metadata": {},
"outputs": [],
"source": [
"MAX_LEN = 500\n",
"PAD = ''\n",
"# truncate\n",
"X = [x[:MAX_LEN] for x in X]\n",
"# pad\n",
"for x in X:\n",
" while len(x) < MAX_LEN:\n",
" x.append(PAD)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "endangered-juvenile",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[500, 500, 500, 500, 500, 500, 500, 500, 500, 500]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[len(x) for x in X]"
]
},
{
"cell_type": "markdown",
"id": "understood-barcelona",
"metadata": {},
"source": [
"Now all reviews are of a uniform length!"
]
},
{
"cell_type": "markdown",
"id": "close-worse",
"metadata": {},
"source": [
"
Numerical Encoding
\n",
"\n",
"If each review in our dataset is an observation, then the features of each observation are the tokens, in this case, words. But these words are still strings. Our machine learning methods require us to be able to multiple our features by weights. If we want to use these words as inputs for a neural network we'll have to convert them into some numerical representation.\n",
"\n",
"One solution is to create a one-to-one mapping between unique words and integers.\n",
"\n",
"If the five sentences below were our entire corpus, our conversion would look this:\n",
"\n",
"1. i have books - [1, 4, 2]\n",
"2. interesting books are useful [11,2,9,8]\n",
"3. i have computers [1,4,3]\n",
"4. computers are interesting and useful [3,5,11,10,8]\n",
"5. books and computers are both valuable. [2,10,3,9,13,12]\n",
"6. bye bye [7,7]\n",
"\n",
"I-1, books-2, computers-3, have-4, are-5, computers-6,bye-7, useful-8, are-9, and-10,interesting-11, valuable-12, both-13\n",
"\n",
"To accomplish this we'll first need to know what all the unique words are in our dataset."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "broke-oakland",
"metadata": {},
"outputs": [],
"source": [
"all_tokens = [word for review in X for word in review]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "sensitive-transparency",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(5000, 5000)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# sanity check\n",
"len(all_tokens), sum([len(x) for x in X])"
]
},
{
"cell_type": "markdown",
"id": "matched-information",
"metadata": {},
"source": [
"Casting our `list` of words into a `set` is a great way to get all the *unique* words in the data."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "broad-section",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Unique Words: 892\n"
]
}
],
"source": [
"vocab = sorted(set(all_tokens))\n",
"print('Unique Words:', len(vocab))"
]
},
{
"cell_type": "markdown",
"id": "external-preliminary",
"metadata": {},
"source": [
"Now we need to create a mapping from words to integers. For this will a **dictionary comprehension**."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "absent-coordinate",
"metadata": {},
"outputs": [],
"source": [
"word2idx = {word: idx for idx, word in enumerate(vocab)}"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "northern-teens",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'\"Absolute': 0,\n",
" '\"Bohlen\"-Fan': 1,\n",
" '\"Brideshead': 2,\n",
" '\"Candy\"?).': 3,\n",
" '\"City': 4,\n",
" '\"Dieter': 5,\n",
" '\"Dieter\"': 6,\n",
" '\"Dragonfly\"': 7,\n",
" '\"I\\'ve': 8,\n",
" '\"Lady.\"
"
]
},
{
"cell_type": "markdown",
"id": "known-rover",
"metadata": {},
"source": [
"Let us build a single-layer feed-forward net with a hidden layer of 250 nodes. Each input would be a 500-dim vector of tokens since we padded all our sequences to size 500.\n",
"\n",
" \n",
"
\n",
"Q: How would you calculate the number of parameters in this network?\n",
"
\n",
"\n",
"\n",
"One can view the embedding process as a linear projection from one vector space to another. For NLP, we usually use embeddings to project the sparse one-hot encodings of words on to a lower-dimensional continuous space so that the input surface is 'dense' and possibly smooth. Thus, one can view this embedding layer process as just a transformation from $\\mathbb{R}^{inp}$ to $\\mathbb{R}^{emb}$\n",
"\n",
"This not only reduces dimensionality but also allows semantic similarities between tokens to be captured by 'similiarities' between the embedding vectors. This was not possible with one-hot encoding as all vectors there were orthogonal to one another. \n",
"\n",
"\n",
"\n",
"It is also possible to load pretrained embeddings that were learned from giant corpora. This would be an instance of transfer learning.\n",
"\n",
"If you are interested in learning more, start with the astromonically impactful papers of [word2vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) and [GloVe](https://www.aclweb.org/anthology/D14-1162.pdf).\n",
"\n",
"In Keras we use the [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer:\n",
"```\n",
"tf.keras.layers.Embedding(\n",
" input_dim, output_dim, embeddings_initializer='uniform',\n",
" embeddings_regularizer=None, activity_regularizer=None,\n",
" embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs\n",
")\n",
"```\n",
"We'll need to specify the `input_dim` and `output_dim`. If working with sequences, as we are, you'll also need to set the `input_length`."
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "covered-automation",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model: \"FFNN_EMBED\"\n",
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"embedding_1 (Embedding) (None, 500, 100) 1000000 \n",
"_________________________________________________________________\n",
"flatten_1 (Flatten) (None, 50000) 0 \n",
"_________________________________________________________________\n",
"dense_6 (Dense) (None, 250) 12500250 \n",
"_________________________________________________________________\n",
"dense_7 (Dense) (None, 1) 251 \n",
"=================================================================\n",
"Total params: 13,500,501\n",
"Trainable params: 13,500,501\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n",
"None\n",
"Epoch 1/2\n",
"196/196 - 6s - loss: 0.6433 - accuracy: 0.6078 - val_loss: 0.3630 - val_accuracy: 0.8497\n",
"Epoch 2/2\n",
"196/196 - 6s - loss: 0.2349 - accuracy: 0.9025 - val_loss: 0.2977 - val_accuracy: 0.8747\n",
"Accuracy: 87.47%\n"
]
}
],
"source": [
"EMBED_DIM = 100\n",
"\n",
"model = Sequential(name='FFNN_EMBED')\n",
"model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))\n",
"model.add(Flatten())\n",
"model.add(Dense(250, activation='relu'))\n",
"model.add(Dense(1, activation='sigmoid'))\n",
"model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
"print(model.summary())\n",
"\n",
"model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=2)\n",
"\n",
"scores = model.evaluate(X_test, y_test, verbose=0)\n",
"print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
]
},
{
"cell_type": "markdown",
"id": "silent-reporter",
"metadata": {},
"source": [
"
Model 3: 1-Dimensional Convolutional Network
\n",
""
]
},
{
"cell_type": "markdown",
"id": "continental-import",
"metadata": {},
"source": [
"Text can be thought of as 1-dimensional sequence (a single, long vector) and we can apply 1D Convolutions over a set of word embeddings. \n",
"\n",
"More information on convolutions on text data can be found on [this blog](http://debajyotidatta.github.io/nlp/deep/learning/word-embeddings/2016/11/27/Understanding-Convolutions-In-Text/). If you want to learn more, read this [published and well-cited paper](https://www.aclweb.org/anthology/I17-1026.pdf) from Eleni's friend, Byron Wallace."
]
},
{
"cell_type": "markdown",
"id": "adjacent-response",
"metadata": {},
"source": [
"
\n",
"Q: Why do we use Conv1D if our input, a sequence of word embeddings, is 2D?\n",
"
\n",
""
]
},
{
"cell_type": "markdown",
"id": "secret-atlas",
"metadata": {},
"source": [
"At a high-level, an RNN is similar to a feed-forward neural network (FFNN) in that there is an input layer, a hidden layer, and an output layer. The input layer is fully connected to the hidden layer, and the hidden layer is fully connected to the output layer. However, the crux of what makes it a **recurrent** neural network is that the hidden layer for a given time _t_ is not only based on the input layer at time _t_ but also the hidden layer from time _t-1_.\n",
"\n",
"Here's a popular blog post on [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).\n",
"\n",
"In Keras, the vanilla RNN unit is implemented the`SimpleRNN` layer:\n",
"```\n",
"tf.keras.layers.SimpleRNN(\n",
" units, activation='tanh', use_bias=True,\n",
" kernel_initializer='glorot_uniform',\n",
" recurrent_initializer='orthogonal',\n",
" bias_initializer='zeros', kernel_regularizer=None,\n",
" recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None,\n",
" kernel_constraint=None, recurrent_constraint=None, bias_constraint=None,\n",
" dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False,\n",
" go_backwards=False, stateful=False, unroll=False, **kwargs\n",
")\n",
"```\n",
"As you can see, recurrent layers in Keras take many arguments. We only need to be concerned with `units`, which specifies the size of the hidden state, and `return_sequences`, which will be discussed shortly. For the moment is it fine to leave this set to the default of `False`.\n",
"\n",
"Due to the limitations of the vanilla RNN unit (more on that next) it tends not to be used much in practice. For this reason it seems that the Keras developers neglected to implement GPU acceleration for this layer! Notice how much slower the trainig is even for a network with far fewer parameters."
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "stretch-andorra",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model: \"SimpleRNN\"\n",
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"embedding_3 (Embedding) (None, 500, 100) 1000000 \n",
"_________________________________________________________________\n",
"simple_rnn (SimpleRNN) (None, 100) 20100 \n",
"_________________________________________________________________\n",
"dense_10 (Dense) (None, 1) 101 \n",
"=================================================================\n",
"Total params: 1,020,201\n",
"Trainable params: 1,020,201\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n",
"None\n",
"Epoch 1/3\n",
"196/196 [==============================] - 53s 267ms/step - loss: 0.6720 - accuracy: 0.5660\n",
"Epoch 2/3\n",
"196/196 [==============================] - 52s 266ms/step - loss: 0.5283 - accuracy: 0.7444\n",
"Epoch 3/3\n",
"196/196 [==============================] - 52s 265ms/step - loss: 0.3406 - accuracy: 0.8588\n",
"Accuracy: 83.13%\n"
]
}
],
"source": [
"model = Sequential(name='SimpleRNN')\n",
"model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))\n",
"model.add(SimpleRNN(100))\n",
"model.add(Dense(1, activation='sigmoid'))\n",
"model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
"print(model.summary())\n",
"\n",
"model.fit(X_train, y_train, epochs=3, batch_size=128)\n",
"\n",
"scores = model.evaluate(X_test, y_test, verbose=0)\n",
"print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
]
},
{
"cell_type": "markdown",
"id": "dependent-blast",
"metadata": {},
"source": [
"
Vanishing/Exploding Gradients
"
]
},
{
"cell_type": "markdown",
"id": "weekly-comparison",
"metadata": {},
"source": [
"\n",
" \n",
"\n",
"We need to backpropogate through every time step to calculate the gradients used for our weight updates.\n",
"\n",
"This requires the use of the chain rule which amounts to repeated multiplications.\n",
"\n",
"This can cause two types of problems. First, this product can quickly 'explode,' becoming large, causing destructive updates to the model and numerical overflow. One hack to solve this problem is to **clip** the gradient at some threshold.\n",
"\n",
"Alternatively, the gradient can 'vanish,' getting smaller and smaller as the gradient moves backwards in time. Gradient clipping will not help us here. If we can't propogate gradients suffuciently far back in time then our network will be unable to learn long temporal dependencies. This problem motivates the architecture of the GRU and LSTM units as substitutes for the 'vanilla' RNN.\n",
"\n",
"For a more detailed look at the vanishing/exploding gradient problem, please see [Marios's excellent Advanced Section](https://edstem.org/us/courses/3773/lessons/11753/slides/56629). "
]
},
{
"cell_type": "markdown",
"id": "annual-composition",
"metadata": {},
"source": [
"
Model 5: GRU
\n",
"\n",
"\n",
"$X_{t}$: input \n",
"$U$, $V$, and $\\beta$: parameter matrices and vector \n",
"$\\tilde{h_t}$: candidate activation vector \n",
"$h_{t}$: output vector \n",
"$R_t$: reset gate \n",
"$Z_t$: update gate \n",
"\n",
"The gates of the GRU allow for the gradients to flow more freely to previous time steps, helping to mitigate the vanishing gradient problem.\n",
"\n",
"In Keras, the [`GRU`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) layer is used in exactly the same way as the `SimpleRNN` layer. \n",
"```\n",
"tf.keras.layers.GRU(\n",
" units, activation='tanh', recurrent_activation='sigmoid',\n",
" use_bias=True, kernel_initializer='glorot_uniform',\n",
" recurrent_initializer='orthogonal',\n",
" bias_initializer='zeros', kernel_regularizer=None,\n",
" recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None,\n",
" kernel_constraint=None, recurrent_constraint=None, bias_constraint=None,\n",
" dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False,\n",
" go_backwards=False, stateful=False, unroll=False, time_major=False,\n",
" reset_after=True, **kwargs\n",
")\n",
"```\n",
"\n",
"Here we just swap it in to the previous architecture. Note how much faster it trains with GPU excelleration than the simple RNN!"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "organizational-chosen",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model: \"GRU\"\n",
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"embedding_6 (Embedding) (None, 500, 100) 1000000 \n",
"_________________________________________________________________\n",
"gru_1 (GRU) (None, 100) 60600 \n",
"_________________________________________________________________\n",
"dense_13 (Dense) (None, 1) 101 \n",
"=================================================================\n",
"Total params: 1,060,701\n",
"Trainable params: 1,060,701\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n",
"None\n",
"Epoch 1/3\n",
"391/391 [==============================] - 13s 30ms/step - loss: 0.5626 - accuracy: 0.6781\n",
"Epoch 2/3\n",
"391/391 [==============================] - 12s 30ms/step - loss: 0.2510 - accuracy: 0.9011\n",
"Epoch 3/3\n",
"391/391 [==============================] - 12s 30ms/step - loss: 0.1757 - accuracy: 0.9349\n",
"Accuracy: 88.02%\n"
]
}
],
"source": [
"model = Sequential(name='GRU')\n",
"model.add(Embedding(MAX_VOCAB, EMBED_DIM, input_length=MAX_LEN))\n",
"model.add(GRU(100))\n",
"model.add(Dense(1, activation='sigmoid'))\n",
"model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
"print(model.summary())\n",
"\n",
"model.fit(X_train, y_train, epochs=3, batch_size=64)\n",
"\n",
"scores = model.evaluate(X_test, y_test, verbose=0)\n",
"print(\"Accuracy: %.2f%%\" % (scores[1]*100))"
]
},
{
"cell_type": "markdown",
"id": "adopted-anchor",
"metadata": {},
"source": [
"
\n",
"\n",
"\n",
"We may want our model to learn dependencies in either direction. A **BiDirectional RNN** consists of two separate recurrent units. One processing the sequence from left to right, the other processes that same sequence but in reverse, from right to left. The output of the two units are then merged together (typically concatenated) and feed to the next layer of the network. \n",
"\n",
"\n",
"\n",
"Creating a Bidirection RNN in Keras is quite simple. We just 'wrap' a recurrent layer in the [`Bidirectional`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) layer. The default behavior is to concatenate the output from each direction.\n",
"\n",
"```\n",
"tf.keras.layers.Bidirectional(\n",
" layer, merge_mode='concat', weights=None, backward_layer=None,\n",
" **kwargs\n",
")\n",
"```\n",
"\n",
"Example:\n",
"```\n",
"model = Sequential()\n",
"...\n",
"model.add(Bidirectional(SimpleRNN(n_nodes))\n",
"...\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "valuable-canvas",
"metadata": {},
"source": [
"
Deep RNNs
\n",
"\n",
"\n",
"We may want to stack RNN layers one after another. But there is a problem. A recurrent layer expects to be given a sequence as input, and yet we can see that the recurrent layer in each of our models above outputs a single vector. This is because the default behavior of Keras's recurrent layers is to suppress the output until the final time step. If we want to have two recurrent units in a row then the first will have to given an output after each time step, thus providing a sequence to the 2nd recurrent layer.\n",
"\n",
"We can have our recurrent layers output at each time step setting `return_sequences=True`. \n",
"Example:\n",
"```\n",
"model = Sequential()\n",
"...\n",
"model.add(LSTM(100, return_sequences=True))\n",
"model.add(LSTM(100)\n",
"...\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "assisted-pollution",
"metadata": {},
"source": [
"
TimeDistributed Layer
\n"
]
},
{
"cell_type": "markdown",
"id": "loved-camcorder",
"metadata": {},
"source": [
"[`TimeDistributed`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TimeDistributed) is a 'wrapper' that applies a layer to all time steps of an input sequence.\n",
"```\n",
"tf.keras.layers.TimeDistributed(\n",
" layer, **kwargs\n",
")\n",
"```\n",
"We use `TimeDistributed` when we want to input a sequence into a layer that doesn't normally expect a time dimension, such as `Dense`."
]
},
{
"cell_type": "code",
"execution_count": 146,
"id": "noted-database",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of input : (1, 3, 5)\n",
"Shape of output : (1, 3, 8)\n"
]
}
],
"source": [
"model = Sequential()\n",
"model.add(TimeDistributed(Dense(8), input_shape=(3, 5)))\n",
"input_array = np.random.randint(10, size=(1,3,5))\n",
"print(\"Shape of input : \", input_array.shape)\n",
"\n",
"model.compile('rmsprop', 'mse')\n",
"output_array = model.predict(input_array)\n",
"print(\"Shape of output : \", output_array.shape)"
]
},
{
"cell_type": "markdown",
"id": "touched-occasions",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "brilliant-welding",
"metadata": {},
"source": [
"CNNs are good at learning spatial features, and sentences can be thought of as 1-D spatial vectors (dimensionality is determined by the number of words in the sentence). We can then take the features learned by the CNN (after a maxpooling layer) and feed them into an RNN! We expect the CNN to be able to pick out invariant features across the 1-D spatial structure (i.e., sentence) that characterize good and bad sentiment. This learned spatial features may then be learned as sequences by a reccurent layer. The classification step is then performed by a final dense layer."
]
},
{
"cell_type": "markdown",
"id": "chicken-verse",
"metadata": {},
"source": [
"