Title :¶

LSTM v/s GRU

Description :¶

The goal of this exercise is to compare the performance between two popular gating methods, i.e LSTM and GRUs:

Instructions :¶

Read the IMDB dataset from the helper code given.
Take a quick look at your training inputs and labels.
Pad the values to a fix number max_words in-order to have sequences of the same size.
Build, compile and fit a GRU model
Evaluate the model performance on the test set and report the test set accuracy.
Again build, compile and fit a model but use the LSTM architecture instead.
Evaluate the LSTM model's performance on the test set and report the test set accuracy.
Compare the performance of all the two models.

Hints:¶

tf.keras.layers.Embedding() Turns positive integers (indexes) into dense vectors of fixed size.

tf.keras.layers.LSTM() Long Short-Term Memory layer - Hochreiter 1997.

tf.keras.layers.Dense() Just your regular densely-connected NN layer.

LSTM¶

We will use both GRU and LSTM to perform sentiment analysis in tensorflow.keras and compare their performance using the custom IMDB dataset.

In [1]:

# Import necessary libraries
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import backend as K
from tensorflow.keras.layers import RNN
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Model,Sequential
from tensorflow.keras.layers import Input,Dense,LSTM,GRU,Embedding
from tensorflow.keras.preprocessing import sequence
from prettytable import PrettyTable
import pickle

In [2]:

# We use the same dataset as the previous exercise 
with open('imdb_mini.pkl','rb') as f:
    X_train, y_train, X_test, y_test = pickle.load(f)

In [3]:

# Similar to the previous exercise, we will pre-preprocess our review sequences
# We fix the vocabulary size to 5000 because our custom 
# dataset was curated with that
vocabulary_size = 5000
# Max word length for each review will be 500
max_words = 200
# we set the embedding size to 32
embedding_size=32
# Pre-padding sequences to max_words lenth
X_train = sequence.pad_sequences(X_train, maxlen=max_words,padding='pre')
X_test = sequence.pad_sequences(X_test, maxlen=max_words,padding='pre')

In [4]:

# We create the mapping between words and sequences
word2id = imdb.get_word_index()
# We need to adjust the mapping by 3 because of tensorflow.keras preprocessing
# more here: https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset
word2id = {k:(v+3) for k,v in word2id.items()}
word2id[""] = 0
word2id[""] = 1
word2id[""] = 2
word2id[""] = 3

# Reversing the key,value pair will give the id2word
id2word = {i: word for word, i in word2id.items()}

⏸ For this problem with `embedding_size=32` ($X_t$) and `hidden_size=100` ($H_{t-1}$), how many trainable weights are associated with the GRU Cell (assuming `use_bias=True`)?¶

A. 39600¶

B. 39800¶

C. 40200¶

D. 40400¶

In [5]:

### edTest(test_chow1) ###
# Submit an answer choice as a string below (eg. if you choose option A, put 'A')
answer1 = 'C'

In [6]:

# Comparing with GRU model
embedding_size=32
hidden_size = 100
gru_model=Sequential()
# Add Embedding, GRU and a Dense layer 
# Add Embedding layer with vocabulary_size, embedding_size and input_length
# Add GRU with hidden_size
# Add Dense layer with 1 unit and sigmoid activation
gru_model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
gru_model.add(GRU(hidden_size))
gru_model.add(Dense(1, activation='sigmoid'))

gru_model.compile(loss='binary_crossentropy',optimizer = 'Adam', metrics=['accuracy'])

In [7]:

gru_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 200, 32)           160000    
_________________________________________________________________
gru (GRU)                    (None, 100)               40200     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
=================================================================
Total params: 200,301
Trainable params: 200,301
Non-trainable params: 0
_________________________________________________________________

In [8]:

### edTest(test_chow2) ###
gru_cnt_params = gru_model.count_params()

In [9]:

batch_size = 256
num_epochs = 3
gru_model.fit(X_train, y_train, batch_size=batch_size, epochs=num_epochs)
gru_score = gru_model.evaluate(X_test,y_test)
print(f'Model accuracy on the test set is {gru_score[1]:.2f}')

Epoch 1/3
40/40 [==============================] - 31s 711ms/step - loss: 0.6919 - accuracy: 0.5424
Epoch 2/3
40/40 [==============================] - 35s 871ms/step - loss: 0.6306 - accuracy: 0.5924
Epoch 3/3
40/40 [==============================] - 46s 1s/step - loss: 0.5879 - accuracy: 0.7208
157/157 [==============================] - 20s 120ms/step - loss: 0.5664 - accuracy: 0.7166
Model accuracy on the test set is 0.72

⏸ For this problem with `embedding_size=32` ($X_t$) and `hidden_size=100` ($H_{t-1}$), how many trainable weights are associated with the LSTM Cell (assuming `use_bias=True`)?¶

A. 52800¶

B. 53200¶

C. 54200¶

D. 51400¶

In [10]:

### edTest(test_chow3) ###
# Submit an answer choice as a string below (eg. if you choose option A, put 'A')
answer2 = 'A'

In [11]:

# Comparing with LSTM model
embedding_size=32
hidden_size = 100

lstm_model=Sequential()

# Add Embedding, LSTM and a Dense layer 
# Add Embedding layer with vocabulary_size, embedding_size and input_length
# Add LSTM with hidden_size
# Add Dense layer with 1 unit and sigmoid activation
lstm_model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
lstm_model.add(LSTM(100))
lstm_model.add(Dense(1, activation='sigmoid'))

lstm_model.compile(loss='binary_crossentropy',optimizer = 'Adam', metrics=['accuracy'])

In [12]:

lstm_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 200, 32)           160000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
=================================================================
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________

In [13]:

### edTest(test_chow4) ###
lstm_cnt_params = lstm_model.count_params()

In [14]:

batch_size = 256
num_epochs = 3
lstm_model.fit(X_train, y_train, batch_size=batch_size, epochs=num_epochs)
lstm_score = lstm_model.evaluate(X_test,y_test)
print(f'Model accuracy on the test set is {lstm_score[1]:.2f}')

Epoch 1/3
40/40 [==============================] - 45s 1s/step - loss: 0.6872 - accuracy: 0.5577
Epoch 2/3
40/40 [==============================] - 38s 956ms/step - loss: 0.5472 - accuracy: 0.7252
Epoch 3/3
40/40 [==============================] - 30s 746ms/step - loss: 0.3207 - accuracy: 0.8693
157/157 [==============================] - 9s 56ms/step - loss: 0.3433 - accuracy: 0.8530
Model accuracy on the test set is 0.85

In [15]:

# Finally, we compare the results from the three implementations above

pt = PrettyTable()
pt.field_names = ["Strategy","Test set accuracy"]
pt.add_row(["GRU RNN",f'{gru_score[1]*100:.2f}%'])
pt.add_row(["LSTM RNN",f'{lstm_score[1]*100:.2f}%'])
print(pt)

+----------+-------------------+
| Strategy | Test set accuracy |
+----------+-------------------+
| GRU RNN  |       71.66%      |
| LSTM RNN |       85.30%      |
+----------+-------------------+

🍲 Which variant is better, LSTM or GRU?¶

Both LSTM & GRUs solve the vanishing gradient problem of RNNs but each has their advantages and disadvantages. (Read this paper for a thorough analysis of the two methods) Based on your understanding, which architecture is more appropriate for the current analysis?

In [17]:

### edTest(test_chow5) ###
# Type your answer within in the quotes given
answer3 = 'LSTM'

In [0]: