Key Word(s): RNN, LSTM, GRU, Keras


CS-109B Advanced Data Science

Lab 9: Recurrent Neural Networks (Part II)

Harvard University
Fall 2020
Instructors: Mark Glickman, Pavlos Protopapas, and Chris Tanner
Lab Instructors: Chris Tanner and Eleni Angelaki Kaxiras
Content: Srivatsan Srinivasan, Pavlos Protopapas, Chris Tanner

In [1]:
# RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

Learning Goals

In this lab, we will continue where we left off in Lab 8. By the end of this lab, you should:

  • feel comfortable modelling sequences in keras via RNNs and its variants (GRUs, LSTMs)
  • have a good undertanding on how sequences -- any data that has some temporal semantics (e.g., time series, natural language, images etc.) -- fit into and benefit from a recurrent architecture
  • ask any other NLP questions that you're curious about. This lab is the closest one we'll ever have to being an Office Hour.

Seq2Seq Model: 231+432 = 665.... It's not ? Let's ask our LSTM

In this exercise, we are going to teach addition to our model. Given two numbers (<999), the model outputs their sum (<9999). The input is provided as a string '231+432' and the model will provide its output as ' 663' (Here the empty space is the padding character). We are not going to use any external dataset and are going to construct our own dataset for this exercise.

The exercise we attempt to do effectively "translates" a sequence of characters '231+432' to another sequence of characters ' 663' and hence, this class of models are called sequence-to-sequence models (aka seq2seq). Such architectures have profound applications in several real-life tasks such as machine translation, summarization, image captioning etc.

To be clear, sequence-to-sequence (aka seq2seq) models take as input a sequence of length N and return a sequence of length M, where N and M may or may not differ, and every single observation/input may be of different values, too. For example, machine translation concerns converting text from one natural language to another (e.g., translating English to French). Google Translate is an example, and their system is a seq2seq model. The input (e.g., an English sentence) can be of any length, and the output (e.g., a French sentence) may be of any length.

Background knowledge: The earliest and most simple seq2seq model works by having one RNN for the input, just like we've always done, and we refer to it as being an "encoder." The final hidden state of the encoder RNN is fed as input to another RNN that we refer to as the "decoder." The job of the decoder is to generate each token, one word at a time. This may seem really limiting, as it relies on the encoder encapsulating the entire input sequence with just 1 hidden layer. It seems unrealistic that we could encode an entire meaning of a sentence with just one hidden layer. Yet, results even in this simplistic manner can be quite impressive. In fact, these early results were compelling enough that these models immediately replaced the decades of earlier machine translation work.

In [5]:
from __future__ import print_function
from keras.models import Sequential
from keras import layers
from keras.layers import Dense, RepeatVector, TimeDistributed
import numpy as np
from six.moves import range

Data preprocessing and handling

In [6]:
class CharacterTable(object):
    def __init__(self, chars):        
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))

    # converts a String of characters into a one-hot embedding/vector
    def encode(self, C, num_rows):        
        x = np.zeros((num_rows, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x
    
    # converts a one-hot embedding/vector into a String of characters
    def decode(self, x, calc_argmax=True):        
        if calc_argmax:
            x = x.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in x)
In [7]:
TRAINING_SIZE = 50000
DIGITS = 3
MAXOUTPUTLEN = DIGITS + 1
MAXLEN = DIGITS + 1 + DIGITS

chars = '0123456789+ '
ctable = CharacterTable(chars)
In [8]:
def return_random_digit():
  return np.random.choice(list('0123456789'))  

# generate a new number of length `DIGITS`
def generate_number():
  num_digits = np.random.randint(1, DIGITS + 1)  
  return int(''.join( return_random_digit()
                      for i in range(num_digits)))

# generate `TRAINING_SIZE` # of pairs of random numbers
def data_generate(num_examples):
  questions = []
  answers = []
  seen = set()
  print('Generating data...')
  while len(questions) < TRAINING_SIZE:      
      a, b = generate_number(), generate_number()
        
      # don't allow duplicates; this is good practice for training,
      # as we will minimize memorizing seen examples
      key = tuple(sorted((a, b)))
      if key in seen:
          continue
      seen.add(key)
    
      # pad the data with spaces so that the length is always MAXLEN.
      q = '{}+{}'.format(a, b)
      query = q + ' ' * (MAXLEN - len(q))
      ans = str(a + b)
    
      # answers can be of maximum size DIGITS + 1.
      ans += ' ' * (MAXOUTPUTLEN - len(ans))
      questions.append(query)
      answers.append(ans)
  print('Total addition questions:', len(questions))
  return questions, answers

def encode_examples(questions, answers):
  x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=np.bool)
  y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=np.bool)
  for i, sentence in enumerate(questions):
      x[i] = ctable.encode(sentence, MAXLEN)
  for i, sentence in enumerate(answers):
      y[i] = ctable.encode(sentence, DIGITS + 1)

  indices = np.arange(len(y))
  np.random.shuffle(indices)
  return x[indices],y[indices]
In [9]:
q, a = data_generate(TRAINING_SIZE)
x, y = encode_examples(q,a)

# divides our data into training and validation
split_at = len(x) - len(x) // 10
x_train, x_val, y_train, y_val = x[:split_at], x[split_at:],y[:split_at],y[split_at:]

print('Training Data shape:')
print('X : ', x_train.shape)
print('Y : ', y_train.shape)

print('Sample Question(in encoded form) : ', x_train[0], y_train[0])
print('Sample Question(in decoded form) : ', ctable.decode(x_train[0]),'Sample Output : ', ctable.decode(y_train[0]))
Generating data...
Total addition questions: 50000
Training Data shape:
X :  (45000, 7, 12)
Y :  (45000, 4, 12)
Sample Question(in encoded form) :  [[False False False False False False False False False False  True False]
 [False False False  True False False False False False False False False]
 [False  True False False False False False False False False False False]
 [False False False False False False False False  True False False False]
 [False False False False False  True False False False False False False]
 [False False False False False False False False False False False  True]
 [ True False False False False False False False False False False False]] [[False False False False False False False False False  True False False]
 [False False False False  True False False False False False False False]
 [False False  True False False False False False False False False False]
 [ True False False False False False False False False False False False]]
Sample Question(in decoded form) :  81+639  Sample Output :  720 

Let's learn two wrapper functions in Keras - TimeDistributed and RepeatVector with some dummy examples.

TimeDistributed is a wrapper function call that applies an input operation (e.g., NN layer) on each of the timesteps of an input data. Read the documentation here. For instance, perhaps you have a 5-dimensional input. Instead of simply outputting a single value for each of the 5 inputs, let's say that you want to emit an 8-dimensional output for each input. We can do this simply by using a TimeDistrubted layer while specifying that we want a fully-connected layer that applies itself to each of the 5 input time steps and emits 8 outputs for each. Below, we illustrate such:

In [10]:
# Inputs will be a tensor of size: batch_size * time_steps * input_vector_dim(to Dense)
# Output will be a tensor of size: batch_size * time_steps * output_vector_dim(i.e., 8)

model = Sequential()

# Here, Dense() converts a 5-dim input vector to a 8-dim vector.
model.add(TimeDistributed(Dense(8), input_shape=(3, 5)))
input_array = np.random.randint(10, size=(1,3,5))
print("Shape of input : ", input_array.shape)

model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
print("Shape of output : ", output_array.shape)
Shape of input :  (1, 3, 5)
Shape of output :  (1, 3, 8)

RepeatVector repeats the vector a specified number of times. Dimension changes from batch_size number of elements to batch_size number of repetitions * number of elements.

In [12]:
model = Sequential()
# converts tensor from size of 1*10 to 1*6
model.add(Dense(6, input_dim=10))
print(model.output_shape)

# converts tensor from size of 1*6 to size of 1*3*6
model.add(RepeatVector(3))
print(model.output_shape) 

input_array = np.random.randint(1000, size=(1, 10))
print("Shape of input : ", input_array.shape)

model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)

print("Shape of output : ", output_array.shape)
# note: `None` is the batch dimension
print('Input : ', input_array[0])
print('Output : ', output_array[0])
(None, 6)
(None, 3, 6)
Shape of input :  (1, 10)
Shape of output :  (1, 3, 6)
Input :  [522 685  47 661 971 923 871 316 646 799]
Output :  [[1105.8765  1320.0476  -234.5448   -67.07581 -165.83423  238.93486]
 [1105.8765  1320.0476  -234.5448   -67.07581 -165.83423  238.93486]
 [1105.8765  1320.0476  -234.5448   -67.07581 -165.83423  238.93486]]

MODEL ARCHITECTURE

Note: Whenever you are initializing a LSTM in Keras, by the default the option return_sequences = False. This means that at the end of the step the next component will only get to see the final hidden layer's values. On the other hand, if you set return_sequences = True, the LSTM component will return the hidden layer at each time step. It means that the next component should be able to consume inputs in that form.

Think how this statement is relevant in terms of this model architecture and the TimeDistributed module we just learned.

Build an encoder and decoder both single layer 128 nodes and an appropriate dense layer as needed by the model.

In [37]:
# Hyperaparams
RNN = layers.LSTM
HIDDEN_SIZE = 128
BATCH_SIZE = 128
LAYERS = 1

print('Build model...')
model = Sequential()

# ENCODING
model.add(RNN(HIDDEN_SIZE, input_shape=(MAXLEN, len(chars))))
model.add(RepeatVector(MAXOUTPUTLEN))

# DECODING
for _ in range(LAYERS):
    # return hidden layer at each time step
    model.add(RNN(HIDDEN_SIZE, return_sequences=True)) 

model.add(TimeDistributed(layers.Dense(len(chars), activation='softmax')))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()
Build model...
Model: "sequential_17"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_5 (LSTM)                (None, 128)               72192     
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 4, 128)            0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 4, 128)            131584    
_________________________________________________________________
time_distributed_3 (TimeDist (None, 4, 12)             1548      
=================================================================
Total params: 205,324
Trainable params: 205,324
Non-trainable params: 0
_________________________________________________________________

Let's check how well our model trained.

In [38]:
for iteration in range(1, 2):
    print()  
    model.fit(x_train, y_train,
              batch_size=BATCH_SIZE,
              epochs=20,
              validation_data=(x_val, y_val))
    # Select 10 samples from the validation set at random so
    # we can visualize errors.
    print('Finished iteration ', iteration)
    numcorrect = 0
    numtotal = 20
    
    for i in range(numtotal):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]
        preds = model.predict_classes(rowx, verbose=0)
        q = ctable.decode(rowx[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=False)
        print('Question', q, end=' ')
        print('True', correct, end=' ')
        print('Guess', guess, end=' ')
        if guess == correct :
          print('Good job')
          numcorrect += 1
        else:
          print('Fail')
    print('The model scored ', numcorrect*100/numtotal,' % in its test.')
Train on 45000 samples, validate on 5000 samples
Epoch 1/20
45000/45000 [==============================] - 10s 213us/step - loss: 1.8913 - accuracy: 0.3215 - val_loss: 1.8148 - val_accuracy: 0.3392
Epoch 2/20
45000/45000 [==============================] - 9s 194us/step - loss: 1.7547 - accuracy: 0.3544 - val_loss: 1.6947 - val_accuracy: 0.3761
Epoch 3/20
45000/45000 [==============================] - 8s 188us/step - loss: 1.6045 - accuracy: 0.3990 - val_loss: 1.5127 - val_accuracy: 0.4290
Epoch 4/20
45000/45000 [==============================] - 10s 223us/step - loss: 1.4356 - accuracy: 0.4599 - val_loss: 1.3487 - val_accuracy: 0.4931
Epoch 5/20
45000/45000 [==============================] - 11s 239us/step - loss: 1.2919 - accuracy: 0.5167 - val_loss: 1.2355 - val_accuracy: 0.5404
Epoch 6/20
45000/45000 [==============================] - 9s 196us/step - loss: 1.1851 - accuracy: 0.5556 - val_loss: 1.1403 - val_accuracy: 0.5719
Epoch 7/20
45000/45000 [==============================] - 9s 191us/step - loss: 1.0988 - accuracy: 0.5872 - val_loss: 1.0659 - val_accuracy: 0.5950
Epoch 8/20
45000/45000 [==============================] - 9s 194us/step - loss: 1.0235 - accuracy: 0.6134 - val_loss: 0.9967 - val_accuracy: 0.6226
Epoch 9/20
45000/45000 [==============================] - 9s 198us/step - loss: 0.9560 - accuracy: 0.6388 - val_loss: 0.9391 - val_accuracy: 0.6388
Epoch 10/20
45000/45000 [==============================] - 9s 207us/step - loss: 0.8848 - accuracy: 0.6645 - val_loss: 0.8588 - val_accuracy: 0.6694
Epoch 11/20
45000/45000 [==============================] - 10s 217us/step - loss: 0.7927 - accuracy: 0.6988 - val_loss: 0.7477 - val_accuracy: 0.7097
Epoch 12/20
45000/45000 [==============================] - 9s 198us/step - loss: 0.6750 - accuracy: 0.7465 - val_loss: 0.6294 - val_accuracy: 0.7650
Epoch 13/20
45000/45000 [==============================] - 10s 223us/step - loss: 0.5542 - accuracy: 0.8015 - val_loss: 0.5125 - val_accuracy: 0.8166
Epoch 14/20
45000/45000 [==============================] - 10s 217us/step - loss: 0.4452 - accuracy: 0.8534 - val_loss: 0.4086 - val_accuracy: 0.8692
Epoch 15/20
45000/45000 [==============================] - 9s 197us/step - loss: 0.3586 - accuracy: 0.8935 - val_loss: 0.3416 - val_accuracy: 0.8946
Epoch 16/20
45000/45000 [==============================] - 9s 200us/step - loss: 0.2982 - accuracy: 0.9166 - val_loss: 0.2824 - val_accuracy: 0.9196
Epoch 17/20
45000/45000 [==============================] - 9s 199us/step - loss: 0.2379 - accuracy: 0.9403 - val_loss: 0.2302 - val_accuracy: 0.9381
Epoch 18/20
45000/45000 [==============================] - 9s 190us/step - loss: 0.1988 - accuracy: 0.9516 - val_loss: 0.1899 - val_accuracy: 0.9512
Epoch 19/20
45000/45000 [==============================] - 9s 203us/step - loss: 0.1605 - accuracy: 0.9644 - val_loss: 0.1664 - val_accuracy: 0.9562
Epoch 20/20
45000/45000 [==============================] - 9s 195us/step - loss: 0.1354 - accuracy: 0.9705 - val_loss: 0.1462 - val_accuracy: 0.9622
Finished iteration  1
Question 551+581 True 1132 Guess 1132 Good job
Question 332+746 True 1078 Guess 1078 Good job
Question 736+654 True 1390 Guess 1490 Fail
Question 77+705  True 782  Guess 782  Good job
Question 7+266   True 273  Guess 273  Good job
Question 18+7    True 25   Guess 25   Good job
Question 123+666 True 789  Guess 889  Fail
Question 886+48  True 934  Guess 934  Good job
Question 6+862   True 868  Guess 868  Good job
Question 492+378 True 870  Guess 870  Good job
Question 59+30   True 89   Guess 89   Good job
Question 542+95  True 637  Guess 637  Good job
Question 58+738  True 796  Guess 796  Good job
Question 29+62   True 91   Guess 91   Good job
Question 217+21  True 238  Guess 238  Good job
Question 140+811 True 951  Guess 951  Good job
Question 601+814 True 1415 Guess 1425 Fail
Question 815+337 True 1152 Guess 1152 Good job
Question 578+52  True 630  Guess 630  Good job
Question 3+424   True 427  Guess 427  Good job
The model scored  85.0  % in its test.

EXERCISE

  • Try changing the hyperparams, use other RNNs, more layers, check if increasing the number of epochs is useful.

  • Try reversing the data from validation set and check if commutative property of addition is learned by the model.

  • Try printing the hidden layer with two inputs that are commutative and check if the hidden representations it learned are same or similar. Do we expect it to be true? If so, why? If not why? You can access the layer using an index with model.layers and layer.output will give the output of that layer.

  • Try doing addition in the RNN model the same way we do by hand. Reverse the order of digits and at each time step, input two digits get an output use the hidden layer and input next two digits and so on. (units in the first time step, tens in the second time step etc.)

Extra HW tidbits:

pad_sequences()

When working with sequences, it is usually important to ensure that our input sequences are of the same length -- especially when doing reading inputs in batches. A simple and common approach is to make our input length be that of a our longest input. All inputs that are shorter than this will be padded with a particular value of your choice. Also, if any inputs happen to be longer than our specified length (again, we typically set this to be that of our longest input), then they will be truncated to that length. Click here for the Keras documentation

Embedding layers

Embeddings layers, when used in Keras, must be the 1st layer of your network. They transform vectors from one space into another. Typically, this learned embedding is of a smaller size than the original. The power of this layer is that the Embeddings will be learned by the network.

In [ ]: