Key Word(s): NLP, Embedding, tokenize, CNN, Convolutional Neural Network

CS109B Data Science 2: Advanced Topics in Data Science

Lecture 9: NLP example with CNN¶

Harvard University
Spring 2019
Instructors: Pavlos Protopapas and Mark Glickman

In [1]:

#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:

In this example, we will try to implement a CNN for text.

We will use the task of IMDB movie review classification. Response variable is positive/negative review.

A sentence can be thought of as a sequence of words which have semantic connections across time.

By semantic connection, we mean that the words that occur earlier in the sentence influence the sentence's structure and meaning in the latter part of the sentence.

Note: There are also semantic connections backwards in a sentence (we will revisit this idea when we do RNNs from both directions and combine their outputs which we will see in the next few lectures).

In [30]:

import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, SimpleRNN
from keras.layers.embeddings import Embedding
from keras.layers import Flatten
from keras.preprocessing import sequence
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
import numpy as np
# fix random seed for reproducibility
numpy.random.seed(1)

SEEDING - Important thing to do in many machine learning tasks which involve stochastic sampling (where random numbers are generated for different samples) is to do seeding so that the results are fairly reproducible.

WHY SEEDING ? Most random number generators in computers are pseudo-random number generators i.e. they generate random numbers starting from a seed, but internally have a deterministic formula to calculate the next random number it generates and thus, if you fix your seed, the set of random numbers produced are the same in every run.

STEP 1 : Load and visualize the data¶

In [43]:

# We want to have a finite vocabulary (9,999 most frequent words, one for everything else)
vocabulary_size = 10000

#We also want to have a finite length of reviews and not have to process really long sentences.
# Anything longer will be chopped! 
max_review_length = 500

For practical data science applications, we need to convert text into tokens since the machine understands only numbers and not really English words like humans can. As a simple example of tokenization, we can see a small example.

Assume we have 5 sentences. This is how we tokenize them into numbers once we create a dictionary.

I have books - [1, 4, 7]
Interesting books are useful [10,2,9,8]
I have computers [1,4,6]
Computers are interesting and useful [6,9,11,10,8]
Books and computers are both valuable. [7,10,2,9,13,12]

Create tokens for vocabulary based on frequency of occurrence. Hence, we assign the following tokens

I-1, books-2, computers-3, have-4, are-5, computers-6,books-7, useful-8, are-9, and-10,interesting-11, valuable-12, both-13

Thankfully, in our dataset it is internally handled and each sentence is represented in such tokenized form.

In [32]:

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)

In [33]:

print('Number of reviews', len(X_train))
print('Length of first and fifth review before padding', len(X_train[0]) ,len(X_train[4]))
print('First review', X_train[0])
print('First label', y_train[0])

Number of reviews 25000
Length of first and fifth review before padding 218 147
First review [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
First label 1

Pad sequences in order to ensure that all inputs have same sentence length and dimensions.

DISCUSSION : Why are we padding here?

In [34]:

X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
print('Length of first and fifth review after padding', len(X_train[0]) ,len(X_train[4]))

Length of first and fifth review after padding 500 500

Is Accuracy the right metric to look at ?¶

Discuss : In what cases is accuracy a good metric to measure classification models ?

What other metrics are useful incase accuracy proves to be incompetent metric for our dataset ? https://towardsdatascience.com/understanding-data-science-classification-metrics-in-scikit-learn-in-python-3bc336865019

In [35]:

from collections import Counter
counts = dict(Counter(y_train))
print('Number of zeroes : ', counts[0], ' and Number of ones : ', counts[1])

Number of zeroes :  12500  and Number of ones :  12500

MODEL 1(a) : FEEDFORWARD NETWORKS WITHOUT EMBEDDINGS¶

Let us build a single layer feedforward net with 250 nodes.

GOAL : Calculate the number of parameters involved in this network and implement a feedforward net to do classification.

In [36]:

model = Sequential()

model.add(Dense(250, activation='relu',input_dim=max_review_length))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())


model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_16 (Dense)             (None, 250)               125250    
_________________________________________________________________
dense_17 (Dense)             (None, 1)                 251       
=================================================================
Total params: 125,501
Trainable params: 125,501
Non-trainable params: 0
_________________________________________________________________
None
Train on 25000 samples, validate on 25000 samples
Epoch 1/3
 - 1s - loss: 8.0534 - acc: 0.4990 - val_loss: 7.9612 - val_acc: 0.5053
Epoch 2/3
 - 1s - loss: 8.0372 - acc: 0.5012 - val_loss: 8.0608 - val_acc: 0.4998
Epoch 3/3
 - 1s - loss: 8.0574 - acc: 0.5001 - val_loss: 8.0603 - val_acc: 0.4999
Accuracy: 49.99%

Any idea why the performance is terrible ?¶

Hint : Tokenization.

Obvious Workaround : One-Hot Encodings

EMBEDDINGS - Sparse to Dense Transformations¶

We use embeddings to reduce dimensions of our data since the tokens we assign based on our word frequency are discrete and do not have a continuous structure.

What are embeddings ?¶

Embeddings are functional transformations from a sparse discrete vector representation of text (either as tokens or as one-hot encodings) into a dense vector representation of a fixed size(usually of much lower dimensions than the vocabulary length of the text). The dense representations allow the neural network to generalize better.

Here we are training our own embedding while training our neural network. To transfer "knowledge" from other sources, in more complicated projects we can also use pre-trained embeddings such as word-2-vec, GloVE, Fastext etc. https://nlpforhackers.io/word-embeddings/

Example Embeddings Transformation¶

Let us first understand how Keras embedding layer works through a dummy example to see how the dimensions are transformed. In this example, each input is mapped to a 64 dimensional vector (via the embedding layer).

EXERCISE : Manually calculate the number of parameters needed in the embedding layer before executing the code.

In [37]:

model = Sequential()
#input - Number of categorical inputs, embedding dimension, input length.
model.add(Embedding(1000, 64, input_length=10))
print(model.summary())

# the model will take as input an integer matrix of size (batch, input_length).
# the largest integer (i.e. word index) in the input should be
# no larger than 999 (vocabulary size).
# now model.output_shape == (None, 10, 64), where None is the batch dimension.

input_array = np.random.randint(1000, size=(32, 10))
print("Shape of input : ", input_array.shape)
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
assert output_array.shape == (32, 10, 64)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_8 (Embedding)      (None, 10, 64)            64000     
=================================================================
Total params: 64,000
Trainable params: 64,000
Non-trainable params: 0
_________________________________________________________________
None
Shape of input :  (32, 10)

In [44]:

print(input_array[0])
print(output_array[0].shape)

[905 644  12 451 965 181  64 621 703 214]
(10, 64)

MODEL 1(b) : FEEDFORWARD NETWORKS WITH EMBEDDINGS¶

EXERCISE : Implement the feedforward net combining the embedding layer and the feedforward layer(one layer, 250 nodes) without looking at cells below. Manually calculate the number of parameters needed in the feedforward network before executing the code.

In [39]:

embedding_dim = 100

In [40]:

model = Sequential()
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_9 (Embedding)      (None, 500, 100)          1000000   
_________________________________________________________________
flatten_4 (Flatten)          (None, 50000)             0         
_________________________________________________________________
dense_18 (Dense)             (None, 250)               12500250  
_________________________________________________________________
dense_19 (Dense)             (None, 1)                 251       
=================================================================
Total params: 13,500,501
Trainable params: 13,500,501
Non-trainable params: 0
_________________________________________________________________
None

In [41]:

# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
 - 89s - loss: 0.5811 - acc: 0.6663 - val_loss: 0.3381 - val_acc: 0.8514
Epoch 2/2
 - 84s - loss: 0.2002 - acc: 0.9222 - val_loss: 0.3017 - val_acc: 0.8726
Accuracy: 87.26%

MODEL 2 : Convolutional Nets¶

Text can be thought of as 1-dimensional sequence and we can apply 1-D Convolutions over a set of words. Let us walk through convolutions on text data with this blog.

http://debajyotidatta.github.io/nlp/deep/learning/word-embeddings/2016/11/27/Understanding-Convolutions-In-Text/

EXERCISE : Manually calculate the number of parameters needed in the feedforward network before executing the code.

In [20]:

# create the model
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))
model.add(Conv1D(filters=embedding_dim, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (None, 500, 100)          1000000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 100)          30100     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 100)          0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 25000)             0         
_________________________________________________________________
dense_11 (Dense)             (None, 250)               6250250   
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 251       
=================================================================
Total params: 7,280,601
Trainable params: 7,280,601
Non-trainable params: 0
_________________________________________________________________
None

In [21]:

# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
 - 314s - loss: 0.4952 - acc: 0.7055 - val_loss: 0.2996 - val_acc: 0.8768
Epoch 2/2
 - 147s - loss: 0.2043 - acc: 0.9216 - val_loss: 0.2849 - val_acc: 0.8819
Accuracy: 88.19%

EXERCISE¶

Try other CNNs with

Different kernel sizes
Different pooling operations(AveragePooling1D)

DISCUSSION : What does max and average pooling mean in terms of processing text sequences ?