Key Word(s): NLP, Embedding, tokenize, CNN, Convolutional Neural Network
CS109B Data Science 2: Advanced Topics in Data Science
Lecture 9: NLP example with CNN¶
Harvard University
Spring 2019
Instructors: Pavlos Protopapas and Mark Glickman
#RUN THIS CELL
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
In this example, we will try to implement a CNN for text.
We will use the task of IMDB movie review classification. Response variable is positive/negative review.
A sentence can be thought of as a sequence of words which have semantic connections across time.
By semantic connection, we mean that the words that occur earlier in the sentence influence the sentence's structure and meaning in the latter part of the sentence.
Note: There are also semantic connections backwards in a sentence (we will revisit this idea when we do RNNs from both directions and combine their outputs which we will see in the next few lectures).
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, SimpleRNN
from keras.layers.embeddings import Embedding
from keras.layers import Flatten
from keras.preprocessing import sequence
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
import numpy as np
# fix random seed for reproducibility
numpy.random.seed(1)
SEEDING - Important thing to do in many machine learning tasks which involve stochastic sampling (where random numbers are generated for different samples) is to do seeding so that the results are fairly reproducible.
WHY SEEDING ? Most random number generators in computers are pseudo-random number generators i.e. they generate random numbers starting from a seed, but internally have a deterministic formula to calculate the next random number it generates and thus, if you fix your seed, the set of random numbers produced are the same in every run.
STEP 1 : Load and visualize the data¶
# We want to have a finite vocabulary (9,999 most frequent words, one for everything else)
vocabulary_size = 10000
#We also want to have a finite length of reviews and not have to process really long sentences.
# Anything longer will be chopped!
max_review_length = 500
For practical data science applications, we need to convert text into tokens since the machine understands only numbers and not really English words like humans can. As a simple example of tokenization, we can see a small example.
Assume we have 5 sentences. This is how we tokenize them into numbers once we create a dictionary.
- I have books - [1, 4, 7]
- Interesting books are useful [10,2,9,8]
- I have computers [1,4,6]
- Computers are interesting and useful [6,9,11,10,8]
- Books and computers are both valuable. [7,10,2,9,13,12]
Create tokens for vocabulary based on frequency of occurrence. Hence, we assign the following tokens
I-1, books-2, computers-3, have-4, are-5, computers-6,books-7, useful-8, are-9, and-10,interesting-11, valuable-12, both-13
Thankfully, in our dataset it is internally handled and each sentence is represented in such tokenized form.
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)
print('Number of reviews', len(X_train))
print('Length of first and fifth review before padding', len(X_train[0]) ,len(X_train[4]))
print('First review', X_train[0])
print('First label', y_train[0])
Pad sequences in order to ensure that all inputs have same sentence length and dimensions.
DISCUSSION : Why are we padding here?
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
print('Length of first and fifth review after padding', len(X_train[0]) ,len(X_train[4]))
Is Accuracy the right metric to look at ?¶
Discuss : In what cases is accuracy a good metric to measure classification models ?
What other metrics are useful incase accuracy proves to be incompetent metric for our dataset ? https://towardsdatascience.com/understanding-data-science-classification-metrics-in-scikit-learn-in-python-3bc336865019
from collections import Counter
counts = dict(Counter(y_train))
print('Number of zeroes : ', counts[0], ' and Number of ones : ', counts[1])
MODEL 1(a) : FEEDFORWARD NETWORKS WITHOUT EMBEDDINGS¶
Let us build a single layer feedforward net with 250 nodes.
GOAL : Calculate the number of parameters involved in this network and implement a feedforward net to do classification.
model = Sequential()
model.add(Dense(250, activation='relu',input_dim=max_review_length))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Any idea why the performance is terrible ?¶
Hint : Tokenization.
Obvious Workaround : One-Hot Encodings
EMBEDDINGS - Sparse to Dense Transformations¶
We use embeddings to reduce dimensions of our data since the tokens we assign based on our word frequency are discrete and do not have a continuous structure.
What are embeddings ?¶
Embeddings are functional transformations from a sparse discrete vector representation of text (either as tokens or as one-hot encodings) into a dense vector representation of a fixed size(usually of much lower dimensions than the vocabulary length of the text). The dense representations allow the neural network to generalize better.
Here we are training our own embedding while training our neural network. To transfer "knowledge" from other sources, in more complicated projects we can also use pre-trained embeddings such as word-2-vec, GloVE, Fastext etc. https://nlpforhackers.io/word-embeddings/
Example Embeddings Transformation¶
Let us first understand how Keras embedding layer works through a dummy example to see how the dimensions are transformed. In this example, each input is mapped to a 64 dimensional vector (via the embedding layer).
EXERCISE : Manually calculate the number of parameters needed in the embedding layer before executing the code.
model = Sequential()
#input - Number of categorical inputs, embedding dimension, input length.
model.add(Embedding(1000, 64, input_length=10))
print(model.summary())
# the model will take as input an integer matrix of size (batch, input_length).
# the largest integer (i.e. word index) in the input should be
# no larger than 999 (vocabulary size).
# now model.output_shape == (None, 10, 64), where None is the batch dimension.
input_array = np.random.randint(1000, size=(32, 10))
print("Shape of input : ", input_array.shape)
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
assert output_array.shape == (32, 10, 64)
print(input_array[0])
print(output_array[0].shape)
MODEL 1(b) : FEEDFORWARD NETWORKS WITH EMBEDDINGS¶
EXERCISE : Implement the feedforward net combining the embedding layer and the feedforward layer(one layer, 250 nodes) without looking at cells below. Manually calculate the number of parameters needed in the feedforward network before executing the code.
embedding_dim = 100
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
MODEL 2 : Convolutional Nets¶
Text can be thought of as 1-dimensional sequence and we can apply 1-D Convolutions over a set of words. Let us walk through convolutions on text data with this blog.
EXERCISE : Manually calculate the number of parameters needed in the feedforward network before executing the code.
# create the model
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))
model.add(Conv1D(filters=embedding_dim, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
EXERCISE¶
Try other CNNs with
- Different kernel sizes
- Different pooling operations(AveragePooling1D)
DISCUSSION : What does max and average pooling mean in terms of processing text sequences ?