Key Word(s): NPL
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
np.random.seed(0)
Movie Review Classifier 🍿📽️¶
In this exercise we'll be training a model to classify movie reviews as 'good' or 'bad.'\ The data consists of 50,000 real move reviews from IMBD.\ Obligatory Disclaimer: This is real-world data and so it's possible that it contains language or topics that some may find offensive. 🙈
We'll load the data which is hosted on the course Github repo as a zipped csv (it's too big to upload to
Ed).\
Notice that pd.read_csv()
can take a URL as the path argument and that we can read in a
compressed file without first expanding it if we specify the compression
format!
data_url = 'https://github.com/Harvard-IACS/2021-CS109A/raw/master/content/lectures/lecture23/data/movie_reviews.zip'
df = pd.read_csv(data_url, compression='zip')
df.head()
df.shape
df.label.unique()
We see that the dataset consists of text reviews and binary labels. Intuitively, the positive class is "good" while the negative is "bad."
Here are two examples from the dataset:
labels = {0: 'bad', 1: 'good'}
seen = {'bad': False, 'good': False}
for i in range(df.shape[0]):
label = df.loc[i,'label']
if not seen[labels[label]]:
# display/print combination used to appease Ed's strange output behavior
display(df.loc[i, 'text'])
print()
display(f"label: {labels[label]}")
print()
seen[labels[label]] = True
if all(val == True for val in seen.values()):
break
Some Preprocessing
In the 2nd example, we can see some html tags inside the review text.
Complete the remove_br()
function by providing its call to re.sub()
with a
regex that removes those pesky "\
" tags from an input string, x
.\
Speciffically, we should replace 2 consecutive occurances of "\
" with a single space (can you see
why?).
Hint: It is good practice to use 'raw' string when writing regular expressions to ensure
that special characters are treated correctly. Raw strings are appended with an 'r' like this: r'this
is a raw string'
### edTest(test_remove_br) ###
# fill in the regular expression
remove_br = lambda x: re.sub(___, ' ', x)
Use the dataframe's apply()
method to apply remove_br
to each review in both
train and test.
df['text'] = df.text.apply(___)
And we can see that the tags have been removed!
df.loc[4,'text']
Don't worry about any newline characters or backslashes you may see before apostrophes in the examples
above. This is just a quirk of how Jupyter displays strings by default.\
We don't see that these characters if we explicitly print
the string.
example_str = df.loc[4,'text']
print(example_str)
We'll continue our preprocessing by next removing punctuation.\ But first, let's keep a copy of the data with punctuation. This will be useful at the end of the notebook when we want to display the original text of specific observations.
# store copy of data with punctuation
df_raw = df.copy()
The next regex we need is a bit more involved.\ This should match any non-whitespace, any non-alphanumeric characters, and underscores (strangly, underscores are not covered by the first 2 conditions).
Hints:
\w
matches alphanumeric characters\s
matches whitespace[]
can be used to denote a set of characters. ex:r'[ab]'
will match on 'a' or 'b'^
at the beginning of a character set denotes negation. ex:r'[^0-9]'
will matching any non-integer|
is the logical or operator. ex:r'cat|dog'
will match the strings 'cat' or 'dog'- There are many helpful sites for testing regexes. Here's a nice one.
### edTest(test_punc_regex) ###
# create a regex that will match the characters described above
punc_regex = ___
Here we'll use an alternative to the apply
approach we saw above.\
Pandas has its own set of built-in string methods which includes a version of replace
. But
unlike Python's str.replace()
this can actually use regexes!
df['text'] = df.text.str.replace(punc_regex, '', regex=True) # remove punctuation
If all went well we can see that punctuation has been removed from our dataset.
example_str = df.loc[4,'text']
print(example_str)
Train/Test Split
Rather than splitting the data directly with train_test_split
we'll instead use it to
generate indices for the train and test data.\
This may seem strange, but there is a good reason for it. These indices will later allow us to recover
the original, unprocessed text from df_raw
for any given training and test observations.
Notice too that we are stratifying on the label. This will help ensure that good and bad reviews appear in the same proportions in both train and test.
# generate indices to designate train and test observations
train_idx, test_idx = train_test_split(range(df.shape[0]), test_size=0.2, random_state=0, stratify=df['label'])
# Separate the predictor from the response
x = df.text.values
y = df.label.values
# Create train and test sets using the generated indices
x_train = x[train_idx]
y_train = y[train_idx]
x_test = x[test_idx]
y_test = y[test_idx]
Building the Classifier Pipeline\ Step 1: Vectorizor
It's true that there are still several preprocessing steps to be done such as converting to lowercase and tokenizing the reviews, but these can be done for using sklearn's TfidfVectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer
Instantiate a TfidfVectorizer
with parameters such that it will:
- set all reviews to lowercase
- remove english stopwords
- exclude words that occur in less than 1 review in 10,000
- exclude words that occur in more than 90% of reviews
Hint: Reading the documentation, you'll see the arguments you need are
lowercase
, stop_words
, min_df
, and max_df
### edTest(test_tfidf) ###
vec = TfidfVectorizer(___)
Step 2: Classifier
We'll use logistic regression with l2 regularization as our classifier model. The LogisticRegressionCV object allows us to easily tune for the best regularization parameter.
from sklearn.linear_model import LogisticRegressionCV
With 40,000 training observations and each word in the vectorizer's vocabulary counting acting as a predictor training could be slow.\ This issue is exacerbated when using cross validation as we need fit the model multiple times!\ We'll set our classifier CV parameters so as to help keep the training time down to around 30 seconds or so.\
- l2 penalty (e.g., Ridge)
- 10 iterations per fit (remember, logistic regression has no closed form solution for the betas!)
- 5-fold CV
- random state of 0 (the fitting can be stochastic)
### edTest(test_clf) ###
# Instantiate our Classifier
clf = LogisticRegressionCV(___)
Step 3: Pipeline
Any text data going into our classifier will have to first be converted to numerical data by our vectorizer.\ One way to do this would be to:
- fit the vectorizor on the training data
- transform a dataset with the fitted vectorizer
- pass the transformed data to the classifier
(1) only needs to be done once, but (2) & (3) would need to be done manually for train, test, and any other data we want to give them model.\ This would be tedious! Luckily, sklearn's Pipline object allow use to connect one more 'transformers' (such as a scaler or vectorizer) with a model.
from sklearn.pipeline import make_pipeline
Use make_pipeline()
to connect the vectorizor, vec
, and our classifier, clf
, into a single
pipeline.
Hint: You can set verbose=True
to see the individual steps during the fit
process later.
### edTest(test_pipeline) ###
# Construct the pipeline
pipe = make_pipeline(___)
Step 4: Fitting
When it comes to fitting, we can treat the pipeline object as if it were the classifier object itself,
and simply call fit
on the pipeline.
# For the sake of time, we are fitting quickly and we may not converge
# We'll supress those pesky warnings
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
# We also ignore FutureWarnings due to version issues on Ed
simplefilter("ignore", category=(ConvergenceWarning, FutureWarning))
### edTest(test_fit) ###
# Fit the model via the pipeline
pipe.___(___,___)
We can inspect the steps of the pipeline.
pipe.get_params()['steps']
By default they are named using the all lowercase class name of each object.\ We can use these names to access the fitted objects inside. Here we see the size of our vectorizer's vocabulary.
features = pipe.get_params()['tfidfvectorizer'].get_feature_names()
print('# of features:', len(features))
There are too many to print, but we can peek at a random sample.
sample_size = 40
feature_sample_idx = np.random.choice(len(features), size=sample_size, replace=False)
print(np.array(features)[feature_sample_idx])
Similarly, we can access the fitted logistic model and see what regularization parameter was used.
best_C = pipe.get_params()['logisticregressioncv'].C_[0]
print(f'Best C from cross-validation: {best_C:.4f}')
Step 5: Prediction
Just like we did when fitting, we can treat the pipeline object as the classifier when making predictions.\ Predict on the test data to get:
- class labels
- probabilities of being the positive class (i.e., 'good' reviews)
- test accuracy
### edTest(test_pred) ###
# Predict class labels on test data
y_pred = pipe.___(___)
# Predict probabilities of the positive on the test data
y_pred_proba = pipe.___(___)[___,___]
# Calculate test accuracy (there are several ways to do this)
test_acc = ___
print(f"test accuracy: {test_acc:0.3f}")
Can you get better than 0.896 by tweaking the preprocessing, or vetorizer and classifier parameters? Perhaps inspecting how our model makes its predictions may help us decide how we might improve the model in the future.
Step 6: Interpretation
Below we'll use the eli5
library we saw in Model Interpretation Lab (#11) to have some fun
interpreting what is driving our model's predictions on specific test observations.
# For interpretation
import eli5
# for parsing/formating eli5's HTML output
from bs4 import BeautifulSoup
# for displaying formatted HTML output
from IPython.display import HTML
Here are the words driving positive class predictions.
eli5.show_weights(clf, vec=vec, top=25)
Hmm, those digits like 710, 810, and 410 driving predictions seems strange. What might they represent?\
We'll use the 'raw' data with punctuation when inspecting the data (See! It is coming in handy!)
x_train_raw = df_raw.text[train_idx].values
x_test_raw = df_raw.text[test_idx].values
df_raw[df.text.str.contains(' 710 ')].iloc[0].text
These are actually numerical ratings embedded in the reviews! Looking at the text without the punctuation made it hard for us to see this at first.
Here's a helper function used to remove some extraneous things from eli5
's output. We just
want to see the highlighted text.\
You don't need to read through the function but it is here as a nice resource/example. 🤓
def eli5_html(clf, vec, observation):
"""
helper function for nicely formatting and displaying eli5 output
"""
# Get info on is driving a given observation's predictions
eli5_results = eli5.show_prediction(estimator=clf, doc=observation, vec=vec, targets=[True], target_names=['bad', 'good'])
# Convert eli5's HTML data to BS object for parsing/formatting
soup = BeautifulSoup(eli5_results.data, 'html.parser')
# Remove a table we don't want
soup.table.decompose()
# Remove the first <p> tag with unwanted text
soup.p.decompose()
# Display the newly formatted HTML!
display(HTML(str(soup)))
Now all you need to do is find the specific observations requested.\
You'll need your y_pred_proba
values for this section to find which elements from x_test_raw
to select.
Hint: np.argsort(), np.flip(), and np.abs() may be useful here.
What are the 5 worst movie reviews in the test set according to your model? 🍅¶
# Find indices of 5 worst reviews
worst5 = x_test_raw[___]
for i, review in enumerate(worst5):
style = 'background-color:black;color:white;font-weight:bold;padding:4px'
display(HTML(f"<p style={style}>Bad Movie #{i+1} 🍅</p>"))
eli5_html(clf, vec, review)
What are the 5 best movie review in the test set according to your model? 🏆¶
# Find indices of 5 best reviews
best5 = x_test_raw[___]
for i, review in enumerate(best5):
display(HTML(f"<p style={style}>Good Movie #{i+1} 🏆</p>"))
eli5_html(clf, vec, review)
What are the 5 most 'meh' movie review in the test set according to your model? 😐\ That is, which reviews are the most neutral according to your model?\ Upon reading some of these reviews you may find their sentiment to actually not be very ambiguous. What might be confusing our model?
# Find indices of the 5 most neutral reviews
meh5 = x_test_raw[___]
for i, review in enumerate(meh5):
display(HTML(f"<p style={style}>'Meh' Movie #{i+1} 😐</p>"))
eli5_html(clf, vec, review)
Despite some difficulties with a few of the 'meh' movies, our model is actually pretty good! In fact, it works so well you can actually use it to find mistakes in the manually labeled data!\ This can be done by inspecting which training observation predictions differ the most from the provided labels.\ (But if you do decide to explore this, just remember the disclaimer at the top of the notebook!)
Write your own review
Finally, you can try writing a review of your own and see what your model does with it!
my_review = """
your review here
"""
# Remove punctuation using your regex from earlier
my_review = re.sub(punc_regex, '', my_review)
# Remove leading & trailing whitespace
# and put into a numpy array (which the model expects)
my_review = np.array([my_review.strip()])
my_review
my_review_proba = pipe.predict_proba(my_review)[:,1][0]
my_review_label = pipe.predict(my_review)[0]
print('predicted class:', my_review_label)
print('predicted probability:', my_review_proba)
display(HTML(f"<p style={style}>My Review 🍿</p>"))
eli5_html(clf, vec, my_review[0])