Key Word(s): data, datasets, bias, regular expressions, regex

CS109A Introduction to Data Science

Lecture 2, Exercise 1: RegEx¶

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner

Title¶

Exercise 1

Description¶

Introduction

Regular Expressions (RegEx) are a mechanism that allows one to define a pattern to be searched for. That is, Regular Expressions aren't just a Python concept or library; they extend beyond the scope of any one programming language. Fortunately, many programming languages support them, though. This makes string processing (namely, parsing) incredibly convenient, as it's vastly easier than writing our own search query (e.g., via a series of if-statements)

Many programming languages slightly differ in their syntax for creating RegEx. RegEx's can quickly get tedious at times, and it is not expected for any student to memorize all of the possible character sequences or to become a master at them. We do expect you to know what they are and to have basic exposure to them, so that you are aware of their incredible utility and can use them in the future when you want to parse through text. It is perfectly permissible -- and expected -- for students to reference the syntax guide while creating RegEx. You will quickly learn that there are a few main, powerful character sequences that tend to get used over and over, and it's these that you will likely find useful for committing to memory:

any whitespace
non-whitespace
any character
non-character
digit
non-digit
occurs 0 or more times
occurs 1 or more times
start of a string
end of a string
the difference between greedy and non-greedy (aka reluctant) searching

Resources

You may find this "cheatsheet" useful.
As part of your post-class assignment, you should look through the official Python documentation. It's okay to skim most sections (skip .split() and .sub()), but you should pay particular attention to:

the Performing Matches section (.match())

Grouping section (.group())

Greedy vs Non-Greedy

For immediate, visual feedback as to if your regular expression is matching the way you want, I highly recommend visiting websites that provide Online Regular Expression testing, e.g.:

Pythex <-- my favorite (it includes a cheatsheet button)
Regex101

Exercise

In this exercise, we want you to practice using RegularExpression. If you have used them many times in the past, this will likely be a breeze for you. If this is your first time, please do not be intimidated by the unusual-looking syntax. After a few correct RegEx, you will be a pro in no time.

In [ ]:

import re
import requests

For the three parts to this Exercise, you will be extracting contents from the sample_string provided below.

In [ ]:

# she was elected as State Representative of my district (Somerville) this week
# (I am not blasting contact information of random people)
sample_string = "Hello, my name is Erika Uyterhoeven, and my email is erika@electerika.com!!"

Part A: Your first RegEx (or at least in CS109A)¶

Write code (fill in the blank) that returns a list of all "words" in sample_string. NOTE: here, we consider a "word" to be any contiguous group of characters that are separated by whitespace. Thus, words include any attached punctuation. For example, the very first "word" is Hello,, and the very last word is erika@electerika.com!!

In [ ]:

### edTest(test_a) ###
words = re.findall('___',sample_string)
words

Part B: No punctuation¶

In the cell below, write a very similar RegEx, but now return all words excluding any attached punctuation marks, even if it's more than 1 punctuation mark. For example, the first word is now Hello (without the ,) and the last word is erika@electerika.com (without !!). NOTE: For this part, let's assume there are only 2 types of punctuations in the world, ! and ,. Thus, do not worry about properly treating others (e.g., .;[])

In [ ]:

### edTest(test_b) ###
words = re.findall('___',sample_string)
words

Part C: E-mail only¶

In the cell below, write a RegEx to extract just her e-mail address, excluding the exclamation points. Thus, you should return erika@electerika.com. To be clear, you do not need to write a robust RegEx that properly matches all patterns of ____@___.__. It is fine to target just this one e-mail by assuming the template is ____@____ (without attached !)

In [ ]:

### edTest(test_c) ###
words = re.findall('___',sample_string)
words