Chapter 17 Regular Expression

17.1 Comparison of R and Python

R Python
str_extract() re.search()
str_extract_all() re.findall()
str_match_all() re.finditer()
str_replace_all() re.sub()
str_split() re.split()
? re.subn()
? re.match()
str_detect() ?
str_subset() ?

The above table shows the similarities and differences in terms of the regular expression functions in Python and R. They are more or less similar. These mappings can be helpful for R users to understand the re in Python.

17.2 Structure of Regular Expression Usage

17.2.1 re.search()

  • Import the regex module with import re
  • Create a Regex object by compiling a regular expression pattern (re.compile()). Remember to use a raw string.
  • Use the pattern for search (re.search()) by passing the string you want to search into the Regex object’s search method. This returns a Match object.
  • Call the Match object’s group() method to return a string of the actual matched text.
import re

text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

sentence = 'Start a sentence and then bring it to an end'

pattern = re.compile(r'\d{3}-\d{3}-\d{4}', re.I)

17.2.2 re.findall()

  • While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string. (re.findall() will not return a Match object but a list of strings–*as long as there are no groups in the regular expression.)
## perform a search
matches= pattern.findall(text_to_search)
matches
['321-555-4321', '800-555-1234', '900-555-1234']
  • If there are groups in the regular expressions, then re.findall() will return a list of tuples.
pattern2 =  re.compile(r'(\d{3})-(\d{3})-(\d{4})', re.I)
pattern2.findall(text_to_search)
[('321', '555', '4321'), ('800', '555', '1234'), ('900', '555', '1234')]

17.2.3 re.finditer()

## find all matches
matches = pattern.finditer(text_to_search)
if matches:
    for m in matches:
        print("%02d-%02d: %s" % (m.start(), m.end(), m.group()))
151-163: 321-555-4321
190-202: 800-555-1234
203-215: 900-555-1234

17.3 Special Falgs/Settings for Regular Expressions

  • re.IGNORECASE: case-insensitive for pattern matching
  • re.DOTALL: to allow the wildcard * to match linebreaks
  • re.VERBOSE: to create complex regular expressions with multilines and comments (#)
pattern3 = re.compile(r'''
  (\d{3})       # area code
  -             # delimiter
  (\d{3})       # first 3 digits
  -             # delimiter
  (\d{4})       # last 4 digits
''', re.VERBOSE)

pattern3.findall(text_to_search)
[('321', '555', '4321'), ('800', '555', '1234'), ('900', '555', '1234')]

Exercise 17.1 With the text_to_search, how to create a more complete regular expression to extract all the phone numbers, including those numbers that have other delimiters (e.g., *)

[('321', '555', '4321'), ('123', '555', '1234'), ('123', '555', '1234'), ('800', '555', '1234'), ('900', '555', '1234')]

17.4 Regular Expression in Python

17.4.1 Raw String Notation

Raw string notation (r'text') keeps regular expressions sane. Without it, every backslash ('') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:

17.4.2 Find all matches

  • re.findall(): matches all occurrences of a pattern, not just the first one as re.search() does.

  • re.finditer(): If one wants more information about all matches of a pattern than the matched text, re.finditer() is useful as it provides match objects instead of strings.

17.4.3 group() vs. groups()

  • group(): by default, returns the whole match of the pattern
  • groups(): by default, returns all capturing groups
m = re.match("a(.)(.)","abcedf")

print(m.group(0)) # return the whole match
abc
print(m.group()) # return the whole match, same as above
abc
print(m.groups()) # return each capturing group match
('b', 'c')
print(m.group(1)) # return first capturing gorup match
b

17.4.4 string format validation

valid = re.compile(r"^[a-z]+@[a-z]+\.[a-z]{3}$")
print(valid.match('alvin@ntnu.edu'))
<re.Match object; span=(0, 14), match='alvin@ntnu.edu'>
print(valid.match('alvin123@ntnu.edu'))
None
print(valid.match('alvin@ntnu.homeschool'))
None

17.4.5 re.match() vs. re.search()

Python offers two different primitive operations based on regular expressions:

  • re.match() checks for a match only at the beginning of the string
  • re.search() checks for a match anywhere in the string (this is what Perl does by default).
print(re.match("c", "abcdef"))    # No match
None
print(re.search("^c", "abcdef"))  # No match, same as above
None
print(re.search("c", "abcdef"))   # Match
<re.Match object; span=(2, 3), match='c'>
  • re.match always matches at the beginning of the input string even if it is in the MULTILINE mode.
  • re.search however, when in MULTILINE mode, is able to search at the beginning of every line if used in combination with ^.
print(re.match('X', 'A\nB\nX', re.MULTILINE))  # No match
None
print(re.search('^X', 'A\nB\nX', re.MULTILINE))  # Match
<re.Match object; span=(4, 5), match='X'>
print(re.search('^X', 'A\nB\nX')) # No match
None

17.4.6 re.split()

text = """Ross McFluff: 834.345.1254 155 Elm Street

Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way


Heather Albrecht: 548.326.4584 919 Park Place"""
# split text into lines
re.split(r'\n',text)
['Ross McFluff: 834.345.1254 155 Elm Street', '', 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 'Frank Burger: 925.541.7625 662 South Dogwood Way', '', '', 'Heather Albrecht: 548.326.4584 919 Park Place']
re.split(r'\n+', text)
['Ross McFluff: 834.345.1254 155 Elm Street', 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 'Frank Burger: 925.541.7625 662 South Dogwood Way', 'Heather Albrecht: 548.326.4584 919 Park Place']
entries = re.split(r'\n+', text)
[re.split(r'\s', entry) for entry in entries]
[['Ross', 'McFluff:', '834.345.1254', '155', 'Elm', 'Street'], ['Ronald', 'Heathmore:', '892.345.3428', '436', 'Finley', 'Avenue'], ['Frank', 'Burger:', '925.541.7625', '662', 'South', 'Dogwood', 'Way'], ['Heather', 'Albrecht:', '548.326.4584', '919', 'Park', 'Place']]
[re.split(r':?\s', entry, maxsplit=3) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

17.5 Text Munging

  • re.sub()
text = '''Peter Piper picked a peck of pickled peppers
A peck of pickled peppers Peter Piper picked
If Peter Piper picked a peck of pickled peppers
Where’s the peck of pickled peppers Peter Piper picked?'''
print(re.sub(r'[aeiou]','_', text))
P_t_r P_p_r p_ck_d _ p_ck _f p_ckl_d p_pp_rs
A p_ck _f p_ckl_d p_pp_rs P_t_r P_p_r p_ck_d
If P_t_r P_p_r p_ck_d _ p_ck _f p_ckl_d p_pp_rs
Wh_r_’s th_ p_ck _f p_ckl_d p_pp_rs P_t_r P_p_r p_ck_d?
print(re.sub(r'([aeiou])',r'[\1]', text))
P[e]t[e]r P[i]p[e]r p[i]ck[e]d [a] p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs
A p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs P[e]t[e]r P[i]p[e]r p[i]ck[e]d
If P[e]t[e]r P[i]p[e]r p[i]ck[e]d [a] p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs
Wh[e]r[e]’s th[e] p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs P[e]t[e]r P[i]p[e]r p[i]ck[e]d?
American_dates = ["7/31/1976", "02.15.1970", "11-31-1986", "04/01.2020"]
print(American_dates)
['7/31/1976', '02.15.1970', '11-31-1986', '04/01.2020']
print([re.sub(r'(\d+)(\D)(\d+)(\D)(\d+)', r'\3\2\1\4\5', date) for date in American_dates])
['31/7/1976', '15.02.1970', '31-11-1986', '01/04.2020']
  • In re.sub(repl, string), the repl argument can be a function. If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
s = "This is a simple sentence."

pat_vowels = re.compile(r'[aeiou]')

def replaceVowels(m):
    c = m.group(0)
    c2 = ""
    if c in "ie":
        c2 = "F"
    else:
        c2 = "B"
    return c2
pat_vowels.sub(replaceVowels, s)
'ThFs Fs B sFmplF sFntFncF.'

Exercise 17.2 Create a small program to extract both emails and phone numbers from the texts on this faculty page: Department o English, NTNU.

All Phone Numbers:
['02-7749-1801', '02-7749-1772', '02-7749-1775', '02-7749-1783', '02-7749-1767', '02-7749-1757', '02-7749-1781', '02-7749-1777', '02-7749-1773', '02-7749-1759', '02-7749-1768', '02-7749-1822', '02-7749-1760', '02-7749-1764', '02-7749-1769', '02-7749-1817', '02-7749-1756', '02-7749-1788', '02-7749-1758', '02-7749-1754', '02-7749-1821', '02-7749-1819', '02-7749-1790', '02-7749-1761', '02-7749-1761', '02-7749-1774', '02-7749-1763', '02-7749-1776', '02-7749-1770', '02-7749-1786', '02-7749-1816', '02-7749-1765', '02-7749-1778', '02-7749-1541', '02-7749-1785', '02-7749-1821', '02-7749-1811', '02-7749-1779', '02-7749-1820', '02-7749-1766', '02-7749-1782', '02-7749-1762', '02-7749-1800', '02-2363-4793']
All Emails:
['chunyin@ntnu.edu.tw', 'mhchang@ntnu.edu.tw', 'clchern@ntnu.edu.tw', 'joanchang@ntnu.edu.tw', 'hjchen@ntnu.edu.tw', 'tcsu@ntnu.edu.tw', 't22028@ntnu.edu.tw', 'lip@ntnu.edu.tw', 'ting@ntnu.edu.tw', 'hclee@ntnu.edu.tw', 'hslin@ntnu.edu.tw', 'chyhuang@ntnu.edu.tw', 'profgood@ntnu.edu.tw', 'cclin@ntnu.edu.tw', 'iriswu@ntnu.edu.tw', 'yeutingliu@ntnu.edu.tw', 'ioana.luca@ntnu.edu.tw', 'lindsey@ntnu.edu.tw', 'jprystash@ntnu.edu.tw', 'hsysu@ntnu.edu.tw', 'hannes.bergthaller@ntnu.edu.tw', 'peichinchang@ntnu.edu.tw', 'shiaohui@ntnu.edu.tw', 'mlhsieh@ntnu.edu.tw', 'lihsin@ntnu.edu.tw', 'lijeni@ntnu.edu.tw', 'ykhsu@ntnu.edu.tw', 'ycshao@ntnu.edu.tw', 'jjwu@ntnu.edu.tw', 'jungsu@ntnu.edu.tw', 't22001@ntnu.edu.tw', 'jjtseng@ntnu.edu.tw', 'wanghc@ntnu.edu.tw', 'alvinchen@ntnu.edu.tw', 'gfsayang@ntnu.edu.tw', 'yuchentai@ntnu.edu.tw', 'jiaqiwu8@ntnu.edu.tw', 'angelawu@ntnu.edu.tw', 'yichien@ntnu.edu.tw', 'fwkung@ntnu.edu.tw', 'english@ntnu.edu.tw']