Chapter 17 Regular Expression
17.1 Comparison of R and Python
R | Python |
---|---|
str_extract() |
re.search() |
str_extract_all() |
re.findall() |
str_match_all() |
re.finditer() |
str_replace_all() |
re.sub() |
str_split() |
re.split() |
? | re.subn() |
? | re.match() |
str_detect() |
? |
str_subset() |
? |
The above table shows the similarities and differences in terms of the regular expression functions in Python and R. They are more or less similar. These mappings can be helpful for R users to understand the re in Python.
17.2 Structure of Regular Expression Usage
17.2.1 re.search()
- Import the regex module with
import re
- Create a
Regex
object by compiling a regular expression pattern (re.compile()
). Remember to use a raw string. - Use the pattern for search (
re.search()
) by passing the string you want to search into theRegex
object’s search method. This returns aMatch
object. - Call the
Match
object’sgroup()
method to return a string of the actual matched text.
import re
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''
sentence = 'Start a sentence and then bring it to an end'
pattern = re.compile(r'\d{3}-\d{3}-\d{4}', re.I)
17.2.2 re.findall()
- While
search()
will return aMatch
object of the first matched text in the searched string, thefindall()
method will return the strings of every match in the searched string. (re.findall()
will not return aMatch
object but a list of strings–*as long as there are no groups in the regular expression.)
['321-555-4321', '800-555-1234', '900-555-1234']
- If there are groups in the regular expressions, then
re.findall()
will return a list of tuples.
[('321', '555', '4321'), ('800', '555', '1234'), ('900', '555', '1234')]
17.3 Special Falgs/Settings for Regular Expressions
re.IGNORECASE
: case-insensitive for pattern matchingre.DOTALL
: to allow the wildcard*
to match linebreaksre.VERBOSE
: to create complex regular expressions with multilines and comments (#)
pattern3 = re.compile(r'''
(\d{3}) # area code
- # delimiter
(\d{3}) # first 3 digits
- # delimiter
(\d{4}) # last 4 digits
''', re.VERBOSE)
pattern3.findall(text_to_search)
[('321', '555', '4321'), ('800', '555', '1234'), ('900', '555', '1234')]
Exercise 17.1 With the text_to_search
, how to create a more complete regular expression to extract all the phone numbers, including those numbers that have other delimiters (e.g., *
)
[('321', '555', '4321'), ('123', '555', '1234'), ('123', '555', '1234'), ('800', '555', '1234'), ('900', '555', '1234')]
17.4 Regular Expression in Python
17.4.1 Raw String Notation
Raw string notation (r'text'
) keeps regular expressions sane. Without it, every backslash (''
) in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:
17.4.2 Find all matches
re.findall()
: matches all occurrences of a pattern, not just the first one asre.search()
does.re.finditer()
: If one wants more information about all matches of a pattern than the matched text,re.finditer()
is useful as it provides match objects instead of strings.
17.4.3 group()
vs. groups()
group()
: by default, returns the whole match of the patterngroups()
: by default, returns all capturing groups
abc
abc
('b', 'c')
b
17.4.5 re.match()
vs. re.search()
Python offers two different primitive operations based on regular expressions:
re.match()
checks for a match only at the beginning of the stringre.search()
checks for a match anywhere in the string (this is what Perl does by default).
None
None
<re.Match object; span=(2, 3), match='c'>
re.match
always matches at the beginning of the input string even if it is in the MULTILINE mode.re.search
however, when in MULTILINE mode, is able to search at the beginning of every line if used in combination with^
.
None
<re.Match object; span=(4, 5), match='X'>
None
17.4.6 re.split()
text = """Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way
Heather Albrecht: 548.326.4584 919 Park Place"""
['Ross McFluff: 834.345.1254 155 Elm Street', '', 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 'Frank Burger: 925.541.7625 662 South Dogwood Way', '', '', 'Heather Albrecht: 548.326.4584 919 Park Place']
['Ross McFluff: 834.345.1254 155 Elm Street', 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 'Frank Burger: 925.541.7625 662 South Dogwood Way', 'Heather Albrecht: 548.326.4584 919 Park Place']
[['Ross', 'McFluff:', '834.345.1254', '155', 'Elm', 'Street'], ['Ronald', 'Heathmore:', '892.345.3428', '436', 'Finley', 'Avenue'], ['Frank', 'Burger:', '925.541.7625', '662', 'South', 'Dogwood', 'Way'], ['Heather', 'Albrecht:', '548.326.4584', '919', 'Park', 'Place']]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
17.5 Text Munging
re.sub()
text = '''Peter Piper picked a peck of pickled peppers
A peck of pickled peppers Peter Piper picked
If Peter Piper picked a peck of pickled peppers
Where’s the peck of pickled peppers Peter Piper picked?'''
P_t_r P_p_r p_ck_d _ p_ck _f p_ckl_d p_pp_rs
A p_ck _f p_ckl_d p_pp_rs P_t_r P_p_r p_ck_d
If P_t_r P_p_r p_ck_d _ p_ck _f p_ckl_d p_pp_rs
Wh_r_’s th_ p_ck _f p_ckl_d p_pp_rs P_t_r P_p_r p_ck_d?
P[e]t[e]r P[i]p[e]r p[i]ck[e]d [a] p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs
A p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs P[e]t[e]r P[i]p[e]r p[i]ck[e]d
If P[e]t[e]r P[i]p[e]r p[i]ck[e]d [a] p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs
Wh[e]r[e]’s th[e] p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs P[e]t[e]r P[i]p[e]r p[i]ck[e]d?
['7/31/1976', '02.15.1970', '11-31-1986', '04/01.2020']
['31/7/1976', '15.02.1970', '31-11-1986', '01/04.2020']
- In
re.sub(repl, string)
, therepl
argument can be a function. Ifrepl
is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
s = "This is a simple sentence."
pat_vowels = re.compile(r'[aeiou]')
def replaceVowels(m):
c = m.group(0)
c2 = ""
if c in "ie":
c2 = "F"
else:
c2 = "B"
return c2
pat_vowels.sub(replaceVowels, s)
'ThFs Fs B sFmplF sFntFncF.'
Exercise 17.2 Create a small program to extract both emails and phone numbers from the texts on this faculty page: Department o English, NTNU.
All Phone Numbers:
['02-7749-1801', '02-7749-1772', '02-7749-1775', '02-7749-1783', '02-7749-1767', '02-7749-1757', '02-7749-1781', '02-7749-1777', '02-7749-1773', '02-7749-1759', '02-7749-1768', '02-7749-1822', '02-7749-1760', '02-7749-1764', '02-7749-1769', '02-7749-1817', '02-7749-1756', '02-7749-1788', '02-7749-1758', '02-7749-1754', '02-7749-1821', '02-7749-1819', '02-7749-1790', '02-7749-1761', '02-7749-1761', '02-7749-1774', '02-7749-1763', '02-7749-1776', '02-7749-1770', '02-7749-1786', '02-7749-1816', '02-7749-1765', '02-7749-1778', '02-7749-1541', '02-7749-1785', '02-7749-1821', '02-7749-1811', '02-7749-1779', '02-7749-1820', '02-7749-1766', '02-7749-1782', '02-7749-1762', '02-7749-1800', '02-2363-4793']
All Emails:
['chunyin@ntnu.edu.tw', 'mhchang@ntnu.edu.tw', 'clchern@ntnu.edu.tw', 'joanchang@ntnu.edu.tw', 'hjchen@ntnu.edu.tw', 'tcsu@ntnu.edu.tw', 't22028@ntnu.edu.tw', 'lip@ntnu.edu.tw', 'ting@ntnu.edu.tw', 'hclee@ntnu.edu.tw', 'hslin@ntnu.edu.tw', 'chyhuang@ntnu.edu.tw', 'profgood@ntnu.edu.tw', 'cclin@ntnu.edu.tw', 'iriswu@ntnu.edu.tw', 'yeutingliu@ntnu.edu.tw', 'ioana.luca@ntnu.edu.tw', 'lindsey@ntnu.edu.tw', 'jprystash@ntnu.edu.tw', 'hsysu@ntnu.edu.tw', 'hannes.bergthaller@ntnu.edu.tw', 'peichinchang@ntnu.edu.tw', 'shiaohui@ntnu.edu.tw', 'mlhsieh@ntnu.edu.tw', 'lihsin@ntnu.edu.tw', 'lijeni@ntnu.edu.tw', 'ykhsu@ntnu.edu.tw', 'ycshao@ntnu.edu.tw', 'jjwu@ntnu.edu.tw', 'jungsu@ntnu.edu.tw', 't22001@ntnu.edu.tw', 'jjtseng@ntnu.edu.tw', 'wanghc@ntnu.edu.tw', 'alvinchen@ntnu.edu.tw', 'gfsayang@ntnu.edu.tw', 'yuchentai@ntnu.edu.tw', 'jiaqiwu8@ntnu.edu.tw', 'angelawu@ntnu.edu.tw', 'yichien@ntnu.edu.tw', 'fwkung@ntnu.edu.tw', 'english@ntnu.edu.tw']
17.6 References
- Python regular expression cheatsheet
- Python official regular expression documentation
- Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O’Reilly Media, 2009.
- A good graphic interface to try out regular expressions: pythex.org