Regular Expression#

Comparison of Python and R#

Python

R

re.search()

str_extract()

re.findall()

str_extract_all()

re.finditer()

str_match_all()

re.sub()

str_replace_all()

re.split()

str_split()

re.subn()

?

re.match()

?

?

str_detect()

?

str_subset()

The above table shows the similarities and differences in terms of the regular expression functions in Python and R. They are more or less similar. These mappings can be helpful for R users to understand the re in Python.

Regular Expression Syntax#

import re

text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

sentence = 'Start a sentence and then bring it to an end'

pattern = re.compile(r'\d{3}-\d{3}-\d{4}', re.I)
## perform a search
matches= re.search(pattern, text_to_search)
if matches:
    print(matches.group())
321-555-4321
## find all matches
matches = re.findall(pattern, text_to_search)
if matches:
    for m in matches:
        print(m.strip())
321-555-4321
800-555-1234
900-555-1234
## find all matches
matches = re.finditer(pattern, text_to_search)
if matches:
    for m in matches:
        print("%02d-%02d: %s" % (m.start(), m.end(), m.group()))
151-163: 321-555-4321
190-202: 800-555-1234
203-215: 900-555-1234

Regular Expression in Python#

Raw String Notation#

Raw string notation (r”text”) keeps regular expressions sane. Without it, every backslash (‘’) in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:

Find all matches#

  • re.findall(): matches all occurrences of a pattern, not just the first one as search() does.

  • `re.finditer(): If one wants more information about all matches of a pattern than the matched text, finditer() is useful as it provides match objects instead of strings.

group() vs. groups()#

  • group(): by default, returns the match

  • groups(): by default, returns all capturing groups

m = re.match("a(.)(.)","abcedf")

print(m.group(0)) # return the whole match
print(m.group()) # return the whole match, same as above
print(m.groups()) # return each capturing group match
print(m.group(1)) # return first capturing gorup match
abc
abc
('b', 'c')
b

string format validation#

valid = re.compile(r"^[a-z]+@[a-z]+\.[a-z]{3}$")
print(valid.match('alvin@ntnu.edu'))
print(valid.match('alvin123@ntnu.edu'))
print(valid.match('alvin@ntnu.homeschool'))
<re.Match object; span=(0, 14), match='alvin@ntnu.edu'>
None
None

re.split()#

text = """Ross McFluff: 834.345.1254 155 Elm Street

Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way


Heather Albrecht: 548.326.4584 919 Park Place"""
# split text into lines
re.split(r'\n',text)
['Ross McFluff: 834.345.1254 155 Elm Street',
 '',
 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
 'Frank Burger: 925.541.7625 662 South Dogwood Way',
 '',
 '',
 'Heather Albrecht: 548.326.4584 919 Park Place']
re.split(r'\n+', text)
['Ross McFluff: 834.345.1254 155 Elm Street',
 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
 'Frank Burger: 925.541.7625 662 South Dogwood Way',
 'Heather Albrecht: 548.326.4584 919 Park Place']
entries = re.split(r'\n+', text)
[re.split(r'\s', entry) for entry in entries]
[['Ross', 'McFluff:', '834.345.1254', '155', 'Elm', 'Street'],
 ['Ronald', 'Heathmore:', '892.345.3428', '436', 'Finley', 'Avenue'],
 ['Frank', 'Burger:', '925.541.7625', '662', 'South', 'Dogwood', 'Way'],
 ['Heather', 'Albrecht:', '548.326.4584', '919', 'Park', 'Place']]
[re.split(r':?\s', entry, maxsplit=3) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

Text Munging#

  • re.sub()

text = '''Peter Piper picked a peck of pickled peppers
A peck of pickled peppers Peter Piper picked
If Peter Piper picked a peck of pickled peppers
Where’s the peck of pickled peppers Peter Piper picked?'''
print(re.sub(r'[aeiou]','_', text))
P_t_r P_p_r p_ck_d _ p_ck _f p_ckl_d p_pp_rs
A p_ck _f p_ckl_d p_pp_rs P_t_r P_p_r p_ck_d
If P_t_r P_p_r p_ck_d _ p_ck _f p_ckl_d p_pp_rs
Wh_r_’s th_ p_ck _f p_ckl_d p_pp_rs P_t_r P_p_r p_ck_d?
print(re.sub(r'([aeiou])',r'[\1]', text))
P[e]t[e]r P[i]p[e]r p[i]ck[e]d [a] p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs
A p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs P[e]t[e]r P[i]p[e]r p[i]ck[e]d
If P[e]t[e]r P[i]p[e]r p[i]ck[e]d [a] p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs
Wh[e]r[e]’s th[e] p[e]ck [o]f p[i]ckl[e]d p[e]pp[e]rs P[e]t[e]r P[i]p[e]r p[i]ck[e]d?
American_dates = ["7/31/1976", "02.15.1970", "11-31-1986", "04/01.2020"]
print(American_dates)
print([re.sub(r'(\d+)(\D)(\d+)(\D)(\d+)', r'\3\2\1\4\5', date) for date in American_dates])
['7/31/1976', '02.15.1970', '11-31-1986', '04/01.2020']
['31/7/1976', '15.02.1970', '31-11-1986', '01/04.2020']
  • In re.sub(repl, string), the repl argument can be a function. If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.

s = "This is a simple sentence."

pat_vowels = re.compile(r'[aeiou]')

def replaceVowels(m):
    c = m.group(0)
    c2 = ""
    if c in "ie":
        c2 = "F"
    else:
        c2 = "B"
    return c2
pat_vowels.sub(replaceVowels, s)
'ThFs Fs B sFmplF sFntFncF.'

References#