Regular Expression Basics
List of Keys
Anchors
- at the beginning of line
^
import re p = re.compile('^T', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['T']
- at the end of the line
$
import re p = re.compile('e$', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['e']
Character Classes
Printable Characters
- any character
.
import re p = re.compile('^T.', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['Th']
- single character of digit
\d
- word character
\w
(including alphanumeric character and underscore):import re line = "The email address is this this the do you see" result = re.findall('\w', line) print(result) # ['T', 'h', 'e', 'e', 'm', 'a', 'i', 'l', 'a', 'd', 'd', 'r', 'e', 's', 's', 'i', 's', 't', 'h', 'i', 's', 't', 'h', 'i', 's', 't', 'h', 'e', 'd', 'o', 'y', 'o', 'u', 's', 'e', 'e']
- whitespace
\s
import re line = "The email address is this this the do you see" result = re.findall('\sth[e|i]', line) print(result) # [' thi', ' thi', ' the']
Non-printable Characters
- tabs
\t
- new line
\n
- carriage return
\r
Capitalization
- non-digit
\D
- non-word
\W
- non-blank character
\S
Quantifiers
- 0 or more times
*
import re p = re.compile('^T\w*', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['The']
- 1 or more times
+
- n times
{n}
- n1 to n2 times
{n1,n2}
- n or more times
{n,}
import re p = re.compile('(?:d){2,}', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['dd']
Flags
Regex comes with several flags that can be used to define the way of searching. In the python re
module, it is done with options of compile
function.
- ignoring cases
i
:re.I
in pythonimport re p = re.compile('the',re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['The', 'the']
- multiline
m
:re.M
in python - global
g
- Python
re
module also provides some other flags.1
Greedy Search
- Don’t be greedy
?
: regex matches the longest strings of the pattern without?
import re p1 = re.compile('^T.*e', re.I) p2 = re.compile('^T.*?e', re.I) line = "The email address is this this the do you see" result1 = p1.findall(line) result2 = p2.findall(line) print(result1) # ['The email address is this this the do you see'] print(result2) # ['The']
Grouping and Capturing
- capturing
()
: matches the whole expression even with keys outside of the parenthesis but returns only the part inside()
import re p = re.compile('th(e|i)',re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['e', 'i', 'i', 'e']
- grouping
(?:)
:?:
disables the capturing so that the parenthesis indicates only groupingimport re p = re.compile('th(?:e|i)',re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['The', 'thi', 'thi', 'the']
- either character
[]
:[aeiou]
,[a-z]
import re p = re.compile('th[ei]',re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['The', 'thi', 'thi', 'the']
- group name
T(?<groupname>he)
- referencing nth group
\n
- referencing group by name
\k<groupname>
Special Characters
- escape
\
: is used to escape some special characters
Boundaries
- boundaries of words
\b
(depends on the locale)
Python
compile
search
findall
match
- …
Useful expressions
^X-.*:
:X-`` is at the beginning of the line, followed by 0 or more characters and
:`^X-\S+:
:X-
is at the beginning of the line, followed by 1 or more non-blank characters and:
^X-\S+?:
:?
means “don’t be greedy”\S+@\S+
: finds email addresses^Email (\S+@\S+)
: finds the pattern but returns only the part in()
which should be the email address[^ ]
: not space;^
means not[a-zA-Z0-9]
means all the letters and numbers[^a-zA-Z0-9]
means neither letters nor numbers
Links
Regex tutorial — A quick cheatsheet by examples by Jonny Fox
regex101 is an useful website for regex.
Practice on repl.it
extendsclass is an online regex tester with a regular expression visualizer.
References
Planted:
by L Ma;
References:
Dynamic Backlinks to
wiki/sugar/regular-experssions
:L Ma (2018). 'Regular Expression Basics', Datumorphism, 06 April. Available at: https://datumorphism.leima.is/wiki/sugar/regular-experssions/.