Regular Expression Basics
List of Keys
Anchors
- at the beginning of line
^import re p = re.compile('^T', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['T'] - at the end of the line
$import re p = re.compile('e$', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['e']
Character Classes
Printable Characters
- any character
.import re p = re.compile('^T.', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['Th'] - single character of digit
\d - word character
\w(including alphanumeric character and underscore):import re line = "The email address is this this the do you see" result = re.findall('\w', line) print(result) # ['T', 'h', 'e', 'e', 'm', 'a', 'i', 'l', 'a', 'd', 'd', 'r', 'e', 's', 's', 'i', 's', 't', 'h', 'i', 's', 't', 'h', 'i', 's', 't', 'h', 'e', 'd', 'o', 'y', 'o', 'u', 's', 'e', 'e'] - whitespace
\simport re line = "The email address is this this the do you see" result = re.findall('\sth[e|i]', line) print(result) # [' thi', ' thi', ' the']
Non-printable Characters
- tabs
\t - new line
\n - carriage return
\r
Capitalization
- non-digit
\D - non-word
\W - non-blank character
\S
Quantifiers
- 0 or more times
*import re p = re.compile('^T\w*', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['The'] - 1 or more times
+ - n times
{n} - n1 to n2 times
{n1,n2} - n or more times
{n,}import re p = re.compile('(?:d){2,}', re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['dd']
Flags
Regex comes with several flags that can be used to define the way of searching. In the python re module, it is done with options of compile function.
- ignoring cases
i:re.Iin pythonimport re p = re.compile('the',re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['The', 'the'] - multiline
m:re.Min python - global
g - Python
remodule also provides some other flags.1
Greedy Search
- Don’t be greedy
?: regex matches the longest strings of the pattern without?import re p1 = re.compile('^T.*e', re.I) p2 = re.compile('^T.*?e', re.I) line = "The email address is this this the do you see" result1 = p1.findall(line) result2 = p2.findall(line) print(result1) # ['The email address is this this the do you see'] print(result2) # ['The']
Grouping and Capturing
- capturing
(): matches the whole expression even with keys outside of the parenthesis but returns only the part inside()import re p = re.compile('th(e|i)',re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['e', 'i', 'i', 'e'] - grouping
(?:):?:disables the capturing so that the parenthesis indicates only groupingimport re p = re.compile('th(?:e|i)',re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['The', 'thi', 'thi', 'the'] - either character
[]:[aeiou],[a-z]import re p = re.compile('th[ei]',re.I) line = "The email address is this this the do you see" result = p.findall(line) print(result) # ['The', 'thi', 'thi', 'the'] - group name
T(?<groupname>he) - referencing nth group
\n - referencing group by name
\k<groupname>
Special Characters
- escape
\: is used to escape some special characters
Boundaries
- boundaries of words
\b(depends on the locale)
Python
compilesearchfindallmatch- …
Useful expressions
^X-.*::X-`` is at the beginning of the line, followed by 0 or more characters and:`^X-\S+::X-is at the beginning of the line, followed by 1 or more non-blank characters and:^X-\S+?::?means “don’t be greedy”\S+@\S+: finds email addresses^Email (\S+@\S+): finds the pattern but returns only the part in()which should be the email address[^ ]: not space;^means not[a-zA-Z0-9]means all the letters and numbers[^a-zA-Z0-9]means neither letters nor numbers
Links
Regex tutorial — A quick cheatsheet by examples by Jonny Fox
regex101 is an useful website for regex.
Practice on repl.it
extendsclass is an online regex tester with a regular expression visualizer.
References
Planted:
by L Ma;
References:
Dynamic Backlinks to
wiki/sugar/regular-experssions:L Ma (2018). 'Regular Expression Basics', Datumorphism, 06 April. Available at: https://datumorphism.leima.is/wiki/sugar/regular-experssions/.