To perform regular expression processing in Python, we use the re module from the standard library. It allows you to extract, replace, and split strings using regular expression patterns.
- re — Regular expression operations — Python 3.10.0 Documentation
- Regular Expression HOWTO — Python 3.10.0 Documentation
In this section, we will first explain the functions and methods of the re module.
- Compiling regular expression patterns:
compile()
- match object
- Check if the beginning of the string matches, extract:
match()
- Check for matches not limited to the beginning:
search()
- Check if the entire string matches:
fullmatch()
- Get a list of all matching parts:
findall()
- Get all matching parts as an iterator:
finditer()
- Replace the matching part:
sub()
,subn()
- Splitting strings with regular expression patterns:
split()
After that, I will explain the meta characters (special characters) and special sequences of regular expressions that can be used in the re module. Basically, it is the standard regular expression syntax, but be careful about setting flags (especially re.ASCII).
- Regular expression metacharacters, special sequences, and caveats in Python
- Setting the flag
- Limited to ASCII characters:
re.ASCII
- Not case-sensitive:
re.IGNORECASE
- Match the beginning and end of each line:
re.MULTILINE
- Specify multiple flags
- Limited to ASCII characters:
- Greedy and non-greedy matches
- Compile the regular expression pattern: compile()
- match object
- Check if the beginning of a string matches, extract: match()
- Check for matches not limited to the beginning, extract: search()
- Check if the whole string matches: fullmatch()
- Get a list of all matching parts: findall()
- Get all matching parts as an iterator: finditer()
- Replace the matching parts: sub(), subn()
- Splitting strings with regular expression patterns: split()
- Regular expression metacharacters, special sequences, and caveats in Python
- Setting the flag
- Greedy and non-greedy matches
Compile the regular expression pattern: compile()
There are two ways to perform regular expression processing in the re module.
Run with function
The first is a function.re.match()
,re.sub()
Functions like these are available to perform extraction, replacement, and other processes using regular expression patterns.
The details of the functions will be described later, but in all of them, the first argument is the string of the regular expression pattern, followed by the string to be processed and so on. For example, in re.sub(), which performs substitution, the second argument is the substitution string, and the third argument is the string to be processed.
import re
s = 'aaa@xxx.com, bbb@yyy.com, ccc@zzz.net'
m = re.match(r'([a-z]+)@([a-z]+)\.com', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
result = re.sub(r'([a-z]+)@([a-z]+)\.com', 'new-address', s)
print(result)
# new-address, new-address, ccc@zzz.net
Note that [a-z] in the regular expression pattern in this example means any character from a to z (i.e. lowercase alphabet), and + means repeat the previous pattern (in this case [a-z]) one or more times. The [a-z]+ matches any string that repeats one or more lowercase alphabetic characters.
. is a meta character (a character with special meaning) and must be escaped with a backslash.
Since regular expression pattern strings often use a lot of backslashes, it is convenient to use raw strings as in the example.
Runs in a method of a regular expression pattern object
The second way to process regular expressions in the re module is the regular expression pattern object method.
Using re.compile(), you can compile a regular expression pattern string to create a regular expression pattern object.
p = re.compile(r'([a-z]+)@([a-z]+)\.com')
print(p)
# re.compile('([a-z]+)@([a-z]+)\\.com')
print(type(p))
# <class 're.Pattern'>
re.match()
,re.sub()
For example, the same process as these functions can be executed as methods match(),sub() of regular expression objects.
m = p.match(s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
result = p.sub('new-address', s)
print(result)
# new-address, new-address, ccc@zzz.net
All of the re.xxx() functions described below are also provided as methods of the regular expression object.
If you are repeating a process that uses the same pattern, it is more efficient to generate a regular expression object with re.compile() and use it around.
In the following sample code, the function is used without compiling for convenience, but if you want to use the same pattern repeatedly, it is recommended to compile it in advance and execute it as a method of a regular expression object.
match object
match(), search(), etc. return a match object.
s = 'aaa@xxx.com'
m = re.match(r'[a-z]+@[a-z]+\.[a-z]+', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
print(type(m))
# <class 're.Match'>
The matched string and position are obtained using the following methods of the match object.
- Get the location of the match:
start()
,end()
,span()
- Get the matched string:
group()
- Get the string for each group:
groups()
print(m.start())
# 0
print(m.end())
# 11
print(m.span())
# (0, 11)
print(m.group())
# aaa@xxx.com
If you enclose a part of a regular expression pattern in a string with parentheses(), the part will be processed as a group. In this case, the string of the part that matches each group in groups() can be obtained as a tuple.
m = re.match(r'([a-z]+)@([a-z]+)\.([a-z]+)', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
print(m.groups())
# ('aaa', 'xxx', 'com')
Check if the beginning of a string matches, extract: match()
match() returns a match object if the beginning of the string matches the pattern.
As mentioned above, the match object can be used to extract the matched substring, or simply to check if a match was made.
match() will only check the beginning. If there is no matching string at the beginning, it returns None.
s = 'aaa@xxx.com, bbb@yyy.com, ccc@zzz.net'
m = re.match(r'[a-z]+@[a-z]+\.com', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
m = re.match(r'[a-z]+@[a-z]+\.net', s)
print(m)
# None
Check for matches not limited to the beginning, extract: search()
Like match(), it returns a match object if it matches.
If there are multiple matching parts, only the first matching part will be returned.
s = 'aaa@xxx.com, bbb@yyy.com, ccc@zzz.net'
m = re.search(r'[a-z]+@[a-z]+\.net', s)
print(m)
# <re.Match object; span=(26, 37), match='ccc@zzz.net'>
m = re.search(r'[a-z]+@[a-z]+\.com', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
If you want to get all matching parts, use findall() or finditer() as described below.
Check if the whole string matches: fullmatch()
To check if the whole string matches the regular expression pattern, use fullmatch(). This is useful, for example, to check whether a string is valid as an email address or not.
If the entire string matches, a match object is returned.
s = 'aaa@xxx.com'
m = re.fullmatch(r'[a-z]+@[a-z]+\.com', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
If there are unmatched parts (only partial matches or no matches at all), None is returned.
s = '!!!aaa@xxx.com!!!'
m = re.fullmatch(r'[a-z]+@[a-z]+\.com', s)
print(m)
# None
The fullmatch() was added in Python 3.4. If you want to do the same in earlier versions, use match() and a matching meta character $ at the end. If the entire string from beginning to end does not match, it returns None.
s = '!!!aaa@xxx.com!!!'
m = re.match(r'[a-z]+@[a-z]+\.com$', s)
print(m)
# None
Get a list of all matching parts: findall()
findall() returns a list of all matching substrings. Note that the elements of the list are not match objects but strings.
s = 'aaa@xxx.com, bbb@yyy.com, ccc@zzz.net'
result = re.findall(r'[a-z]+@[a-z]+\.[a-z]+', s)
print(result)
# ['aaa@xxx.com', 'bbb@yyy.com', 'ccc@zzz.net']
The number of matched parts can be checked using the built-in function len(), which returns the number of elements in the list.
print(len(result))
# 3
Grouping with parentheses() in a regular expression pattern returns a list of tuples whose elements are the strings of each group. This is equivalent to groups() in the match object.
result = re.findall(r'([a-z]+)@([a-z]+)\.([a-z]+)', s)
print(result)
# [('aaa', 'xxx', 'com'), ('bbb', 'yyy', 'com'), ('ccc', 'zzz', 'net')]
The group parentheses () can be nested, so if you want to get the whole match as well, just enclose the whole match in parentheses ().
result = re.findall(r'(([a-z]+)@([a-z]+)\.([a-z]+))', s)
print(result)
# [('aaa@xxx.com', 'aaa', 'xxx', 'com'), ('bbb@yyy.com', 'bbb', 'yyy', 'com'), ('ccc@zzz.net', 'ccc', 'zzz', 'net')]
If no match is found, an empty tuple is returned.
result = re.findall('[0-9]+', s)
print(result)
# []
Get all matching parts as an iterator: finditer()
finditer() returns all matching parts as an iterator. The elements are not strings like findall(), but match objects, so you can get the position (index) of the matched parts.
The iterator itself cannot be printed out with print() to get its contents. If you use the built-in function next() or the for statement, you can get the contents one by one.
s = 'aaa@xxx.com, bbb@yyy.com, ccc@zzz.net'
result = re.finditer(r'[a-z]+@[a-z]+\.[a-z]+', s)
print(result)
# <callable_iterator object at 0x10b0efa90>
print(type(result))
# <class 'callable_iterator'>
for m in result:
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
# <re.Match object; span=(13, 24), match='bbb@yyy.com'>
# <re.Match object; span=(26, 37), match='ccc@zzz.net'>
It can also be converted to a list with list().
l = list(re.finditer(r'[a-z]+@[a-z]+\.[a-z]+', s))
print(l)
# [<re.Match object; span=(0, 11), match='aaa@xxx.com'>, <re.Match object; span=(13, 24), match='bbb@yyy.com'>, <re.Match object; span=(26, 37), match='ccc@zzz.net'>]
print(l[0])
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
print(type(l[0]))
# <class 're.Match'>
print(l[0].span())
# (0, 11)
If you want to get the position of all matching parts, the list comprehension notation is more convenient than list().
print([m.span() for m in re.finditer(r'[a-z]+@[a-z]+\.[a-z]+', s)])
# [(0, 11), (13, 24), (26, 37)]
The iterator takes out elements in order. Note that if you try to extract more elements after reaching the end, you will be left with nothing.
result = re.finditer(r'[a-z]+@[a-z]+\.[a-z]+', s)
for m in result:
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
# <re.Match object; span=(13, 24), match='bbb@yyy.com'>
# <re.Match object; span=(26, 37), match='ccc@zzz.net'>
print(list(result))
# []
Replace the matching parts: sub(), subn()
Using sub(), you can replace the matched part with another string. The substituted string will be returned.
s = 'aaa@xxx.com, bbb@yyy.com, ccc@zzz.net'
result = re.sub(r'[a-z]+@[a-z]+\.com', 'new-address', s)
print(result)
# new-address, new-address, ccc@zzz.net
print(type(result))
# <class 'str'>
When grouping with parentheses(), the matched string can be used in the replaced string.
By default, the following is supported: Note that for normal strings that are not raw strings, a backslash must be listed before the backslash to escape the backslash.
\1 | The first parenthesis |
\2 | The second parenthesis |
\3 | The third parenthesis |
result = re.sub(r'([a-z]+)@([a-z]+)\.com', r'\1@\2.net', s)
print(result)
# aaa@xxx.net, bbb@yyy.net, ccc@zzz.net
?P<xxx>
If you name the group by writing this at the beginning of the regular expression pattern's parentheses, you can specify it using the name instead of the number, as shown below.\g<xxx>
result = re.sub(r'(?P<local>[a-z]+)@(?P<SLD>[a-z]+)\.com', r'\g<local>@\g<SLD>.net', s)
print(result)
# aaa@xxx.net, bbb@yyy.net, ccc@zzz.net
The argument count specifies the maximum number of replacements. Only the count from the left side will be replaced.
result = re.sub(r'[a-z]+@[a-z]+\.com', 'new-address', s, count=1)
print(result)
# new-address, bbb@yyy.com, ccc@zzz.net
subn() returns a tuple of the substituted string (same as the return value of sub()) and the number of substituted parts (the number that matched the pattern).
result = re.subn(r'[a-z]+@[a-z]+\.com', 'new-address', s)
print(result)
# ('new-address, new-address, ccc@zzz.net', 2)
The method of specifying arguments is the same as sub(). You can use the part grouped by parentheses, or specify the argument count.
result = re.subn(r'(?P<local>[a-z]+)@(?P<SLD>[a-z]+)\.com', r'\g<local>@\g<SLD>.net', s)
print(result)
# ('aaa@xxx.net, bbb@yyy.net, ccc@zzz.net', 2)
result = re.subn(r'[a-z]+@[a-z]+\.com', 'new-address', s, count=1)
print(result)
# ('new-address, bbb@yyy.com, ccc@zzz.net', 1)
Splitting strings with regular expression patterns: split()
split() splits the string at the part that matches the pattern, and returns it as a list.
Note that the first and last matches will contain empty strings at the beginning and end of the resulting list.
s = '111aaa222bbb333'
result = re.split('[a-z]+', s)
print(result)
# ['111', '222', '333']
result = re.split('[0-9]+', s)
print(result)
# ['', 'aaa', 'bbb', '']
The maxsplit argument specifies the maximum number of splits (pieces). Only the count from the left side will be split.
result = re.split('[a-z]+', s, 1)
print(result)
# ['111', '222bbb333']
Regular expression metacharacters, special sequences, and caveats in Python
The main regular expression meta characters (special characters) and special sequences that can be used in the Python 3 re module are as follows
metacharacter | contents |
---|---|
. | Any single character other than a newline (including a newline with the DOTALL flag) |
^ | The beginning of the string (also matches the beginning of each line with the MULTILINE flag) |
$ | The end of the string (also matches the end of each line with the MULTILINE flag) |
* | Repeat the previous pattern more than 0 times |
+ | Repeat the previous pattern at least once. |
? | Repeat the previous pattern 0 or 1 times |
{m} | Repeat the previous pattern m times |
{m, n} | The last pattern.m ~n repeat |
[] | A set of characters[] Matches any one of these characters |
| | ORA|B Matches either A or B pattern |
special sequence | contents |
---|---|
\d | Unicode decimal numbers (limited to ASCII numbers by ASCII flag) |
\D | \d Meaning the opposite of this. |
\s | Unicode whitespace characters (limited to ASCII whitespace characters by ASCII flag) |
\S | \s Meaning the opposite of this. |
\w | Unicode word characters and underscores (limited to ASCII alphanumeric characters and underscores by ASCII flag) |
\W | \w Meaning the opposite of this. |
Not all of them are listed in this table. See the official documentation for a complete list.
Also note that some of the meanings are different in Python 2.
Setting the flag
As shown in the table above, some meta characters and special sequences change their mode depending on the flag.
Only the main flags are covered here. See the official documentation for the rest.
Limited to ASCII characters: re.ASCII
\w
This will also match double-byte kanji, alphanumeric characters, etc. by default for Python 3 strings. It is not equivalent to the following because it is not a standard regular expression.[a-zA-Z0-9_]
m = re.match(r'\w+', '漢字ABC123')
print(m)
# <re.Match object; span=(0, 11), match='漢字ABC123'>
m = re.match('[a-zA-Z0-9_]+', '漢字ABC123')
print(m)
# None
If you specify re.ASCII for the argument flags in each function, or add the following inline flag to the beginning of the regular expression pattern string, it will only match ASCII characters (it will not match double-byte Japanese, alphanumeric characters, etc.).(?a)
In this case, the following two are equivalent.\w
=[a-zA-Z0-9_]
m = re.match(r'\w+', '漢字ABC123', flags=re.ASCII)
print(m)
# None
m = re.match(r'(?a)\w+', '漢字ABC123')
print(m)
# None
The same applies when compiling with re.compile(). Use the argument flags or inline flags.
p = re.compile(r'\w+', flags=re.ASCII)
print(p)
# re.compile('\\w+', re.ASCII)
print(p.match('漢字ABC123'))
# None
p = re.compile(r'(?a)\w+')
print(p)
# re.compile('(?a)\\w+', re.ASCII)
print(p.match('漢字ABC123'))
# None
ASCII is also available as the short form re. A. You can use either.
print(re.ASCII is re.A)
# True
\W, the opposite of \W, is also affected by re.ASCII and inline flags.
m = re.match(r'\W+', '漢字ABC123')
print(m)
# None
m = re.match(r'\W+', '漢字ABC123', flags=re.ASCII)
print(m)
# <re.Match object; span=(0, 11), match='漢字ABC123'>
As with \w, the following two match both single-byte and double-byte characters by default, but are limited to single-byte characters if re.ASCII or inline flags are specified.
- Match the numbers
\d
- Matches a blank space
\s
- Matches non-numbers
\D
- Matches any non-space.
\S
m = re.match(r'\d+', '123')
print(m)
# <re.Match object; span=(0, 3), match='123'>
m = re.match(r'\d+', '123')
print(m)
# <re.Match object; span=(0, 3), match='123'>
m = re.match(r'\d+', '123', flags=re.ASCII)
print(m)
# <re.Match object; span=(0, 3), match='123'>
m = re.match(r'\d+', '123', flags=re.ASCII)
print(m)
# None
m = re.match(r'\s+', ' ') # full-width space
print(m)
# <re.Match object; span=(0, 1), match='\u3000'>
m = re.match(r'\s+', ' ', flags=re.ASCII)
print(m)
# None
Not case-sensitive:re.IGNORECASE
By default, it is case-sensitive. To match both, you need to include both uppercase and lowercase letters in the pattern.
re.IGNORECASE
If this is specified, it will match case-insensitively. Equivalent to the i flag in standard regular expressions.
m = re.match('[a-zA-Z]+', 'abcABC')
print(m)
# <re.Match object; span=(0, 6), match='abcABC'>
m = re.match('[a-z]+', 'abcABC', flags=re.IGNORECASE)
print(m)
# <re.Match object; span=(0, 6), match='abcABC'>
m = re.match('[A-Z]+', 'abcABC', flags=re.IGNORECASE)
print(m)
# <re.Match object; span=(0, 6), match='abcABC'>
You can use less than or equal to.
- inline flag
(?i)
- abbreviation
re.I
Match the beginning and end of each line:re.MULTILINE
^
The meta characters in this regular expression match the beginning of the string.
By default, only the beginning of the whole string is matched, but the following will match the beginning of each line as well. Equivalent to the m flag in standard regular expressions.re.MULTILINE
s = '''aaa-xxx
bbb-yyy
ccc-zzz'''
print(s)
# aaa-xxx
# bbb-yyy
# ccc-zzz
result = re.findall('[a-z]+', s)
print(result)
# ['aaa', 'xxx', 'bbb', 'yyy', 'ccc', 'zzz']
result = re.findall('^[a-z]+', s)
print(result)
# ['aaa']
result = re.findall('^[a-z]+', s, flags=re.MULTILINE)
print(result)
# ['aaa', 'bbb', 'ccc']
$
Matches the end of the string. By default, only the end of the entire string is matched.re.MULTILINE
If you specify this, it will also match the end of each line.
result = re.findall('[a-z]+$', s)
print(result)
# ['zzz']
result = re.findall('[a-z]+$', s, flags=re.MULTILINE)
print(result)
# ['xxx', 'yyy', 'zzz']
You can use less than or equal to.
- inline flag
(?m)
- abbreviation
re.M
Specify multiple flags
|
If you want to enable multiple flags at the same time, use this. In the case of inline flags, each character must be followed by a letter as shown below.(?am)
s = '''aaa-xxx
漢漢漢-字字字
bbb-zzz'''
print(s)
# aaa-xxx
# 漢漢漢-字字字
# bbb-zzz
result = re.findall(r'^\w+', s, flags=re.M)
print(result)
# ['aaa', '漢漢漢', 'bbb']
result = re.findall(r'^\w+', s, flags=re.M | re.A)
print(result)
# ['aaa', 'bbb']
result = re.findall(r'(?am)^\w+', s)
print(result)
# ['aaa', 'bbb']
Greedy and non-greedy matches
This is a general problem with regular expressions, not just a problem with Python, but I'll write about it because it tends to get me into trouble.
By default, the following is a greedy match, which matches the longest possible string.
*
+
?
s = 'aaa@xxx.com, bbb@yyy.com'
m = re.match(r'.+com', s)
print(m)
# <re.Match object; span=(0, 24), match='aaa@xxx.com, bbb@yyy.com'>
print(m.group())
# aaa@xxx.com, bbb@yyy.com
The ? after it will result in a non-greedy, minimal match, matching the shortest possible string.
*?
+?
??
m = re.match(r'.+?com', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
print(m.group())
# aaa@xxx.com
Note that the default greedy match may match unexpected strings.