Mitch's Technical Blog

Python's Regular Expression Library (re)

August 25, 2018
notes, python, regexp
permalink

Putting aside for a moment how much I value clean, readable, well-tested code, I admit I'm envious of those regular expression aces on Stackoverflow who seem to read (and type!) regular expressions as a second language.

Regular expressions don't come naturally to me, not yet anyway, perhaps in part because I don't use them on a regular (pun acknowledged) basis.

Hence these notes, which I update as needed.

Quick tip: I recommend making your own notes on this or any other topic you're learning. They help greatly with retention and quick reference.

Example: Find contents within matching pairs of angle braces

Find all contents within matching angle brackets.

<hello><world>
 ^^^^^  ^^^^^

In the example above, we want to retrieve 'hello' and 'world'.

Build up a regular expression in conceptual pieces.

[^<>] means any single character except < or >. The ^ character is a logical "not" when used between square brackets.

[^<>]+ means the same as the above, but stipulates one or more such characters.

([^<>]+) adds enclosing parentheses, which means define the match as a subexpression which can be referenced independently.

Use findall() to return all matches. Notice how only the subexpression is matched, not the bounding < and >.

Note also the r' prefix when opening the string. This indicates a raw string expression which ensures that e.g. '\n' is treated as two separate characters, not a single new-line character.

>>> import re
>>> exp = r'<([^<>]+)>'
>>> re.findall(exp, '<hello><world>')
['hello', 'world']

On the other hand, omitting use of a subexpression will include the bounding < and >.

>>> import re
>>> exp = r'<[^<>]+>'
>>> re.findall(exp, '<hello><world>')
['<hello>', '<world>']

Cheat Sheet

*   0 or more

+   1 or more

?   0 or 1

Links

Python library re docs

Mentioned in several places: "note that this is different from finding a zero-length match". Huh?

Python docs Regular Expression HOWTO

Differences between findall, match, search

search() vs match()

re.findall() will find all matches:

>>> re.findall(r'[\d]+', 'a1b2c3')
['1', '2', '3']

re.match() requires the match be at the beginning of the string.

>>> re.match(r'\d+', '123abc')
<_sre.SRE_Match object; span=(0, 3), match='123'>
>>>
>>> re.match(r'\d+', 'foo123abc')
>>>

re.search() checks for a match anywhere in the string. It returns only the first result.

>>> re.search(r'\d+', 'a1b2c3')
<_sre.SRE_Match object; span=(1, 2), match='1'>

Contact: hello at escapefromsql.net