Word Boundaries and Lookahead Assertions
As I was trying to improve the lexer for postcss-calc
, I learnt about two regular expression features: the word boundary anchor character and lookahead.
Word boundary anchor character: \b
In a regular expression, \b
specifies that the expression has to match at the word boundary.
For example, in JavaScript,
/123\b/.test('123 456')
returns true
because the space is a word separator.
/123\b/.test('123456')
returns false
, because in 123456
123
is not followed by a word separator.
\b
in combination with digits and units can be treacherous, because it matches the .
decimal point character. Say you are expecting only whole numbers, but the input also contains decimal numbers with units. /[0-9]+\b/
matches 123
in 123.45deg
, leaving the .45deg
string behind, which can give the illusion that the input matches expectations.
Lookahead assertions in regular expressions
While I was looking for how to exclude the \.
character from word boundaries, I came across lookahead assertions. Lookahead assertions match a pattern depending on the pattern that follows it.
The syntax for lookahead assertion can be confusing, as it looks like the syntax for non-capturing groups.
For example, appending the negative lookahead assertion (?!\.)
to the pattern will only match the pattern if it is not followed by the decimal point. So
/[0-9]+(?!\.)\b
does not match any part of 123.45deg
.
In a similar fashion, the positive lookahead assertion (?=\.)
requires a decimal point after the pattern. In Java and in the 2018 edition of the ECMAScript standard, there’s also lookbehind assertions: (?<=\.)
requires a decimal point before the pattern.