Word Boundaries and Lookahead Assertions

Wednesday, Jan 12 2022 in JavaScript Java

As I was trying to improve the lexer for postcss-calc, I learnt about two regular expression features: the word boundary anchor character and lookahead.

Word boundary anchor character: `\b`

In a regular expression, \b specifies that the expression has to match at the word boundary. For example, in JavaScript,

/123\b/.test('123 456')

returns true because the space is a word separator.

/123\b/.test('123456')

returns false, because in 123456 123 is not followed by a word separator.

\b in combination with digits and units can be treacherous, because it matches the . decimal point character. Say you are expecting only whole numbers, but the input also contains decimal numbers with units. /[0-9]+\b/ matches 123 in 123.45deg, leaving the .45deg string behind, which can give the illusion that the input matches expectations.

Lookahead assertions in regular expressions

While I was looking for how to exclude the \. character from word boundaries, I came across lookahead assertions. Lookahead assertions match a pattern depending on the pattern that follows it. The syntax for lookahead assertion can be confusing, as it looks like the syntax for non-capturing groups.

For example, appending the negative lookahead assertion (?!\.) to the pattern will only match the pattern if it is not followed by the decimal point. So

/[0-9]+(?!\.)\b

does not match any part of 123.45deg.

In a similar fashion, the positive lookahead assertion (?=\.) requires a decimal point after the pattern. In Java and in the 2018 edition of the ECMAScript standard, there’s also lookbehind assertions: (?<=\.) requires a decimal point before the pattern.

Word boundary anchor character: \b

Lookahead assertions in regular expressions

Word boundary anchor character: `\b`