Regular Expression Rabbit Hole
Wed, Jan 24, 2024
Today I wanted to write a bash script that checks if an input matches an AWS UserPool ClientId.
Since AWS documentation kindly provides [\w_]+
as a regular expression pattern, I wanted to take this shortcut and use the pattern.
After cooking the bash script I was wondering why the test string didn’t work although every online regular expression tool stated the opposite.
#!/bin/bash
why_is_this_regex_not_working="^[\w_]+$"
testStr="abc123"
if [[ ! "$testStr" =~ $why_is_this_regex_not_working ]]; then
echo "${testStr} NOT MATCHING ${why_is_this_regex_not_working}"
fi
After some research I found this statement from MDN docs
\w matches any alphanumeric character from the basic Latin alphabet, including the underscore. Equivalent to [A-Za-z0-9_].
So I implemented the equivalent [A-Za-z0-9_]
and it worked like a charm.
But why was \w
not working ?
It turns out that the character class \w
is an enhanced feature of re_format
which is not implemented in every regex engine.
So which engine is used by bash ?
Reading man bash
An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)).
Reading man regex
These routines implement IEEE Std 1003.2 (“POSIX.2”) regular expressions (“RE”s); see re_format(7).
Reading man regex
Regular expressions (“REs”), as defined in IEEE Std 1003.2 (“POSIX.2”), come in two forms: modern REs (roughly those of egrep(1); 1003.2 calls these “extended” REs) and obsolete REs (roughly those of ed(1); 1003.2 “basic” REs)
So it appears that the regex engine used by my bash on mac m1 implements 1003.2 “basic” REs and thus does not support \w
.
Perl obviously does:
$ perl -e 'print "Matched" if "abc123" =~ /^[\w_]+$/;'
Matched%
Using the POSIX Basic Regular Expression class [:alnum:]
or the range notation [A-Za-z0-9]
solved the issue.
#!/bin/bash
not_working_regex="^[\w]+$"
working_regex="^[A-Za-z0-9_]+$"
working_posix_regex="^[[:alnum:]_]+$"
testStr="abc123"
if [[ ! "$testStr" =~ $not_working_regex ]]; then
echo "${testStr} NOT MATCHING ${not_working_regex}"
fi
if [[ ! "$testStr" =~ $working_regex ]]; then
echo "${testStr} NOT MATCHING ${working_regex}"
fi
if [[ ! "$testStr" =~ $working_posix_regex ]]; then
echo "${testStr} NOT MATCHING ${working_posix_regex}"
fi