Issue
I would like to know why using bracket expressions [\w]
and [\d]
in GNU sed
produces different results:
$ echo '\weds4'|sed -E 's/[\w]/_/g'
__eds4
$ echo '\weds4'|sed -E 's/[\d]/_/g'
\we_s4
$ echo '\weds4'|sed -E 's/[\s]/_/g'
_wed_4
I expected that echo '\weds4'|sed -E 's/[\d]/_/g'
would produce _we_s4
and not \we_s4
Here described that it should match both \
and d
, like I'm expecting.
So in POSIX, the regular expression
[\d]
matches a\
or ad
.
Why is it happening?
Demo here.
Side note: using BRE instead of ERE doesn't change anything.
Solution
From https://www.gnu.org/software/sed/manual/sed.html :
Regex syntax clashes (problems with backslashes)
....
In addition, this version of sed supports several escape characters (some of which are multi-character) to insert non-printable characters in scripts (\a, \c, \d, \o, \r, \t, \v, \x). These can cause similar problems with scripts written for other seds.
And
5.8 Escape Sequences - specifying special characters
[...]
\dxxx
Produces or matches a character whose decimal ASCII value is xxx.
Why is it happening?
When writing \d
and there is nothing after it, the https://github.com/mirror/sed/blob/master/sed/compile.c#L1345 case matches and executes https://github.com/mirror/sed/blob/master/sed/compile.c#L1356 convert_number()
which in case the buffer is empty just assigns *result = *buf
the character to the result value https://github.com/mirror/sed/blob/master/sed/compile.c#L275 instead of converting the digits after d
.
This will happen to all the cases in the switch, so \d
\x
\o
with nothing behind will match d
x
and o
. I would count /\d/
as undefined behavior in GNU sed - \d
has to be followed by 3 decimals. I would say GNU sed documentation does not specify what should happen on \d
or \x
or \c
or \o
not followed by digits or followed by invalid characters.
why do I need second slash?
In POSIX sed all https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html I think all three of your commands are invalid / undefined behavior. Sed does not specify what should happen on \d
\s
or \w
, these are invalid escape sequences, so you can't expect them to work. Your commands are invalid. If you want to match \
you have to escape it \\
, see https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap05.html#tagtcjh_2 .
But it would be nicer to get an error messages from GNU sed like in the case of \c
.
Answered By - KamilCuk Answer Checked By - Timothy Miller (WPSolving Admin)