Tuesday, November 16, 2021

[SOLVED] Difference in the output of commands in two different environments

Issue

I'm just curious why the output of the sed command in these two environments is different:

  1. command:

echo "xxx-MNP_ISS_DE-5.12.0.37-quality.zip"|sed 's#^[a-z,A-Z,-.\_]*##'

5.12.0.37-quality.zip

System info: i)echo $0

bash

ii)uname -a

Linux xxxx 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  1. command:

echo "xxx-MNP_ISS_DE-5.12.0.37-quality.zip"|sed 's#^[a-z,A-Z,-.\_]*##'

-MNP_ISS_DE-5.12.0.37-quality.zip

System info: i)echo $0

-bash

ii)uname -a

Linux xxxx 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Thanks in advance!


Solution

Short answer: the character range is malformed, and is running into what appears to be a bug in GNU sed v4.2.2's unicode character range handling. Use [-[:alpha:].\_] instead (assuming the backslash is actually supposed to be one of the characters to trim; if not, remove that from the bracket expression).

Long answer, part 1: In a regex bracket expression (like [some characters]), commas are not needed to separate entries, and will instead be treated as part of the list of characters to match. On the other hand, dashes in most contexts are treated as part of a range (e.g. a-z) rather than as literal characters themselves. Thus, the bracket expression [a-z,A-Z,-.\_] is parsed as including:

  • The character range a through z
  • The character ,
  • The character range A through Z
  • The character range , through .
  • The characters \ and _

This isn't quite what's intended, but is mostly close enough to to what's expected. Mostly. Except for the , through . range.

Long answer, part 2: The character , is hex 2C (decimal 44) in ASCII, and U+002C in Unicode. The . is hex 2E (46) in ASCII and U+002E in Unicode. In both ASCII and Unicode, the character between them happens to be -. This means that if character ranges follow the order of the character codes, the range ,-. just happens to correspond to those same three characters: ,, -, and .. The POSIX locale just uses the ASCII order, so that's exactly what the range corresponds to, but Unicode locales can be more complicated.

I do not understand Unicode's collation (sorting) rules properly, but with everything I've tested except GNU sed v4.2.2, the character range corresponds to just ,, -, and . in the en_US.UTF-8 locale (and that includes testing with GNU sed v4.7). So I'm fairly sure that's what it should correspond to.

But not with GNU sed v4.2.2. In that, ,-. corresponds to a different bunch of punctuation characters:

$ chars=' !"#$%&'\''()*+,-./0123456789:;<=>?'
$ range=',-.'
$ echo "$chars"; echo "$chars" | LC_ALL=en_US.UTF-8 sed "s#[$range]#X#g"
 !"#$%&'()*+,-./0123456789:;<=>?
 X"#$%&'()*+X-XX0123456789XX<=>X
$ sed --version
sed (GNU sed) 4.2.2
Copyright (C) 2012 Free Software Foundation, Inc.
[...]

...so with this sed version and locale, it's interpreting the range ,-. as including !, /, :, :, and ? (in addition to the , and . that're the endpoints of the range).

I have no idea why it does this, but it appears to be a bug that got fixed somewhere between versions 4.2.2 and 4.7.

BTW, weirdness like this is why it's generally better to use [[:alpha:]] instead of [a-zA-Z] -- depending on the locale, those ranges might correspond to something quite different from what you expect (see this question for an example).



Answered By - Gordon Davisson