Issue
I do some regex checking in Bash to make sure that a string contains only sane characters (only lowercase a-z in this case) and I encountered this strange behavior.
It looks the same in grep
and sed
.
Python 3.9 behaves as I would expect it.
Am I doing something wrong or is it a bug? If it is a bug, where to report it?
Lowercase š
is wrongly detected as a character between a-z
:
[[ 'š' =~ ^[a-z]$ ]] && echo sane || echo nope
sane
[[ 'š' =~ [a-z] ]] && echo sane || echo nope
sane
grep '^[a-z]$' <<<'š' && echo sane || echo nope
š
sane
sed 's/^[a-z]$/a/' <<<'š'
a
Lowercase ž
is correctly detected as not a character between a-z
:
EDIT: Because ž
goes right after z
- that is outside of a-z
.
[[ 'ž' =~ ^[a-z]$ ]] && echo sane || echo nope
nope
[[ 'ž' =~ [a-z] ]] && echo sane || echo nope
nope
grep '^[a-z]$' <<<'ž' && echo sane || echo nope
nope
sed 's/^[a-z]$/a/' <<<'ž'
ž
Capital Š
is correctly detected as not a character between a-z
:
[[ 'Š' =~ ^[a-z]$ ]] && echo sane || echo nope
nope
[[ 'Š' =~ [a-z] ]] && echo sane || echo nope
nope
grep '^[a-z]$' <<<'Š' && echo sane || echo nope
nope
sed 's/^[a-z]$/a/' <<<'Š'
Š
Capital Š
is wrongly detected as a character between A-Z
:
[[ 'Š' =~ ^[A-Z]$ ]] && echo sane || echo nope
sane
[[ 'Š' =~ [A-Z] ]] && echo sane || echo nope
sane
grep '^[A-Z]$' <<<'Š' && echo sane || echo nope
Š
sane
sed 's/^[A-Z]$/A/' <<<'Š'
A
My bash
version:
GNU bash, version 5.1.8(1)-release (x86_64-redhat-linux-gnu)
My grep
version:
grep (GNU grep) 3.6
My sed
version:
sed (GNU sed) 4.8
My locale:
locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Python:
python3 -c 'import re ;print("sane" if re.match(r"^[a-z]$", "š") else "nope")'
nope
python3 -c 'import re ;print("sane" if re.match(r"^[a-z]$", "s") else "nope")'
sane
EDIT:
As @oguz-ismail pointed out, ž
was just a badly chosen outlier (literally) as it goes after z
.
The behavior looks consistent with characters between a-z
in alphabetical order - like š
and č
.
And to get rid of them all, I had to set LC_ALL=C
.
[[ 'č' =~ ^[a-z]$ ]] && echo sane || echo nope
sane
LC_CTYPE=C
# or stronger: LC_ALL=C
[[ 'č' =~ ^[a-z]$ ]] && echo sane || echo nope
nope
My last questio is whether it is expected to match letters with diacritics with the [a-z]
range.
(Definitely not expected by me.)
Solution
Am I doing something wrong or is it a bug?
You are doing something that is locale-sensitive, and whose behavior may not be specified by POSIX. The observed behavior probably is not buggy.
Bash's pattern matching operator uses the POSIX flavor of regular expressions, and POSIX leaves the behavior of range expressions inside character classes unspecified except in the POSIX locale. In the POSIX locale (and maybe elsewhere), the meaning of a range expression depends on the collation order in effect. It is my understanding that in locales for languages and regions where letters with diacritical marks are in common use, such letters are often collated together with the corresponding base letter. The behavior you describe is consistent with such a collation order.
If you want to match against the characters mapped by ASCII (and Unicode) to code points 0x61 - 0x7A, and only those, regardless of locale, then the most reliable way to spell that is to list all the matching characters individually:
[[ 'š' =~ ^[abcdefghijklmnopqrstuvwxyz]$ ]] && echo sane || echo nope
Answered By - John Bollinger Answer Checked By - Cary Denson (WPSolving Admin)