Issue
I had a special character(Â) which gave error while using sed.
echo $'H\xc3\x82Bnc' | sed -E 's/[A-Z]*/`&`/g'
sed: RE error: illegal byte sequence
$ locale
LANG=""
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Now if I set LC_COLLATE, then the command works.
echo $'H\xc3\x82Bnc' | LC_COLLATE="en_US.UTF-8" sed -E 's/[A-Z]*/`&`/g'
`HÂB`n``c``
Is LC_COLLATE affecting the character range. Why after LC_COLLATE was set, did the code produce no error of illegal byte sequence? I'm using MacOS and the sed is implemented from FreeBSD in it.
Solution
LC_COLLATE. Is it also used for setting character range?
Yes.
Is LC_COLLATE affecting the character range
Yes.
Why after LC_COLLATE was set, did the code produce no error of illegal byte sequence?
Because you set encoding to UTF-8 intead of C.
LC_COLLATE sets the character range for collation i.e. comparison of characters and strings. So earlier the LC_COLLATE was set to C
and it couldn't compare Â
with other characters but after setting LC_COLLATE to en_US.UTF-8
, it could compare successfully.
Also, the backtick (`) appeared on the whole HÂB
since sometimes the accented characters are treated same as their base characters (so A and  were same here).
Answered By - KamilCuk Answer Checked By - Katrina (WPSolving Volunteer)