Issue
I have run grep with the following regex:
grep -e "^[a-zA-Z]" file.txt
the point is to only get lines that start with alphabetic characters in the ascii range, which works, if I explicitly type out the alphabet like
grep -e "^[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]" file.txt
which is odd already, because that's what [a-zA-Z]
is supposed to specify. When I look at my input data's matches with the first regex, we get matches like:
fi
fl
🅱
notice that fi and fl are one character in these cases.
Technically, the explicit typing of the alphabet is a solution, but I'd rather want to
- know why
[a-zA-Z]
doesn't work - if a sensible solution exists, see what that'd look like.
Solution
grep is locale aware. [a-zA-Z]
can match non-ASCII characters depending on your locale (e.g. á, ä, ø, æ). To force ASCII (and not handle any multibyte characters), set the C locale:
LC_ALL=C grep -e '^[a-zA-Z]' file.txt
Answered By - knittl Answer Checked By - David Marino (WPSolving Volunteer)