Issue
I'm building a wordlist and wanted to remove diacriticts from the file using
sed -i -E '/[^a-zA-ZäöüÄÖÜß]/d' wordlist.txt
However, that does not remove e. g. André
and I fail to understand why. Also grep
does not output this line. What am I missing?
Solution
What you are fundamentally missing is that this depends on your locale and on Unicode normalization, and secondly on whether your sed
supports those two facilities.
To reliably filter lines which contain diacritic characters other than the ones in your list, probably switch to a tool which portably and reliably supports all of these Unicode concepts.
perl -CSD -nle 'print unless /[^a-zA-ZöüÄÖÜß]/' wordlist.txt
Somewhat paradoxically, Perl is almost certain to be installed, whereas your system-installed sed
may or may not support these facilities.
Answered By - tripleee Answer Checked By - David Goodson (WPSolving Volunteer)