Issue
I need to read all the words from a file to a variable. In addition to that I need to store each word only once. The selection will not be key sensitive so "Hello", "hello", "hElLo" and "HELLO" will count as the same word. If a word has an apostrophe, like the word "it's", it must ignore the "'s" and only count the "it" as a word.
To do that I used the following command:
#Stores the words of the file without duplicates
WORDS=`grep -o -E '\w+' $1 | sort -u -f`
The first two criteria are met but this method counts words like "it's" as two separate words "it" and "s".
Solution
Maybe, something like that:
WORDS=$(grep -o -E "(\w|')+" words.txt | sed -e "s/'.*\$//" | sort -u -f)
UPDATE
Explanations:
var=$(...command...)
: Execute command (newer and better solution than `...command...`) and put standard output tovar
variablegrep -o -E "(\w|')+" words.txt
: Read filewords.txt
and applygrep filter
grep
filter is : print only found tokens (-o
) from extended (-E
) rational expression(\w|')+
. This expression is form extract characters of words (\w
: synonym of[_[:alnum:]]
,alnum
is for alpha-numeric characters like[0-9a-zA-Z]
for english/american but extended to many other characters for other languages) or (|
) simple cote ('
), one or more times (+
) : seeman grep
- The standard ouptut of
grep
is the standard input of next commandsed
with the pipe (|
) sed -e "s/'.*\$//"
: Execute (-e
) expressions/'.*\$//
:sed
expression is substitution (s/
) of'.*\$
(simple cote followed by zero or any characters to the end of line) by empty string (between the last two slashes (//
)) : seeman sed
- The standard ouptut of
sed
is the standard input of next commandsort
with the pipe (|
) - sort the result of
sed
and remove doubles (-u
: uniq) and do not make a differences between upper and lower characters (case) : seeman sort
Answered By - Arnaud Valmary