Issue
I'm running cygwin
under windows 10
Have a dictionary file (1-dictionary.txt
) that looks like this:
labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "
The separators between are TAB
s (\t
s).
The dictionary file is encoded as UTF-8
.
Want to replace words and symbols in the first column with words and HTML entities in the second column.
My source file (2-source.txt
) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8
.
Sample text looks like this:
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
I run the following sed
one-liner in a shell script (./3-script.sh):
sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt
The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt
is successful.
However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:
vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)
If i use only the specific symbol (not the full word) I get results like this:
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
The ASCII quote symbol is appended with "
- it is not replaced.
Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.
The expected output would look like this:
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
How to modify the sed
script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?
Solution
I tried it, just replace all &
with \&
in your 1-dictionary.txt
will solve your problem.
Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add \
to prepare them to be escaped.
And the to part will have special characters too, mainly \
and &
, add extra \
to prepare them to be escaped too.
Above linked to GNU sed's document, for other sed
version, you can also check man sed
.
Answered By - Tiw Answer Checked By - Mary Flores (WPSolving Volunteer)