Issue
I have a WMT17 training dataset with 3,961,179 lines.
From these lines I would like to augment 198,058 random lines, e.g. by inserting a \tbewegen
(\t
is a tab character) at the end of each line containing the word "move".
The word "move" can be anywhere in the sentence, and it is a substring of sentences like
1. There was more behind this move than simply wishing to expand their product portfolio .
2. move and collect miles
3. January 16 - Pro@@ hi@@ bition begins in USA . Many li@@ qu@@ or @-@ lo@@ ving Americans move to France .
.
.
.
if the substring "move" appears in a line, then the sentence should look like this
1. There was more behind this move than simply wishing to expand their product portfolio .\tbewegen
2. move and collect miles\tbewegen
3. January 16 - Pro@@ hi@@ bition begins in USA . Many li@@ qu@@ or @-@ lo@@ ving Americans move to France .\tbewegen
.
.
.
For this I already made a script, but I found out that an augmentation of 10 lines takes about 2 minutes and 198,058 lines would take 39,611 minutes.
Here is my bash script:
sed -n '=' train.de | shuf | head -198058 > lines
cat lines | while IFS= read -r line ;
do
sed -i.bak "${line}s/move/$/\tbewegen/" train.de;
done
Is there a way to shorten the process so that I don't have to wait several days?
Update: Assuming I want to apply the insert before/after operations from https://www.golinuxhub.com/2017/06/sed-insert-word-after-match-in-middle/. How to rewrite the awk code in the solution?
Edit:
You can randomly insert a word before or after a matched word with these commands:
awk -i inplace '(NR==FNR){a[$1];next}
(FNR in a) && gsub(/\<the\>/,"Before &")
1
' <(shuf -n 198058 -i 1-$(wc -l < n_train)) n_train
awk -i inplace '(NR==FNR){a[$1];next}
(FNR in a) && gsub(/\<the\>/,"& After")
1
' <(shuf -n 198058 -i 1-$(wc -l < n_train)) n_train
Solution
The following command should help you here. It reads a sequence of random numbers followed by the processing of the file. This does not do an inplace modification but prints the output to the screen. A redirection will save the file.
awk '(NR==FNR){a[$1];next}
(FNR in a) && /\<move\>/ {$0=$0 "\tbewegen"}
1
' <(shuf -n 198058 -i 1-$(wc -l < train.de)) train.de
This contains out of a couple commands:
1. get a random selection of line numbers:
shuf -n 198058 -i 1-$(wc -l < train.de)
This line generates a random selection of 198058 numbers between the range 1-N where N is the total number of lines in the file train.de
which is given by awk 'END{print NR}' train.de
. This line replaces the initial line in your code:
sed -n '=' train.de | shuf | head -198058 > lines
2. use awk to do the rest:
awk '(NR==FNR){a[$1];next}(FNR in a) && /\<move\>/{$0=$0 "\tbewegen"}1' file1 file2
We use awk here to read the input of file1 (the output of shuf
) and store all of it in an array a
which is used as a lookup table. When the first file is read, we check the record number (line number) of the second file FNR
and check if we have it in the lookup table a
. If this is true, we check if the line contains the word "move". If both conditions are met, update that line by adding \tbewegen
to it.
You can now store this output in a new file.
This will work much faster than the previous version as this only reads the file twice, where in your example you were reading it 198059 times.
Answered By - kvantour Answer Checked By - Marilyn (WPSolving Volunteer)