Sunday, March 27, 2022

[SOLVED] Fastest way to augment a WMT17 training set

March 27, 2022 sed, unix

Issue

I have a WMT17 training dataset with 3,961,179 lines.

From these lines I would like to augment 198,058 random lines, e.g. by inserting a \tbewegen (\t is a tab character) at the end of each line containing the word "move".

The word "move" can be anywhere in the sentence, and it is a substring of sentences like

1. There was more behind this move than simply wishing to expand their product portfolio .
2. move and collect miles
3. January 16 - Pro@@ hi@@ bition begins in USA . Many li@@ qu@@ or @-@ lo@@ ving Americans move to France .
.
.
.

if the substring "move" appears in a line, then the sentence should look like this

1. There was more behind this move than simply wishing to expand their product portfolio .\tbewegen
2. move and collect miles\tbewegen
3. January 16 - Pro@@ hi@@ bition begins in USA . Many li@@ qu@@ or @-@ lo@@ ving Americans move to France .\tbewegen
.
.
.

For this I already made a script, but I found out that an augmentation of 10 lines takes about 2 minutes and 198,058 lines would take 39,611 minutes.

Here is my bash script:

sed -n '=' train.de | shuf | head -198058 > lines

cat lines | while IFS= read -r line ;
do 
sed -i.bak "${line}s/move/$/\tbewegen/" train.de; 
done

Is there a way to shorten the process so that I don't have to wait several days?

Update: Assuming I want to apply the insert before/after operations from https://www.golinuxhub.com/2017/06/sed-insert-word-after-match-in-middle/. How to rewrite the awk code in the solution?

Edit:

You can randomly insert a word before or after a matched word with these commands:

awk -i inplace '(NR==FNR){a[$1];next}
    (FNR in a) && gsub(/\<the\>/,"Before &")
     1
    ' <(shuf -n 198058 -i 1-$(wc -l < n_train)) n_train

awk -i inplace '(NR==FNR){a[$1];next}
    (FNR in a) && gsub(/\<the\>/,"& After")
     1
    ' <(shuf -n 198058 -i 1-$(wc -l < n_train)) n_train

Solution

The following command should help you here. It reads a sequence of random numbers followed by the processing of the file. This does not do an inplace modification but prints the output to the screen. A redirection will save the file.

awk '(NR==FNR){a[$1];next}
     (FNR in a) && /\<move\>/ {$0=$0 "\tbewegen"}
     1
    ' <(shuf -n 198058 -i 1-$(wc -l < train.de)) train.de

This contains out of a couple commands:

1. get a random selection of line numbers:

shuf -n 198058 -i 1-$(wc -l < train.de)

This line generates a random selection of 198058 numbers between the range 1-N where N is the total number of lines in the file train.de which is given by awk 'END{print NR}' train.de. This line replaces the initial line in your code:

sed -n '=' train.de | shuf | head -198058 > lines

2. use awk to do the rest:

awk '(NR==FNR){a[$1];next}(FNR in a) && /\<move\>/{$0=$0 "\tbewegen"}1' file1 file2

We use awk here to read the input of file1 (the output of shuf) and store all of it in an array a which is used as a lookup table. When the first file is read, we check the record number (line number) of the second file FNR and check if we have it in the lookup table a. If this is true, we check if the line contains the word "move". If both conditions are met, update that line by adding \tbewegen to it.

You can now store this output in a new file.

This will work much faster than the previous version as this only reads the file twice, where in your example you were reading it 198059 times.

Answered By - kvantour

Answer Checked By - Marilyn (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, March 27, 2022

[SOLVED] Fastest way to augment a WMT17 training set

Issue

Solution

Popular Posts

Labels