Issue
I'm trying to found a way to speed a delete process.
Currently I've two files, file1.txt and file2.txt
file1 contain records on 20 digits near 10000 lines. file2 contain length records of 6500 digits and near 2 millions.
My goal is to delete lines on file2 that matches records on file1.
To do this I create a sed file with the record line from the fist file like this:
File1:
/^20606516000100070004/d /^20630555000100030001/d /^20636222000800050001/d
command used : sed -i -f file1 file2
The command works fine but it take about 4hours to delete the 10 000 lines on the file2.
I'm looking for a solution that can speed up the delete process.
Additional information:
each records of file1 is on file2 for sure ! line from file2 always start with a number of 20digits that should match or not with the records contain on file1.
to illustrate the upper point here is a line from file2(this is not the entire line as explain each records of file 2 is 6500 length)
20606516000100070004XXXXXXX19.202107.04.202105.03.202101.11.202001.11.2020WWREABBXBOU
Thanks in advance.
Solution
All you need is this, using any awk in any shell on every Unix box:
awk 'NR==FNR{a[$0]; next} !(substr($0,1,20) in a)' file1 file2
and with files such as you described on a reasonable processor it'll run in a couple of seconds rather than 4 hours.
Just make sure file1
only contains the numbers you want to match on, not a sed script using those numbers, e.g.:
$ head file?
==> file1 <==
20606516000100070004
20630555000100030001
20636222000800050001
==> file2 <==
20606516000100070004XXXXXXX19.202107.04.202105.03.202101.11.202001.11.2020WWREABBXBOU
99906516000100070004XXXXXXX19.202107.04.202105.03.202101.11.202001.11.2020WWREABBXBOU
$ awk 'NR==FNR{a[$0]; next} !(substr($0,1,20) in a)' file1 file2
99906516000100070004XXXXXXX19.202107.04.202105.03.202101.11.202001.11.2020WWREABBXBOU
Answered By - Ed Morton