Issue
I have to extract lines from file1
corresponding to a list of words in file2
I'm wondering what's the difference between doing:
while read line; do grep "${line}" file1; done < file2 > output
while read line; do grep "${line}" file1 >> output; done < file2
Which one is the correct and fatest?
Is there any other faster way of doing this than a loop?
Both the files I'm working are huge 536864856
and 1947
lines for file1
and file2
, respectively.
file1 (look a $7)
NC_045027.1 29500101 T/A NC_045027.1:29500101 A 101232882 XM_032744187.1 Transcript 3_prime_UTR_variant 2764 - - - - - MODIFIER - -1 - ARL14EPL - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_045027.1 29500102 G/A NC_045027.1:29500102 A 101232882 XM_032744187.1 Transcript 3_prime_UTR_variant 2763 - - - - - MODIFIER - -1 - ARL14EPL - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_045027.1 29500103 C/A NC_045027.1:29500103 A 101232882 XM_032744187.1 Transcript 3_prime_UTR_variant 2762 - - - - - MODIFIER - -1 - ARL14EPL - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_045027.1 29500104 C/A NC_045027.1:29500104 A 101232882 XM_032744187.1 Transcript 3_prime_UTR_variant 2761 - - - - - MODIFIER - -1 - ARL14EPL - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_045027.1 29500105 A/C NC_045027.1:29500105 C 101232882 XM_032744187.1 Transcript 3_prime_UTR_variant 2760 - - - - - MODIFIER - -1 - ARL14EPL - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_045027.1 29500106 A/C NC_045027.1:29500106 C 101232882 XM_032744187.1 Transcript 3_prime_UTR_variant 2759 - - - - - MODIFIER - -1 - ARL14EPL - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_045027.1 29500107 G/A NC_045027.1:29500107 A 101232882 XM_032744187.1 Transcript 3_prime_UTR_variant 2758 - - - - - MODIFIER - -1 - ARL14EPL - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_045027.1 29500108 T/A NC_045027.1:29500108 A 101232882 XM_032744187.1 Transcript 3_prime_UTR_variant 2757 - - - - - MODIFIER - -1 - ARL14EPL - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_045027.1 29500109 G/A NC_045027.1:29500109 A 101232882 XM_032744187.1 Transcript 3_prime_UTR_variant 2756 - - - - - MODIFIER - -1 - ARL14EPL - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_045027.1 29500110 G/A NC_045027.1:29500110 A 101232882 XM_032744187.1 Transcript 3_prime_UTR_variant 2755 - - - - - MODIFIER - -1 - ARL14EPL - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_044998.1 16147 C/A NC_044998.1:16147 A 100221041 XM_030285707.2 Transcript 3_prime_UTR_variant 7416 - - - - - MODIFIER - -1 - LOC100221041 - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_044998.1 16147 C/A NC_044998.1:16147 A 100221041 XM_030285715.2 Transcript 3_prime_UTR_variant 7234 - - - - - MODIFIER - -1 - LOC100221041 - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_044998.1 16147 C/A NC_044998.1:16147 A 100221041 XM_030285720.2 Transcript 3_prime_UTR_variant 7110 - - - - - MODIFIER - -1 - LOC100221041 - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_044998.1 16147 C/A NC_044998.1:16147 A 100221041 XM_030285728.2 Transcript 3_prime_UTR_variant 6856 - - - - - MODIFIER - -1 - LOC100221041 - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_044998.1 16147 C/A NC_044998.1:16147 A 100221041 XM_030285733.2 Transcript intron_variant -- - - - - MODIFIER - -1 - LOC100221041 - - - ZFgenomic_tabixprep_nomiRNA.gff.gz --
NC_044998.1 16147 C/A NC_044998.1:16147 A 100221041 XM_030285738.2 Transcript 3_prime_UTR_variant 6637 - - - - - MODIFIER - -1 - LOC100221041 - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_044998.1 16147 C/A NC_044998.1:16147 A 100221041 XM_030285750.2 Transcript 3_prime_UTR_variant 6348 - - - - - MODIFIER - -1 - LOC100221041 - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_044998.1 16147 C/A NC_044998.1:16147 A 100221041 XM_030285760.2 Transcript 3_prime_UTR_variant 7209 - - - - - MODIFIER - -1 - LOC100221041 - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_044998.1 16148 A/C NC_044998.1:16148 C 100221041 XM_030285707.2 Transcript 3_prime_UTR_variant 7415 - - - - - MODIFIER - -1 - LOC100221041 - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
NC_044998.1 16148 A/C NC_044998.1:16148 C 100221041 XM_030285715.2 Transcript 3_prime_UTR_variant 7233 - - - - - MODIFIER - -1 - LOC100221041 - - - ZFgenomic_tabixprep_nomiRNA.gff.gz - -
file2
XM_032744187.1
XM_030272916.2
XM_032747381.1
XM_030265061.2
XM_030271469.2
XM_030272412.2
XM_032747456.1
Solution
while read line; do grep "${line}" file1; done < file2 > output
while read line; do grep "${line}" file1 >> output; done < file2
Which one is the correct and fastest?
First one as it would open output file only once whereas >> output
inside the loop would open output file for each line in file2
.
Is there any other faster way of doing this than a loop?
Based on updated information in question, this awk
will produce accurate matching result which won't be possible with grep -fF
. awk
would be pretty fast too as we are reading only smaller file's first column in memory before doing a non-regex string comparison against $7
from second file:
awk 'FNR == NR {seen[$1]; next} $7 in seen' file2 file1 > output
Answered By - anubhava