Issue
I have a large, tab delimited file (technically a VCF of genetic variants), call it file.vcf
, with millions of lines that look something like this
locus1 1 15 0 0/0,21,2,2,;0
locus1 2 17 0 0/0,21,2,1,;0
locus2 1 10 0 0/1,21,2,2,;0
locus3 1 2 0 0/1,21,2,1,;0
...
locus123929 1 3 0 1/0,22,2,1,;0
locus123929 2 4 0 1/2,1,1,3,;0
I'd like to subset this original file to include all lines from loci in another file (search-file.txt
). For example, if search-file.txt
were:
locus1
locus3
locus123929
Then the final would be:
locus1 1 15 0 0/0,21,2,2,;0
locus1 2 17 0 0/0,21,2,1,;0
locus3 1 2 0 0/1,21,2,1,;0
locus123929 1 3 0 1/0,22,2,1,;0
locus123929 2 4 0 1/2,1,1,3,;0
What is the most efficient way to subset this large of a file using either bash or R? (Note, reading the entire file into memory, as in R is very very very slow, and often crashes the system.)
Solution
I'd use awk
:
awk -F'\t' '
NR == FNR { a[$0]; next }
$1 in a
' search-file.txt file.vcf > filtered_file
bash
would be too slow for this job.
Note: Make sure the file search-file.txt
doesn't have DOS line endings.
Alternatively,
LC_ALL=C sort search-file.txt file.vcf |
awk '
NF == 1 { loc = $1; next }
$1 == loc
' > filtered_file
but this version may disturb the original order of lines.
Answered By - M. Nejat Aydin Answer Checked By - Marie Seifert (WPSolving Admin)