Issue
I have three very large files and I just want to grab the matching ID2s that come up for the first 20 ID2s for file 1 (in the example, I listed only 5). I want it to search file 2 and file 3 in its entirety and pull out the lines that match to the 20 ID2s from file1. I also want the ID1- even if it's blank (as well as other columns that are on the same matching ID2 line).
I know how to do this in python, but since the files are so large, my computer can't handle it. I have been trying to do it in the terminal (unix) with no luck. I tried the following command but 1)it doesn't deal with three files and 2) it is also throwing me an error (maybe I'm grabbing the wrong columns)?
awk ‘NR==FNR{a[$1][$0];next} $0 in a {for (i in a[$0]) print i}’ file1.tsv file2.tsv
Error:
zsh: parse error near `}'
Another question: How do I skip the first five rows in unix when doing the command (because these files have some notes in the beginning that are read in as rows)?
file1
id 1 id2 name
A00000004 B00000004 emily
B00000005 joe
A00000006 B00000006 jack
B00000007 john
A00000008 B00000008 sally
file2
id1 id2 source source_code
A00000001 B00000001 source1 321
A00000003 B00000003 source2 165
A00000004 B00000004 source3 481
A00000005 B00000005 source2 891
file 3
id2 code
B00000006 yes
B00000007 no
B00000005 yes
B00000004 yes
B00000012 yes
Desired:
id1 id2 name source source_code code
A00000004 B00000004 emily source3 481 yes
A00000005 B00000005 joe source2 891 yes
Solution
First, to start on the Nth line of a file:
tail -n +N infile
To match the id2
columns in the different files, I'd recommend the join
command. However, there are drawbacks, including the requirement to know which columns contain id2
and the need to sort on those columns. This method also won't recognize more than one identical column (e.g. the id1
column will be repeated for each file in which it appears). It's also unclear to me from the question if there are headers you want to keep. This method will lose the headers.
# Sort files on the id2 column
# First, drop the header line.
# Then sort on the common column (in this case the 2nd field).
# "-t $'\t'" means to use tabs as the field separator.
tail -n +2 file1 | sort -t $'\t' -k2,2 > file1.sort
tail -n +2 file2 | sort -t $'\t' -k2,2 > file2.sort
tail -n +2 file3 | sort -t $'\t' -k1,1 > file3.sort
First join file1.sort
to file2.sort
:
# The "-j 2" flag means that both files use the second column as the common field.
# One could also use "-1 2 -2 2" to specify the field for the first and second files.
join -t $'\t' -j 2 file1.sort file2.sort > file1-2
Then join that file to file3.sort
. Note that the output of join
places the joined column first, by default.
join -t $'\t' -j 1 file1-2 file3.sort > files1-3
You can save a step, and an intermediary file, by combining the last two commands:
join -t $'\t' -j 2 file1.sort file2.sort | join -t $'\t' -j 1 - file3.sort > files1-3
>cat files1-3
B00000004 A00000004 emily A00000004 source3 481 yes
B00000005 joe A00000005 source2 891 yes
If you know that the columns are arranged the way they are in the example, you can also use join
's output formatting to clean things up a bit:
join -t $'\t' -j 2 file1.sort file2.sort | join -t $'\t' -j 1 -o 1.4,1.1,1.3,1.5,1.6,2.2 - file3.sort > files1-3_format
>cat files1-3_format
A00000004 B00000004 emily source3 481 yes
A00000005 B00000005 joe source2 891 yes
Answered By - Darlingtonia Answer Checked By - Terry (WPSolving Volunteer)