Saturday, November 20, 2021

[SOLVED] Compare columns in two files and if match change string in another column

November 20, 2021 awk, grep, unix

Issue

I have two files

file1 
non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase gene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase gene    15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";

file2
betaNACtes5
CR18275
28SrRNA-Psi:CR45859
CR32821

What I want: if there is a match of any line in file2 with column 13 (partial match because of the " ") of file1 I want to change the string in column 4 to "pseudogene" otherwise nothing should be done.

Desired output

non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase pseudogene  19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase gene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase pseudogene  15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding  X   FlyBase pseudogene  19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";

So far I can get the matches, but I can't do the rest.

grep -Ff file2 file1

Solution

Using GNU awk for the 3rd arg to match() and \s/\S shorthand:

$ cat tst.awk
NR==FNR {
    genes["\""$1"\";"]
    next
}
$NF in genes {
    match($0,/((\S+\s+){3})\S+(.*)/,a)
    $0 = a[1] "pseudogene" a[3]
}
{ print }

$ awk -f tst.awk file2 file1
non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase pseudogene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase pseudogene    15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";

alternatively, using any POSIX awk:

$ cat tst.awk
NR==FNR {
    genes["\""$1"\";"]
    next
}
$NF in genes {
    match($0,/([^[:space:]]+[[:space:]]+){3}/)
    tail = substr($0,RLENGTH+1)
    sub(/[^[:space:]]+/,"",tail)
    $0 = substr($0,1,RLENGTH) "pseudogene" tail
}
{ print }

$ awk -f tst.awk file2 file1
non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase pseudogene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase pseudogene    15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";

Answered By - Ed Morton

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 20, 2021

[SOLVED] Compare columns in two files and if match change string in another column

Issue

Solution

Popular Posts

Labels