Issue
I have two files
file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase gene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase gene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
file2
betaNACtes5
CR18275
28SrRNA-Psi:CR45859
CR32821
What I want: if there is a match of any line in file2 with column 13 (partial match because of the " ") of file1 I want to change the string in column 4 to "pseudogene" otherwise nothing should be done.
Desired output
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase gene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
So far I can get the matches, but I can't do the rest.
grep -Ff file2 file1
Solution
Using GNU awk for the 3rd arg to match() and \s/\S
shorthand:
$ cat tst.awk
NR==FNR {
genes["\""$1"\";"]
next
}
$NF in genes {
match($0,/((\S+\s+){3})\S+(.*)/,a)
$0 = a[1] "pseudogene" a[3]
}
{ print }
$ awk -f tst.awk file2 file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
alternatively, using any POSIX awk:
$ cat tst.awk
NR==FNR {
genes["\""$1"\";"]
next
}
$NF in genes {
match($0,/([^[:space:]]+[[:space:]]+){3}/)
tail = substr($0,RLENGTH+1)
sub(/[^[:space:]]+/,"",tail)
$0 = substr($0,1,RLENGTH) "pseudogene" tail
}
{ print }
$ awk -f tst.awk file2 file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
Answered By - Ed Morton