Wednesday, November 9, 2022

[SOLVED] Making a KO annotation table from eggnog mapper output

November 09, 2022 awk, bash

Issue

I have an annotation table from eggnog mapper and need to make this KO table:

Gene    evalue    KO
Gene1   0.00003   KO0000
Gene2   0.00005   KO0001
Gene2   0.00005   KO0003
Gene3   0.000005  KO0002

This is the table I have (test.txt):

Gene    evalue    KO
Gene1   0.00003   KO0000
Gene2   0.00005   KO0001,KO0003
Gene3   0.000005  KO0002

I have ~17,000 rows and the output is in xlsx format. The first issue I am having is that when I save the output file as a txt and view in linux (head test.txt) some of the columns look like this:

Gene,evalue,KO
Gene1  0.00003  KO0000
Gene2  0.0005   "KO0001,KO0003"
Gene3  0.00005  KO0002

How can I remove the quotes around these values? And how can I make the annotation table above?

I have tried this script from this thread (How can I split comma separated values into multiple rows?)

awk ' BEGIN { OFS="\t" }
     { $1=$1;t=$0; }
     { while(index($0,",")) {
         gsub(/,[[:alnum:],]*/,""); print;
         $0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
       }
       print t
     }' file

But it seems to get stuck in an infinite loop because of the quotes around the values in the third column.

Thanks

Solution

I could not reproduce your 'infinite loop' issue using your awk code. (In fact, there was no output generated at all from the code). Here is an alternative awk solution:

awk 'BEGIN{OFS="\t"} \
NR==1{gsub(",",OFS,$0); print} \
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}} \
(NR>1 && $3!=""){print $0}' input.txt

Output:

Gene    evalue  KO
Gene1  0.00003  KO0000
Gene2   0.0005  KO0003
Gene2   0.0005  KO0001
Gene3  0.00005  KO0002

To match your expected output formatting more exactly, pipe the awk output to column -t:

awk 'BEGIN{OFS="\t"} \
NR==1{gsub(",",OFS,$0); print} \
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}} \
(NR>1 && $3!=""){print $0}' tt.txt | column -t

Output:

Gene   evalue   KO
Gene1  0.00003  KO0000
Gene2  0.0005   KO0003
Gene2  0.0005   KO0001
Gene3  0.00005  KO0002

Explanation:

Set output field separator to tab character

BEGIN{OFS="\t"}

Change commas in first record to output field separator (tab character) and print the first record

NR==1{gsub(",",OFS,$0); print}

For each row after the first row that has commas in the row: 1) remove all double quotes, 2) split the third column into an array on comma characters, 3) set the 3 column to empty string, 4) print output for each of the items in the array.

(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}}

Print all rows after first row where 3rd column is not empty.

(NR>1 && $3!=""){print $0}'

Answered By - j_b

Answer Checked By - Clifford M. (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 9, 2022

[SOLVED] Making a KO annotation table from eggnog mapper output

Issue

Solution

Popular Posts

Labels