Issue
I have an annotation table from eggnog mapper and need to make this KO table:
Gene evalue KO
Gene1 0.00003 KO0000
Gene2 0.00005 KO0001
Gene2 0.00005 KO0003
Gene3 0.000005 KO0002
This is the table I have (test.txt):
Gene evalue KO
Gene1 0.00003 KO0000
Gene2 0.00005 KO0001,KO0003
Gene3 0.000005 KO0002
I have ~17,000 rows and the output is in xlsx format. The first issue I am having is that when I save the output file as a txt and view in linux (head test.txt
) some of the columns look like this:
Gene,evalue,KO
Gene1 0.00003 KO0000
Gene2 0.0005 "KO0001,KO0003"
Gene3 0.00005 KO0002
How can I remove the quotes around these values? And how can I make the annotation table above?
I have tried this script from this thread (How can I split comma separated values into multiple rows?)
awk '
BEGIN { OFS="\t" }
{ $1=$1;t=$0; }
{ while(index($0,",")) {
gsub(/,[[:alnum:],]*/,""); print;
$0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
}
print t
}' file
But it seems to get stuck in an infinite loop because of the quotes around the values in the third column.
Thanks
Solution
I could not reproduce your 'infinite loop' issue using your awk
code. (In fact, there was no output generated at all from the code). Here is an alternative awk
solution:
awk 'BEGIN{OFS="\t"} \
NR==1{gsub(",",OFS,$0); print} \
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}} \
(NR>1 && $3!=""){print $0}' input.txt
Output:
Gene evalue KO
Gene1 0.00003 KO0000
Gene2 0.0005 KO0003
Gene2 0.0005 KO0001
Gene3 0.00005 KO0002
To match your expected output formatting more exactly, pipe the awk
output to column -t
:
awk 'BEGIN{OFS="\t"} \
NR==1{gsub(",",OFS,$0); print} \
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}} \
(NR>1 && $3!=""){print $0}' tt.txt | column -t
Output:
Gene evalue KO
Gene1 0.00003 KO0000
Gene2 0.0005 KO0003
Gene2 0.0005 KO0001
Gene3 0.00005 KO0002
Explanation:
Set output field separator to tab character
BEGIN{OFS="\t"}
Change commas in first record to output field separator (tab character) and print the first record
NR==1{gsub(",",OFS,$0); print}
For each row after the first row that has commas in the row: 1) remove all double quotes, 2) split the third column into an array on comma characters, 3) set the 3 column to empty string, 4) print output for each of the items in the array.
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}}
Print all rows after first row where 3rd column is not empty.
(NR>1 && $3!=""){print $0}'
Answered By - j_b Answer Checked By - Clifford M. (WPSolving Volunteer)