Issue
I have a very simple one liner that works almost perfectly. I want to add a new column to a file that says "non coding or coding" depending on conditions on columns 12 and 3 (if column 12 has substring RNA or mir- and/or column 3 == "pseudogene then column 1 should read non-coding else coding).
#file
X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821"; transcript_id "FBtr0307588"; transcript_symbol "CR32821-RB";
X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275"; transcript_id "FBtr0070097"; transcript_symbol "CR18275-RA";
X FlyBase pseudogene 5832298 5832368 . + . gene_id "FBgn0052761"; gene_symbol "tRNA:Glu-CTC-6-1Psi"; transcript_id "FBtr0070818"; transcript_symbol "tRNA:Glu-CTC-6-1Psi-RA";
X FlyBase pseudogene 6361496 6362960 . - . gene_id "FBgn0016974"; gene_symbol "swaPsi"; transcript_id "FBtr0070923"; transcript_symbol "swaPsi-RA";
X FlyBase pseudogene 6361496 6363310 . - . gene_id "FBgn0016974"; gene_symbol "swaPsi"; transcript_id "FBtr0334014"; transcript_symbol "swaPsi-RB";
X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
X FlyBase gene 1482492 1482590 . - . gene_id "FBgn0044508"; gene_symbol "snoRNA:M";
X FlyBase gene 2330159 2330826 . + . gene_id "FBgn0053218"; gene_symbol "lncRNA:CR33218";
X FlyBase gene 3427452 3427523 . - . gene_id "FBgn0052493"; gene_symbol "tRNA:Gln-TTG-2-1";
X FlyBase gene 3819699 3819770 . + . gene_id "FBgn0052785"; gene_symbol "tRNA:Gln-CTG-2-1";
X FlyBase gene 3827622 3827693 . + . gene_id "FBgn0025118"; gene_symbol "tRNA:Pro-CGG-3-1";
2L FlyBase gene 825969 833241 . + . gene_id "FBgn0010583"; gene_symbol "dock";
2L FlyBase gene 852768 854539 . + . gene_id "FBgn0020545"; gene_symbol "kraken";
2L FlyBase gene 855337 856639 . + . gene_id "FBgn0031288"; gene_symbol "CG13949";
2L FlyBase gene 860197 861806 . + . gene_id "FBgn0031289"; gene_symbol "CG13950";
2L FlyBase gene 877302 878270 . + . gene_id "FBgn0002936"; gene_symbol "ninaA";
#command
awk '{ if($12 ~ /RNA/ || $12 ~ /mir-/ || $3 == "pseudogene") $1="non-coding"; else $1="coding"; print }' a.gene-pseudogene_all_dmel-all-r6.40.gtf
The code works but it replaces column 1. Which is not what I want, I want to add this new column before column 1 (so it becomes the new column one).
How can I adjust it?
Solution
You can (effectively) prepend a new column by converting $1
into something OFS $1
. You aren't really creating a new column ($2
still refers to the original second column and $1
refers to "both" new columns) but that is not important in this case:
awk '{
x = ( $12~/RNA|mir-/ || $3=="pseudogene" ) ? "non-" : ""
$1 = x "coding" OFS $1
print
}' a.gene-pseudogene_all_dmel-all-r6.40.gtf
The technique above can be used to insert before (or after) any column. Because we are prepending before the first column (or if we were appending after the final one), the code can be made more efficient by avoiding the assignment:
awk '{
x = ( $12~/RNA|mir-/ || $3=="pseudogene" ) ? "non-" : ""
print x "coding", $0
}' a.gene-pseudogene_all_dmel-all-r6.40.gtf
Answered By - jhnc