Saturday, January 27, 2024

[SOLVED] Replace single character in fasta header with awk or sed

Issue

I am working in bash with a fasta file with headers that begin with a ">" and end with either a "C" or a "+". Like so:

>chr1:35031657-35037706+
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425C
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga

I'd like to use awk (gsub?) or sed to change the last character of the header to a "+" if it is a "C". Basically I want all of the sequences to end in "+". No C's.

Desired output:

>chr1:35031657-35037706+
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425+
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga

Nothing needs to change with the sequences. I think this is pretty straight forward, but I'm struggling to use other posts to do this myself. I know that awk '/^>/ && /C$/{print $0}' will print the headers than begin with ">" and end with "C", but I'm not sure how to replace all of those "C"s with "+"s.

Thanks for your help!


Solution

I think this would be easier to do in sed:

sed '/^>/ s/C$/+/'

Translation: on lines starting with ">", replace "C" at the end of the line with "+". Note that if the "C" isn't matched, there isn't an error, it just doesn't replace anything. Also, unlike awk, sed automatically prints each line after processing it.

If you really want to use awk, the equivalent would be:

awk '/^>/ {sub("C$","+",$0)}; {print}'


Answered By - Gordon Davisson
Answer Checked By - Terry (WPSolving Volunteer)