Issue
I am working in bash with a fasta file with headers that begin with a ">" and end with either a "C" or a "+". Like so:
>chr1:35031657-35037706+
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425C
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga
I'd like to use awk (gsub?) or sed to change the last character of the header to a "+" if it is a "C". Basically I want all of the sequences to end in "+". No C's.
Desired output:
>chr1:35031657-35037706+
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425+
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga
Nothing needs to change with the sequences. I think this is pretty straight forward, but I'm struggling to use other posts to do this myself. I know that awk '/^>/ && /C$/{print $0}'
will print the headers than begin with ">" and end with "C", but I'm not sure how to replace all of those "C"s with "+"s.
Thanks for your help!
Solution
I think this would be easier to do in sed
:
sed '/^>/ s/C$/+/'
Translation: on lines starting with ">", replace "C" at the end of the line with "+". Note that if the "C" isn't matched, there isn't an error, it just doesn't replace anything. Also, unlike awk
, sed
automatically prints each line after processing it.
If you really want to use awk
, the equivalent would be:
awk '/^>/ {sub("C$","+",$0)}; {print}'
Answered By - Gordon Davisson Answer Checked By - Terry (WPSolving Volunteer)