Issue
I've been reading lots of helpful posts about reverse complementing sequences, but I've got what seems to be an unusual request. I'm working in bash and I have DNA sequences in fasta format in my stdout that I'd like to pass on down the pipe. The seemingly unusual bit is that I'm trying to reverse complement SOME of those sequences, so that the output has all the sequences in the same direction (for multiple sequence alignment later).
My fasta headers end in either "C" or "+". I'd like to reverse complement the ones that end in "C". Here's a little subset:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
I know there are lots of ways to reverse complement out there, like:
echo ACCTTGAAA | tr ACGTacgt TGCAtgca | rev
and
seqtk seq -r in.fa > out.fa
But I'm not sure how to do this for only those sequences that have a C at the end of the header. I think awk or sed is probably the ticket, but I'm at a loss as to how to actually code it. I can get the sequence headers with awk, like:
awk '/^>/ { print $0 }'
>chr1:84518073-84524089C
>chr1:86214203-86220231+
But if someone could help me figure out how to turn that awk statement into one that asks "if the last character in the header has a C, do this!" that would be great!
Edited to add: I was so tired when I made this post, I apologize for not including my desired output. Here is what I'd like to output to look like, using my little example:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
You can see the sequence that ends in + is unchanged, but the sequence with a header that ends in C is reverse complemented.
Thanks!
Solution
An earlier answer (by Ed Morton) uses a self-contained awk
procedure to selectively reverse-complement sequences following a comment line ending with "C". Although I think that to be the best approach, I will offer an alternative approach that might have wider applicability.
The procedure here uses awk
's system()
function to send data extracted from the fasta file in awk
to the shell where the sequence can be processed by any of the many shell applications existing for sequence manipulation.
I have defined an awk
user function to pass the isolated sequence from awk
to the shell. It can be called from any part of the awk procedure:
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
The argument of the system
function is a string containing the command you would type into terminal to achieve the desired outcome (in this case I've used one of the example reverse-complement routines mentioned in the question). The parts to note are the correct escaping of quote marks that are to appear in the shell command, and the variable s
that will be substituted for the sequence string assigned to it when the function is called. The value of s
is concatenated with the strings quoted before and after it in the argument to system()
shown above.
isolating the required sequences
The rest of the procedure addresses how to achieve:
"if the last character in the header has a C, do this"
Before making use of shell applications, awk
needs to isolate the part(s) of the file to process. In general terms, awk
employs one or more pattern
/action
blocks where only records
(lines by default) that match a given pattern
are processed by the subsequent action
commands. For example, the following illustrative procedure performs the action
of printing the whole line print $0
if the pattern
/^>/ && /C$/
is true for that line (where /^>/
looks for ">" at the start of a line and /C$/
looks for "C" at the end of the same line.:
/^>/ && /C$/{ print $0 }
For the current needs, the sequence begins on the next record (line) after any record beginning with >
and ending with C
. One way of referencing that next line is to set a variable (named line
in my example) when the C
line is encountered and establishing a later pattern for the record with numerical value one more than line
variable.
Because fasta sequences may extend over several lines, we have to accumulate several successive lines following a C
title line. I have achieved this by concatenating each line following the C
title line until a record beginning with >
is encountered again (or until the end of the file is reached, using the END
block).
In order that sequence lines following a non-C
title line are ignored, I have used a variable named flag
with values of either "do"
or "ignore"
set when a title record is encountered.
The call to a the custom function processSeq()
that employs the system()
command, is made at the beginning of a C
title action
block if the variable seq
holds an accumulated sequence (and in the END
block for relevant sequences that occur at the end of the file where there will be no title line).
Test file and procedure
A modified version of your example fasta was used to test the procedure. It contains an extra relevant C
record with three and-a-bit lines instead of two, and an extra irrelevant +
record.
seq.fasta:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chranotherC
aatgaagtatattcagaatgtagaacattaactgacccgccatgttaatc
aatatctataagaccttttaaaaagcaccttagagattcaataaagtcag
gaagtatattcagaatgtagaacattaactgactaagaccttttaacatg
gcattgact
procedure
awk '
/^>/ && /C$/{
if (length(seq)>0) {processSeq(seq); seq="";}
line=NR; print $0; flag="do"; next;
}
/^>/ {line=NR; flag="ignore"}
NR>1 && NR==(line+1) && (flag=="do"){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0) processSeq(seq);}
' seq.fasta
output
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagttgtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
>chranotherC
agtcaatgccatgttaaaaggtcttagtcagttaatgttctacattctgaatatacttcctgactttattgaatctctaaggtgctttttaaaaggtcttatagatattgattaacatggcgggtcagttaatgttctacattctgaatatacttcatt
Tested using GNU Awk 5.1.0 on a Raspberry Pi 400.
performance note
Because calling sytstem()
creates a sub shell, this process will be slower than a self-contained awk
procedure. It might be useful where existing shell routines are available or tricky to reproduce with custom awk routines.
Edit: modification to include unaltered +
records
This version has some repetition of earlier blocks, with minor changes, to handle printing of the lines that are not to be reverse-complemented (the changes should be self-explanatory if the main explanations were understood)
awk '
/^>/ && /C$/{
if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq="";line=NR; print $0; flag="do"; next;
}
/^>/ {if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq=""; print $0; line=NR; flag="ignore"}
NR>1 && NR==(line+1){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq}}
' seq.fasta
Answered By - Dave Pritlove Answer Checked By - Gilberto Lyons (WPSolving Admin)