Issue
I want to count the number of occurrences of a char in each sequence in a fasta file with multiple sequences, but with the method I use I count the total of the char in the fasta file:
grep -o 'G' my_sequence.fasta | wc -l
Is there some way to do it with each of the sequences using the fasta file with multiple sequences?
the fasta file look as below
>sequence1
CCGTGGGTCAATCCCGTA
>sequence2
CCGTGGGGCACTCCCGTA
>sequence3
TTGTGGGTCAATCCCGTC
>sequence4
CCCGGGTGCACTCCCGTA
Solution
Here's an awk
that counts the number of G
in each sequence; it discards the possible header in the FASTA file and supports multi-line sequences. Also, the description lines in the FASTA file might contain more than just the sequence ID, for example >MCHU ‑ Calmodulin …
. The code only outputs the ID, i.e. >MCHU
.
awk -v char=G '
/^>/ {
if (label != "") {
print label, gsub(char, "", sequence)
sequence = ""
}
label = $1
next
}
{ sequence = sequence $0 }
END {
if (label != "")
print label, gsub(char, "", sequence)
}
' file.fasta
>sequence1 5
>sequence2 6
>sequence3 5
>sequence4 5
remark: Be aware that the char
parameter in argument of the awk
command is a regex and that C‑style escape sequences in it will be unescaped; well, it doesn't really matter for your use‑case (i.e. for characters that don't have any special meaning in regex nor C‑escaping).
Answered By - Fravadona Answer Checked By - Pedro (WPSolving Volunteer)