Issue
I have a FASTA file, in which the headers inside (the lines starting with ">") are not in the right order. What I want to do, is to take part of the text with a certain pattern (ctg.*,
) and move it to the start of the text. for example:
head seq.fasta -n 3
>JACEFZ010000001.1 Cepaea nemoralis isolate C981 ctg35418, whole genome shotgun sequence
cctcctcctccctcctcccctttttCCCTccttcccctttcccccctcctcttcctccccccctcctcccccccctcctc
cttcctccgccctctcctcctcctcactcctcctcctccctcctcctcctccctctacctcctacccCCTCCTCCCGTCA
And I want to "move" the ctg35418
string to the start, where now the new file will be:
>ctg35418 JACEFZ010000001.1 Cepaea nemoralis isolate C981, whole genome shotgun sequence
cctcctcctccctcctcccctttttCCCTccttcccctttcccccctcctcttcctccccccctcctcccccccctcctc
cttcctccgccctctcctcctcctcactcctcctcctccctcctcctcctccctctacctcctacccCCTCCTCCCGTCA
Well, I'm kind of new with shell scripting, so I did something like this:
while read line; do if [[ $line =~ ">" ]]; then
id=$(echo $line | grep -oe "ctg.*," | sed 's/,//g')
line2=$(echo $line | sed 's/>//g' | sed "s| ${id}||g")
sed -i "s|$line|>${id} ${line2}|g" seq.fasta
fi
done < seq.fasta
I would love to get your inputs to reduce the complexity of the, lets call it, code.
Solution
This sed
command should do the job:
sed 's/>\(.*\)[[:blank:]]\(ctg[^,]*\)/>\2 \1/' seq.fasta > newseq.fasta
The \(.*\)
captures the text between >
and blank just before ctg
(both exclusive) and \(ctg[^,]*\)
captures the text between ctg
(inclusive) and ,
(exclusive). In the given sample, \(.*\)
captures ACEFZ010000001.1 Cepaea nemoralis isolate C981
and \(ctg[^,]*\)
captures ctg35418
.
Answered By - M. Nejat Aydin Answer Checked By - Gilberto Lyons (WPSolving Admin)