Wednesday, August 31, 2022

[SOLVED] Rearranging specific lines of text, based on a pattern

Issue

I have a FASTA file, in which the headers inside (the lines starting with ">") are not in the right order. What I want to do, is to take part of the text with a certain pattern (ctg.*,) and move it to the start of the text. for example:

head seq.fasta -n 3
>JACEFZ010000001.1 Cepaea nemoralis isolate C981 ctg35418, whole genome shotgun sequence
cctcctcctccctcctcccctttttCCCTccttcccctttcccccctcctcttcctccccccctcctcccccccctcctc
cttcctccgccctctcctcctcctcactcctcctcctccctcctcctcctccctctacctcctacccCCTCCTCCCGTCA

And I want to "move" the ctg35418 string to the start, where now the new file will be:

>ctg35418 JACEFZ010000001.1 Cepaea nemoralis isolate C981, whole genome shotgun sequence
cctcctcctccctcctcccctttttCCCTccttcccctttcccccctcctcttcctccccccctcctcccccccctcctc
cttcctccgccctctcctcctcctcactcctcctcctccctcctcctcctccctctacctcctacccCCTCCTCCCGTCA

Well, I'm kind of new with shell scripting, so I did something like this:

while read line; do if [[ $line =~ ">" ]]; then 
 id=$(echo $line | grep -oe "ctg.*," | sed 's/,//g')
 line2=$(echo $line | sed 's/>//g' | sed "s| ${id}||g")
 sed -i "s|$line|>${id} ${line2}|g" seq.fasta
 fi
 done < seq.fasta

I would love to get your inputs to reduce the complexity of the, lets call it, code.


Solution

This sed command should do the job:

sed 's/>\(.*\)[[:blank:]]\(ctg[^,]*\)/>\2 \1/' seq.fasta > newseq.fasta

The \(.*\) captures the text between > and blank just before ctg (both exclusive) and \(ctg[^,]*\) captures the text between ctg (inclusive) and , (exclusive). In the given sample, \(.*\) captures ACEFZ010000001.1 Cepaea nemoralis isolate C981 and \(ctg[^,]*\) captures ctg35418.



Answered By - M. Nejat Aydin
Answer Checked By - Gilberto Lyons (WPSolving Admin)