Sunday, December 24, 2023

[SOLVED] Replacing numerical values in a FASTA file with their index in a different file. (Bash preferred)

December 24, 2023 awk, bash, character-replacement, fasta, sed

Issue

I have a folder full of fasta files with the following format (and so on), where the line beginning with > is the read name of DNA sequence, and the following line is the sequence itself. This pattern repeats for the entire file:

> 887_ENCFF899MTI.fastq.gz_seq1
GGCCCGCCTCCCGTCGGCCGGTGCGAGCGGCTCCGCGA
> 55_ENCFF899MTI.fastq.gz_seq2
GGGGGGGGCGTCTCGCGCAAACGTCCATAAC
> ...
...

In the read names, [887] corresponds to the index of a query sequence I used to find this read, stored in a different file (e.g. SequenceNames.txt). The other file can be assumed to have this format:

SequenceA
SequenceB
...

I want to replace only the number between > and _ (avoiding incidental matches with the filename) with the Sequence matching the index of that number from the SequenceNames file. For example, I would want

> 1_ENCFF899MTI.fastq.gz_seq1
ACTATC
> 2_ENCFF899MTI.fastq.gz_seq1

to become

> SequenceA_ENCFF899MTI.fastq.gz_seq1
> SequenceB_ENCFF899MTI.fastq.gz_seq1

I am able to make these replacements generally, but I'm really unsure of how to direct the index replacement specifically to the location/regex match between > and _ without performing a file-wide dictionary replacement of these numbers, and I'm struggling with awk array indexing to get something like

gawk '{print gensub(/^> ([0-9]*)_/,array[pattern],"\\1")}'

to produce what I'm looking for.

Solution

Using gawk:

awk 'NR==FNR{ar["> "NR"_"]=$0} 
NR>FNR{match($0,/^> [0-9]+_/,m); gsub(/^> [0-9]+_/, "> " ar[m[0]]"_", $0);print} ' SequenceNames.txt matches.fasta

The NR==FNR block collects line data from the sequence name file in an array indexed with a string built from ">", the line number, and the trailing "_" character.

The NR>FNR block stores the string matched to a regex requiring the line-start ">" followed by a space, a number, and the underscore in array m. Gsub is then used to replace the match with the corresponding value held in the sequence name array.

Tested using GNU Awk 5.1.0

Answered By - Dave Pritlove

Answer Checked By - Cary Denson (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 24, 2023

[SOLVED] Replacing numerical values in a FASTA file with their index in a different file. (Bash preferred)

Issue

Solution

Popular Posts

Labels