Issue
I want to replace all the headers (starting with >
) with >{filename}
, of all *.fasta
files inside my directory
AND concatenate them afterwards
content of my directory
speciesA.fasta
speciesB.fasta
speciesC.fasta
example of file, speciesA.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL
my desired output (only for speciesA.fasta
now):
>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL
This is my code:
for file in *.fasta; do var=$(basename $file .fasta) | sed 's/>.*/>$var/' $var.fasta >>$var.outfile.fasta; done
but all I get is
>$var
MJSUNDKFJSKFJSKFJ
>$var
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
[and so on ...]
Where did i make a mistake??
Solution
The bash loop is superfluous. Try:
awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
This approach is safe even if the file names contain special or regex-active characters.
How it works
/^>/ {print ">" substr(FILENAME, 1, length(FILENAME)-6); next}
For any line that begins
>
, the commands in curly braces are executed. The first command prints>
followed by all but the last 6 letters of the filename. The second command,next
, skips the rest of the commands on the line and jumps to start over with thenext
line.1
This is awk's cryptic shorthand for print-the-line.
Example
Let's consider a directory with two (identical) test files:
$ cat speciesA.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL
$ cat speciesB.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL
The output of our command is:
$ awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL
>speciesB
MJSUNDKFJSKFJSKFJ
>speciesB
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesB
KSDAFJLASDJFKLAJFL
The output has the substitutions and concatenates all the input files.
Answered By - John1024 Answer Checked By - Candace Johnson (WPSolving Volunteer)