Issue
I have a large multi-line FASTA file that looks like the following:
>NWQ47741.1 CLTR1 protein, partial [Melospiza_melodia] Vertebrate
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE
>XP_002763076.2 cysteinyl leukotriene receptor 1 [Callithrix_jacchus] Vertebrate
MDGTGNLTVSSATCHDTIDEFRNQVYSTLYSMISVVGFFGNGFVLYVLIKTYHEKSAFQI
YMINLAIADLLCVCTLPLRVVYYVHKGIWFFGDFLCRLSTYALYVNLYCSIFFMTAMSFF
RCIAIVFPVQNINLVTQKKARFVCVGIWIFVILASSPFLITKSYKDEKNNTKCFEPPQDN
QTKNHVLILHYVSLFLGFIIPFVIIIVCYTMIILTLLKKSMKKNLSSHKKAIRMIMVVTA
AFLVSFMPYHIQRTIHLHFLHNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFFSGG
NFRRRLSTFRKHSLSSMTYVPRKKASLPEKGEEICKV
>XP_036988076.1 cysteinyl leukotriene receptor 1 [Artibeus_jamaicensis] Vertebrate
MDGTGNLTASSASNNMCNSSIDDFRNQVYSTMYSMISIVGFFGNGFVLYVLIRTYHEKSA
FQIYMINLAVSDLLCVCTLPLRVVYYVHKGMWFFGDILCRLSTYALYVNLYCSIFFMTAM
SFFRCIAIVFPVKNINLVTEKKARFVCASIWVFVILTSSPFLMSKSYKDEKNNTKCFEPP
QDNETKNHIFILHYVSLLVGFLIPFIIIIVCYTMIIFTLLKNSMQKNVPSRKKAVGMIII
VTAAFLISFMPYHIQRTIHLHFLYNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFF
SGGNFRRRLSTFRKHSLSSMTYVPKKKVSLPEKEDEVCK
I need to replace a specific substring within each sequence in the file. The substring to be replaced is the part after the accession ID and before the first '[' character. For example, in the first sequence, I want to replace "CLTR1 protein, partial" with "new_string", resulting in:
>NWQ47741.1 new_string [Melospiza_melodia] Vertebrate
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE
I'm looking for a way to achieve this using AWK or Sed, as the file is large, and I want an efficient solution. Can someone provide a sample script or command for this task?
I tried the following script:
awk '/^>/ {gsub(/\[[^[]*/, "new_string"); print; next} 1' your_file.fasta > modified_file.fasta
But got the following output:
>NWQ47741.1 CLTR1 protein, partial new_string
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE
>XP_002763076.2 cysteinyl leukotriene receptor 1 new_string
MDGTGNLTVSSATCHDTIDEFRNQVYSTLYSMISVVGFFGNGFVLYVLIKTYHEKSAFQI
YMINLAIADLLCVCTLPLRVVYYVHKGIWFFGDFLCRLSTYALYVNLYCSIFFMTAMSFF
RCIAIVFPVQNINLVTQKKARFVCVGIWIFVILASSPFLITKSYKDEKNNTKCFEPPQDN
QTKNHVLILHYVSLFLGFIIPFVIIIVCYTMIILTLLKKSMKKNLSSHKKAIRMIMVVTA
AFLVSFMPYHIQRTIHLHFLHNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFFSGG
NFRRRLSTFRKHSLSSMTYVPRKKASLPEKGEEICKV
>XP_036988076.1 cysteinyl leukotriene receptor 1 new_string
MDGTGNLTASSASNNMCNSSIDDFRNQVYSTMYSMISIVGFFGNGFVLYVLIRTYHEKSA
FQIYMINLAVSDLLCVCTLPLRVVYYVHKGMWFFGDILCRLSTYALYVNLYCSIFFMTAM
SFFRCIAIVFPVKNINLVTEKKARFVCASIWVFVILTSSPFLMSKSYKDEKNNTKCFEPP
QDNETKNHIFILHYVSLLVGFLIPFIIIIVCYTMIIFTLLKNSMQKNVPSRKKAVGMIII
VTAAFLISFMPYHIQRTIHLHFLYNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFF
SGGNFRRRLSTFRKHSLSSMTYVPKKKVSLPEKEDEVCK
Solution
Using sed
$ sed -E '/^>/s/ [^[]*/ new_string /' input_file
>NWQ47741.1 new_string [Melospiza_melodia] Vertebrate
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE
>XP_002763076.2 new_string [Callithrix_jacchus] Vertebrate
MDGTGNLTVSSATCHDTIDEFRNQVYSTLYSMISVVGFFGNGFVLYVLIKTYHEKSAFQI
YMINLAIADLLCVCTLPLRVVYYVHKGIWFFGDFLCRLSTYALYVNLYCSIFFMTAMSFF
RCIAIVFPVQNINLVTQKKARFVCVGIWIFVILASSPFLITKSYKDEKNNTKCFEPPQDN
QTKNHVLILHYVSLFLGFIIPFVIIIVCYTMIILTLLKKSMKKNLSSHKKAIRMIMVVTA
AFLVSFMPYHIQRTIHLHFLHNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFFSGG
NFRRRLSTFRKHSLSSMTYVPRKKASLPEKGEEICKV
>XP_036988076.1 new_string [Artibeus_jamaicensis] Vertebrate
MDGTGNLTASSASNNMCNSSIDDFRNQVYSTMYSMISIVGFFGNGFVLYVLIRTYHEKSA
FQIYMINLAVSDLLCVCTLPLRVVYYVHKGMWFFGDILCRLSTYALYVNLYCSIFFMTAM
SFFRCIAIVFPVKNINLVTEKKARFVCASIWVFVILTSSPFLMSKSYKDEKNNTKCFEPP
QDNETKNHIFILHYVSLLVGFLIPFIIIIVCYTMIIFTLLKNSMQKNVPSRKKAVGMIII
VTAAFLISFMPYHIQRTIHLHFLYNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFF
SGGNFRRRLSTFRKHSLSSMTYVPKKKVSLPEKEDEVCK
Answered By - sseLtaH Answer Checked By - Mary Flores (WPSolving Volunteer)