Issue
Understand fastq file which has 4 important lines.
Line starting with @ contains the sequence identifier.
Line containing the DNA sequence.
Line starting with + (plus sign) indicating the beginning of the quality score line.
Line containing the quality scores corresponding to the DNA sequence.
I have a corrupted fastq.gz file where a +
sign is missing in the file. For example, zcat sample.fastq.gz
@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
The third read in the file has missing +
file.
the expected output is :
@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
I tried like:
zcat sample.fastq.gz | awk 'NR%4==0 {print "+"} {print}' | sed 's/^\+$/+/g' > corrected_file.fastq
But it gave me:
@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
+
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFAs
Solution
Using sed
$ sed -e '/^@/{n;/^[[:alpha:]]/{n;/^+/!{i\+' -e '}}}' input_file
@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
Answered By - sseLtaH Answer Checked By - Mildred Charles (WPSolving Admin)