Issue
f1:
>seq11
TCAGATGTGTATAAGAGACAGGATTCTTCCACGGTTATTGAGAGTANGCGAGAA
>seq95
TCAGATGTGTATAAGAGACAGTACGTCTTGGTGACTATATCGAGGCNGAATGAA
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTGCGCTAAGCGGCTACTTCGCATACT
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCAAGACCACTTGTGGCCGTTCGCATACT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGTCCGAGCTTGCCGAACAGTATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCGGAGAACCGTGCTCTCATATTCAGCCT
f2:
>seq11
TCAGATGTGTATAAGAGACAGGATTCTTCCACGGTTATTGAGAGTATGCGAGAA
>seq95
TCAGATGTGTATAAGAGACAGTACGTCTTGGTGACTATATCGAGGCTGAATGAA
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTTCAACGTGTCAGGCCGTCGCATACT
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTGCGCTAAGCGGCTACTTCGCATACT
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCAAGACCACTTGTGGCCGTTCGCATACT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCCTCTGCGCTAACGAGAGTATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGTCCGAGCTTGCCGAACAGTATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCGGAGAACCGTGCTCTCATATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGGTAAGACCGTCTGCACCGTATTCAGCCT
f1 is a subset file of f2 - I want to print all the lines of f1 but then the shorter line of f1 has an N
character that should be replaced with its original character based on the f2 file. so the desired output should be:
>seq11
TCAGATGTGTATAAGAGACAGGATTCTTCCACGGTTATTGAGAGTATGCGAGAA
>seq95
TCAGATGTGTATAAGAGACAGTACGTCTTGGTGACTATATCGAGGCTGAATGAA
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTGCGCTAAGCGGCTACTTCGCATACT
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCAAGACCACTTGTGGCCGTTCGCATACT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGTCCGAGCTTGCCGAACAGTATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCGGAGAACCGTGCTCTCATATTCAGCCT
I know the grep -f f2.fa f1.fa
but have not been able to ignore the N
mismatch.
How can I do this?
Thank you in advance.
Solution
Try this using GNU awk for arrays of arrays:
$ cat tst.awk
BEGIN {
split("T C A G",tmp)
for ( i in tmp ) {
chars[tmp[i]]
}
fullLength = length("GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTGCGCTAAGCGGCTACTTCGCATACT")
}
/>/ {
key = $1
next
}
{ currLength = length($0) }
NR == FNR {
if ( currLength < fullLength ) {
shortStrings[key][$0]
}
next
}
currLength == fullLength {
print key ORS $0
next
}
key in shortStrings {
delete currStrings
currStrings[$0]
if ( pos = index($0,"N") ) {
for ( char in chars ) {
currStrings[substr($0,1,pos-1) char substr($0,pos+1)]
}
}
for ( string in currStrings ) {
if ( string in shortStrings[key] ) {
print key ORS string
}
}
}
$ awk -f tst.awk f2 f1
>seq11
TCAGATGTGTATAAGAGACAGGATTCTTCCACGGTTATTGAGAGTATGCGAGAA
>seq95
TCAGATGTGTATAAGAGACAGTACGTCTTGGTGACTATATCGAGGCTGAATGAA
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTGCGCTAAGCGGCTACTTCGCATACT
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCAAGACCACTTGTGGCCGTTCGCATACT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGTCCGAGCTTGCCGAACAGTATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCGGAGAACCGTGCTCTCATATTCAGCCT
Answered By - Ed Morton Answer Checked By - Robin (WPSolving Admin)