Issue
I'm trying to find matching words in a file allowing one mismatch between words, I put here a part of the file and the expected output I want to obtain.
The file I want to parse looks like this:
CTAGGA
TTAGCT
CGTACA
ACAGTG
ACACTG
And the output I want to obtain is something similar to this:
CTAGGA: CTAGGA
TTAGCT: TTAGCT
CGTACA: CGTACA
ACAGTG: ACAGTG, ACACTG
ACACTG: ACAGTG, ACACTG
The output doesn't need to be exactly like this, but something understandable of which words have <=1 mismatch. I DON'T want a match between something like CTAGGA and CTGGAC, where they would match if the second word was something like CTAGGAC.
Thank you very much
Solution
Let's build a solution step-by-step by solving the subproblems.
Problem 1: Levenhstein distance (Editing distance). agrep has it built-in.
agrep -1 "ACATTG" dna.file
Problem 2: Reading the file line by line
#!/bin/bash
#pass file as argument
while IFS='' read -r LINE || [ -n "${LINE}" ]; do
echo "processing line: ${LINE}"
done < $1
call with: ./script.sh <absolutepathtoyourgenomefile>
Problem 3: Combining it together and building the output.
#!/bin/bash
#pass file as argument
file=$1
while IFS='' read -r LINE || [ -n "${LINE}" ]; do
echo "${LINE}:" $(agrep -1 "${LINE}" $file)
done < $file
Example:
Inputfile /tmp/genome.txt
CTAGGA
TTAGCT
CGTACA
ACAGTG
ACACTG
TCAGGA
TTAAGG
TTGGAA
TTAGCA
TTGGAA
TTAGGA
Run script:
$ ./script.sh /tmp/genome.txt
CTAGGA: CTAGGA TCAGGA TTAGGA
TTAGCT: TTAGCT TTAGCA
CGTACA: CGTACA
ACAGTG: ACAGTG ACACTG
ACACTG: ACAGTG ACACTG
TCAGGA: CTAGGA TCAGGA TTAGGA
TTAAGG: TTAAGG TTAGGA
TTGGAA: TTGGAA TTGGAA
TTAGCA: TTAGCT TTAGCA TTAGGA
TTGGAA: TTGGAA TTGGAA
TTAGGA: CTAGGA TCAGGA TTGGAA TTAGCA TTGGAA TTAGGA
Note, that 'one mismatch' is highly ambiguous. What metric do you use to define what is 'one' mismatch?
Does that solve your question?
Answered By - kaiya Answer Checked By - David Marino (WPSolving Volunteer)