Wednesday, February 7, 2024

[SOLVED] Finding matches between words in file allowing one mismatch

Issue

I'm trying to find matching words in a file allowing one mismatch between words, I put here a part of the file and the expected output I want to obtain.

The file I want to parse looks like this:

CTAGGA
TTAGCT
CGTACA
ACAGTG
ACACTG

And the output I want to obtain is something similar to this:

CTAGGA: CTAGGA
TTAGCT: TTAGCT
CGTACA: CGTACA
ACAGTG: ACAGTG, ACACTG
ACACTG: ACAGTG, ACACTG

The output doesn't need to be exactly like this, but something understandable of which words have <=1 mismatch. I DON'T want a match between something like CTAGGA and CTGGAC, where they would match if the second word was something like CTAGGAC.

Thank you very much


Solution

Let's build a solution step-by-step by solving the subproblems.

Problem 1: Levenhstein distance (Editing distance). agrep has it built-in.

agrep -1 "ACATTG" dna.file

Problem 2: Reading the file line by line

#!/bin/bash 
#pass file as argument

while IFS='' read -r LINE || [ -n "${LINE}" ]; do
    echo "processing line: ${LINE}"
done < $1

call with: ./script.sh <absolutepathtoyourgenomefile>

Problem 3: Combining it together and building the output.

#!/bin/bash 
#pass file as argument

file=$1

while IFS='' read -r LINE || [ -n "${LINE}" ]; do
    echo "${LINE}:" $(agrep -1 "${LINE}" $file)
done < $file

Example:

Inputfile /tmp/genome.txt

CTAGGA
TTAGCT
CGTACA
ACAGTG
ACACTG
TCAGGA
TTAAGG
TTGGAA
TTAGCA
TTGGAA
TTAGGA

Run script:

$ ./script.sh /tmp/genome.txt 

CTAGGA: CTAGGA TCAGGA TTAGGA
TTAGCT: TTAGCT TTAGCA
CGTACA: CGTACA
ACAGTG: ACAGTG ACACTG
ACACTG: ACAGTG ACACTG
TCAGGA: CTAGGA TCAGGA TTAGGA
TTAAGG: TTAAGG TTAGGA
TTGGAA: TTGGAA TTGGAA
TTAGCA: TTAGCT TTAGCA TTAGGA
TTGGAA: TTGGAA TTGGAA
TTAGGA: CTAGGA TCAGGA TTGGAA TTAGCA TTGGAA TTAGGA

Note, that 'one mismatch' is highly ambiguous. What metric do you use to define what is 'one' mismatch?

Does that solve your question?



Answered By - kaiya
Answer Checked By - David Marino (WPSolving Volunteer)