Issue
sorry if this is just simple regex here, I'm trying to grep for the first X characters in a line. My original idea was something like
#!/usr/bin/env bash
X=$1
cat filename | ... stream ... | grep -r '\w{0,$X}'
Although I don't actually think this would work...
Basically, suppose I had the following:
ATGGCGAGAGAGGTGCCTATAGAGAAATTGAGAAACATAGGTATAGTTGCTCACATTGACGCGGGTAAAACTACGACTAC
CGAGAG**A**ATTCTCTATTACACGGGTAAGACTTACAAGATAGGTGAAGTTCACGAAGGTGCTGCAACGATGGACTGGATGC
CCCAGGAAAAGGAAAGAGGTATAACCATAACCGTTGCAACGACCGCATGTTATTGGACGAGAAACGGGGAGAGGTATCAA
If I wanted to grep to the 7th place on the 2nd line, how would I do this? What regex would work to only get the following:
ATGGCGAGAGAGGTGCCTATAGAGAAATTGAGAAACATAGGTATAGTTGCTCACATTGACGCGGGTAAAACTACGACTAC
CGAGAGA
More detailed explanation:
what I currently have prints out the line that the indicated position is on, and the line before it, but doesn't indicate the exact location of the position on output (I cat my fasta file into this):
#!/usr/bin/env bash
spot=$1
myvar=`expr $spot / 81`
#later, I awk in the line number as a column, to navigate
X=$(($spot % 81))
#since each line in the file I'm looking at has 81 characters per line (not including the newline character), this gives the spot I'm looking for
grep -v '>' | awk | 'BEGIN{t=-1}{t = t + 1; {print t, $0}}' | grep -B 1 "$myvar" | head
Basically trying to whip up an easy command line FASTA file navigator (nucleotide sequence, protein sequence), and what to view the sequence up to a designated spot (I don't use $X here yet).
so for example, if I want to read up to the 9th position in the following sequence (the bolded T here) (so maybe like the 10051 position, which is on the 124rd line at position 7,
\>NC_000918.1 Aquifex aeolicus VF5, complete sequence
...
ATGGCGAGAGAGGTGCCTATAGAGAAATTGAGAAACATAGGTATAGTTGCTCACATTGACGCGGGTAAAACTACGACTAC
CGAGAG**A**ATTCTCTATTACACGGGTAAGACTTACAAGATAGGTGAAGTTCACGAAGGTGCTGCAACGATGGACTGGATGC
CCCAGGAAAAGGAAAGAGGTATAACCATAACCGTTGCAACGACCGCATGTTATTGGACGAGAAACGGGGAGAGGTATCAA
I want my read to include both the previous line, and the ``current" line up til the 9th position, and so (based on the script I currently have), I want something like
ATGGCGAGAGAGGTGCCTATAGAGAAATTGAGAAACATAGGTATAGTTGCTCACATTGACGCGGGTAAAACTACGACTAC
CGAGAGA
Currently the script gives (these 123 ATGGCG... columns, in case it wraps it and makes it unclear)
123 ATGGCGAGAGAGGTGCCTATAGAGAAATTGAGAAACATAGGTATAGTTGCTCACATTGACGCGGGTAAAACTACGACTAC
124 CGAGAGAATTCTCTATTACACGGGTAAGACTTACAAGATAGGTGAAGTTCACGAAGGTGCTGCAACGATGGACTGGATGC
And other lines that have 123, 124 in them. I don't mind the line numbers (although they are easy to get rid of I guess), I just want to get more specific viewpoint here.
I'm fairly new to bash scripting, so let me know if there is anything weird I wrote as well!
(note: the lines I show are actually the first three lines from the VF5 fasta file, I just pretend to make them lines 124, 123, etc to illustrate the point)
Solution
Assuming:
- You want to print the line which includes the specified position, the character length counted from the beginning of the sequence.
- You want to terminate the line at the specified position, not printing the whole matched line.
- You want to include the previous line.
Then would you please try the awk
solution:
#!/bin/bash
spot=$1 # assigned to "10051" or whatever
awk -v spot="$spot" '!/^>/ {
amount += length
if (amount >= spot) {
print(prev substr($0, 1, spot - (amount - length)))
exit
}
prev = $0 RS
}' file.fasta
- The
-v spot="$spot"
option assigns awk variablespot
to the bash variable$spot
. - The pattern
!/^>/
skips the header line. - The variable
amount
accumulates the character length. - The variable
prev
keeps the previous line (appended with RS, record separator).
Please note the line length of FASTA format is not fixed to a specific value such as 80. The document just describes as:
It is recommended that all lines of text be shorter than 80 characters in length.
Then it will be better to simply count the length of lines.
Answered By - tshiono