Monday, November 1, 2021

[SOLVED] grep for the first X amount of characters in a line in a stream

November 01, 2021 bash, grep

Issue

sorry if this is just simple regex here, I'm trying to grep for the first X characters in a line. My original idea was something like

#!/usr/bin/env bash
X=$1    
cat filename | ... stream ... | grep -r '\w{0,$X}'

Although I don't actually think this would work...

Basically, suppose I had the following:

ATGGCGAGAGAGGTGCCTATAGAGAAATTGAGAAACATAGGTATAGTTGCTCACATTGACGCGGGTAAAACTACGACTAC
CGAGAG**A**ATTCTCTATTACACGGGTAAGACTTACAAGATAGGTGAAGTTCACGAAGGTGCTGCAACGATGGACTGGATGC
CCCAGGAAAAGGAAAGAGGTATAACCATAACCGTTGCAACGACCGCATGTTATTGGACGAGAAACGGGGAGAGGTATCAA

If I wanted to grep to the 7th place on the 2nd line, how would I do this? What regex would work to only get the following:

ATGGCGAGAGAGGTGCCTATAGAGAAATTGAGAAACATAGGTATAGTTGCTCACATTGACGCGGGTAAAACTACGACTAC

CGAGAGA

More detailed explanation:

what I currently have prints out the line that the indicated position is on, and the line before it, but doesn't indicate the exact location of the position on output (I cat my fasta file into this):

#!/usr/bin/env bash
spot=$1
myvar=`expr $spot / 81`
#later, I awk in the line number as a column, to navigate
X=$(($spot % 81))
#since each line in the file I'm looking at has 81 characters per line (not including the newline character), this gives the spot I'm looking for

grep -v '>' | awk | 'BEGIN{t=-1}{t = t + 1; {print t, $0}}' | grep -B 1 "$myvar" | head

Basically trying to whip up an easy command line FASTA file navigator (nucleotide sequence, protein sequence), and what to view the sequence up to a designated spot (I don't use $X here yet).

so for example, if I want to read up to the 9th position in the following sequence (the bolded T here) (so maybe like the 10051 position, which is on the 124rd line at position 7,

\>NC_000918.1 Aquifex aeolicus VF5, complete sequence

...

ATGGCGAGAGAGGTGCCTATAGAGAAATTGAGAAACATAGGTATAGTTGCTCACATTGACGCGGGTAAAACTACGACTAC
CGAGAG**A**ATTCTCTATTACACGGGTAAGACTTACAAGATAGGTGAAGTTCACGAAGGTGCTGCAACGATGGACTGGATGC
CCCAGGAAAAGGAAAGAGGTATAACCATAACCGTTGCAACGACCGCATGTTATTGGACGAGAAACGGGGAGAGGTATCAA

I want my read to include both the previous line, and the ``current" line up til the 9th position, and so (based on the script I currently have), I want something like

ATGGCGAGAGAGGTGCCTATAGAGAAATTGAGAAACATAGGTATAGTTGCTCACATTGACGCGGGTAAAACTACGACTAC

CGAGAGA

Currently the script gives (these 123 ATGGCG... columns, in case it wraps it and makes it unclear)

123    ATGGCGAGAGAGGTGCCTATAGAGAAATTGAGAAACATAGGTATAGTTGCTCACATTGACGCGGGTAAAACTACGACTAC

124    CGAGAGAATTCTCTATTACACGGGTAAGACTTACAAGATAGGTGAAGTTCACGAAGGTGCTGCAACGATGGACTGGATGC

And other lines that have 123, 124 in them. I don't mind the line numbers (although they are easy to get rid of I guess), I just want to get more specific viewpoint here.

I'm fairly new to bash scripting, so let me know if there is anything weird I wrote as well!

(note: the lines I show are actually the first three lines from the VF5 fasta file, I just pretend to make them lines 124, 123, etc to illustrate the point)

Solution

Assuming:

You want to print the line which includes the specified position, the character length counted from the beginning of the sequence.
You want to terminate the line at the specified position, not printing the whole matched line.
You want to include the previous line.

Then would you please try the awk solution:

#!/bin/bash

spot=$1                         # assigned to "10051" or whatever
awk -v spot="$spot" '!/^>/ {
    amount += length
    if (amount >= spot) {
        print(prev substr($0, 1, spot - (amount - length)))
        exit
    }
    prev = $0 RS
}' file.fasta

The -v spot="$spot" option assigns awk variable spot to the bash variable $spot.
The pattern !/^>/ skips the header line.
The variable amount accumulates the character length.
The variable prev keeps the previous line (appended with RS, record separator).

Please note the line length of FASTA format is not fixed to a specific value such as 80. The document just describes as:

It is recommended that all lines of text be shorter than 80 characters in length.

Then it will be better to simply count the length of lines.

Answered By - tshiono

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 1, 2021

[SOLVED] grep for the first X amount of characters in a line in a stream

Issue

Solution

Popular Posts

Labels