Issue
I need to extract several lines of text (which vary in length along the 500 mb document) between a line that starts with Query # and two consecutive carriage returns. This is being done in a Mac. For example de document format is:
Query #1: 020.1-Bni_its1_2019_envio1set1
lines I need to extract
Alignments (the following lines I don't need)
xyz
xyx
Query #2: This and the following lines I need. And so on.
There are always exactly two carriage returns before the word "Alignments". So basically I need all the lines from Query #.: until Alignments.
I tried the following regex but I only recover the first line.
ggrep -P 'Query #.*?(?:[\r\n]{2}|\Z)'
I have tested the regex with multiple iterations here regex101, but have not yet found the answer.
The expected output is:
Query #1. Text.
Lines I need to extract
Query #2: This and following lines I need.
Lines I need.
Query #....
thanks in advance for any pointers.
Solution
With pcregrep
, you can use
pcregrep -oM 'Query #.*(?:\R(?!\R{2}).*)*' file.txt > results.txt
Here,
o
- outputs matched textsM
- enables matching across lines (puts line endings into "pattern space")Query #.*(?:\R(?!\R{2}).*)*
matchesQuery #
- literal text.*
- the rest of the line(?:\R(?!\R{2}).*)*
- zero or more sequences of a line break sequence (\R
) not immediately followed with two line break sequences ((?!\R{2})
) and then the rest of the line.
Test screenshot:
Answered By - Wiktor Stribiżew Answer Checked By - Timothy Miller (WPSolving Admin)