Saturday, April 9, 2022

[SOLVED] Regex select several lines until two consecutive new lines not working on Mac

Issue

I need to extract several lines of text (which vary in length along the 500 mb document) between a line that starts with Query # and two consecutive carriage returns. This is being done in a Mac. For example de document format is:

Query #1: 020.1-Bni_its1_2019_envio1set1 

lines I need to extract


Alignments (the following lines I don't need)

xyz
xyx

Query #2: This and the following lines I need. And so on.

There are always exactly two carriage returns before the word "Alignments". So basically I need all the lines from Query #.: until Alignments.

I tried the following regex but I only recover the first line.

ggrep -P 'Query #.*?(?:[\r\n]{2}|\Z)' 

I have tested the regex with multiple iterations here regex101, but have not yet found the answer.

The expected output is:

Query #1.   Text.

Lines I need to extract

Query #2: This and following lines I need.

Lines I need.

Query #....

thanks in advance for any pointers.


Solution

With pcregrep, you can use

pcregrep -oM 'Query #.*(?:\R(?!\R{2}).*)*' file.txt > results.txt

Here,

  • o - outputs matched texts
  • M - enables matching across lines (puts line endings into "pattern space")
  • Query #.*(?:\R(?!\R{2}).*)* matches
    • Query # - literal text
    • .* - the rest of the line
    • (?:\R(?!\R{2}).*)* - zero or more sequences of a line break sequence (\R) not immediately followed with two line break sequences ((?!\R{2})) and then the rest of the line.

Test screenshot:

enter image description here



Answered By - Wiktor Stribiżew
Answer Checked By - Timothy Miller (WPSolving Admin)