Tuesday, February 1, 2022

[SOLVED] How do I match the last occurrence of Pattern before another Pattern with REGEX

Issue

I have a huge XML file and I need to extract the content of a whole tag that contains a sequence of numbers. Everything is one line in my file, I added line breaks here to make it more readable

So here I have a simplified example

The file:

<ORDERS>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>12345</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>23456</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>34567</tag3><tag4>ccc</tag4></IDOC>
</ORDER>

I want to match the IDOC BEGIN tag that contains the sequence 0007537181. So it would be

<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>

So far I got this regex:

cat myfile | grep -oP '<IDOC BEGIN.*?0007536846.*?</IDOC>'

Which results in everything from the beginning of the first tag with the same name until the one that I want:

<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>12345</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>23456</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>

I managed to work around this by sending this to a second regex that gets the last occurrence of IDOC BEGIN

cat myfile | grep -oP '<IDOC BEGIN.*?0007536846.*?</IDOC>' | grep -oP '<IDOC BEGIN(?!.*<IDOC BEGIN).*?</IDOC>'

To summarize, I need to get the last IDOC BEGIN before the sequence of number

Please keep in mind that the original file does not have line breaks, everything is in one line.


Solution

The regex you could use is either based on a greedy dot pattern placed at the start and followed with a \K match reset operator, or based on a tempered greedy token. Both are very unsafe when it comes to large strings with partial matches (but not matching).

So, the two regexps are

.*\K<IDOC BEGIN.*?0007536846.*?</IDOC>
<IDOC BEGIN(?:(?!<IDOC BEGIN).)*?0007536846(?:(?!<IDOC BEGIN).)*?</IDOC>

The best idea is to unroll the tempered greedy token in these cases:

<IDOC BEGIN[^<]*(?:<(?!IDOC BEGIN)[^<]*?)*0007537181.*?</IDOC>

See the regex demo

The first .*? is replaced with [^<]*(?:<(?!IDOC BEGIN)[^<]*?)*:

  • [^<]* - a negated character class matching 0 or more chars other than <, as many as possible
  • (?:<(?!IDOC BEGIN)[^<]*?)* - 0 or more repetitions of
    • <(?!IDOC BEGIN) - a < char that is not immediately followed with IDOC BEGIN string
    • [^<]*? - a negated character class matching 0 or more chars other than <, as few as possible


Answered By - Wiktor Stribiżew
Answer Checked By - Gilberto Lyons (WPSolving Admin)