Issue
I have a huge XML file and I need to extract the content of a whole tag that contains a sequence of numbers. Everything is one line in my file, I added line breaks here to make it more readable
So here I have a simplified example
The file:
<ORDERS>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>12345</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>23456</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>34567</tag3><tag4>ccc</tag4></IDOC>
</ORDER>
I want to match the IDOC BEGIN tag that contains the sequence 0007537181. So it would be
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>
So far I got this regex:
cat myfile | grep -oP '<IDOC BEGIN.*?0007536846.*?</IDOC>'
Which results in everything from the beginning of the first tag with the same name until the one that I want:
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>12345</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>23456</tag3><tag4>ccc</tag4></IDOC>
<IDOC BEGIN><tag1>aaa</tag1><tag2>bbb</tag2><tag3>0007537181</tag3><tag4>ccc</tag4></IDOC>
I managed to work around this by sending this to a second regex that gets the last occurrence of IDOC BEGIN
cat myfile | grep -oP '<IDOC BEGIN.*?0007536846.*?</IDOC>' | grep -oP '<IDOC BEGIN(?!.*<IDOC BEGIN).*?</IDOC>'
To summarize, I need to get the last IDOC BEGIN before the sequence of number
Please keep in mind that the original file does not have line breaks, everything is in one line.
Solution
The regex you could use is either based on a greedy dot pattern placed at the start and followed with a \K
match reset operator, or based on a tempered greedy token. Both are very unsafe when it comes to large strings with partial matches (but not matching).
So, the two regexps are
.*\K<IDOC BEGIN.*?0007536846.*?</IDOC>
<IDOC BEGIN(?:(?!<IDOC BEGIN).)*?0007536846(?:(?!<IDOC BEGIN).)*?</IDOC>
The best idea is to unroll the tempered greedy token in these cases:
<IDOC BEGIN[^<]*(?:<(?!IDOC BEGIN)[^<]*?)*0007537181.*?</IDOC>
See the regex demo
The first .*?
is replaced with [^<]*(?:<(?!IDOC BEGIN)[^<]*?)*
:
[^<]*
- a negated character class matching 0 or more chars other than<
, as many as possible(?:<(?!IDOC BEGIN)[^<]*?)*
- 0 or more repetitions of<(?!IDOC BEGIN)
- a<
char that is not immediately followed withIDOC BEGIN
string[^<]*?
- a negated character class matching 0 or more chars other than<
, as few as possible
Answered By - Wiktor Stribiżew Answer Checked By - Gilberto Lyons (WPSolving Admin)