Issue
I have the following badly formatted text:
<h1 id="page-title">ABCD TEXT TEXT ( QQQ-10-123-01)</h1>
<h1 id="page-title">ABCD TEXT TEXT (QQQ-10-123-02)</h1>
<h1 id="page-title">ABCD TEXT TEXT (QQQ-10-123-03 (QWERTY))</h1>
and need to get from it:
QQQ-10-123-01
QQQ-10-123-02
QQQ-10-123-03 (QWERTY)
I.e. get only text between the first "(" and ")", at the moment doing the following:
sed -n "s/.*<h1 id=\"page-title\">.*(\(.*\))<\/h1>.*/\1/p" ./file.txt
and get:
QQQ-10-123-01
QQQ-10-123-02
QWERTY)
As you can see only the second line is being processed properly, since this line is most accurate. There are problems with ignoring possible whitespace and dealing with double entry "(" and ")". Can somebody give the right direction for solving the problems?
P.S. I need to parse over 2k lines; would there be a big difference in performance between sed
and awk
? As far as I have been reading and understood, sed
should have a little benefit in speed. Is that really so?
Solution
Using sed
$ sed 's/[^(]*([[:space:]]\?\([^)]*)\?\)).*/\1/' input_file
QQQ-10-123-01
QQQ-10-123-02
QQQ-10-123-03 (QWERTY)
$ sed -E 's/[^(]*\([[:space:]]?([^)]*\)?)\).*/\1/' input_file
QQQ-10-123-01
QQQ-10-123-02
QQQ-10-123-03 (QWERTY)
Answered By - HatLess Answer Checked By - Robin (WPSolving Admin)