Issue
I have a file with the following text:
<div>
<b>a:</b> <a class='a' href='/a/1'>a1</a><br>
<b>b:</b> <a class='b' href='/b/2'>b2</a><br>
<b>c:</b> <a class='c' href='/c/3/'>c3</a><br>
<b>d:</b> "ef"<br><br><div class='start'>123
<br>ghij.
<br>klmn
<br><br><b>end</b>
</div>
</div>
I want to do the following:
- Whenever a line starts with
<b>a:</b> <a class='a'
, I want to copy the text between the>
symbol after<a class='a'
and</a>
— it must be stored ina[1]
; - Similarly, whenever a line starts with
<b>b:</b> <a class='b'
, I want to copy the text between the>
symbol after<a class='b'
and</a>
— it must be stored inb[1]
; - Whenever a line contains
<div class='start'>
, I want to create the variablet
whose value starts with the text that occurs between<div class='start'>
and the end of this line, then setflag
to1
; - If the value of
flag
is already1
and the current line does not start with<br><br><b>end</b>
, I want to append the current line to the current value of the variablet
(using the space symbol as separator); - If the value of
flag
is already1
and the current line starts with<br><br><b>end</b>
, I want to concatenate three current values ofa[1]
,b[1]
andt
(using;
as separator) and print the result to the output file, then setflag
to0
, then clear the variablet
.
I used the following code (for gawk 4.0.1
):
gawk 'BEGIN {flag = 0; t = ""; }
{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ ) {
match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/, a) };
if ($0 ~ /^<b>b:<\/b> <a class=\x27b\x27/ ) {
match($0, /^<b>b:<\/b> <a class=\x27b\x27 href=\x27\/b\/[0-9]{1,}\x27>(.*)<\/a>/, b) };
if ($0 ~ /<div class=\x27start\x27>/ ) {
match($0, /^.*<div class=\x27start\x27>(.*)$/, s);
t = s[1];
flag = 1 };
if (flag == 1) {
if ($0 ~ /^<br><br><b>end<\/b>/) {
str = a[1] ";" b[1] ";" t;
print(str) > "output.txt";
flag = 0; str = ""; t = "" }
else {
t = t " " $0 }
}
}' input.txt
I was expecting the following output:
a1;b2;123 <br>ghij. <br>klmn
But the output is:
;;123 <b>d:</b> "ef"<br><br><div class='start'>123 <br>ghij. <br>klmn
Why are a[1]
and b[1]
empty? Why does <b>d:</b> "ef"<br><br><div class='start'>
occur in the output? How to fix the code to obtain the expected output?
Solution
Here's the answers to your specific questions:
Q) Why are
a[1]
andb[1]
empty?A) They aren't when I try your script with gawk 5.1.1 so most likely either there's a bug in your awk version or some of the white space in your input isn't blanks as your script requires (maybe it's tabs), or you have some control chars or your awk version doesn't like using
\x27
instead of\047
for'
s.Q) Why does
<b>d:</b> "ef"<br><br><div class='start'>
occur in the output?A) Because you forgot a
next
in the block that matches ondiv
so the next block is also executing and saving$0
from thediv
line.Q) How to fix the code to obtain the expected output?
A) Here's how I'd approach your problem, using GNU awk for the 3rd arg to
match()
and\s
shorthand for[:space:]
:
$ cat tst.sh
#!/usr/bin/env bash
gawk '
BEGIN { OFS=";" }
match($0, /^<b>(.):<\/b>\s+<a\s+class=\047.\047\s+href=\047\/.\/[0-9]+\/?\047>(.*)<\/a>/, arr) {
vals[arr[1]] = arr[2]
}
match($0, /^.*<div\s+class=\047start\047>(.*)/, arr) {
vals["div"] = arr[1]
inDiv = 1
next
}
inDiv {
if ( /^<br><br><b>end<\/b>/ ) {
print vals["a"], vals["b"], vals["div"]
delete vals
inDiv = 0
}
else {
vals["div"] = vals["div"] " " $0
}
}
' 'input.txt' > 'output.txt'
$ ./tst.sh
$ cat output.txt
a1;b2;123 <br>ghij. <br>klmn
So
- I'm using a single
match()
to capture all values for lines that look like your a, b, c lines for consistency, conciseness, and maintainability. - I'm always saving the match results in an array named
arr
rather than different arrays per occurrence so I don't have to remember to keep deleting those arrays and the code that uses the matches can all be homogenized. - I'm using a single associative array
vals[]
to hold all values indexed by the letter after<b>
so we don't need to test those letters and create separate variables, it's easy to clear the data by just deleting the array rather than having to set multiple variables to null, and it's easy to add thec
or any other similar values to the output later if desired. - I'm using
\s+
instead of a single blank char for every space in the input to be agnostic about the actual space char(s) and number of spaces used. - I'm using
\047
instead of\x27
to match'
s for portability and robustness, see http://awk.freeshell.org/PrintASingleQuote. - I'm letting the shell handle all input/output rather than including output redirection in the awk script for consistency and improved robustness in error scenarios like files that can't be opened.
- I named my flag variable
inDiv
rather thanflag
so it tells us what it means, i.e. that we're in thediv
block of the input, for improved clarity and easy of future maintenance. Naming a flag variableflag
is like naming a numeric variablenumber
instead ofsum
,count
,ave
,tot
,diff
or something else meaningful that'd improve your script. When you see people usef
for the name of a flag variable, thatf
is shorthand forfound
, not forflag
.
Answered By - Ed Morton Answer Checked By - Gilberto Lyons (WPSolving Admin)