Wednesday, November 16, 2022

[SOLVED] How to extract a fragment of text from file?

November 16, 2022 awk, shell

Issue

I have a file with the following text:

<div>
<b>a:</b> <a class='a' href='/a/1'>a1</a><br>
<b>b:</b> <a class='b' href='/b/2'>b2</a><br>
<b>c:</b> <a class='c' href='/c/3/'>c3</a><br>
<b>d:</b> "ef"<br><br><div class='start'>123
<br>ghij.
<br>klmn
<br><br><b>end</b>
</div>
</div>

I want to do the following:

Whenever a line starts with a: <a class='a', I want to copy the text between the > symbol after <a class='a' and </a> — it must be stored in a[1];
Similarly, whenever a line starts with b: <a class='b', I want to copy the text between the > symbol after <a class='b' and </a> — it must be stored in b[1];
Whenever a line contains <div class='start'>, I want to create the variable t whose value starts with the text that occurs between <div class='start'> and the end of this line, then set flag to 1;
If the value of flag is already 1 and the current line does not start with  end, I want to append the current line to the current value of the variable t (using the space symbol as separator);
If the value of flag is already 1 and the current line starts with  end, I want to concatenate three current values of a[1], b[1] and t (using ; as separator) and print the result to the output file, then set flag to 0, then clear the variable t.

I used the following code (for gawk 4.0.1):

gawk 'BEGIN {flag = 0; t = ""; } 
{  
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ ) {
    match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/, a) };
if ($0 ~ /^<b>b:<\/b> <a class=\x27b\x27/ ) {
    match($0, /^<b>b:<\/b> <a class=\x27b\x27 href=\x27\/b\/[0-9]{1,}\x27>(.*)<\/a>/, b) };
if ($0 ~ /<div class=\x27start\x27>/ ) {
    match($0, /^.*<div class=\x27start\x27>(.*)$/, s);
    t = s[1];
    flag = 1 };
if (flag == 1) {
    if ($0 ~ /^<br><br><b>end<\/b>/) {
        str = a[1] ";" b[1] ";" t;
        print(str) > "output.txt";
        flag = 0; str = ""; t = "" } 
    else {
    t = t " " $0 }
}
}' input.txt

I was expecting the following output:

a1;b2;123 <br>ghij. <br>klmn

But the output is:

;;123 <b>d:</b> "ef"<br><br><div class='start'>123 <br>ghij. <br>klmn

Why are a[1] and b[1] empty? Why does d: "ef" <div class='start'> occur in the output? How to fix the code to obtain the expected output?

Solution

Here's the answers to your specific questions:

Q) Why are a[1] and b[1] empty?

A) They aren't when I try your script with gawk 5.1.1 so most likely either there's a bug in your awk version or some of the white space in your input isn't blanks as your script requires (maybe it's tabs), or you have some control chars or your awk version doesn't like using \x27 instead of \047 for 's.
Q) Why does d: "ef" <div class='start'> occur in the output?

A) Because you forgot a next in the block that matches on div so the next block is also executing and saving $0 from the div line.
Q) How to fix the code to obtain the expected output?

A) Here's how I'd approach your problem, using GNU awk for the 3rd arg to match() and \s shorthand for [:space:]:

$ cat tst.sh
#!/usr/bin/env bash

gawk '
    BEGIN { OFS=";" }

    match($0, /^<b>(.):<\/b>\s+<a\s+class=\047.\047\s+href=\047\/.\/[0-9]+\/?\047>(.*)<\/a>/, arr) {
        vals[arr[1]] = arr[2]
    }

    match($0, /^.*<div\s+class=\047start\047>(.*)/, arr) {
        vals["div"] = arr[1]
        inDiv = 1
        next
    }

    inDiv {
        if ( /^<br><br><b>end<\/b>/ ) {
            print vals["a"], vals["b"], vals["div"]
            delete vals
            inDiv = 0
        }
        else {
            vals["div"] = vals["div"] " " $0
        }
    }

' 'input.txt' > 'output.txt'

$ ./tst.sh

$ cat output.txt
a1;b2;123 <br>ghij. <br>klmn

I'm using a single match() to capture all values for lines that look like your a, b, c lines for consistency, conciseness, and maintainability.
I'm always saving the match results in an array named arr rather than different arrays per occurrence so I don't have to remember to keep deleting those arrays and the code that uses the matches can all be homogenized.
I'm using a single associative array vals[] to hold all values indexed by the letter after  so we don't need to test those letters and create separate variables, it's easy to clear the data by just deleting the array rather than having to set multiple variables to null, and it's easy to add the c or any other similar values to the output later if desired.
I'm using \s+ instead of a single blank char for every space in the input to be agnostic about the actual space char(s) and number of spaces used.
I'm using \047 instead of \x27 to match 's for portability and robustness, see http://awk.freeshell.org/PrintASingleQuote.
I'm letting the shell handle all input/output rather than including output redirection in the awk script for consistency and improved robustness in error scenarios like files that can't be opened.
I named my flag variable inDiv rather than flag so it tells us what it means, i.e. that we're in the div block of the input, for improved clarity and easy of future maintenance. Naming a flag variable flag is like naming a numeric variable number instead of sum, count, ave, tot, diff or something else meaningful that'd improve your script. When you see people use f for the name of a flag variable, that f is shorthand for found, not for flag.

Answered By - Ed Morton

Answer Checked By - Gilberto Lyons (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 16, 2022

[SOLVED] How to extract a fragment of text from file?

Issue

Solution

Popular Posts

Labels