Sunday, March 13, 2022

[SOLVED] Can it work together head, sed and regex into one bash script?

Issue

I have MyInitialTextFile.txt with these characteristics: <p><nsup></nsup> <b>Abc 1:2<sup>varied text.

  • every line starts with this: <p><nsup></nsup> <b>
  • it is followed by an expression like this: Abc 1:2 or by 2Ab 1:2
  • always followed by: <sup>
  • followed by varied text afterwards.

<p><nsup></nsup> <b>Abc 1:2<sup>varied text

<p><nsup></nsup> <b>Abc 1:2<sup>varied text

<p><nsup></nsup> <b>Abc 1:3<sup>varied text

<p><nsup></nsup> <b>Abc 1:4<sup>varied text

<p><nsup></nsup> <b>Abc 1:4<sup>varied text

<p><nsup></nsup> <b>Abc 1:4<sup>varied text

I need to:

  1. Select the first line(s) from MyInitialTextFile.txt if they start the same. In my case the first two lines. And then transfer these two lines into a TransitionalTextFile.txt For this I used head in bash:
head -n 2 MyInitialTextFile.txt > TransitionalTextFile.txt
  1. Where I would apply on them manually a sequence of two regex expressions. For regex I used:

Find1: (\n) #that is, find Line Feed (an enter on keyboard)

Replace1: " " #that is, Replace with 5 empty spaces

Find2: (.*) #that is, select the entire string

Replace2: $1\n #that is, Replace with all selected (the entire string), and add a Line Feed at the end.

  1. Transfer the content of TransitionalTextFile.txt to the end of a new text file with the same name as found in first string Abc 1:2. For this I used:

    head -n 1 TransitionalTextFile.txt >> 'Abc 1:2.txt'

This will be always -n 1 because following the regex step, all the text becomes one entry, even if there were two strings selected initially.

  1. Delete from MyInitialTextFile.txt the number of lines that I transferred, which for me there were two lines. For this I used sed in bash:

    sed -i '1,2d' MyInitialTextFile.txt

And the process continues with the next one line: <p><nsup></nsup> <b>Abc 1:3<sup>varied text

I made all the above four steps work manually, but my problem is how to bring all these four steps into one script. That is, to select the strings from a initial file and transfer them to another file via regex where I delete the line feed between them and I add a line feed at the end of them so that it will look like this:

<p><nsup></nsup> <b>Abc 1:2<sup>varied text <p><nsup></nsup> <b>Abc 1:2<sup>varied text

At the end I have to delete from my initial file these two strings. I would appreciate any help to bring these four steps into one script. Thank you.


Solution

Like this (taking one for the team :)? Using awk (Notice: it creates files like Abc 1:2 or whatever is between <b> and <sup>):

$ awk '
BEGIN {
    FS="<sup>"                 # split at this delimiter
}
{
    if($1==p) {                # if first part equals first part of previous split
        b=b "     " $0         # append to the output buffer
    }
    else {                     # if first part differs, do stuff
        if(NR>1) {             # first line needs not printing
            print b >> t[n]
            # close t[n]       # uncomment if if needed
        }
        n=split($1,t,/<b>/)    # get the changing part
        b=$0                   # reset buffer
    }
    p=$1                       # create previous to compare on next round
}
END {
    print b >> t[n]            # flush the rest of the buffer
}' file

Output of cat Abc\ 1\:2:

<p><nsup></nsup> <b>Abc 1:2<sup>varied text     <p><nsup></nsup> <b>Abc 1:2<sup>varied text

Depending on the awk flavor used, if you start running out of file descriptors, add a close(t[n]) after the print >>s.



Answered By - James Brown
Answer Checked By - Mary Flores (WPSolving Volunteer)