Issue
I have MyInitialTextFile.txt with these characteristics: <p><nsup></nsup> <b>Abc 1:2<sup>varied text
.
- every line starts with this:
<p><nsup></nsup> <b>
- it is followed by an expression like this: Abc 1:2 or by 2Ab 1:2
- always followed by:
<sup>
- followed by varied text afterwards.
<p><nsup></nsup> <b>
Abc 1:2<sup>
varied text
<p><nsup></nsup> <b>
Abc 1:2<sup>
varied text
<p><nsup></nsup> <b>
Abc 1:3<sup>
varied text
<p><nsup></nsup> <b>
Abc 1:4<sup>
varied text
<p><nsup></nsup> <b>
Abc 1:4<sup>
varied text
<p><nsup></nsup> <b>
Abc 1:4<sup>
varied text
I need to:
- Select the first line(s) from MyInitialTextFile.txt if they start the same. In my case the first two lines. And then transfer these two lines into a TransitionalTextFile.txt For this I used head in bash:
head -n 2 MyInitialTextFile.txt > TransitionalTextFile.txt
- Where I would apply on them manually a sequence of two regex expressions. For regex I used:
Find1: (\n) #that is, find Line Feed (an enter on keyboard)
Replace1: " " #that is, Replace with 5 empty spaces
Find2: (.*) #that is, select the entire string
Replace2: $1\n #that is, Replace with all selected (the entire string), and add a Line Feed at the end.
Transfer the content of TransitionalTextFile.txt to the end of a new text file with the same name as found in first string Abc 1:2. For this I used:
head -n 1 TransitionalTextFile.txt >> 'Abc 1:2.txt'
This will be always -n 1 because following the regex step, all the text becomes one entry, even if there were two strings selected initially.
Delete from MyInitialTextFile.txt the number of lines that I transferred, which for me there were two lines. For this I used sed in bash:
sed -i '1,2d' MyInitialTextFile.txt
And the process continues with the next one line:
<p><nsup></nsup> <b>
Abc 1:3<sup>
varied text
I made all the above four steps work manually, but my problem is how to bring all these four steps into one script. That is, to select the strings from a initial file and transfer them to another file via regex where I delete the line feed between them and I add a line feed at the end of them so that it will look like this:
<p><nsup></nsup> <b>
Abc 1:2<sup>
varied text <p><nsup></nsup> <b>
Abc 1:2<sup>
varied text
At the end I have to delete from my initial file these two strings. I would appreciate any help to bring these four steps into one script. Thank you.
Solution
Like this (taking one for the team :)? Using awk (Notice: it creates files like Abc 1:2
or whatever is between <b>
and <sup>
):
$ awk '
BEGIN {
FS="<sup>" # split at this delimiter
}
{
if($1==p) { # if first part equals first part of previous split
b=b " " $0 # append to the output buffer
}
else { # if first part differs, do stuff
if(NR>1) { # first line needs not printing
print b >> t[n]
# close t[n] # uncomment if if needed
}
n=split($1,t,/<b>/) # get the changing part
b=$0 # reset buffer
}
p=$1 # create previous to compare on next round
}
END {
print b >> t[n] # flush the rest of the buffer
}' file
Output of cat Abc\ 1\:2
:
<p><nsup></nsup> <b>Abc 1:2<sup>varied text <p><nsup></nsup> <b>Abc 1:2<sup>varied text
Depending on the awk flavor used, if you start running out of file descriptors, add a close(t[n])
after the print >>
s.
Answered By - James Brown Answer Checked By - Mary Flores (WPSolving Volunteer)