Issue
This is an HTML file containing a large number of <section>... </section>
content in an HTML file, which has the following format.
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<section>
<div>
<header><h2>This is a title (RfQVthHm)</h2></header>
More HTML codes...
</div>
</section>
<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>
<section>
<div>
<header><h2>This is a title (vxzbXEGq)</h2></header>
More HTML codes...
</div>
</section>
</body>
</html>
I need to extract the second <section>...</section>
content.
This is the expected output.
<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>
I noticed that I can look for the UaHaZWvm
character first (and 2 lines ahead) until I encounter the next </section>
.
OP's efforts(mentioned in comments): grep -o "hi.*bye" file
Can this be done with awk
, sed
or grep
tools please?
Solution
With your shown samples, could you please try following. Written and tested in GNU awk
, should work in any awk
.
awk '
/^<\/section>/{
if(found1==2 && found2==1){
print val
exit
}
found2++
}
/<section>/{
found1++
}
found1==2{
val=(val?val ORS:"")$0
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^<\/section>/{ ##Checking condition if line starts from </section> here.
if(found1==2 && found2==1){ ##Checking condition if found1 is 2 AND found2 is 1 then do following.
print val ##printing val here.
exit ##exiting from program from here.
}
found2++ ##Increasing found2 with 1 here.
}
/<section>/{ ##Checking condition if line has <section> then do following.
found1++ ##Increasing found1 with 1 here.
}
found1==2{ ##Checking if found1 is 2 then do following.
val=(val?val ORS:"")$0 ##Creating val and keep adding lines into it.
}
'
Answered By - RavinderSingh13