Issue
I know, by awk the solution is easy, but for this type of problem I'm stuck to sed quite often. I've hit the trap several times and could not find a solution anywhere, yet.
The sample:
<!-- comment #1 --><p>useful text</p> <!-- comment #2 -->more useful text
How to eliminate the comments by sed?
Solutions like this one
cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'
(found here) manage multiple lines quite well (so I excluded this part of the problem), but trap in the "greedy" behavior of regex. None of the solutions I found handle the problem: "eliminate two comment blocks in one line".
My idea of the solution would look like this, but doesn't work:
sed -re 's/<!--[^(-->)]*-->//g' in.html > out.html
But all my efforts to negate the subexpression (-->)
have failed.
I appreciate a general solution for this type of issue, but I'm curious if there is a way to negate a subexpression in sed (the reason for the subject).
Used version: sed (GNU sed) 4.7
Solution
This might work for you (GNU sed):
sed -E 'H;1h;$!d;x;s/<!--([^>-]+|(-?>+)+|(-+[^->]+))*(-?>+)*--+>//g' file
Slurp the whole file into memory.
If a string begins <!--
followed by zero or alternations of three variations: one or more characters which are neither >
or -
,a possible -
followed by one or more >
's or one or more -
's followed by one or more characters which are neither -
or >
; followed by a closing zero or more combination of a possible -
followed by one or more >
's followed by two or more -
's followed by >
, remove that string globally throughout the file.
N.B. This assumes the file is well formed.
Kudos to Renaud Pacalet for the most elegant solution:
sed -E 'H;1h;$!d;x;s/<!--([^>]|[^-]?>|[^-]->)*-->//g' file
I ameliorated the solution slightly to take in the edge case <!-->-->
.
Answered By - potong Answer Checked By - Mary Flores (WPSolving Volunteer)