Issue
I want to remove XML comments in bash using regex (awk, sed, grep...) I have looked at other questions about this but they are missing something. Here's my xml code
<Table>
<!--
to be removed bla bla bla bla bla bl............
removeee
to be removeddddd
-->
<row>
<column name="example" value="1" ></column>
</row>
</Table>
So I'm comparing 2 xml files but I don't want the comparison to take into account the comments. I do this
diff file1.xml file2.xml | sed '/<!--/,/-->/d'
but that only removes the line that starts with <!--
and the last line. It does not remove all the lines in between.
Solution
In the end, you're going to have to recommend to your client/friend/instructor that they need to install some kind of XML processor. xmlstarlet
is a good command line tool, but there are any number (or at least some number greater than 2) of implementations of XSLT which can be compiled for any standard Unix, and in most cases also for Windows. You really cannot do much XML processing with regex-based tools, and whatever you do will be hard to read, harder to maintain, and likely to fail on corner cases, sometimes with disastrous consequences.
I haven't spent a lot of time polishing or reviewing the following little awk program. I think it will remove comments from compliant xml documents. Note that the following comment is not compliant:
<!-- XML comments cannot include -- so this comment is illegal -->
and it will not be treated correctly by my script.
The following is also illegal, but since I've seen it in the wild and it wasn't hard to deal with, I did so:
<!-------------- This comment is ill-formed but... -------------->
Here it is. No guarantees. I know that it's hard to read, and I wouldn't want to maintain it. It may well fail on arbitrary corner cases.
awk 'in_comment&&/-->/{sub(/([^-]|-[^-])*--+>/,"");in_comment=0}
in_comment{next}
{gsub(/<!--+([^-]|-[^-])*--+>/,"");
in_comment=sub(/<!--+.*/,"");
print}'
Answered By - rici Answer Checked By - Mary Flores (WPSolving Volunteer)