Issue
I have a Problem. I want to get two parts of this html in values with the sed or grep command. How i can extract both of them?
test.html:
<html>
<body>
<div id="foo" class="foo">
Some Text.
<p id="author" class="author">
<br>
<a href="example.com">bar</a>
</p>
</div>
</body>
</html>
script.sh
#!/bin/bash
author=$(sed 's/.*<p id="author" class="author"><br><a href="*">\(.*\)<\/a><\/p>.*/\1/p' test.html)
quote=$(sed 's/.*<div id="foo" class="foo">\(.*\)<\/div>.*/\1/p' test.html)
Under the line i want only the text in the values. without the html tags. But my script doesent works..
Solution
The code:
text="$(sed 's:^ *::g' < test.html | tr -d \\n)"
author=$(sed 's:.*<p id="author" class="author"><br><a href="[^"]*">\([^<]*\)<.*:\1:' <<<"$text")
quote=$(sed 's:.*<div id="foo" class="foo">\([^<]*\)<.*:\1:' <<<"$text")
echo "'$author' '$quote'"
How it works:
$text
is assigned an unindented single-line representation oftest.html
; note that:
is used as a delimiter forsed
instead of/
, since any character is capable of being a delimiter, and the text we are parsing has/
-s present, so we don`t have to escape them with\
-s when constructing a regex.$author
is assumed to be between<p id="author" class="author"><br><a href="[^"]*">
(where[^"]*
means «any characters except"
, repeated N times, N ∈ [0, +∞)») and any tag that comes next.$quote
is assumed to be between<div id="foo" class="foo">
and any tag that comes next.- The rather obscure construct
<<<"$text"
is the so-called here-string, which is almost equivalent toecho "$text" |
placed at the beginning.
Answered By - hidefromkgb