Tuesday, January 4, 2022

[SOLVED] Get content between html tags using grep

Issue

I have a html file which I am trying to get data from. The website is this https://www.tv2.no/nyheter. I am trying to get all the news article from the website.

I do this wget -O news.html https://www.tv2.no/nyheter

this creates a local file for me.

Then I am trying to get all the articles having class article--nyheter. I try running this command

tr '\n' ' ' < news.html | grep -E "^<article class="article-nyheter">.*$"

but I did not got any result. The html structure is like this

<body>
<div>
    <article class="article column large-4 small-12">
        hello
    </article>
</div>

<article class="article column large-4 small-12 article--nyheter">

    <a class="article__link" href="/nyheter/14336304/">


        <figure class="image image__responsive" style="padding-bottom:51.312%;">

            <img class="image__img lazyload" itemprop="image" title="" alt=""
                src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7"
                data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
                data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">


        </figure>


        <div class="article__content">


            <h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>



        </div>
    </a>



</article>

<article class="article column large-4 small-12 article--nyheter">

    <a class="article__link" href="/nyheter/14336420/">


        <figure class="image image__responsive" style="padding-bottom:115.452%;">

            <img class="image__img lazyload" itemprop="image" title="" alt=""
                src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7"
                data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
                data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">


        </figure>


        <div class="article__content">


            <h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
                troster</h2>



        </div>
    </a>



</article>

sample output as both of the below articles contain class name article--nyheter

<article class="article column large-4 small-12 article--nyheter">

    <a class="article__link" href="/nyheter/14336420/">


        <figure class="image image__responsive" style="padding-bottom:115.452%;">

            <img class="image__img lazyload" itemprop="image" title="" alt=""
                src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7"
                data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
                data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">


        </figure>


        <div class="article__content">


            <h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
                troster</h2>



        </div>
    </a>



</article>


<article class="article column large-4 small-12 article--nyheter">

    <a class="article__link" href="/nyheter/14336304/">


        <figure class="image image__responsive" style="padding-bottom:51.312%;">

            <img class="image__img lazyload" itemprop="image" title="" alt=""
                src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7"
                data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
                data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">


        </figure>


        <div class="article__content">


            <h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>



        </div>
    </a>



</article>

I have to use grep, sed, curl, awk for this. Cannot use any other parser.

So my expected output is to get all the articles tag having a specific class. I want everything inside those article tags.


Solution

Assumptions:

  • there is some valid reason why a HTML-centric tool is not being used to parse out the desired sections
  • input is formatted as in the question otherwise the proposed sed solution will likely not work correctly
  • extract the <article> ... </article> pairs where the article class entry contains the string article--nyheter
  • OP's expected output has the two article--nyheter sections listed in reverse order; for now I'm going to assume that was some sort of typo and that there are no requirements to sort the two sections

One sed idea using ranges to to extract the desired data:

sed -n '/<article class.*article--nyheter/,/<\/article>/p' news.html

This generates:

<article class="article column large-4 small-12 article--nyheter">

    <a class="article__link" href="/nyheter/14336304/">


        <figure class="image image__responsive" style="padding-bottom:51.312%;">

            <img class="image__img lazyload" itemprop="image" title="" alt=""
                src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7"
                data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
                data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">


        </figure>


        <div class="article__content">


            <h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>



        </div>
    </a>



</article>
<article class="article column large-4 small-12 article--nyheter">

    <a class="article__link" href="/nyheter/14336420/">


        <figure class="image image__responsive" style="padding-bottom:115.452%;">

            <img class="image__img lazyload" itemprop="image" title="" alt=""
                src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7"
                data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
                data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">


        </figure>


        <div class="article__content">


            <h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
                troster</h2>



        </div>
    </a>



</article>

If the input data is not formatted as presented in the question (eg, carriage returns/linefeeds are missing) then this sed solution likely will not work; a more 'robust' parser would need to be built (eg, via awk) ...



Answered By - markp-fuso