Issue
I want to download all files from this section of a HTML page :
<td><a class="xm" name="item_1" type="dd" href="/data/24765/dd">Item 1</a></td>
<td><a class="xm" name="item_2" type="dd" href="/data/12345/dd">Item 2</a></td>
<td><a class="xm" name="item_3" type="dd" href="/data/75239/dd">Item 3</a></td>
The download link for the first file is https://foo.bar/data/24765/dd
, and as it's a zip file, I'd like to unzip it as well.
My script is this :
#!/bin/bash
curl -s "https://foo.bar/path/to/page" > data.html
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' data.html > data.txt
for f in $(cat data.txt); do
curl -s "https://foo.bar/$f" > data.zip
unzip data.zip
done
Is there a more elegant way to write this script? I'd like to avoid saving the html, txt and zip files.
Solution
The bsdtar
command can unzip archives from stdin, allowing you to do this:
curl -s "https://foo.bar/$f" | bsdtar -xf-
And of course you can pipe the first curl
command directly into awk
:
curl -s "https://foo.bar/path/to/page" |
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' > data.txt
And in fact you might as well just pipe the output of that pipeline directly into a loop:
curl -s "https://foo.bar/path/to/page" |
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' |
while read archive; do
curl -s "https://foo.bar/$archive" | bsdtar -xf-
done
Answered By - larsks Answer Checked By - Clifford M. (WPSolving Volunteer)