Issue
I have some markdown files to process which contain links to images that I wish to download. e.g. a markdown file:
[![](https://imgs.xkcd.com/comics/git.png)](https://imgs.xkcd.com/comics/git.png)
a lot of text
some more text...
[![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRoBwfdvhORbMBfBmMnk1DHa_MzAhu2sQqmrqkr2p55dwik-b-fj7ceOlMnAND3T1YTC8Ke_XWFZGog9fqsF_HC-oCXvh78ngMowryHWDrBTNuYtu0d8Rd796-fn3RSGBj9tjdhnn4qrM/s320/take_a_break_git.gif)](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRoBwfdvhORbMBfBmMnk1DHa_MzAhu2sQqmrqkr2p55dwik-b-fj7ceOlMnAND3T1YTC8Ke_XWFZGog9fqsF_HC-oCXvh78ngMowryHWDrBTNuYtu0d8Rd796-fn3RSGBj9tjdhnn4qrM/s1600/take_a_break_git.gif)
some more text
another URL but not image
[https://github.com]
so on
I am trying to parse through this file and extract the list of image URLs, which I can later pass on wget
command to download.
So far I have used grep
and sed
and have got results:
$ sed -nE "/https?:\/\/[^ ]+.(jpg|png|gif)/p" $path
[![](https://imgs.xkcd.com/comics/git.png)](https://imgs.xkcd.com/comics/git.png)
[![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRoBwfdvhORbMBfBmMnk1DHa_MzAhu2sQqmrqkr2p55dwik-b-fj7ceOlMnAND3T1YTC8Ke_XWFZGog9fqsF_HC-oCXvh78ngMowryHWDrBTNuYtu0d8Rd796-fn3RSGBj9tjdhnn4qrM/s320/take_a_break_git.gif)](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRoBwfdvhORbMBfBmMnk1DHa_MzAhu2sQqmrqkr2p55dwik-b-fj7ceOlMnAND3T1YTC8Ke_XWFZGog9fqsF_HC-oCXvh78ngMowryHWDrBTNuYtu0d8Rd796-fn3RSGBj9tjdhnn4qrM/s1600/take_a_break_git.gif)
$ grep -Eo "https?://[^ ]+.(jpg|png|gif)" $path
https://imgs.xkcd.com/comics/git.png)](https://imgs.xkcd.com/comics/git.png
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRoBwfdvhORbMBfBmMnk1DHa_MzAhu2sQqmrqkr2p55dwik-b-fj7ceOlMnAND3T1YTC8Ke_XWFZGog9fqsF_HC-oCXvh78ngMowryHWDrBTNuYtu0d8Rd796-fn3RSGBj9tjdhnn4qrM/s320/take_a_break_git.gif)](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRoBwfdvhORbMBfBmMnk1DHa_MzAhu2sQqmrqkr2p55dwik-b-fj7ceOlMnAND3T1YTC8Ke_XWFZGog9fqsF_HC-oCXvh78ngMowryHWDrBTNuYtu0d8Rd796-fn3RSGBj9tjdhnn4qrM/s1600/take_a_break_git.gif
The regex is essentially working fine, but the issue is that as the same URL is present twice in the same line, the text selected is the first occurrence of https
and last occurrence of jpg|png|gif
. But I want the first occurrence of https
and first occurrence of jpg|png|gif
How can fix this?
P.S. I have also tried lynx -dump -image_links -listonly $path
but this prints the entire file.
I am also open to other options that solve the purpose, and as long as I can hook the code up in my current shell script.
Solution
You may add square brackets into the negated bracket expression:
grep -Eo "https?://[^][ ]+\.(jpg|png|gif)"
See the online demo. Details:
https?://
-http://
orhttps://
[^][ ]+
- one or more chars other than]
,[
and space\.
- a dot(jpg|png|gif)
- either of the three alternative substrings.
Answered By - Wiktor Stribiżew