Issue
I have a long list of URLs stored in a text files which I will go through and download. But before doing this I want to remove the duplicate URLs from the list. One thing to note is that some of the URLs look different but infact lead to the same page. The unique elements in the URL (aside from the domain and path) are the first 2 parameters in the query string. So for example, my text file would look like this:
https://www.example.com/page1.html?id=12345&key=dnks93jd&user=399494&group=23
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15
https://www.example.com/page1.html?id=12345&key=dnks93jd&user=454665&group=12
If a unique URL is defined up to the second query string (key) then lines 1 and 4 are a duplicate. I would like to completely remove the duplicate lines, so not even keeping one. In the example above, lines 2 and 3 would remain and the 1 and 4 get deleted.
How can I achieve this using basic command line tools?
Solution
To shorten the code from other answer:
awk -F\& 'FNR == NR { url[$1,$2]++; next } url[$1,$2] == 1' urls.txt urls.txt
Answered By - Romeo Ninov