Issue
I'm trying to download the rss feed from "https://www.straitstimes.com/news/singapore/rss.xml". I have the following Python script:
import requests
r = requests.get('https://www.straitstimes.com/news/singapore/rss.xml')
for k, v in r.headers.items():
print("{}: {}".format(k, v))
print(r.content)
When I run this, I get the following response:
Cache-Control: max-age=0, no-cache, no-store
Content-Type: text/html
Date: Wed, 13 Dec 2023 03:06:00 GMT
Expires: Wed, 13 Dec 2023 03:05:59 GMT
Referrer-Policy: no-referrer-when-downgrade
Server: ECD (sgc/56B1)
Set-Cookie: sph_user_country=SG;Path=/;
X-EC-Security-Audit: 403
x-vmg-version: v10.5.70
Content-Length: 345
b'<?xml version="1.0" encoding="iso-8859-1"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n "http://www.w3
.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n\t<head>\n\t\t<titl
e>403 - Forbidden</title>\n\t</head>\n\t<body>\n\t\t<h1>403 - Forbidden</h1>\n\t</body>\n</html>\n'
When I try to get it with curl using the following request (I'm trying to force HTTP/1.1 and remove any user-agent/accept headers from the request), I get the XML just fine. What am I doing wrong with requests?
curl https://www.straitstimes.com/news/singapore/rss.xml -v --http1.1 -H 'User-Agent:' -H 'Accept:'
Solution
You can try like this
import requests
headers = {
'User-Agent': '',
'Accept': ''
}
url = 'https://www.straitstimes.com/news/singapore/rss.xml'
r = requests.get(url, headers=headers)
print(r.status_code)
if r.status_code == 200:
print(r.text)
Answered By - Mahboob Nur Answer Checked By - Marie Seifert (WPSolving Admin)