Issue
I'm trying to scrape some data from a web page that allows data scraping in their robots.txt
file.
In order to get the data that I want, I looked into API requests that the web page sends when loading and I identified the API requests of interest.
If I copy the request URL as seen in Brave -> Network
tab and paste it into new tab, I get the JSON data identical to the one that the web page gets when loading. However, if I copy that request as cURL
command, with the same headers, cookies etc., and execute it in my terminal, I get a Cloudflare html
page that mentions captcha-bypass
and has "Checking your browser..." message.
I tried exporting cookies from my browser to a file and then using them with cURL
but it doesn't work. I also tried comparing HTTP requests sent at different times, in case something like a timestamp gets added to the request, but they are the same. And one more thing: when sending the API request from the browser, I don't get any Captcha challenges to solve, I just get the JSON back.
I would like to know the mechanism by which the server determines that I'm not using a browser just from the HTTP request.
UPDATE: I tried sending request with Tor and in this case I get back the same page as in cURL
request. After some time, the Captcha gets verified on its own and the JSON
data gets loaded as in a regular browser.
Solution
As some of the comments suggested, the reason why cURL
didn't show the same output is that the server would first serve a html page where some javascript code gets automatically executed. After the code is executed, the actual data gets requested and shown.
I'm not sure why this can't be seen in the Network
tab in dev tools nor is it indicated anywhere in the browser while the response is being fetched.
I figured it out by trying to send the api request via Tor. That's the only browser which showed that something was going on before sending the redirect request.
In the end I automated this with puppeteer
, as suggested by skyez.
Answered By - honknoodle Answer Checked By - Pedro (WPSolving Volunteer)