Issue
I want to scrape this website in PHP using cURL.
I use similar webscraping scripts in PHP, they work well.
Nevertheless, I get the following error:
Fatal error: Uncaught ValueError: DOMDocument::loadHTML(): Argument #1 ($source) must not be empty
, followed by Stack trace: #0 [...](29): DOMDocument->loadHTML() #1 {main} thrown in [...] on line 29
.
The error message references line 29, i.e. $doc->loadHTML($html);
(the final line in the following code):
<?php
ini_set('display_errors', '1');
ini_set('display_startup_errors', '1');
error_reporting(E_ALL);
$ch = curl_init();
// Set the cURL options
curl_setopt($ch, CURLOPT_URL, "https://link.springer.com/book/10.1007/978-3-031-10453-4");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
// Execute the cURL request and fetch the HTML source code
$html = curl_exec($ch);
if (curl_error($ch)){
$output = "\n". curl_error($ch);
echo $output;
die();
}
// Close the cURL handle
curl_close($ch);
// Create a new DOMDocument object
$doc = new DOMDocument();
// Load the HTML source code
$doc->loadHTML($html);
I do not think that I am blocked by the website I want to reach -- it is my very first test with that website.
What could be the issue behind the error?
Solution
This seems a headers issue. I added a bunch of headers, and it works. I think it particularly needs a cookie header, but I am not sure. This is what I did:
$ch = curl_init();
$headers = ['Host: link.springer.com',
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: gzip, deflate, br',
'Alt-Used: link.springer.com',
'Connection: keep-alive',
'Cookie: sim-inst-token=""; trackid="zx2x19io4tsuk4vmfxbyuzwmw"; idp_session=sVERSION_14a94b0be-f404-438f-a28a-7fa100f381d4; idp_session_http=hVERSION_1982d3b5c-a7fc-4ee4-a0a5-5cbfbc9f57fc; idp_marker=4b716e5d-4a3f-4bc4-bf74-be595a696d13; user.uuid.v2="2577d0e2-af55-4b19-b21b-60267616d034"; sncc=P%3D17%3AV%3D35.0.0%26C%3DC01',
'Upgrade-Insecure-Requests: 1',
'If-None-Match: "9887e0016a5f403a529a66099f5cc49d"',
'TE: trailers'];
// Set the cURL options
curl_setopt($ch, CURLOPT_URL, "https://link.springer.com/book/10.1007/978-3-031-10453-4");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
// Execute the cURL request and fetch the HTML source code
$html = curl_exec($ch);
More investigation is needed to know exactly what headers are required.
Notice that the result is compressed, which is why it doesn't look like HTML.
Answered By - KIKO Software Answer Checked By - Cary Denson (WPSolving Admin)