Friday, December 1, 2023

[SOLVED] Fatal error: Uncaught ValueError: DOMDocument::loadHTML(): Argument #1 ($source) must not be empty

December 01, 2023 curl, domdocument, php, web-scraping

Issue

I want to scrape this website in PHP using cURL.

I use similar webscraping scripts in PHP, they work well.

Nevertheless, I get the following error:

Fatal error: Uncaught ValueError: DOMDocument::loadHTML(): Argument #1 ($source) must not be empty, followed by Stack trace: #0 [...](29): DOMDocument->loadHTML() #1 {main} thrown in [...] on line 29.

The error message references line 29, i.e. $doc->loadHTML($html); (the final line in the following code):

<?php
ini_set('display_errors', '1');
ini_set('display_startup_errors', '1');
error_reporting(E_ALL);

$ch = curl_init();

// Set the cURL options
curl_setopt($ch, CURLOPT_URL, "https://link.springer.com/book/10.1007/978-3-031-10453-4");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');

// Execute the cURL request and fetch the HTML source code
$html = curl_exec($ch);

if (curl_error($ch)){
    $output = "\n". curl_error($ch);
    echo $output;
    die();
}

// Close the cURL handle
curl_close($ch);

// Create a new DOMDocument object
$doc = new DOMDocument();

// Load the HTML source code
$doc->loadHTML($html);

I do not think that I am blocked by the website I want to reach -- it is my very first test with that website.

What could be the issue behind the error?

Solution

This seems a headers issue. I added a bunch of headers, and it works. I think it particularly needs a cookie header, but I am not sure. This is what I did:

$ch = curl_init();

$headers = ['Host: link.springer.com',
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0',
            'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
            'Accept-Language: en-US,en;q=0.5',
            'Accept-Encoding: gzip, deflate, br',
            'Alt-Used: link.springer.com',
            'Connection: keep-alive',
            'Cookie: sim-inst-token=""; trackid="zx2x19io4tsuk4vmfxbyuzwmw"; idp_session=sVERSION_14a94b0be-f404-438f-a28a-7fa100f381d4; idp_session_http=hVERSION_1982d3b5c-a7fc-4ee4-a0a5-5cbfbc9f57fc; idp_marker=4b716e5d-4a3f-4bc4-bf74-be595a696d13; user.uuid.v2="2577d0e2-af55-4b19-b21b-60267616d034"; sncc=P%3D17%3AV%3D35.0.0%26C%3DC01',
            'Upgrade-Insecure-Requests: 1',
            'If-None-Match: "9887e0016a5f403a529a66099f5cc49d"',
            'TE: trailers'];

// Set the cURL options
curl_setopt($ch, CURLOPT_URL, "https://link.springer.com/book/10.1007/978-3-031-10453-4");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

// Execute the cURL request and fetch the HTML source code
$html = curl_exec($ch);

More investigation is needed to know exactly what headers are required.

Notice that the result is compressed, which is why it doesn't look like HTML.

Answered By - KIKO Software

Answer Checked By - Cary Denson (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 1, 2023

[SOLVED] Fatal error: Uncaught ValueError: DOMDocument::loadHTML(): Argument #1 ($source) must not be empty

Issue

Solution

Popular Posts

Labels