Issue
I have a form that has fields for a couple of URLS. I wrote a Zend Framework validator that does a trivial preg_match to screen out ridiculous strings, and then does a curl HEAD request (CURLOPT_NOBODY) to screen out 404's and other connectivity issues. In testing I came across the mysterious return code 0 with "unknown SSL protocol error", so I added a check to accept as valid anything that gave a message with "SSL" in it, since that would suggest that the URL reached a webserver.
But one particular URL that our customers would likely use in practice redirects to an s3.amazonaws.com URL for a PDF file. In a browser, both the original URL, and the s3 URL it redirects to, display the PDF just fine. Since I used CURLOPT_FOLLOWLOCATION, I expected my validator would accept it. But instead it gave a 404. I then tried specifying the s3 URL directly, and that gave a 403(!). Thinking that possibly the 403 was triggered by the fact that I had specified a header of 'HTTP_X_REQUESTED_WITH: XMLHttpRequest', I commented out that line in the code. But it still gave a 403.
How can this happen? It seems to me that amazon s3 would have to look for HEAD requests explicitly, and deliberately issue a 404 or 403 depending on whether it came via a redirect???
I suppose I could delete the CURLOPT_NOBODY to have it send a GET request, but that seems silly since I don't care about the body.
Here is my complete code:
<?php
class Oshk_ZendX_Validate_Url {
static $debug = true;
// Based on https://stackoverflow.com/a/42619410/467590
const PATTERN = '/^(https?:\/\/)?[^" ]+(\.[^" ]+)*$/';
public static function isValid($value) {
$STDERR = fopen("php://stderr", "w");
$value = (string) $value;
$matches = array();
if (! preg_match(self::PATTERN, $value, $matches)) {
fwrite($STDERR, sprintf("File '%s', line %d, value '%s' does not match pattern '%s'\n", __FILE__, __LINE__, $value, self::PATTERN));
fclose($STDERR);
return false;
}
if (! array_key_exists(1, $matches)) {
$value = "https://$value";
}
if (self::$debug) {
fwrite($STDERR, sprintf("File '%s', line %d, \$value = '%s', \$matches = %s", __FILE__, __LINE__, $value, print_r($matches, true)));
}
// URL looks well-formed. Ask curl to send a HEAD request to it
$ch = curl_init($value);
if ($ch === false) {
throw new Exception("curl_init($value) failed!");
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0); // From https://www.php.net/manual/en/curl.examples-basic.php
curl_setopt($ch, CURLOPT_HTTPHEADER, array('HTTP_X_REQUESTED_WITH: XMLHttpRequest'));
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
if (self::$debug) {
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_STDERR, $STDERR);
}
$data = curl_exec($ch);
$msg = curl_error($ch);
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if (self::$debug) {
// https://stackoverflow.com/a/14436877/467590
$allinfo = curl_getinfo($ch);
fwrite($STDERR, sprintf("File '%s', line %d, \$allinfo = %s\n", __FILE__, __LINE__, print_r($allinfo, true)));
}
curl_close($ch);
if (self::$debug) {
fwrite($STDERR, sprintf("File '%s', line %d, data = '%s'\n", __FILE__, __LINE__, substr($data, 0, 255)));
}
if(! strlen($data) && $status != 0 && false === strpos($msg, 'SSL')) {
fwrite($STDERR, sprintf("File '%s', line %d, '%s' gives bad status code %d when accessed, with message '%s'\n", __FILE__, __LINE__, $value, $status, $msg));
fclose($STDERR);
return false;
}
if (self::$debug) {
fwrite($STDERR, sprintf("File '%s', line %d, url = '%s'\n", __FILE__, __LINE__, $value));
fwrite($STDERR, sprintf("File '%s', line %d, data = '%s'\n", __FILE__, __LINE__, substr($data, 0, 255)));
}
unset($data);
if (self::$debug) {
fwrite($STDERR, sprintf("File '%s', line %d, \$msg = '%s'\n", __FILE__, __LINE__, $msg));
fwrite($STDERR, sprintf("File '%s', line %d, \$status = '%s'\n", __FILE__, __LINE__, $status));
fwrite($STDERR, sprintf("File '%s', line %d, \$value = '%s'\n", __FILE__, __LINE__, $value));
}
if (($status >= 100 & $status < 400) || false !== strpos($msg, 'SSL')) {
fclose($STDERR);
return true;
}
fwrite($STDERR, sprintf("File '%s', line %d, '%s' gives bad status code %d when accessed, with message '%s'\n", __FILE__, __LINE__, $value, $status, $msg));
fclose($STDERR);
return false;
}
}
echo var_dump(Oshk_ZendX_Validate_Url::isValid($argv[1]));
Here is the bash shell session running it with the original URL:
$ php curltest.php 'https://americandrivingsociety.org/docs.ashx?id=1037680'
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 21, $value = 'https://americandrivingsociety.org/docs.ashx?id=1037680', $matches = Array
(
[0] => https://americandrivingsociety.org/docs.ashx?id=1037680
[1] => https://
)
* Trying 208.66.171.71:443...
* Connected to americandrivingsociety.org (208.66.171.71) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: \xampp7412\apache\bin\curl-ca-bundle.crt
CApath: none
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
* subject: CN=americandrivingsociety.org
* start date: Sep 2 00:00:00 2022 GMT
* expire date: Oct 3 23:59:59 2023 GMT
* subjectAltName: host "americandrivingsociety.org" matched cert's "americandrivingsociety.org"
* issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Domain Validation Secure Server CA
* SSL certificate verify ok.
> HEAD /docs.ashx?id=1037680 HTTP/1.1
Host: americandrivingsociety.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
Accept: */*
HTTP_X_REQUESTED_WITH: XMLHttpRequest
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
* The requested URL returned error: 404 Not Found
* Closing connection 0
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 46, $allinfo = Array
(
[url] => https://americandrivingsociety.org/docs.ashx?id=1037680
[content_type] =>
[http_code] => 404
[header_size] => 0
[request_size] => 250
[filetime] => -1
[ssl_verify_result] => 0
[redirect_count] => 0
[total_time] => 0.132769
[namelookup_time] => 0.009406
[connect_time] => 0.035694
[pretransfer_time] => 0.090879
[size_upload] => 0
[size_download] => 0
[speed_download] => 0
[speed_upload] => 0
[download_content_length] => -1
[upload_content_length] => -1
[starttransfer_time] => 0.132714
[redirect_time] => 0
[redirect_url] =>
[primary_ip] => 208.66.171.71
[certinfo] => Array
(
)
[primary_port] => 443
[local_ip] => 16.1.1.151
[local_port] => 55977
[http_version] => 2
[protocol] => 2
[ssl_verifyresult] => 0
[scheme] => HTTPS
[appconnect_time_us] => 90757
[connect_time_us] => 35694
[namelookup_time_us] => 9406
[pretransfer_time_us] => 90879
[redirect_time_us] => 0
[starttransfer_time_us] => 132714
[total_time_us] => 132769
)
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 50, data = ''
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 53, 'https://americandrivingsociety.org/docs.ashx?id=1037680' gives bad status code 404 when accessed, with message 'The requested URL returned error: 404 Not Found'
C:\xampp1826\htdocs\OSH0\curltest.php:77:
bool(false)
repete@DESKTOP-CLQS7C1 /cygdrive/c/xampp1826/htdocs/OSH0
$
Here's the same thing using the s3 URL it redirects to:
$ php curltest.php 'https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D'
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 21, $value = 'https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D', $matches = Array
(
[0] => https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D
[1] => https://
)
* Trying 52.216.56.0:443...
* Connected to s3.amazonaws.com (52.216.56.0) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: \xampp7412\apache\bin\curl-ca-bundle.crt
CApath: none
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: CN=s3.amazonaws.com
* start date: Apr 11 00:00:00 2023 GMT
* expire date: Dec 20 23:59:59 2023 GMT
* subjectAltName: host "s3.amazonaws.com" matched cert's "s3.amazonaws.com"
* issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
* SSL certificate verify ok.
> HEAD /ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D HTTP/1.1
Host: s3.amazonaws.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
Accept: */*
HTTP_X_REQUESTED_WITH: XMLHttpRequest
* Mark bundle as not supporting multiuse
* The requested URL returned error: 403 Forbidden
* Closing connection 0
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 46, $allinfo = Array
(
[url] => https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D
[content_type] =>
[http_code] => 403
[header_size] => 0
[request_size] => 523
[filetime] => -1
[ssl_verify_result] => 0
[redirect_count] => 0
[total_time] => 0.128771
[namelookup_time] => 0.027331
[connect_time] => 0.043198
[pretransfer_time] => 0.107906
[size_upload] => 0
[size_download] => 0
[speed_download] => 0
[speed_upload] => 0
[download_content_length] => -1
[upload_content_length] => -1
[starttransfer_time] => 0.128721
[redirect_time] => 0
[redirect_url] =>
[primary_ip] => 52.216.56.0
[certinfo] => Array
(
)
[primary_port] => 443
[local_ip] => 16.1.1.151
[local_port] => 56277
[http_version] => 2
[protocol] => 2
[ssl_verifyresult] => 0
[scheme] => HTTPS
[appconnect_time_us] => 107740
[connect_time_us] => 43198
[namelookup_time_us] => 27331
[pretransfer_time_us] => 107906
[redirect_time_us] => 0
[starttransfer_time_us] => 128721
[total_time_us] => 128771
)
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 50, data = ''
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 53, 'https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D' gives bad status code 403 when accessed, with message 'The requested URL returned error: 403 Forbidden'
C:\xampp1826\htdocs\OSH0\curltest.php:77:
bool(false)
repete@DESKTOP-CLQS7C1 /cygdrive/c/xampp1826/htdocs/OSH0
$
Solution
I added a check to accept as valid anything that gave a message with "SSL" in it
This seems dangerous. What if the error message is "Invalid SSL certificate"?
since that would suggest that the URL reached a webserver
This true of any response -- 300, 400, 500, whatever. If your connection didn't timeout, then you've successfully connected to something, regardless of the status code. I.e., by this logic, if "reaching a webserver" is what you're validating, then only a timeout should fail.
I suppose I could delete the CURLOPT_NOBODY to have it send a GET request, but that seems silly since I don't care about the body.
You can't expect that every URL will be successfully reachable via a HEAD request, or that the results of HEAD request will always be the same as the results of a GET request.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
Don't do this. If the verify fails, you want the request the fail, that's the whole point of SSL.
Overall, if you're not going to validate the actual content of the page, then I don't think it makes any sense to even make the request. Just validate the syntax of the URL. Otherwise, you're going to fail on things like transient network errors, maintenance downtimes, ad blockers, IP-based filtering, etc. You've got acres of code for what should just be one line:
class Oshk_ZendX_Validate_Url {
public static function isValid(string $url): bool
{
return (bool) filter_var($url, FILTER_VALIDATE_URL);
}
}
If you want to also test the connection and make sure there's a live server answering the request at the time of form submission, then the status doesn't really matter, and you can just check for a non-false return value from the HTTP wrapper via file_get_contents():
class Oshk_ZendX_Validate_Url {
public static function isValid(string $url): bool
{
return filter_var($url, FILTER_VALIDATE_URL) &&
file_get_contents($url) !== false;
}
}
Answered By - Alex Howansky Answer Checked By - Katrina (WPSolving Volunteer)