Issue
I am creating a web scraper for personal use that scrape car dealership sites based on my personal input but several of the sites that I attempting to collect data from a blocked by a redirected captcha page. The current site I am scraping with curl returns this HTML
<html>
<head>
<title>You have been blocked</title>
<style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style>
</head>
<body style="margin:0">
<p id="cmsg">Please enable JS and disable any ad blocker</p>
<script>
var dd={'cid':'AHrlqAAAAAMA1gZrYHNP4MIAAYhtzg==','hsh':'C0705ACD75EBF650A07FF8291D3528','t':'fe','host':'geo.captcha-delivery.com'}
</script>
<script src="https://ct.captcha-delivery.com/c.js"></script>
</body>
</html>
I am using this to scrape the page:
<?php
function web_scrape($url)
{
$ch = curl_init();
$imei = "013977000272744";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_COOKIE, '_ym_uid=1460051101134309035; _ym_isad=1; cxx=80115415b122e7c81172a0c0ca1bde40; _ym_visorc_20293771=w');
curl_setopt($ch, CURLOPT_POSTFIELDS, array(
'imei' => $imei,
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec($ch);
return $server_output;
curl_close($ch);
}
echo web_scrape($url);
?>
And to reiterate what I want to do; I want to collect the Recaptcha from this page so when I want to view the page details on an external site I can fill in the Recaptcha on my external site and then scrape the page initially imputed. Any response would be great!
Solution
based on the high demand for code, HERE is my upgraded scraper that bypassed this specific issue. However my attempt to obtain the captcha did not work and I still have not solved how to obtain it.
include "simple_html_dom.php";
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
// This function is where the Magic comes from. It bypasses ever peice of security carsales.com.au can throw at me
function get_web_page( $url ) {
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_SSL_VERIFYPEER => false // Disabled SSL Cert checks
);
$ch = curl_init( $url ); //initiate the Curl program that we will use to scrape data off the webpage
curl_setopt_array( $ch, $options ); //set the data sent to the webpage to be readable by the webpage (JSON)
$content = curl_exec( $ch ); //creates function to read pages content. This variable will be used to hold the sites html
$err = curl_errno( $ch ); //errno function that saves all the locations our scraper is sent to. This is just for me so that in the case of a error,
//I can see what parts of the page has it seen and more importantly hasnt seen
$errmsg = curl_error( $ch ); //check error message function. for example if I am denied permission this string will be equal to: 404 access denied
$header = curl_getinfo( $ch ); //the information of the page stored in a array
curl_close( $ch ); //Closes the Curler to save site memory
$header['errno'] = $err; //sending the header data to the previously made errno, which contains a array path of all the places my scraper has been
$header['errmsg'] = $errmsg; //sending the header data to the previously made error message checker function.
$header['content'] = $content; //sending the header data to the previously made content checker that will be the variable holder of the webpages HTML.
return $header; //Return all the pages data and my identifying functions in a array. To be used in the presentation of the search results.
};
//using the function we just made, we use the url genorated by the form to get a developer view of the scraping.
$response_dev = get_web_page($url);
// print_r($response_dev);
$response = end($response_dev); //takes only the end of the developer response because the rest is for my eyes only in the case that the site runs into a issue
Answered By - Joel Mac Answer Checked By - Dawn Plyler (WPSolving Volunteer)