Tools to Capture and Convert the Web

Web Scraper API for PHP PHP Scraper API

First of all download the Web Scraper API for PHP and look at the example handler located inside to get started.

Process Scraped Data

The easiest way to process scraped data is to access the data as a JSON or XML object, as this enables the data to be easily manipulated and queried. The JSON will be structured in the following general format with the dataset name as the object attribute, itself containing an array of objects with each column name as another attribute.

{
  "Dataset_Name": [
    {
      "Column_One": "https://grabz.it/",
      "Column_Two": "Found"
    },
    {
      "Column_One": "http://dfadsdsa.com/",
      "Column_Two": "Missing"
    }]
}

First of all it must be remembered that the handler will be sent all scraped data, which may include data that can not be converted to JSON or XML objects. Therefore the type of data you are receiving must be checked before being processed.

$scrapeResult = new \GrabzIt\Scraper\ScrapeResult();

if ($scrapeResult->getExtension() == 'json')
{
    $json = $scrapeResult->toJSON();
    foreach ($json->Dataset_Name as $obj)
    {
        if ($obj->Column_Two == "Found")
        {
            //do something
        }
        else
        {
            //do something else
        }
    }
}
else
{
    //probably a binary file etc save it
    $scrapeResult->save("results/".$scrapeResult->getFilename());
}

The above example shows how to loop through all the results of the dataset Dataset_Name and do specific actions depending on the value of the Column_Two attribute. Also if the file received by the handler is not a JSON file then it is just saved to results directory. While the ScrapeResult class does attempt to ensure that all posted files originate from GrabzIt's servers the extension of the files should also be checked before they are saved.

ScrapeResult Methods

Listed below are all of the methods of the ScrapeResult class that can be used to process scrape results.

  • string getExtension() - gets the extension of any file resulting from the scrape.
  • string getFilename() - gets the filename of any file resulting from the scrape.
  • object toJSON() - converts any JSON file resulting from the scrape into an object.
  • string toString() - converts any file resulting from the scrape to a string.
  • SimpleXMLElement toXML() - converts any XML file resulting from the scrape to an XML Element.
  • boolean save($path) - saves any file resulting from the scrape, returns true if it succeeds.

Debugging

The best way to debug your PHP handler is to download the results for a scrape from the web scrapes page, save the file you are having a issue with to an accessible location and then pass the path of that file to the constructor of the ScrapeResult class. This allows you to debug your handler without doing having to do a new scrape each time, as shown below.

$scrapeResult = new \GrabzIt\Scraper\ScrapeResult("data.json");

//the rest of your handler code remains the same

Controlling a Scrape

With GrabzIt's Web Scraper API for PHP you can change that status of a scrape by remotely starting, stopping, enabling or disabling a scrape as needed. This is shown in the example below by passing the ID of the scrape along with the desired scrape status to the SetScrapeStatus method.

$client = new \GrabzIt\Scraper\GrabzItScrapeClient("Sign in to view your Application Key", "Sign in to view your Application Secret");
//Get all of our scrapes
$myScrapes = $client->GetScrapes();
if (empty($myScrapes))
{
    throw new Exception("You haven't created any scrapes yet! Create one here: https://grabz.it/scraper/scrape.aspx");
}
//Start the first scrape
$client->SetScrapeStatus($myScrapes[0]->ID, "Start");
if (count($myScrapes[0]->Results) > 0)
{
    //re-send first scrape result if it exists
    $client->SendResult($myScrapes[0]->ID, $myScrapes[0]->Results[0]->ID);
}

GrabzItScrapeClient Methods and Properties

Listed below are all of the methods and properties of the GrabzItScrapeClient class that can be used to control the state scrapes.

  • GrabzItScrape[] GetScrapes() - returns all of the users scrapes as an array of GrabzItScrape objects.
  • GrabzItScrape GetScrape($id) - returns a GrabzItScrape object representing the desired scrape.
  • SetScrapeProperty($id, $property) - sets the property of a scrape and returns true if successful.
  • SetScrapeStatus($id, $status) - sets the status ("Start", "Stop", "Enable", "Disable") of a scrape and returns true if successful.
  • SendResult($id, $resultId) - resends the result of a scrape and returns true if successful.
    • The scrape id and result id can be found from the GetScrape method.
  • SetLocalProxy($proxyUrl) - sets the local proxy server to be use for all requests.
Try all our premium features for free with a 7 day free trial. Then from $5.99 a month, unless cancelled.
  • More Captures
  • More Features
  • More API's
  • Bigger Scrapes
  • Bigger Captures
  • Bigger Everything
Start Free Trial