First of all download the Web Scraper API for ASP.NET and inspect the handler.ashx located in the sample web project to get started.
The easiest way to process scraped data is to access the data as a JSON or XML object, as this enables the data to be easily manipulated and queried. The JSON will be structured in the following general format with the dataset name as the object attribute, itself containing an array of objects with each column name as another attribute.
{ "Items": [ { "Column_One": "https://grabz.it/", "Column_Two": "Found" }, { "Column_One": "http://dfadsdsa.com/", "Column_Two": "Missing" }] }
First of all it must be remembered that the handler will be sent all scraped data, which may include data that can not be converted to JSON or XML objects. Therefore the type of data you are receiving must be checked before being processed.
However with the ASP.NET API an extra step is required in order to read JSON or XML files, in which classes are created that match the expected data structure. An example of this is shown below were two class definitions have been created to hold the above JSON data structure.
public class DataSet { public List<Item> Items; } public class Item { public string Column_One; public string Column_Two; }
These classes are now used to convert a JSON file into a usable object structure. In the example below the ScrapeResult constructor below is receiving the HttpRequest class, however it also accepts the HttpRequestBase class to make it compatible with ASP.NET MVC web projects.
ScrapeResult scrapeResult = new ScrapeResult(context.Request); if (scrapeResult.Extension == "json") { DataSet dataSet = scrapeResult.FromJSON<DataSet>(); foreach (Item item in dataSet.Items) { if (item.Column_Two == "Found") { //do something } else { //do something else } } } else { //probably a binary file etc save it scrapeResult.save(context.Server.MapPath("~/results/" + scrapeResult.Filename)); }
The above example shows how to loop through all the results of the dataset class and do specific actions depending on the value of the Column_Two
property. Also if the file received by the handler is not a JSON file then it is just saved to results directory. While the ScrapeResult class does attempt to ensure that all posted files originate from GrabzIt's servers the extension of the files should also be checked before they are saved.
Listed below are all of the methods and properties of the ScrapeResult class that can be used to process scrape results.
The best way to debug your ASP.NET handler is to download the results for a scrape from the web scrapes page, save the file you are having a issue with to an accessible location and then pass the path of that file to the constructor of the ScrapeResult class. This allows you to debug your handler without doing having to do a new scrape each time, as shown below.
ScrapeResult scrapeResult = new ScrapeResult("data.json"); #the rest of your handler code remains the same
With GrabzIt's Web Scraper API you can also change that status of a scrape, starting, stopping or disabling a scrape as needed. This is shown in the example below by passing the ID of the scrape along with the desired scrape status provided by the ScrapeStatus
enum to the SetScrapeStatus
method.
GrabzItScrapeClient client = new GrabzItScrapeClient("Sign in to view your Application Key", "Sign in to view your Application Secret"); //Get all of our scrapes GrabzItScrape[] myScrapes = client.GetScrapes(); if (myScrapes.Length == 0) { throw new Exception("You haven't created any scrapes yet! Create one here: https://grabz.it/scraper/scrape/"); } //Start the first scrape client.SetScrapeStatus(myScrapes[0].ID, ScrapeStatus.Start); if (myScrapes[0].Results.Length > 0) { //re-send first scrape result if it exists client.SendResult(myScrapes[0].ID, myScrapes[0].Results[0].ID); }
Listed below are all of the methods and properties of the GrabzItScrapeClient class that can be used to control scrapes.