GrabzIt
Tools to Capture and Convert the Web

Web Scraper Documentation

In order to create a web scrape you have to specify five types of information, spread across the following tabs.

  1. Target Websites
  2. Scrape Instructions
  3. Scrape Options
  4. Export Options
  5. Schedule Scrape

Target Websites

Target Websites

In the Target Websites tab you specify websites you want to extract data from. To tell the scrape tool to extract data from a website you first have to specify the main URL you are interested in e.g. http://www.example.com/shop/ This is were the scraper will start its scrape, it can be a normal webpage, PDF document, RSS feed or a sitemap. If it is a RSS feed or sitemap the scraper will find all the links in the file and visit each one.

To only follow the links found in the target URL and not any subsequent pages you can set the Follow Links scrape option to on first page. This will use the target URL only to seed the rest of the scrape.

By default, the web scraper follows every link it discovers on each web page it visits. If you want to restrict what links the Web Scraper follows, one simple way to do this is to specify a URL Pattern. This works by specifying a URL with the asterisk as a wild card to denote that any characters can be present in this part of the pattern. For instance http://www.example.com/*/articles/* would scrape any URL's that has articles as the second directory from the root of the website.

The URL can also specify a URL with parameters to POST too for instance a login form. To do specify the form URL in the Target URL text box and add the required post parameters to use.

Finally you can specify Seed URLs to ensure that those URLs are scraped.

Seed URL's

Seed URLs allow a user to specify a list of URLs that must be crawled by the Web Scraper. If you only want the Seed URLs scraped set the Follow Links scrape options to no pages in the Scrape Options tab.

To set Seed URLs on the Target Websites tab, click the Add Target button then check the Set Seed URLs checkbox and specify each URL to scrape on a separate line.

Create seed URL's from a Template URL

Alternatively you can automatically generate seed URL's by using a Template URL, this is a single URL that includes a URL variable. A URL Variable specifies a range of numbers to be iterated over.

{{start number|finish number|iterate number}}

  • start number the number that the URL Variable starts at
  • end number the number that the URL Variable ends at
  • iterate number the number that the URL Variable iterates by

The start number is the number that the URL variable should start counting at, the finish number is the number that the URL variable will stop counting at, the iterate number is the number that the number will increase for every iteration of the URL variable.

For instance for the following Template URL http://www.example.com/search?pageNo={{1|3|1}}

This will then create the following seed URLs:

  • http://www.example.com/search?pageNo=1
  • http://www.example.com/search?pageNo=2
  • http://www.example.com/search?pageNo=3

Scrape Instructions

Scrape instructions tell the Web Scraper what actions to carry out when scraping the target website(s). The Scrape Instructions tab shows the scrape wizard by default, which makes it easy to add the scrape instructions you need. To get started press the Add New Scrape Instruction link.

This will open the wizard and will automatically load the web page found in the target website. This wizard will show the target webpage or PDF document, which allows you to select what you want to be scraped. However when the scrape wizard loads a PDF document, the PDF is converted into HTML to allow images, hyperlinks, text and tables to be selected and scraped. As a PDF has no structure, tables are identified using heuristics and so are not always accurate.

When the webpage or PDF document has loaded you can click on any link and it will act as normal, for instance navigation to another webpage. Until you choose one of the actions, at the bottom of the screen, at this point any clicks on the webpage will select the HTML element you wish to extract or manipulate.

The first thing to understand about scrape instructions is that they are executed on every web page by default. The way to stop this is through the use of templates. A template can be assigned to a link, or an event such as a click, and so that whenever the scraper visits that link or clicks that button it will recognise that it belongs to the assigned template. This allows different page types to be defined. For instance you might have a product category page that contains some overview information and then a detail page that contains the product information. Both pages would probably need a different set of scrape instructions.

Scraper Template

To get started choose the Follow Links or Click action, then once you have selected the items you want to perform the action on and clicked the Finish Selection button enter the name of the template in the Create a Template text box now whenever the scraper executes these actions, the template returned will be the name you have supplied.

Then to assign a particular template to a scrape instruction you need to select the desired template from the Execute In drop down list, that appears in the options window that appears just before the scrape instruction is added. The three main options when choosing a template are as follows:

  • All pages - do not use a template, for this scrape instruction, The scrape instruction will be executed on all web pages.
  • Default Template - do not use one of the user defined template. The scrape instruction will be executed on any web page that doesn’t have a template specified.
  • User Defined Template - one of the templates that have been defined by you to identify a particular web page or action.

Once you have selected one of these options, the scrape instruction will only be executed on the template specified.

Focusing the Web Scraper

As previously mentioned by default the web scraper follows every link on every web page it visits, one way to limit what links the web scraper follows is by using the Follow Links action. To do this use the wizard to select the links you are interested in, then select the Follow Links action and click the Finish Selection button, which will make options box appear. Here you need to check the Follow Only checkbox. Doing so will mean that the web scraper can only follow the links that are found by the scraper in the locations you have defined, once you have selected Follow Only option all other Follow Links actions, will be automatically set to Follow Only as well. This is especially useful if you only want the scraper to go to certain parts of the target website like the product pages that are not easily identifiable in other ways, such as by using a URL pattern.

Note that for the Follow Links action to work you must ensure that you have allowed the following of links under the Scrape Options tab, this is enabled by default.

Extracting Data

You will notice that when you select the Extract Data action a series of data items to extract immediately becomes available to download in the bottom left hand corner of the screen. These are properties of the whole page that you can download. To choose one, just select it from the list of options and click Finished to add the data to the dataset.

If you wish to extract data in specific HTML elements rather than belonging to the whole page you need to click on the relevant HTML elements, you can select single or multiple items. However if you are selecting multiple items please try and select multiple items that are then same such as multiple rows in a column, because if the scraper can't create a rule that can uniquely identify the selected collection of data a scrape instruction won't be able to be created. Furthermore if the multiple items you are clicking have been identified as repeating data by our web scraper wizard, all repeating data in that same group will be automatically selected. Once you have selected all your single or multiple items choose a attribute to extract from the bottom left of the screen and then click Finished.

Creating a Dataset

The dataset screen allows you to change how the data is processed, for instance you can rename the dataset and the columns within it, just click on the name to rename it. To add more data to the dataset click on the Add Column button, or click the Delete Column to remove data from the dataset. The dataset also allows various criteria to be applied to the data, to do this select the desired action from the top and then click on the relevant column to apply the criteria. If you make a mistake adding criteria just click the Reset button.

Here is the list of different criteria types and how to use them:

  • Limit Rows - this will limit the number of rows extracted from the web page to the number you define. To use click Limit Rows and then click on the row beyond, which you wish to be cut off.
  • Make Unique - removes any duplicate values. To use just click Make Unique and then click the column you wish to make unique.
  • Use Pattern - specify a pattern to only extract a item of data from a block of text. To use just click Use Pattern, select the relevant column and then follow the instructions to create a pattern that will return the relevant data from the string.
  • Sort Ascending - sorts by the column, ascending. To use click Sort Ascending and then choose the column to sort by.
  • Sort Descending - sorts by the column, descending. To use click Sort Descending and then choose the column to sort by.
  • Equal To - only include values that are equal to the defined value. To use click Equal To select the desired column and then enter the value the column should be equal to.
  • Not Equal To - only include values that are not equal to the defined value. To use click Not Equal To select the desired column and then enter the value the column should not be equal to.
  • Less Than - only include values that are less than the defined value. To use click Less Than to select the desired column and then enter the value the column should less than.
  • Greater Than - only include values that are greater than the defined value. To use click Greater Than to select the desired column and then enter the value the column should greater than.

Once you have modified everything you want to, click Finished and if required choose a template to run the scrape instructions in, your scrape instructions will be added to the scrape.

Manipulating a Webpage

A webpage can be manipulated before it is scraped, by clicking, typing and selecting values from drop downs. To do this choose either the Click Element, Type Text or Select Drop Down List Value actions. If you are performing a click action you can click on any number of elements on a webpage. Otherwise you must select an appropriate HTML element, for instance text should be typed in a text box. Then click Finish Selection. This will open a option box that allows you to complete the action. When typing text and selecting from a drop down the data to be typed or selected must be chosen respectively. Other than that the options are the same for all three actions.

If you wish to you can select the template this action should be executed in and for the click action what template applies, once the click action is complete. However, assigning a new template to a click action that performs multiple clicks on the same page, is not a good idea, such as opening inline popups or make things appear on screen. This because if the click action only executes on certain templates, the new template assigned by the first click would not be reset and therefore depending on how the scrape was written this could stop future clicks on the same page being executed. You can also define if you want this action to only be executed once, which is useful if you are doing something like login into a website.

After actions that manipulate websites have executed, it is useful to wait for a while if the actions initiate AJAX functionality, to allow the AJAX content to load before continuing with the scrape. You can do this by adding a delay in the After Execution Wait text box.

You may wish to jump straight to a different URL once some condition has been met. To do this use the Go To URL action, which will only appear when at least one template has been defined in the scrape and when created must be assigned to a template, to help to avoid infinite loops.

Finally you can use all of GrabzIt’s capture API's in your web scrapes, just choose the Capture Webpage action, and choose your desired capture. You can limit this to capture certain web pages within the scrape by specifying a template to execute in once you select the Finish Selection button.

After every scrape instruction is added it can be seen in the scrape instructions panel, the cross next to each scrape instruction allows the scrape instruction to be deleted. If a scrape instruction is deleted that is required by other scrape instructions those instructions are also deleted. You can change the order of scrape instructions by dragging any scrape instructions with the grab icon.

Writing Scrape Instructions Manually

If you need to customize the scrape instructions in a more specific way or if you wish to execute code before or after scrapes you will need to alter the scrape instructions manually.

The scrape instructions are JavaScript based and the code editor comes complete with a syntax checker, auto-complete and tooltip's to make it as easy as possible.

Web Scraper Instructions The core functionality of the code editor is accessible through the menu options, as shown in the screenshot, the purpose of each is explained separately below. While any syntax errors in your scrape instructions are indicated in the left hand gutter of the code editor.

Wizard the wizard allows you to select parts of the page you wish to take extract and do other common tasks such as create web captures.

Display Scrape Instructions displays the scrape instructions code to the user.

Delete All Instructions deletes all of the scrape instructions.

Webpage Functions will enter the Page keyword into the scrape instructions and open the auto-complete, which contains all the possible Page functions. The Page functions allow you to extract data from the web page.

Data Functions will enter the Data keyword into the scrape instructions. Data functions allow you to save information.

Navigation Functions enters the Navigation keyword into the code editor. The Navigation functions allow you to control how the Web Scraper navigates the target website(s).

Global Functions enters the Global keyword into the scrape instructions. This gives you access to functions that can store data between parsing different web pages. When writing scrape instructions it is important to remember that the state of JavaScript variables in the scrape instructions is not kept when the scraper moves between webpages, unless you use the Global functions to save variables, as shown below.

Global.set("myvariable", "hello");
var mrvar = Global.get("myvariable");

To create a persistent global variable pass true to the persist parameter in the Global.set method, as shown below.

Global.set("myvariable", "hello", true);

Utility Functions enters the Utility keyword into the scrape instructions. This allows you to use common functions that make writing scrapes easier, such as adding or removing querystring parameters from URLs.

Criteria Functions enters the Criteria keyword into the scrape instructions. These functions allows you to refine the data extracted during your scrape, such as eliminating duplicates.

Filter allows you to easily create a filter, this is required by some functions to select a particular HTML element from within a web page. Simply select the attributes that your target element should have and/or the parent(s) of the element should have to select that element. Ensure that before you click this option your cursor is in the correct place in the function to pass the filter too.

Screenshot Functions allows you to set screenshot options. Simply place the cursor in the correct part of the function, as identified by the tooltip and press the screenshot options. Then choose all the options you wish and insert the command.

Performing Actions Before or After a Scrape

You can run commands before or after a scrape by using the drop down list of options at the top of the Scrape Instructions tab. Any commands entered when Execute After Scrape is selected will be run after the scrape has finished. While any commands entered when Execute Before Scrape is selected will be run before the scrape has started.

However when in any of these two special modes there is only a subset of the scrape instructions available. The available commands are the Data, Global and Navigation scrape instructions.

Strings

Strings are used in scrape instructions, when performing a web scrape, to define text. A string is delimited by double (") or single quotes ('). If a string is started with a double quote it must end with a double quote, if a string starts with a single quote it must end with a single quote. For instance:

"my-class" and 'my-class'

A common error that can occur is the unclosed string error, this is when a string does not have a closing quote as shown above or there is a line break in the string. The following are illegal strings:

"my
class"

"my class

To fix this error is to ensure they do not contain line breaks and have matching quotes, like so:

"my class" and "my class"

Sometimes you want a single or double quote to appear in a string. The easiest way to do this is to put a single quote in a string delimited with double quotes and a double quote in a string delimited with single quotes, like so:

"Bob's shop" and '"The best store on the web"'

Alternatively you can use a backslash to escape a quote like so:

'test\'s'

Common Manual Scrape Tasks

Link Checker Create a custom link checker - find out how to create a custom link checker by following these simple instructions.
Image Download Download all images from a website - find out how to download all the images from an entire website.
Create Dataset Extract data and transform it into a dataset - find out how to create a dataset from the website you are scraping.
Extract Links Extract links from a website - find out how to extract all the HTML links from an entire website and save them in the format you desire.
Select Text Extracting values from text using patterns - find out how to use patterns to extract values from blocks of text.
OCR Extract text from images - find out how to extract text contained within images.
Dataset How to pad a dataset - format your extracted data better by using padding.
Array Manipulating Arrays - find out how to use the special array utility methods to easily handle arrays within scrapes.
Action Perform an action only once during a scrape - find out how to perform an action only once during an entire scrape.
Refine Refining scraped data - discover how to remove non-required data from your scrapes.
Email Address Scrape email addresses from a website - find out how to scrape all email addresses from a website.
Screenshot Convert an entire website to PDFs or Images - find out how to use GrabzIt's Web Scraper to capture every page of an entire website.
Screenshot Extract structured information from unstructured text - use GrabzIt to extract sentiment, names, locations and organizations.

Scrape Options

All of the following features are available to customize a web scrape on the Scrape Options tab.

Scrape Name the name of the scrape.

Use Default Filename if not selected the specified name rather than the automatically generated filename will be used for the zipped scrape results.

Follow Links provides the following options on how the scraper should follow links:

  • all pages - the scraper will follow every link it finds
  • no pages - the scraper will only follow the links you have already specified
  • first page - only follow the links found on the first page, specified as the target
  • up to n pages from initial page - only follow links on pages the specified number of clicks from the first page
  • in frames - follow links found in frames and iframes

Ignore Duplicates if set it will ignore pages that are equal to or more than the similarity you set, for instance you could ignore pages that are 95% the same.

Limit Scraped Pages determines how many pages the web scraper should scrape before stopping.

Test Scrape when this is set the Web Scraper outputs verbose log messages and automatically stops scraping after 100 seconds. To allow you to easily test your scrape instructions.

Use my Time Zone if set it indicates that the Web Scraper should attempt to convert any dates it scrapes into your local time zone. Your time zone can be set on the account page.

Page Load Delay this is the time in milliseconds the Web Scraper should wait before parsing a page. This is very useful if a page contains a lot of AJAX or is slow to load.

Export Options

This tab allows you to choose how you wish to export your results your options include Excel spreadsheets, XML, JSON, CSV, SQL commands, or HTML documents. If you are only downloading files or creating web captures then there is no need to choose an export option as you will just receive a ZIP file containing the results. This tab also allows you to specify how you wish to send the results. You can send the results via Amazon S3, Dropbox, email notification, FTP and WebDav.

The final option is a Callback URL, which allows the scrape results to be processed in your application by using our scrape API.

Schedule Scrape

When creating a web scrape the Schedule Scrape tab allows you to set when you want the scrape to start and if you want it to repeat, how frequently it should do so.

Try all our premium features for free with a 7 day free trial. Then from $5.99 a month, unless cancelled.
  • More Captures
  • More Features
  • More API's
  • Bigger Scrapes
  • Bigger Captures
  • Bigger Everything
Start Free Trial