Tools to Capture and Convert the Web

Web Scraper Documentation

In order to create a web scrape you have to specify five types of information, spread across the following tabs.

  1. Scrape Options
  2. Target Website
  3. Scrape Instructions
  4. Export Options
  5. Schedule Scrape

Scrape Options

All of the following features are available to customize a web scrape on the Scrape Options tab.

Scrape Name the name of the scrape.

Follow Links provides the following options on how the scraper should follow links:

Ignore File Downloads once set any links, which cause a file download when visited are not downloaded.

Ignore Robots.txt File if set the scraper can visit the web pages normally excluded from being crawled by the website owner.

Ignore Error Pages if set the web scraper will skip any web pages that report an error. So any HTTP status codes 400 or above.

Ignore URL fragments if set the web scraper will ignore the part of the URL after the # this feature is commonly used to denote a bookmark on the same page and so would normaly result in needless pages being scraped. However, some websites use this feature to show different content, in which case this setting needs to be disabled. This option is only applicable when follow links is not as required.

Ignore Duplicates if set it will ignore pages that are equal to or more than the similarity you set, for instance you could ignore pages that are 95% the same.

Limit Scrape allows you to specify how many pages the web scraper should scrape before stopping.

Use My Timezone if set it indicates that the Web Scraper should attempt to convert any dates it scrapes into your local time zone. Your time zone can be set on the account page.

Location the geographic location the Web Scraper will perform the scrape from. This could be useful if the target website has restrictions based on location.

Default Date Format when converting dates where the date format can not be determined, the Web Scraper will instead default to this chosen format.

Page Load Delay this is the time in milliseconds the Web Scraper should wait before parsing a page. This is very useful if a page contains a lot of AJAX or is slow to load.

Target Website

Target Website

In the Target Website tab you specify websites you want to extract data from. To tell the scrape tool to extract data from a website you first have to specify the main URL you are interested in e.g. http://www.example.com/shop/ This is were the scraper will start its scrape, it can be a normal webpage, PDF document, XML document, JSON document, RSS feed or sitemap. If it isn't a web page or PDF document the scraper will find all the links in the file and visit each one.

To only follow the links found in the target URL and not any subsequent pages you can set the Follow Links scrape option to on first page. This will use the target URL only to seed the rest of the scrape.

URL Pattern

By default, the web scraper follows every link it discovers on each web page it visits. If you want to restrict what links the Web Scraper follows, one simple way to do this is to specify a URL Pattern. This powerful technique mainly works by specifying a URL with the asterisk as a wild card to denote that any characters can be present in this part of the pattern. For instance http://www.example.com/*/articles/* would scrape any URL's that has articles as the second directory from the root of the website.

A more restrictive way to define a URL pattern is to define alternatives. For instance this example will only match store or news: http://www.example.com/ /*

Therefore this would match this http://www.example.com/store/products/1 but not http://www.example.com/about/.

Or alternatively it is possible to match everthing but something. For instance this example will not match store or news: http://www.example.com/ /*

Therefore this would match this http://www.example.com/about/ but not http://www.example.com/store/products/1!

A URL pattern can also contain keywords. A keyword is anything contained in a double square brackets. So [[URL_START]]www.example.com* will match against any valid start of a URL so http://www.example.com/, https://www.example.com/ or even ftp://www.example.com/ for instance.

Seed URL's

Seed URLs allow a user to specify a list of URLs that must be crawled by the Web Scraper. If you only want the Seed URLs scraped set the Follow Links scrape options to no pages in the Scrape Options tab.

To set Seed URLs on the Target Website tab, click the Add Target button then check the Set Seed URLs checkbox and specify each URL to scrape on a separate line.

Create seed URL's from a Template URL

Alternatively you can automatically generate seed URL's by using a Template URL, this is a single URL that includes a URL variable. A URL Variable specifies a range of numbers to be iterated over.

The start number is the number that the URL variable should start counting at, the finish number is the number that the URL variable will stop counting at, the iterate number is the number that the number will increase for every iteration of the URL variable.

For instance for the following Template URL http://www.example.com/search?pageNo=

This will then create the following seed URLs:

Perform Post

The URL can also specify a URL with parameters to POST too for instance a login form. To do so specify the form URL in the Target URL text box and add the required post parameters to use. Post variable values can also include special GrabzIt variables, such as:

Scrape Instructions

Scrape instructions tell the Web Scraper what actions to carry out when scraping the target website. The Scrape Instructions tab shows the scrape wizard by default, which makes it easy to add the scrape instructions you need. A good example of using this wizard is shown in the product list and detail scraping tutorial.

Once you are ready to start scraping press the Add New Scrape Instruction link.

This will open the wizard and will automatically load the target URL, allowing you to immediately select what you want to be scraped. If a webpage or PDF document has been loaded you can click on any link and it will act as normal, for instance navigation to another webpage. Until you choose one of the actions, at the bottom of the screen, at this point any clicks on the content will select the HTML element you wish to extract or manipulate.

The first thing to understand about scrape instructions is that they are executed on every web page by default. The way to stop this is through the use of templates. A template can be assigned to when performing an action such as a clicking a link, and so that whenever the scraper visits that link or clicks that button it will recognise that it belongs to the assigned template. This allows different page types to be defined. For instance you might have a product category page that contains some overview information and then a detail page that contains the product information. Both pages would probably need a different set of scrape instructions.

Scraper Template

To get started choose the Click action, then once you have selected the items you want to perform the action on and clicked the Next button enter the name of the template in the Create a Template text box now whenever the scraper executes these actions, the template returned will be the name you have supplied.

Then to assign a particular template to a scrape instruction you need to select the desired template from the Execute In drop down list, that appears in the options window that appears just before the scrape instruction is added. The three main options when choosing a template are as follows:

Once you have selected one of these options, the scrape instruction will only be executed on the template specified.

Extracting Data

You will notice that when you select the Extract Data action. The bottom left-hand corner of the screen invites you to either select an HTML element in the window above or to choose a global page property.

To use a global page property, click the global page property link. Then confirm you want to continue. You will now have a list of properties that can be extracted straight from the page. For instance: Page Title.

To choose one, just select it from the list of options and click Next to add the data to the dataset.

If you wish to extract data in specific HTML elements rather than belonging to the whole page you need to click on the relevant HTML elements, you can select single or multiple items. However if you are selecting multiple items please try and select multiple items that are then same such as multiple rows in a column, because if the scraper can't create a rule that can uniquely identify the selected collection of data a scrape instruction won't be able to be created. Furthermore if the multiple items you are clicking have been identified as repeating data by our web scraper wizard, all repeating data in that same group will be automatically selected. Once you have selected all your single or multiple items choose a attribute to extract from the bottom left of the screen and then click Next.

Creating a Dataset

The dataset screen allows you to change how the data is processed, for instance you can rename the dataset and the columns within it, just click on the name to rename it. When you add a column to a dataset you also need to choose the template that it should be executed in. You can alter this by clicking on the drop down list located under the column name.

Often when extracting data, it is common for some repeating items to repeat inconsistently, to ensure the correct rows are still associated with each other use the Link Columns criteria, to link the inconsistent columns with the most consistent column in the dataset.

To add more data to the dataset click on the button, or click the to remove data from the dataset, or to delete the entire dataset. The dataset also allows various criteria to be applied to the data, to do this select the desired action from the top and then click on the relevant column to apply the criteria. If you make a mistake adding criteria just click the button.

Here is the list of different criteria types and how to use them:

When you have selected one of the above operations if it can affect multiple columns it will ask you if you want to only allow it to affect a subset of the columns or all of them. In most cases you want it to affect all of the columns, however in some circumstances it is useful to limit the columns effected. For instance, if you are selecting a series of labels and values, which change position across web pages you can select all labels and values. Then in the dataset use the equals operation to limit it to the desired label and specify that only the label and value columns should be affected. This will ensure that the other columns are unaffected by rows being deleted, for completeness it would be useful to hide the label column.

Once you have modified everything you want to, click Next and, your scrape instructions will be added to the scrape. You then have the option of adding further scrape instructions if you wish.

Manipulating a Webpage

A webpage can be manipulated before it is scraped, by clicking, typing and selecting values from drop downs. It is important to remember that even though this can cause a new webpage to load the scrape instructions will not restart until all applicable scrape instructions have executed.

To manipulate a webpage choose either the Click Element, Hover Element, Scroll, Type Text or Select Drop Down List Value actions. If you are performing a click action you can click on any number of elements on a webpage. Otherwise you must select an appropriate HTML element, for instance text should be typed in a text box. Then click Next. This will open a option box that allows you to complete the action. When typing text and selecting from a drop down the data to be typed or selected must be chosen respectively. Other than that the options are the same for all three actions.

If you wish to you can select the template this action should be executed in and for the click action what template applies, once the click action is complete. However, assigning a new template to a click action that performs multiple clicks on the same page, is not a good idea, such as opening inline popups or make things appear on screen. This because if the click action only executes on certain templates, the new template assigned by the first click would not be reset and therefore depending on how the scrape was written this could stop future clicks on the same page being executed. You can also define if you want this action to only be executed once, which is useful if you are doing something like login into a website.

The Type Text or Select Drop Down List Value actions allow you to type multiple items of text or make multiple select box selections, respectively. These can be edited by clicking on the scrape instructions Alter or View Variables button, as shown in the screenshot to the left.

This could be important if you want to type a list of names to a search box, for instance. To ensure a form is submitted only when there is a value in the search box a template could be set each time the text is successfully typed into the textbox and the click action on a button not performed unless this template is set. After the click action has been performed the template would then need to be changed to something else in order to reset the procedure.

After actions that manipulate websites have executed, it is useful to wait for a while if the actions initiate AJAX functionality, to allow the AJAX content to load before continuing with the scrape. You can do this by adding a delay in the After Execution Wait text box.

You may wish to jump straight to a different URL once some condition has been met. To do this use the Go To URL action, which will only appear when at least one template has been defined in the scrape and when created must be assigned to a template, to help to avoid infinite loops.

Finally you can use all of GrabzIt's capture API's in your web scrapes, just choose the Capture Webpage action, and choose your desired capture. You can limit this to capture certain web pages within the scrape by specifying a template to execute in once you select the Next button.

After every scrape instruction is added it can be seen in the scrape instructions panel, the cross next to each scrape instruction allows the scrape instruction to be deleted. If a scrape instruction is deleted that is required by other scrape instructions those instructions are also deleted. You can change the order of scrape instructions by dragging any scrape instructions with the grab icon.

Writing Scrape Instructions Manually

If you need to customize the scrape instructions in a more specific way you will need to alter the scrape instructions manually.

The scrape instructions are JavaScript based and the code editor comes complete with a syntax checker, auto-complete and tooltip's to make it as easy as possible.

Web Scraper Instructions The core functionality of the code editor is accessible through the menu options, as shown in the screenshot, the purpose of each is explained separately below. While any syntax errors in your scrape instructions are indicated in the left hand gutter of the code editor.

Wizard the wizard allows you to select parts of the page you wish to take extract and do other common tasks such as create web captures.

Display Scrape Instructions displays the scrape instructions code to the user.

Delete All Instructions deletes all of the scrape instructions.

Webpage Functions will enter the Page keyword into the scrape instructions and open the auto-complete, which contains all the possible Page functions. The Page functions allow you to extract data from the web page.

Data Functions will enter the Data keyword into the scrape instructions. Data functions allow you to save information.

Navigation Functions enters the Navigation keyword into the code editor. The Navigation functions allow you to control how the Web Scraper navigates the target website.

Global Functions enters the Global keyword into the scrape instructions. This gives you access to functions that can store data between parsing different web pages. When writing scrape instructions it is important to remember that the state of JavaScript variables in the scrape instructions is not kept when the scraper moves between webpages, unless you use the Global functions to save variables, as shown below.

Global.set("myvariable", "hello");
var mrvar = Global.get("myvariable");

To create a persistent global variable pass true to the persist parameter in the Global.set method, as shown below.

Global.set("myvariable", "hello", true);

Utility Functions enters the Utility keyword into the scrape instructions. This allows you to use common functions that make writing scrapes easier, such as adding or removing querystring parameters from URLs.

Criteria Functions enters the Criteria keyword into the scrape instructions. These functions allows you to refine the data extracted during your scrape, such as eliminating duplicates.

Filter allows you to easily create a filter, this is required by some functions to select a particular HTML element from within a web page. Simply select the attributes that your target element should have and/or the parent(s) of the element should have to select that element. Ensure that before you click this option your cursor is in the correct place in the function to pass the filter too.

Screenshot Functions allows you to set screenshot options. Simply place the cursor in the correct part of the function, as identified by the tooltip and press the screenshot options. Then choose all the options you wish and insert the command.

Strings

Strings are used in scrape instructions, when performing a web scrape, to define text. A string is delimited by double (") or single quotes ('). If a string is started with a double quote it must end with a double quote, if a string starts with a single quote it must end with a single quote. For instance:

"my-class" and 'my-class'

A common error that can occur is the unclosed string error, this is when a string does not have a closing quote as shown above or there is a line break in the string. The following are illegal strings:

"my
class"

"my class

To fix this error is to ensure they do not contain line breaks and have matching quotes, like so:

"my class" and "my class"

Sometimes you want a single or double quote to appear in a string. The easiest way to do this is to put a single quote in a string delimited with double quotes and a double quote in a string delimited with single quotes, like so:

"Bob's shop" and '"The best store on the web"'

Alternatively you can use a backslash to escape a quote like so:

'test\'s'

Common Manual Scrape Tasks

Link Checker Create a custom link checker - find out how to create a custom link checker by following these simple instructions.
Image Download Download all images from a website - find out how to download all the images from an entire website.
Create Dataset Extract data and transform it into a dataset - find out how to create a dataset from the website you are scraping.
Extract Links Extract links from a website - find out how to extract all the HTML links from an entire website and save them in the format you desire.
Select Text Extracting values from text using patterns - find out how to use patterns to extract values from blocks of text.
OCR Extract text from images - find out how to extract text contained within images.
Dataset How to pad a dataset - format your extracted data better by using padding.
Array Manipulating Arrays - find out how to use the special array utility methods to easily handle arrays within scrapes.
Action Perform an action only once during a scrape - find out how to perform an action only once during an entire scrape.
Refine Refining scraped data - discover how to remove non-required data from your scrapes.
Email Address Scrape email addresses from a website - find out how to scrape all email addresses from a website.
Screenshot Screenshot entire website into PDFs or Images - find out how to use GrabzIt's Web Scraper to capture every page of an entire website.
Screenshot Extract structured information from unstructured text - use GrabzIt to extract sentiment, names, locations and organizations.

Scraping Content other than HTML

When the Web Scraper comes across PDF's, XML, JSON and RSS it will convert it to a HTML approximation, which allows our Web Scraper to parse it correctly and you to select what content you wish to extract. For instance, if you wanted to parse JSON data it will convert the data into a hierarchal HTML representation as shown to the side. This allows you to build scrape instructions as normal.

In a similar way when the scraper loads a PDF document, the PDF is converted into HTML to allow images, hyperlinks, text and tables to be selected and scraped. However As a PDF has no real structure, tables are identified using heuristics and so are not always accurate.

Export Options

This tab allows you to choose how you wish to export your results your options including as Excel spreadsheets, XML, JSON, CSV, SQL commands, or HTML documents. Additionally, this tab allows the name of the zipped scrape results to be set. If you are only downloading files or creating web captures then there is no need to choose an export option as you will just receive a ZIP file containing the results. This tab also allows you to specify how you wish to send the results. You can send the results via Amazon S3, Dropbox, email notification, FTP and WebDav.

The final option is a Callback URL, which allows the scrape results to be processed in your application by using our scrape API.

The filename of the zipped results or each data file if you request them to be sent separately can be set by unchecking the Use Default Filename option and setting your desired filename. Additionally, a timestamp can be added to your filename by putting {GrabzIt_Timestamp_UTC+1} in the filename. The +1 denotes the offset in hours from UTC.

You can also view the results of a scrape by clicking the View Results button, next to your scrape, this will show any real-time scrape results, as well as previous ones carried out within the last 48 hours.

Schedule Scrape

When creating a web scrape the Schedule Scrape tab allows you to set when you want the scrape to start and if you want it to repeat, how frequently it should do so. The scrape can also be configured to run when a change on a web page is detected. To do this Start When a web page changes checkbox, then enter the URL of the web page to monitor, along with the CSS selector of the part of the page you are interested in. It is important a small part of the page is selected to avoid false positives due to inconsequntial changes.

Monitoring and Debugging Scrapes

Once the web scrape starts the status icon will change to and the processed pages will start to increase over time. A real time snapshot of the scrapes progress is regularly produced with a log file being generated along with a regular screenshot of the last web page the scraper has encountered. This allows you to see what is happening during the scrape. To find this information, click on the expand icon next to your scrape and click Viewer for the scrape you are interested in. This should detail if there have been any errors such as problems with your scrape instructions.

Once the scrape has completed successfully the status icon will switch to , if there is no result by opening the Viewer the log and last screenshot may tell you what went wrong.

One of the most common problems reported in the logs that is there isn't a sufficient rendering delay to scrape the page, often a small increase in the Page Load Delay found in the Scrape Options tab is enough for most websites.