In order to create a web scrape you have to specify five types of information, spread across the following tabs.
All of the following features are available to customize a web scrape on the Scrape Options tab.
Scrape Name the name of the scrape.
Follow Links provides the following options on how the scraper should follow links:
Ignore File Downloads once set any links, which cause a file download when visited are not downloaded.
Ignore Robots.txt File if set the scraper can visit the web pages normally excluded from being crawled by the website owner.
Ignore Error Pages if set the web scraper will skip any web pages that report an error. So any HTTP status codes 400 or above.
Ignore Previously Visited Pages this feature enables incremental scraping, by skipping any URL's that have been visited by this scrape in previous runs, that are not Target URL's or Seed URL's.
Ignore URL fragments if set the web scraper will ignore the part of the URL after the #
this feature is commonly used to denote a bookmark on the same page and so would normaly result in needless pages being scraped. However, some websites use this feature to show different content, in which case this setting needs to be disabled. This option is only applicable when follow links is not as required.
Ignore Duplicates if set it will ignore pages that are equal to or more than the similarity you set, for instance you could ignore pages that are 95% the same.
Limit Scrape allows you to specify how many pages the web scraper should follow before stopping.
Use My Timezone if set it indicates that the Web Scraper should attempt to convert any dates it scrapes into your local time zone. Your time zone can be set on the account page.
Location the geographic location the Web Scraper will perform the scrape from. This could be useful if the target website has restrictions based on location.
Default Date Format when converting dates where the date format can not be determined, the Web Scraper will instead default to this chosen format.
Page Load Delay this is the time in milliseconds the Web Scraper should wait before parsing a page. This is very useful if a page contains a lot of AJAX or is slow to load.
In the Target Website tab you specify websites you want to extract data from. To tell the scrape tool to extract data from a website you first have to specify the main URL you are interested in e.g. http://www.example.com/shop/
This is were the scraper will start its scrape, it can be a normal webpage, PDF document, XML document, JSON document, RSS feed or sitemap. If it isn't a web page or PDF document the scraper will find all the links in the file and visit each one.
To only follow the links found in the target URL and not any subsequent pages you can set the Follow Links scrape option to on first page. This will use the target URL only to seed the rest of the scrape.
By default, the web scraper follows every link it discovers on each web page it visits. If you want to restrict what links the Web Scraper follows, one simple way to do this is to specify a URL Pattern. This powerful technique mainly works by specifying a URL with the asterisk as a wild card to denote that any characters can be present in this part of the pattern. For instance http://www.example.com/*/articles/*
would scrape any URL's that has articles as the second directory from the root of the website.
A more restrictive way to define a URL pattern is to define alternatives. For instance this example will only match store or news: /*
Therefore this would match this http://www.example.com/store/products/1
but not http://www.example.com/about/
.
Or alternatively it is possible to match everthing but something. For instance this example will not match store or news:
Therefore this would match this http://www.example.com/about/
but not http://www.example.com/store/products/1
!
A URL pattern can also contain keywords. A keyword is anything contained in a double square brackets. So [[URL_START]]www.example.com*
will match against any valid start of a URL so http://www.example.com/
, https://www.example.com/
or even ftp://www.example.com/
for instance.
Seed URLs allow a user to specify a list of URLs that must be crawled by the Web Scraper. If you only want the Seed URLs scraped set the Follow Links scrape options to no pages in the Scrape Options tab.
To set Seed URLs on the Target Website tab, click the Add Target button then check the Set Seed URLs checkbox and specify each URL to scrape on a separate line.
Alternatively you can automatically generate seed URL's by using a Template URL, this is a single URL that includes a URL variable.
A URL Variable specifies a range of numbers to be iterated over:
The start number is the number that the URL variable should start counting at, the finish number is the number that the URL variable will stop counting at, the iterate number is the number that the number will increase for every iteration of the URL variable.
For instance for the following Template URL
http://www.example.com/search?pageNo=
This will then create the following seed URLs:
The URL can also specify a URL with parameters to POST too for instance a login form. To do so specify the form URL in the Target URL text box and add the required post parameters to use. Post variable values can also include special GrabzIt variables, such as:
– day as a two-digit value
– month as a two-digit value
– year as a four-digit value
– hour as a two-digit value
– minute as a two-digit value
– second as a two-digit value
Scrape instructions tell the Web Scraper what actions to carry out when scraping the target website. The Scrape Instructions tab shows the scrape wizard by default, which makes it easy to add the scrape instructions you need. A good example of using this wizard is shown in the product list and detail scraping tutorial.
Once you are ready to start scraping press the Add New Scrape Instruction link.
This will open the wizard and will automatically load the target URL, allowing you to immediately select what you want to be scraped. If a webpage or PDF document has been loaded you can click on any link and it will act as normal, for instance navigation to another webpage. Until you choose one of the actions, at the bottom of the screen, at this point any clicks on the content will select the HTML element you wish to extract or manipulate.
The first thing to understand about scrape instructions is that they are executed on every web page by default. The way to stop this is through the use of templates. A template can be assigned to when performing an action such as a clicking a link, and so that whenever the scraper visits that link or clicks that button it will recognise that it belongs to the assigned template. This allows different page types to be defined. For instance you might have a product category page that contains some overview information and then a detail page that contains the product information. Both pages would probably need a different set of scrape instructions.
To get started choose the Click action, then once you have selected the items you want to perform the action on and clicked the Next button enter the name of the template in the Create a Template text box now whenever the scraper executes these actions, the template returned will be the name you have supplied.
Then to assign a particular template to a scrape instruction you need to select the desired template from the Execute In drop down list, that appears in the options window that appears just before the scrape instruction is added. The three main options when choosing a template are as follows:
Once you have selected one of these options, the scrape instruction will only be executed on the template specified.
You can either select the data you want to extract yourself as explained here or use AI to extract it. You will notice that when you select the Select Data to Scrape action. The bottom left-hand corner of the screen invites you to either select an HTML element in the window above or to choose a global page property.
To use a global page property, click the global page property link. Then confirm you want to continue. You will now have a list of properties that can be extracted straight from the page. For instance: Page Title.
To choose one, just select it from the list of options and click Next to add the data to the dataset.
If you wish to extract data in specific HTML elements rather than belonging to the whole page you need to click on the relevant HTML elements, you can select single or multiple items. However if you are selecting multiple items please try and select multiple items that are then same such as multiple rows in a column, because if the scraper can't create a rule that can uniquely identify the selected collection of data a scrape instruction won't be able to be created. Furthermore if the multiple items you are clicking have been identified as repeating data by our web scraper wizard, all repeating data in that same group will be automatically selected. Once you have selected all your single or multiple items choose a attribute to extract from the bottom left of the screen and then click Next.
The dataset screen allows you to change how the data is processed, for instance you can rename the dataset and the columns within it, just click on the name to rename it. When you add a column to a dataset you also need to choose the template that it should be executed in. You can alter this by clicking on the drop down list located under the column name.
Often when extracting data, it is common for some repeating items to repeat inconsistently, to ensure the correct rows are still associated with each other use the Link Columns criteria, to link the inconsistent columns with the most consistent column in the dataset.
To add more data to the dataset click on the button, or click the to remove data from the dataset, or to delete the entire dataset. The dataset also allows various criteria to be applied to the data, to do this select the desired action from the top and then click on the relevant column to apply the criteria. If you make a mistake adding criteria just click the button.
When you have selected one of the above operations if it can affect multiple columns it will ask you if you want to only allow it to affect a subset of the columns or all of them. In most cases you want it to affect all of the columns, however in some circumstances it is useful to limit the columns effected. For instance, if you are selecting a series of labels and values, which change position across web pages you can select all labels and values. Then in the dataset use the equals operation to limit it to the desired label and specify that only the label and value columns should be affected. This will ensure that the other columns are unaffected by rows being deleted, for completeness it would be useful to hide the label column.
Once you have modified everything you want to, click Next and, your scrape instructions will be added to the scrape. You then have the option of adding further scrape instructions if you wish.
A webpage can be manipulated before it is scraped, by clicking, typing and selecting values from drop downs. It is important to remember that even though this can cause a new webpage to load the scrape instructions will not restart until all applicable scrape instructions have executed.
To manipulate a webpage choose either the Click Element, Hover Element, Scroll, Type Text or Select Drop Down List Value actions. If you are performing a click action you can click on any number of elements on a webpage. Otherwise you must select an appropriate HTML element, for instance text should be typed in a text box. Then click Next. This will open a option box that allows you to complete the action. When typing text and selecting from a drop down the data to be typed or selected must be chosen respectively. Other than that the options are the same for all three actions.
If you wish to you can select the template this action should be executed in and for the click action what template applies, once the click action is complete. However, assigning a new template to a click action that performs multiple clicks on the same page, is not a good idea, such as opening inline popups or make things appear on screen. This because if the click action only executes on certain templates, the new template assigned by the first click would not be reset and therefore depending on how the scrape was written this could stop future clicks on the same page being executed. You can also define if you want this action to only be executed once, which is useful if you are doing something like login into a website.
The Type Text or Select Drop Down List Value actions allow you to type multiple items of text or make multiple select box selections, respectively. These can be edited by clicking on the scrape instructions Alter or View Variables button, as shown in the screenshot to the left.
This could be important if you want to type a list of names to a search box, for instance. To ensure a form is submitted only when there is a value in the search box a template could be set each time the text is successfully typed into the textbox and the click action on a button not performed unless this template is set. After the click action has been performed the template would then need to be changed to something else in order to reset the procedure.
After actions that manipulate websites have executed, it is useful to wait for a while if the actions initiate AJAX functionality, to allow the AJAX content to load before continuing with the scrape. You can do this by adding a delay in the After Execution Wait text box.
You may wish to jump straight to a different URL once some condition has been met. To do this use the Go To URL action, which will only appear when at least one template has been defined in the scrape and when created must be assigned to a template, to help to avoid infinite loops.
Finally you can use all of GrabzIt's capture API's in your web scrapes, just choose the Capture Webpage action, and choose your desired capture. You can limit this to capture certain web pages within the scrape by specifying a template to execute in once you select the Next button.
After every scrape instruction is added it can be seen in the scrape instructions panel, the cross next to each scrape instruction allows the scrape instruction to be deleted. If a scrape instruction is deleted that is required by other scrape instructions those instructions are also deleted. You can change the order of scrape instructions by dragging any scrape instructions with the grab icon.
If you need to customize the scrape instructions in a more specific way you will need to alter the scrape instructions manually. The scrape instructions are JavaScript based and the code editor comes complete with a syntax checker, auto-complete and tooltip's to make it as easy as possible.
The core functionality of the code editor is accessible through the menu options, as shown in the screenshot, the purpose of each is explained separately below. While any syntax errors in your scrape instructions are indicated in the left hand gutter of the code editor.
the wizard allows you to select parts of the page you wish to take extract and do other common tasks such as create web captures.
displays the scrape instructions code to the user.
deletes all of the scrape instructions.
will enter the Page keyword into the scrape instructions and open the auto-complete, which contains all the possible Page functions. The Page functions allow you to extract data from the web page.
will enter the Data keyword into the scrape instructions. Data functions allow you to save information.
enters the Navigation keyword into the code editor. The Navigation functions allow you to control how the Web Scraper navigates the target website.
enters the Global keyword into the scrape instructions. This gives you access to functions that can store data between parsing different web pages.
enters the Utility keyword into the scrape instructions. This allows you to use common functions that make writing scrapes easier, such as adding or removing querystring parameters from URLs.
enters the Criteria keyword into the scrape instructions. These functions allows you to refine the data extracted during your scrape, such as eliminating duplicates.
allows you to easily create a filter, this is required by some functions to select a particular HTML element from within a web page. Simply select the attributes that your target element should have and/or the parent(s) of the element should have to select that element. Ensure that before you click this option your cursor is in the correct place in the function to pass the filter too.
allows you to set screenshot options. Simply place the cursor in the correct part of the function, as identified by the tooltip and press the screenshot options. Then choose all the options you wish and insert the command.
When the Web Scraper comes across PDF's, XML, JSON and RSS it will convert it to a HTML approximation, which allows our Web Scraper to parse it correctly and you to select what content you wish to extract. For instance, if you wanted to parse JSON data it will convert the data into a hierarchal HTML representation as shown to the side. This allows you to build scrape instructions as normal.
In a similar way when the scraper loads a PDF document, the PDF is converted into HTML to allow images, hyperlinks, text and tables to be selected and scraped. However As a PDF has no real structure, tables are identified using heuristics and so are not always accurate.
This tab allows you to choose how you wish to export your results your options including as Excel spreadsheets, XML, JSON, CSV, SQL commands, or HTML documents. Additionally, this tab allows the name of the zipped scrape results to be set. If you are only downloading files or creating web captures then there is no need to choose an export option as you will just receive a ZIP file containing the results. This tab also allows you to specify how you wish to send the results. You can send the results via Amazon S3, Dropbox, email notification, FTP and WebDav.
The final option is a Callback URL, which allows the scrape results to be processed in your application by using our scrape API.
The filename of the zipped results or each data file if you request them to be sent separately can be set by unchecking the Use Default Filename option and setting your desired filename. Additionally, a timestamp can be added to your filename by putting {GrabzIt_Timestamp_UTC+1}
in the filename. The +1 denotes the offset in hours from UTC.
You can also view the results of a scrape by clicking the View Results button, next to your scrape, this will show any real-time scrape results, as well as previous ones carried out within the last 48 hours.
When creating a web scrape the Schedule Scrape tab allows you to set when you want the scrape to start and if you want it to repeat, how frequently it should do so. The scrape can also be configured to run when a change on a web page is detected. To do this Start When a web page changes checkbox, then enter the URL of the web page to monitor, along with the CSS selector of the part of the page you are interested in. It is important a small part of the page is selected to avoid false positives due to inconsequntial changes.
Once the web scrape starts the status icon will change to and the processed pages will start to increase over time. A real time snapshot of the scrapes progress is regularly produced with a log file being generated along with a regular screenshot of the last web page the scraper has encountered. This allows you to see what is happening during the scrape. To find this information, click on the expand icon next to your scrape and click Viewer for the scrape you are interested in. This should detail if there have been any errors such as problems with your scrape instructions.
Once the scrape has completed successfully the status icon will switch to , if there is no result by opening the Viewer the log and last screenshot may tell you what went wrong.
One of the most common problems reported in the logs that is there isn't a sufficient rendering delay to scrape the page, often a small increase in the Page Load Delay found in the Scrape Options tab is enough for most websites.
GrabzIt has some indepth tutorials on a variety of web scraping topics, including AI-Powered Scraping, how to convert websites to PDF, downloading website, scraping list detail pages, and scraping Schema.org data.