While many of the other articles deal with how to extract data this article explains how the extracted data can be refined so only the desired information remains. To do this the special Criteria
methods are used while in all of the following examples the data is extracted from a HTML table, this data could be extracted from a variety of different sources as long as each source of data content divs, spans, images etc is of the same length.
Below is the table data being scraped in this example this table, consists of four columns title, author, book age and status.
title | author | book age | status |
---|---|---|---|
How to Garden | John | 5 | Published |
How to use a Camera | Sarah | 0 | Incomplete |
How to use a Camera | Sarah | 0 | Incomplete |
Astronomy made easy | Dominic | 1 | Under Review |
How to Iron | Paul | 1 | Under Review |
How to Draw | Mike | 3 | Published |
How to use a PC | Rachel | 4 | Published |
var titles = Page.getTagValues({"position":1,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}}); var authors = Page.getTagValues({"position":2,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}}); var ages = Page.getTagValues({"position":3,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}}); var statuses = Page.getTagValues({"position":4,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}});
Often scraped data needs to be refined so that they only have the information they require. This is where the Criteria
functions are used. For instance if only published books are required you would need to restrict the statuses column above to published and then apply those changes to the other column data as shown below.
Criteria.create(); statuses = Criteria.equals(statuses, "Published"); titles = Criteria.apply(titles); authors = Criteria.apply(authors); ages = Criteria.apply(ages);
When using Criteria
methods to reduce the data all changes must be applied to on one single column at a time, before the apply
method is used on any other columns that have to have the corresponding records removed. Once complete the Criteria.create()
method has to be called before criteria are set for another columns. It is because of this reason that it is best practice to call the Criteria.create()
before any other criteria methods.
In the example the statuses column has been restricted to only include Published, then using the Criteria.apply
method the corresponding records in the three other columns have also been removed to keep all of the columns consistent. Remember that the apply method is only useful if the different columns contain the same number of records.
Critieria can also be combined together to restrict the data in multiple ways. The below example restricts the book age column to books older than one but less than five years old by using the Criteria.lessThan()
and Criteria.greaterThan()
methods.
Criteria.create(); ages = Criteria.greaterThan(ages, 1); ages = Criteria.lessThan(ages, 5); titles = Criteria.apply(titles); authors = Criteria.apply(authors); statuses = Criteria.apply(statuses);
Sometimes there is duplicate data that needs to be removed, to remove this information you can use the Criteria.unique
method.
Criteria.create(); titles = Criteria.unique(titles); authors = Criteria.apply(authors); ages = Criteria.apply(ages); statuses = Criteria.apply(statuses);
Now any duplicate rows based on the title collumn will be removed. The next method is the Criteria.remove
method. This removes items from the column if those column values are found in the array parameter.
var authorsToRemove = ["Mike","Rachel"]; Criteria.create(); authors = Criteria.remove(authors, authorsToRemove); titles = Criteria.apply(titles); ages = Criteria.apply(ages); statuses = Criteria.apply(statuses);
Here all records who equal Mike and Rachel in the authors column are removed the apply method then removes the corresponding records from the other columns.