Tools to Capture and Convert the Web

Refining scraped data

While many of the other articles deal with how to extract data this article explains how the extracted data can be refined so only the desired information remains. To do this the special Criteria methods are used while in all of the following examples the data is extracted from a HTML table, this data could be extracted from a variety of different sources as long as each source of data content divs, spans, images etc is of the same length.

Example table: book list

Below is the table data being scraped in this example this table, consists of four columns title, author, book age and status.

title author book age status
How to Garden John 5 Published
How to use a Camera Sarah 0 Incomplete
How to use a Camera Sarah 0 Incomplete
Astronomy made easy Dominic 1 Under Review
How to Iron Paul 1 Under Review
How to Draw Mike 3 Published
How to use a PC Rachel 4 Published
var titles = Page.getTagValues({"position":1,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}});
var authors = Page.getTagValues({"position":2,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}});
var ages = Page.getTagValues({"position":3,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}});
var statuses = Page.getTagValues({"position":4,"tag":{"equals":"td"},"parent":{"tag":{"equals":"tr"}}});

Often scraped data needs to be refined so that they only have the information they require. This is where the Criteria functions are used. For instance if only published books are required you would need to restrict the statuses column above to published and then apply those changes to the other column data as shown below.

Criteria.create();
statuses = Criteria.equals(statuses, "Published");
titles = Criteria.apply(titles);
authors = Criteria.apply(authors);
ages = Criteria.apply(ages);

When using Criteria methods to reduce the data all changes must be applied to on one single column at a time, before the apply method is used on any other columns that have to have the corresponding records removed. Once complete the Criteria.create() method has to be called before criteria are set for another columns. It is because of this reason that it is best practice to call the Criteria.create() before any other criteria methods.

In the example the statuses column has been restricted to only include Published, then using the Criteria.apply method the corresponding records in the three other columns have also been removed to keep all of the columns consistent. Remember that the apply method is only useful if the different columns contain the same number of records.

Critieria can also be combined together to restrict the data in multiple ways. The below example restricts the book age column to books older than one but less than five years old by using the Criteria.lessThan() and Criteria.greaterThan() methods.

Criteria.create();
ages = Criteria.greaterThan(ages, 1);
ages = Criteria.lessThan(ages, 5);
titles = Criteria.apply(titles);
authors = Criteria.apply(authors);
statuses = Criteria.apply(statuses);

Sometimes there is duplicate data that needs to be removed, to remove this information you can use the Criteria.unique method.

Criteria.create();
titles = Criteria.unique(titles);
authors = Criteria.apply(authors);
ages = Criteria.apply(ages);
statuses = Criteria.apply(statuses);

Now any duplicate rows based on the title collumn will be removed. The next method is the Criteria.remove method. This removes items from the column if those column values are found in the array parameter.

var authorsToRemove = ["Mike","Rachel"];
Criteria.create();
authors = Criteria.remove(authors, authorsToRemove);
titles = Criteria.apply(titles);
ages = Criteria.apply(ages);
statuses = Criteria.apply(statuses);

Here all records who equal Mike and Rachel in the authors column are removed the apply method then removes the corresponding records from the other columns.