Tools to Capture and Convert the Web

Extract data and transform it into a dataset

One of the most common requirements is to extract data from a website and turn it into a tabular structure that can be exported for further processing. But just what is a dataset and how is it used in GrabzIt's Web Scraper?

Example dataset: price list

Below is the table data contained in the dataset price list, the table consists of three columns item label, item description and item price.

item label item description item price
Camera Takes digital photos $99.00

To create this dataset you will need to use the following scrape instructions.

Data.save('Camera', 'price list', 'item label');
Data.save('Takes digital photos', 'price list', 'item description');
Data.save('$99.00', 'price list', 'item price');

This uses the Data.save method to add a data value to a particular dataset and column. Everytime the Data.save method is called with the same dataset and column name parameters a new row is added to that column. However the above scrape instructions are not very useful as we are creating the dataset using static values. The code below shows the HTML of a webpage, we will then write scrape instructions to dynamically extract the data from the page and save it into a dataset.

<html>
    <body>
        <span id="spnLabel">Nikon 1055</span>
        <span id="spnDescription">Great little camera, creates clear sharp images.</span>
        <span id="spnPrice">$99.99</span>
    </body>
</html>

We will now use the Page.getTagValue method to extract the values from the span tags.

Data.save(Page.getTagValue({"id":{"equals":"spnLabel"}}), 'price list', 'item label');
Data.save(Page.getTagValue({"id":{"equals":"spnDescription"}}), 'price list', 'item description');
Data.save(Page.getTagValue({"id":{"equals":"spnPrice"}}), 'price list', 'item price');

As you can see the Page.getTagValue methods use a filter, which uniquely identifies the HTML element that the text needs to be extracted from. In this case the filters are specifiying that the id HTML attribute should equal spnLabel, spnDescription or spnPrice respectively. You can easily generate a filter by clicking on the Filter button icon, which displays a wizard to simplifly the construction of the filter.

Once you have constructed your dataset as we have shown here, you can decide how you want to export it on the Export Options tab.