A lot of websites wrap their content in schema.org microdata, this metadata provides additional information about the items on a web page. Such as prices, authors, publishers, reviews and much more. This information is designed to be machine readable so is easy to scrape for your own use.
We have created the below web scrape template that allows you to easily all schema.org metadata from a website.
To get started load this template.
The template returns a CSV and Excel spreadsheet with a column for the URL the schema microdata was found on, the microdata path and the value. In the example below the value in the microdata value would be James Cameron for instance. The schema path is our own invention and represents the hierarchical structure of the schema data. So for this snippet.
<div itemscope itemtype="https://schema.org/Movie"> <h1 itemprop="name">Avatar</h1> <div itemprop="director" itemscope itemtype="https://schema.org/Person"> Director: <span itemprop="name">James Cameron</span> (born August 16, 1954) </div> </div>
The path would be movie/person/name
. You can pass the schema path to the Page.getSchemaValues
method, to return just those matching values, like so.
var column_1 = Page.getSchemaValues('movie/person/name');
A wildcard can be used to match values regardless of the path. The example below will scrape all microdata marked up as peoples names.
var column_1 = Page.getSchemaValues('*/person/name');