Tools to Capture and Convert the Web

Extract links from a website

This example is also available as a template.

A common task is to extract links from a website, specifically HTML links. Fortunately this is easy when using GrabzIt's Web Scraper. First of all create a new scrape with the normal details such as the starting page of the scrape and any other options.

Then go to the Scrape Instructions tab and click the Web page button button. This will enter the Page keyword into the scrape instructions and will open a drop down. Select getTagAttributes from the list. Next add 'href' as the first parameter, this tells the Web Scraper to extract the href attribute, then type a comma.

Next click the Filter button this allows you to tell the Web Scraper what elements to extract the href attribute from. In the filter window ensure type is set to 'Web Page' and the restriction is 'tag name' and 'equal to'. Then enter a in the text box and then click the Add button and then Insert Filter button. Finish the instruction by adding a semi-colon to the end of the line.

You should be left with something like what is shown below.

Page.getTagAttributes('href', {"tag":{"equals":"a"}});

The above code will extract all link URL's from the web page, but we now need to save those link URL's. To do this we will wrap this command minus the semi-colon in a command. To do this go to the begining of the line and select the Data button button. Then in the drop down select save, then go to the end of the line and add a comma. Then add what you want to call the dataset such as 'My Website', then add another comma and then add another parameter to describe the column such as 'Links' then close the command with a ) before the semi-colon.

You should now have the following scrape instructions.'href', {"tag":{"equals":"a"}}), 'My Websites', 'Links');

Now if you run the scrape you will extract all links from the website. This will create a table with the name of My Websites, with a column name of Links which can then be exported into many different formats such as XML, CSV or a Spreadsheet. This tutorial could also have been achieved by using the wizard button in the Scrape Instructions toolbar.