Tools to Capture and Convert the Web

Create a custom link checker

This example is also available as a template.

GrabzIt's Web Scraper is very flexible allowing it to perform a variety of online tasks, such as checking a websites links and reporting which are broken.

The first thing to do is to create a scrape and assign the target website you want to check, then use the below code for the scrape instructions.

        var urls = Page.getTagAttributes('href', {"tag":{"equals":"a"}});
        urls = Utility.Array.unique(urls);
        urls = Utility.Array.filter(urls, Data.readColumn("Links", "URL"));

        for (i = 0; i < urls.length; i++) 
        {
          var url = urls[i];

          Data.save(Page.getUrl(), "Links", "Found On");
          Data.save(url, "Links", "URL");

          if (Utility.URL.exists(url))
          {
            Data.save("Found", "Links", "Result");
          }
          else
          {
            Data.save("Missing", "Links", "Result");
          }
        }
    

The first line var urls = Page.getTagAttributes('href', {"tag":{"equals":"a"}}); extracts all the hyperlink URLs and puts them in the urls variable. The next line uses the Utility.Array.unique method to make all the URLs unique.

The third line ensures that the links are not being checked twice to do this we read the URL's that have previously been saved and filter the extracted links by this. If you want every page a link is broken on to be recorded delete this line.

After the URL data has been cleaned we loop through each remaining URL, saving it in the dataset along with the current page, before checking if the URL exists by using the Utility.URL.exists method. The result of this check is then also saved in the dataset.

Alternatively you can check if a websites images exist by replacing the code Page.getTagAttributes('href', {"tag":{"equals":"a"}}); with Page.getTagAttributes('src', {"tag":{"equals":"img"}});.