Tools to Capture and Convert the Web

Extracting data from PDF documents

Scraping data from the content of PDF documents is not as flexible as doing so from HTML documents however there are still a number of ways this can be achieved using GrabzIt's Web Scraper. First to scrape PDF content you use the PDF functions rather than the Page functions but otherwise the functions work in generally the same way.

A filter for a PDF document is much simpler than that for a HTML document first of all you must specify what type of content you want to extract: links, images or text.

//Extract images
PDF.getValue({"type":"image"});
//Extract links
PDF.getValue({"type":"link"});
//Extract text
PDF.getValue({"type":"text"});

For links and images you can restrict what image or link is returned by specifying its position.

PDF.getValue({"type":"image","position":"2"});

Gets the second image in a document. For text, images and links you can further restrict the data returned by specifying a page number.

PDF.getValue({"type":"image","position":"2","page":"5"});

This will return the second image from the fifth page. Text comes with the added option of line number, however text does not support position.

PDF.getValue({"type":"text","page":"5","line":"10"});

This gets the tenth line of text from the fifth page. Other than these filter option differences scraping data from PDF documents works in a very similar way to scraping data from HTML documents, however because you can not be as specific as to what you extract with a PDF filter you may need to specify a pattern to extract the correct information from the text.