Tools to Capture and Convert the Web

How to scrape a website to extract web content with GrabzIt

10 October 2015

First what is web scraping? Web scraping is used to extract information from usually unstructured data sources on the Internet such as HTML and PDF documents.

Different ways to scrape websites

Any programming language that allows you to download and parse web content can be used to extract scrape the web. However there are a few issues, the first is that when reading web content, unless a browser is used the web page will not be rendered correctly as any JavaScript and other dynamic features will not have been run. Another issue is that any common scraping problems encountered will have to be solved by a developer. Such as how to click on dynamic links, take screenshots of websites or extract text from one part of a web page.

Of course if you use a scraping tool like GrabzIt these issues have already been solved.

To do this GrabzIt's Web Scraper enables you to extract web content using a completely online tool to create a scrape that can be run once or at regular intervals.

Scrape Buttons

Before you can extract web content you need to identify what information you want to extract from a website. Then create a new scrape enter the target website on the Target Websites Tab. Next go to the Scrape Instruction Tab and select the Extract Web Content option, then choose the parts of the website you want to extract. Next set an appropriate Dataset and Column name for the extracted web content and add any extra required columns. Then press the Finished button to automatically create the commands and add it to the scrape instructions. While the wizard does not currently support generating scrape commands from PDF documents or images this can still be done by writing the required scrape commands manually.

Choose any options you need from the Scrape Options Tab such as entering a title for this scrape. Now select the Export Options Tab and choose what format you want the data to exported in such as CSV, HTML or a Microsoft Excel document.

You then need to what you want to happen when the scrape completes such as being notified by email. Or sending the results to somewhere like a Dropbox or FTP account. Or integrating it with your application using our Scrape API by choosing the Callback URL option to send the results directly to your application.

Finally go to the Schedule Scrape to set when the scrape should start and if it should be called repeatedly. Then save the scrape to start extracting web data!

View the latest blog posts