Tools to Capture and Convert the Web

How to download a website and all of its content?

Website

There are some instances when it is important to download an entire website, not just the finished result. But HTML web pages, resources such as CSS, scripts and images.

This maybe because you want a backup of the code but can no longer get to the original source for some reason. Or perhaps you want a detailed record of how a website has changed over time.

Fortunately GrabzIt’s Web Scraper can achieve this by crawling over all of the web pages on a website. Then on each web page the scraper downloads the HTML along with any resources referenced on the page.

Create a Scrape to Download an Entire Website

To make downloading your website as easy as possible GrabzIt provides a scrape template. Just click on this template link to get started.

Once clicked your scrape will be created. Next, go to the Target Websites(s) tab and enter the URL of the website to download in the Target URL text box. Then click Assign Target and wait for a second or two.

Skip the Scrape Instructions and Export Options tab and go straight to the Schedule Scrape tab. You can then click Update to start the scrape. However if you want to set up the scrape to run on a regular schedule, for instance to create regular back ups of a website. Then simply click the Repeat Scrape checkbox and then select how frequently you want the scrape to repeat.

Using your Downloaded Website

Once the scrape has finished you will get a ZIP file. Next extract the ZIP file and inside located in a directory called Files will be all of the downloaded web pages and website resources. There will also be a special HTML page called data.html in the root of the directory. Open this file in a web browser and you will find a HTML table with three columns:

  • Resource URL - this is the URL the web scraper found the resource at. So for example: http://www.example.com/logo.jog
  • Resource Type - this is the type of resource that was downloaded. There are four types of resources.
    • Web Page
    • Image
    • External Resource - any resource downloaded from a Link tag
    • Script
  • New File Name - the new file name that the resource has been saved under. Note that this is column also contains a link to the file, which makes inspecting all of the downloaded resource much easier.

This file is designed to help you map the new filenames to their old locations. This is needed because a URL can’t be directly mapped to a file structure as a URL can be much too large to be stored directly in the file path.

Also there can be many permutations especially when a web page can represent a lot of different content by changing various query string parameters! So instead we store the website in a flat structure in the file folder and give you data.html file to map these files to the original structure.

Of course because of this you can’t open a downloaded HTML page and expect to see the web page you saw on the web. To do this you would need to rewrite the paths of the image, script and CSS resources etc so that the HTML file can find them in your local file structure.

Another file that will be included in the root of the ZIP file is called Website.csv. This contains exactly the same information as the data.html file. However this is included in case you want to read and process the website download programatically perhaps using the mapping from the URL’s to the files to recreate the downloaded website.

Try all our premium features for free with a 7 day free trial. Then from $5.99 a month, unless cancelled.
  • More Captures
  • More Features
  • More API's
  • Bigger Scrapes
  • Bigger Captures
  • Bigger Everything
Start Free Trial