Tools to Capture and Convert the Web

Tips on writing HTML for conversion

GrabzIt's API allows you to convert any HTML into PDF, DOCX, images and more. To do so you need to pass regular HTML to our API. For instance, something like the HTML shown in the following example.

<html>
<body>
<h1>Hello World</h1>
</body>
</html>

Notice that this HTML example has included the HTML and BODY tags, but this isn’t required if you just want to convert a snippet of HTML. However if you don't add the HTML and BODY tags these will be automatically added for you just like in a normal browser. To counteract this you can specify some CSS to remove any extra padding and margins on the BODY tag as shown below.

<style>
body{margin:0;padding:0}
</style>

If you want to include JavaScript, images or CSS in the HTML you are going to convert then you can provide these resources in an inline or referenced manor. For instance, the below code shows how to create resources in the HTML in an inline way.

<html>
<head>
<script>
document.getElementsByTagName('H1')[0].innerText = 'Goodbye';
</script>
<style>
h1{
color:red;
}
</style>
</head>
<body>
<img width="16" height="16" alt="star" src="
SKudfOulrSOp3WOyDZu6QdvCchPGolfO0o/XBs/fNwfjZ0frl3/zy7////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAACH5BAkAABAALAAAAAAQABAAAAVVICSOZGlCQAosJ6mu7fiyZeKqNKToQGDsM8hBADgUXoGA
iqhSvp5QAnQKGIgUhwFUYLCVDFCrKUE1lBavAViFIDlTImbKC5Gm2hB0SlBCBMQiB0UjIQA7" />
<h1>Hello World</h1>
</body>
</html>

As you can see in the above example the JavaScript and CSS is contained directly within the HTML page and how the image has been transformed into a data URL.

If we wanted to reference these resources instead then we will need to ensure all URL’s linking to these files use absolute URL’s, which are also publicly accessible. This means the URL contains all the information necessary to locate a resource. Not using absolute URL's is the main reason images, CSS and JavaScript haven't rendered when converting HTML.

To do this the JavaScript, CSS and image would need to be put into separate files and then referenced in the HTML, which would look something like the example below.

<html>
<head>
<script src="http://www.example.com/myscript.js"></script>
<link rel="stylesheet" type="text/css" href="http://www.example.com/mystyle.css">
</head>
<body>
<h1>Hello World</h1>
<img width="16" height="16" alt="star" src="http://www.example.com/star.gif" />
</body>
</html>

DOCX and PDF documents

When creating DOCX or PDF from HTML there are a few other points to consider. The first is that you can manually define page breaks for PDF and DOCX. By default relative links will also be converted to absolute links in the documents but this can be stopped if required using this technique.

DOCX only supports certain CSS attributes but it does additionally support a table of contents tag.