Tools to Capture and Convert the Web

How does GrabzIt convert HTML to DOCX

HTML can represent many complicated structures such as inline DIVs or SPAN’s side by side. HTML elements overlapping and borders applied to different HTML elements. For the most part this wouldn’t be a sensible approach in DOCX while it would be possible to create floating HTML elements with text boxes it would result in almost all content being contained within text boxes resulting in a very ugly and messy Word document.

It is because of this issue we ignore the floating of HTML elements and borders of most HTML elements. However we do respect borders on some HTML elements like table cells and alignment on image elements for example.

Does this mean you can't place content side-by-side? No. This is still possible by using column CSS attributes, HTML tables and tab stops as outlined below.

If you want a HTML document to be captured exactly as shown on screen it would be better to convert the HTML to PDF as the PDF file format uses absolute positioning.

Tab Stops

Tab Stops are a special DOCX feature that is activated if floating HTML elements, with text alignment, are contained within a 100% width HTML element that has no specific text alignment itself. This is important as it means normal alignment should not be applied to the child elements. This is done by using text-align:start. Note that Tab Stops won't work within a table or list.

An example of this is shown below.

<div style="width:100%;text-align:start">
   <div style="width:50%;text-align:left;float:left">Aligned One</div>
   <div style="width:50%;text-align:left;float:left">Aligned Two</div>

Text Language

To make text in the DOCX document have a particular language. The HTML tag element of the HTML document needs to have a lang attribute. Or another HTML element inside the HTML document such as a P tag needs to have a lang specified.

If the child HTML element does not have a lang tag specified then the language falls back to the document default. If none are specified English is used.