Tools to Capture and Convert the Web

Capture HTML Tables from Websites with Python

Python API

There are multiple ways of converting HTML tables into CSV's and Excel spreadsheets using GrabzIt's Python API, detailed here are some of the most useful techniques. However before you start remember that after calling the URLToTable, HTMLToTable or FileToTable methods the Save or SaveTo method must be called to capture the table. If you want to quickly see if this service is right for you, you can try a live demo of capturing HTML tables from a URL.

Basic Options

The below code snippet automatically converts the first HTML table in a specified webpage into a CSV document that can then be downloaded or parsed.

grabzIt.URLToTable("https://www.tesla.com")
# Then call the Save or SaveTo method
grabzIt.HTMLToTable("<html><body><table><tr><th>Name</th><th>Age</th></tr>
    <tr><td>Tom</td><td>23</td></tr><tr><td>Nicola</td><td>26</td></tr>
    </table></body></html>")
# Then call the Save or SaveTo method
grabzIt.FileToTable("tables.html")
# Then call the Save or SaveTo method

By default this will convert the first table it identifies into a table. However the the second table in a web page could be converted by passing a 2 to the tableNumberToInclude attribute.

from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.tableNumberToInclude = 2

grabzIt.URLToTable("https://www.tesla.com", options)
# Then call the Save or SaveTo method
grabzIt.SaveTo("result.csv")
from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.tableNumberToInclude = 2

grabzIt.HTMLToTable("<html><body><table><tr><th>Name</th><th>Age</th></tr>
    <tr><td>Tom</td><td>23</td></tr><tr><td>Nicola</td><td>26</td></tr>
    </table></body></html>", options)
# Then call the Save or SaveTo method
grabzIt.SaveTo("result.csv")
from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.tableNumberToInclude = 2

grabzIt.FileToTable("tables.html", options)
# Then call the Save or SaveTo method
grabzIt.SaveTo("result.csv")

You can also specify the targetElement attribute that will ensure only tables within the specified element id will be converted.

from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.targetElement = "stocks_table"

grabzIt.URLToTable("https://www.tesla.com", options)
# Then call the Save or SaveTo method
grabzIt.SaveTo("result.csv")
from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.targetElement = "stocks_table"

grabzIt.HTMLToTable("<html><body><table id='stocks_table'><tr><th>Name</th><th>Age</th></tr>
    <tr><td>Tom</td><td>23</td></tr><tr><td>Nicola</td><td>26</td></tr>
    </table></body></html>", options)
# Then call the Save or SaveTo method
grabzIt.SaveTo("result.csv")
from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.targetElement = "stocks_table"

grabzIt.FileToTable("tables.html", options)
# Then call the Save or SaveTo method
grabzIt.SaveTo("result.csv")

Alternatively you can capture all tables on a web page by passing true to the includeAllTables attribute, however this will only work with the XLSX and JSON formats. This option will put each table in a new sheet within the generated spreadsheet workbook.

from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.format = 'xlsx'
options.includeAllTables = True

grabzIt.URLToTable("https://www.tesla.com", options)
# Then call the Save or SaveTo method
grabzIt.SaveTo("result.xlsx")
from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.format = 'xlsx'
options.includeAllTables = True

grabzIt.HTMLToTable("<html><body><table><tr><th>Name</th><th>Age</th></tr>
    <tr><td>Tom</td><td>23</td></tr><tr><td>Nicola</td><td>26</td></tr>
    </table></body></html>", options)
# Then call the Save or SaveTo method
grabzIt.SaveTo("result.xlsx")
from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.format = 'xlsx'
options.includeAllTables = True

grabzIt.FileToTable("tables.html", options)
# Then call the Save or SaveTo method
grabzIt.SaveTo("result.xlsx")

Convert HTML Tables to JSON

Using Python and GrabzIt's HTML table conversion service enables you to convert HTML tables into JSON. The first step as shown below is to specify json in the format parameter. We then get the JSON string synchronously with the SaveTo method, you can then use your favourite JSON parser for Python to convert the JSON string into a object.

from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.format = "json"
options.tableNumberToInclude = 1

grabzIt.URLToTable("https://www.tesla.com", options)

json = grabzIt.SaveTo()
from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.format = "json"
options.tableNumberToInclude = 1

grabzIt.HTMLToTable("<html><body><table><tr><th>Name</th><th>Age</th></tr>
    <tr><td>Tom</td><td>23</td></tr><tr><td>Nicola</td><td>26</td></tr>
    </table></body></html>", options)

json = grabzIt.SaveTo()
from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.format = "json"
options.tableNumberToInclude = 1

grabzIt.FileToTable("tables.html", options)

json = grabzIt.SaveTo()

Custom Identifier

You can pass a custom identifier to the table methods as shown below, this value is then returned to your GrabzIt Python handler. For instance this custom identifier could be a database identifier, allowing a screenshot to be associated with a particular database record.

from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.customId = "123456"

grabzIt.URLToTable("https://www.tesla.com", options)
# Then call the Save method
grabzIt.Save("http://www.example.com/handler.py")
from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.customId = "123456"

grabzIt.HTMLToTable("<html><body><h1>Hello World!</h1></body></html>", options)
# Then call the Save method
grabzIt.Save("http://www.example.com/handler.py")
from GrabzIt import GrabzItTableOptions
from GrabzIt import GrabzItClient

grabzIt = GrabzItClient.GrabzItClient("Sign in to view your Application Key", "Sign in to view your Application Secret")

options = GrabzItTableOptions.GrabzItTableOptions()
options.customId = "123456"

grabzIt.FileToTable("example.html", options)
# Then call the Save method
grabzIt.Save("http://www.example.com/handler.py")