Applies to:
PDFTextStream
Extracting tables from PDF documents
One of the most common reasons why software teams adopt PDFxStream is the tools it provides for extracting tabular data from PDF documents.
The spcifics of how this is done depends upon how the tables you are interested in are rendered in your source PDF documents. Keeping in mind that, internally, PDF documents are described entirely visually and not structurally (i.e. there is no information in PDF that e.g. some content constitutes a heading, while some other content constitutes a table row), there are in general two broad categories of table rendering:
- Explicit tables, where table structures (rows and columns)
are laid out using visible graphical lines:
- Implicit tables, where table structures (rows and columns)
are only distinguished using patterns of whitespace:
These examples will be used throughout this page to demonstrate how PDFxStream supports the extraction of data from both kinds of tables.
Extracting data from explicit tables
The visible graphical lines used to render explicit tables gives PDFxStream a ton of
information about how to split tables into rows and columns. This identification is fast and
entirely automatic, and produces com.snowtide.pdf.layout.Table
objects (a subtype of the common com.snowtide.pdf.layout.Block
class) within the document model PDFxStream provides for each com.snowtide.pdf.Page
.
Accessing the data that's been re-structured into those Table
objects can be done in a few different
ways. We'll go through all of them, starting with the most flexible (and thus, complicated),
and moving through to more convenient turnkey approaches.
"Manually" walking the document model
As you traverse the document model tree (starting from the root available via com.snowtide.pdf.Page.getTextContent()
), you can check each child Block
to see if it is a Table
; for each table, you can then use its
methods to access each row (containing a series of Block
s, one per cell within a row). This
approach requires more code and is somewhat more complex than the next
example, but it does give you the opportunity to fine-tune exactly which tables to
extract (e.g. you could only extract those that are on the lower half of a page, or have a
particular heading above them).
For example, this code will extract only the first column of data from the first table found
on a given Page
:
private static List<String> getFirstColumnData (Table t) throws IOException { ArrayList<String> columnData = new ArrayList<String>(); for (int i = 0; i < t.getRowCnt(); i++) { BlockParent row = t.getRow(i); if (row.getChildCnt() > 0) { Block cell = row.getChild(0); StringBuilder sb = new StringBuilder(); OutputTarget tgt = new OutputTarget(sb); cell.pipe(tgt); columnData.add(sb.toString()); } } return columnData; } private static List<String> getFirstTableData (BlockParent bp) throws IOException { for (int i = 0; i < bp.getChildCnt(); i++) { Block b = bp.getChild(i); if (b instanceof Table) { return getFirstColumnData((Table)b); } else if (b instanceof BlockParent) { List<String> data = getFirstTableData((BlockParent)b); if (data != null) return data; } else { // leaf non-table block, nothing to do } } return null; } private static List<String> getFirstTableData (Page p) throws IOException { return getFirstTableData(p.getTextContent()); } public static void main (String[] args) throws IOException { Document pdf = com.snowtide.PDF.open("/path/to/file.pdf"); System.out.println(getFirstTableData(pdf.getPage(0))); }
Applying the code above to a page containing some explicitly-rendered table(s) will properly return
the data in the first column in the first table as a list of String
s:

> [MEAT 2 OZ, MEAT 40Z, WRAP "AMBURGER", WRAP CHICKEN, SAUCE "" AGED, CHICKEN DICED, CHICKEN NUGGETS, LABEL SPINACH SALAD, BACON PRECOOKED]
Using TableUtils
makes extracting explicit table data easy
Much easier than manually walking around the PDFxStream document model is using the com.snowtide.pdf.util.TableUtils
class and its collection of convenient
methods for comprehensively capturing explicitly-rendered table data. The example presented
above can be replaced using TableUtils
with
this much shorter, simpler code:
private static List<String> getFirstTableData (Page p) throws IOException { List<Table> tables = TableUtils.getAllTables(p); if (tables.size() == 0) { return null; } else { String[][] firstTable = TableUtils.tableToStrings(tables.get(0)); ArrayList<String> firstColumn = new ArrayList<String>(); for (String[] row : firstTable) { if (row.length > 0) firstColumn.add(row[0]); } return firstColumn; } } public static void main (String[] args) throws IOException { Document pdf = com.snowtide.PDF.open("/path/to/file.pdf"); System.out.println(getFirstTableData(pdf.getPage(0))); }
As an added bonus for those that need to "export" PDF tables to Excel, Google Sheets, or any
other downstream program or process that can work well with CSV data, com.snowtide.pdf.util.TableUtils.convertToCSV(Table,char)
is a convenient table-to-CSV export method that's so easy, you can dump e.g. the first table
on the first page to CSV with a single expression:
TableUtils.convertToCSV(TableUtils.getAllTables(pdf.getPage(0)).get(0), ',');
"MEAT 2 OZ","4/10 LBS (CASE)","2.7","11/20/06","1.0","11.2","","10.2","","" "MEAT 40Z","4/10 LBS (CASE)","3.2","11/20/06","-0.1","24.2","","24.2","","" "WRAP ""AMBURGER""","1000/CS (CASE)","0.8","11/5/06","2.7","1.4","","0.0","","" "WRAP CHICKEN","1000/CS (CASE)","1.3","11/5/06","2.2","0.8","","0.0","","" "SAUCE """" AGED","6/#10 (CASE)","0.0","11/19/06","0.0","0.0","","0.0","","" "CHICKEN DICED","4/4LB (CASE)","2.8","11/19/06","2.3","2.1","","0.0","","" "CHICKEN NUGGETS","15/2 LBS (CASE)","3.3","11/20/06","1.3","15.2","","13.8","","" "LABEL SPINACH SALAD","1/1000 CS (CASE)","0.0","11/5/06","0.0","0.6","","0.6","","" "BACON PRECOOKED","6/400 (CASE)","0.4","11/20/06","0.2","1.6","","1.4","",""
Hopefully it's clear that using TableUtils
in
this way makes it extremely easy to access any part of any explicit table found in your PDF
documents.
Extracting data from implicit tables
In contrast to explicit tables that use visible lines to demarcate the beginning and end of rows and columns, implicit tables rely on lanes and other patterns of whitespace to convey the impression of tabular structure:

Searching for PDF content arranged to suit the kinds of whitespace patterns associated with
tabular data cannot reasonably be performed automatically: while there are very reliable ways
to identify table structure using implicit whitespace, they are too computationally costly to
include in the page segmentation and document model PDFxStream builds for each Page
. So, for implicit tables, we recommend extracting
them as text using com.snowtide.pdf.VisualOutputTarget
, and using
familiar text-processing utilities to break the resulting text up into rows and
columns.
VisualOutputTarget
is an alternative com.snowtide.pdf.OutputHandler
implementation that is included in
PDFxStream that aims to retain the visual layout of the document’s content as
accurately as possible, using plain text spaces and linebreaks to simulate the precise
spatial positioning of each character in the source PDF. (See 'Controlling the formatting of extracted
text' for more about VisualOutputTarget
.)
Thus, VisualOutputTarget
produces text extracts
where the whitespace separating table columns and rows are usefully reproduced in plain text.
Here is example VisualOutputTarget
output for the
implicit table shown above:
369 03/20 08:45P DETROIT,MI 313-310-6623 370 03/20 08:49P DETROIT,MI 313-310-6623 371 03/20 08:52P Incoming 734-642-7532 372 03/20 08:57P Incoming 734-642-7532 373 03/20 09:03P TRENTON,MI 734-642-7532 374 03/20 09:32P DETROIT,MI 313-310-6623 375 03/20 09:41P DETROIT,MI 313-310-6623 376 03/20 09:59P TRENTON,MI 734-341-2297
Tabular text extracts like this can be sliced and diced quite easily using standard-library
tools like regular expressions and textual pattern matching libraries, as well as common Unix
standbys like cut
and awk
.