Skip to main content

Controlling the formatting of extracted text

The formatting of text extracts provided by PDFxStream is defined by the com.snowtide.pdf.OutputHandler that you use to "collect" the PDF text events generated by PDFxStream.

The default OutputHandler, com.snowtide.pdf.OutputTarget, is optimized for performance and use in semantically-sensitive environments: search, indexing, summarization, etc. In these kinds of environments, maintaining the spacing of text elements (including table columns and such) is mostly unnecessary, but it is important to ensure that logically contiguous text remains contiguous in the linear representation that is plain text. Given columnated content in a PDF that looks like this:

I celebrate myself, and sing myself,       I loafe and invite my soul,
And what I assume you shall assume, I lean and loafe at my ease observing a
For every atom belonging to me as good spear of summer grass.
belongs to you.

A naive text extraction process may well not recognize the columnation here, and glom lines together from disparate columns (e.g. extracting I celebrate myself, and sing myself, I loafe and invite my soul, as the first line). PDFxStream takes pains to recognize columns though, so OutputTarget will produce text formatted like this, with the contents of each column (and other elements) in "natural" read ordering:

I celebrate myself, and sing myself,
And what I assume you shall assume,
For every atom belonging to me as good
belongs to you.

I loafe and invite my soul,
I lean and loafe at my ease observing a
spear of summer grass.

That's good for prose, and generally any content, but sometimes you are more interested in data held in a PDF document. For example, PDF content like this:

                                         2011      2010
Cash flow from operating activities:
Net income $9,126 $8,987
Depreciation 2,816 1,223
Amortization of intangibles 719 971

Would likely be extracted by OutputTarget like so:

Cash flow from operating activities:
Net income
Depreciation
Amortization of intangibles

2011
$9,126
2,816
719

2010
$8,987
1,223
971

If your application were interested in reliably extracting that tabular data, this content-oriented layout would not work very well. In such a context, it is important for plain text extracts to mirror the "visual" layout of the pages in the source PDF document as much as possible, such that the relative horizontal and vertical positioning of each character and word reflects as accurately as possible their positioning in the source PDF document. To do this, one can use one of the many alternate OutputHandler implementations included with PDFxStream, com.snowtide.pdf.VisualOutputTarget.

Making this change is simple in most cases -- just replace all of your references to OutputTarget with VisualOutputTarget, and the resulting text extract will be a close approximation to the physical layout of the content in the PDF:

                                      2011      2010
Cash flow from operating activities:
Net income $9,126 $8,987
Depreciation 2,816 1,223
Amortization of intangibles 719 971

There are other OutputHandler implementations with PDFxStream, including pdfts.examples.XMLOutputTarget, com.snowtide.pdf.RegionOutputTarget, and pdfts.examples.GoogleHTMLOutputHandler. Of course, if you need to, you can build an OutputHandler implemention of your own to perform custom formatting of PDF text extracts based on the observed structure of each page and your domain-specific understanding of the type(s) of documents your application needs to process.