Controlling the formatting of extracted text

The formatting of text extracts provided by PDFTextStream is defined by the com.snowtide.pdf.OutputHandler that you use to "collect" the PDF text events generated by PDFTextStream.

The default OutputHandler, com.snowtide.pdf.OutputTarget, is optimized for performance and use in semantically-sensitive environments: search, indexing, summarization, etc. In these kinds of environments, maintaining the spacing of text elements (including table columns and such) is mostly unnecessary, but it is important to ensure that logically contiguous text remains contiguous in the linear representation that is plain text. Given columnated content in a PDF that looks like this:

I celebrate myself, and sing myself,                     I loafe and invite my soul,
And what I assume you shall assume,                      I lean and loafe at my ease observing a spear of summer grass.
For every atom belonging to me as good belongs to you.

OutputTarget will produce text formatted like this, with the contents of each column (and other elements) in "natural" read ordering:

I celebrate myself, and sing myself,
And what I assume you shall assume,
For every atom belonging to me as good belongs to you.

I loafe and invite my soul,
I lean and loafe at my ease observing a spear of summer grass.

That's good for prose, and generally any content, but sometimes you are more interested in data held in a PDF document. For example, PDF content like this:

                                          2011      2010
Cash flow from operating activities:
Net income                               $9,126    $8,987
Depreciation                              2,816     1,223
Amortization of intangibles                 719       971

Would likely be extracted by OutputTarget like so:

Cash flow from operating activities:
Net income                               
Depreciation                             
Amortization of intangibles              

2011
$9,126
2,816
719

2010
$8,987
1,223
971

If your application were interested in reliably extracting that tabular data, this content-oriented layout would not work very well. In such a context, it is important for plain text extracts to mirror the "visual" layout of the pages in the source PDF document as much as possible, such that the relative horizontal and vertical positioning of each character and word reflects as accurately as possible their positioning in the source PDF document. To do this, one can use one of the many alternate OutputHandler implementations included with PDFTextStream, com.snowtide.pdf.VisualOutputTarget.

Making this change is simple in most cases – just replace all of your references to OutputTarget with VisualOutputTarget, and the resulting text extract will be a close approximation to the physical layout of the content in the PDF:

                                          2011      2010
Cash flow from operating activities:
Net income                              $9,126     $8,987
Depreciation                              2,816    1,223
Amortization of intangibles               719       971

There are other OutputHandler implementations with PDFTextStream, including pdfts.examples.XMLOutputTarget, com.snowtide.pdf.RegionOutputTarget, and pdfts.examples.GoogleHTMLOutputHandler. Of course, if you need to, you can build an OutputHandler implemention of your own to perform custom formatting of PDF text extracts based on the observed structure of each page and your domain-specific understanding of the type(s) of documents your application needs to process.

PDFxStream v3.3.1 Technical Documentation

Next >> Restricting PDF text extraction to only specific coordinates

<< Previous Extracting text from PDF documents