Controlling the formatting of extracted text

The formatting of the text extracts provided by PDFTextStream is defined by the com.snowtide.pdf.OutputHandler that you use to "collect" the PDF text events generated by PDFTextStream.

The default OutputHandler, com.snowtide.pdf.OutputTarget, is optimized for performance and use in semantically-sensitive environments: search, indexing, summarization, etc. In these kinds of environments, maintaining the spacing of text elements (including table columns and such) is mostly unnecessary, but it is important to ensure that logically contiguous text remains contiguous in the linear representation that is plain text. Given columnated content in a PDF that looks like this:

I celebrate myself, and sing myself,                     I loafe and invite my soul,
And what I assume you shall assume,                      I lean and loafe at my ease observing a spear of summer grass.
For every atom belonging to me as good belongs to you.

OutputTarget will produce text formatted like this, with the contents of each column (and other elements) in "natural" read ordering:

I celebrate myself, and sing myself,
And what I assume you shall assume,
For every atom belonging to me as good belongs to you.

I loafe and invite my soul,
I lean and loafe at my ease observing a spear of summer grass.

That's good for prose, and generally any content, but sometimes you are more interested in data held in a PDF document. For example, PDF content like this:

                                          2011      2010
Cash flow from operating activities:
Net income                               $9,126    $8,987
Depreciation                              2,816     1,223
Amortization of intangibles                 719       971

Would likely be extracted by OutputTarget like so:

Cash flow from operating activities:
Net income                               
Depreciation                             
Amortization of intangibles              

2011
$9,126
2,816
719

2010
$8,987
1,223
971

If your application were interested in reliably extracting that tabular data, this text layout would not work very well. In such a context, it is important for plain text extracts to mirror the "visual" layout of the pages in the source PDF document as much as possible. For that, one can use one of the number of alternate OutputHandler implementations included with PDFTextStream, com.snowtide.pdf.VisualOutputTarget.

Making this change is simple in most cases – just replace all of your references to OutputTarget to VisualOutputTarget instead, and the resulting text extract will be a close approximation to the physical layout of the content in the PDF:

                                          2011      2010
Cash flow from operating activities:
Net income                              $9,126     $8,987
Depreciation                              2,816    1,223
Amortization of intangibles               719       971

There are other OutputHandler implementations with PDFTextStream, including pdfts.examples.XMLOutputTarget, com.snowtide.pdf.RegionOutputTarget, and pdfts.examples.GoogleHTMLOutputHandler. Of course, if you need to, you can build an OutputHandler implemention of your own to perform custom formatting of PDF text extracts based on the observed structure of each page and your domain-specific understanding of the type(s) of documents your application needs to process.