Controlling the formatting of extracted text
The formatting of the text extracts provided by PDFTextStream is defined by
the com.snowtide.pdf.OutputHandler
that you use
to "collect" the PDF text events generated by PDFTextStream.
The default OutputHandler
, com.snowtide.pdf.OutputTarget
, is optimized for performance and use in
semantically-sensitive environments: search, indexing, summarization, etc.
In these kinds of environments, maintaining the spacing of text elements
(including table columns and such) is mostly unnecessary, but it is
important to ensure that logically contiguous text remains contiguous in the
linear representation that is plain text. Given columnated content in a PDF
that looks like this:
I celebrate myself, and sing myself, I loafe and invite my soul, And what I assume you shall assume, I lean and loafe at my ease observing a spear of summer grass. For every atom belonging to me as good belongs to you.
OutputTarget
will produce text
formatted like this, with the contents of each column (and other elements)
in "natural" read ordering:
I celebrate myself, and sing myself, And what I assume you shall assume, For every atom belonging to me as good belongs to you. I loafe and invite my soul, I lean and loafe at my ease observing a spear of summer grass.
That's good for prose, and generally any content, but sometimes you are more interested in data held in a PDF document. For example, PDF content like this:
2011 2010 Cash flow from operating activities: Net income $9,126 $8,987 Depreciation 2,816 1,223 Amortization of intangibles 719 971
Would likely be extracted by OutputTarget
like so:
Cash flow from operating activities: Net income Depreciation Amortization of intangibles 2011 $9,126 2,816 719 2010 $8,987 1,223 971
If your application were interested in reliably extracting that tabular
data, this text layout would not work very well. In such a context, it is
important for plain text extracts to mirror the "visual" layout of the pages
in the source PDF document as much as possible. For that, one can use one of
the number of alternate OutputHandler
implementations included
with PDFTextStream, com.snowtide.pdf.VisualOutputTarget
.
Making this change is simple in most cases – just replace all of your
references to OutputTarget
to VisualOutputTarget
instead, and the resulting text
extract will be a close approximation to the physical layout of the content
in the PDF:
2011 2010 Cash flow from operating activities: Net income $9,126 $8,987 Depreciation 2,816 1,223 Amortization of intangibles 719 971
There are other OutputHandler
implementations with PDFTextStream, including pdfts.examples.XMLOutputTarget
,
com.snowtide.pdf.RegionOutputTarget
, and pdfts.examples.GoogleHTMLOutputHandler
. Of course, if you need to, you can
build an OutputHandler
implemention of your own to perform custom formatting of PDF text extracts
based on the observed structure of each page and your domain-specific
understanding of the type(s) of documents your application needs to process.