Applies to:
PDFTextStream
Controlling the formatting of extracted text
The formatting of text extracts provided by PDFTextStream is defined by
the com.snowtide.pdf.OutputHandler
that you use
to "collect" the PDF text events generated by PDFTextStream.
The default OutputHandler
, com.snowtide.pdf.OutputTarget
, is optimized for performance and use in
semantically-sensitive environments: search, indexing, summarization, etc.
In these kinds of environments, maintaining the spacing of text elements
(including table columns and such) is mostly unnecessary, but it is
important to ensure that logically contiguous text remains contiguous in the
linear representation that is plain text. Given columnated content in a PDF
that looks like this:
I celebrate myself, and sing myself, I loafe and invite my soul, And what I assume you shall assume, I lean and loafe at my ease observing a spear of summer grass. For every atom belonging to me as good belongs to you.
OutputTarget
will produce text
formatted like this, with the contents of each column (and other elements)
in "natural" read ordering:
I celebrate myself, and sing myself, And what I assume you shall assume, For every atom belonging to me as good belongs to you. I loafe and invite my soul, I lean and loafe at my ease observing a spear of summer grass.
That's good for prose, and generally any content, but sometimes you are more interested in data held in a PDF document. For example, PDF content like this:
2011 2010 Cash flow from operating activities: Net income $9,126 $8,987 Depreciation 2,816 1,223 Amortization of intangibles 719 971
Would likely be extracted by OutputTarget
like so:
Cash flow from operating activities: Net income Depreciation Amortization of intangibles 2011 $9,126 2,816 719 2010 $8,987 1,223 971
If your application were interested in reliably extracting that tabular
data, this content-oriented layout would not work very well. In such a
context, it is important for plain text extracts to mirror the "visual"
layout of the pages in the source PDF document as much as possible, such
that the relative horizontal and vertical positioning of each character and
word reflects as accurately as possible their positioning in the source PDF
document. To do this, one can use one of the many
alternate OutputHandler
implementations included with
PDFTextStream, com.snowtide.pdf.VisualOutputTarget
.
Making this change is simple in most cases – just replace all of your
references to OutputTarget
with VisualOutputTarget
, and the resulting text
extract will be a close approximation to the physical layout of the content
in the PDF:
2011 2010 Cash flow from operating activities: Net income $9,126 $8,987 Depreciation 2,816 1,223 Amortization of intangibles 719 971
There are other OutputHandler
implementations with PDFTextStream,
including pdfts.examples.XMLOutputTarget
,
com.snowtide.pdf.RegionOutputTarget
,
and pdfts.examples.GoogleHTMLOutputHandler
. Of
course, if you need to, you can build
an OutputHandler
implemention
of your own to perform custom formatting of PDF text extracts based on the
observed structure of each page and your domain-specific understanding of
the type(s) of documents your application needs to process.