Controlling the formatting of extracted text
The formatting of text extracts provided by PDFxStream is defined by
the com.snowtide.pdf.OutputHandler
that you use to "collect"
the PDF text events generated by PDFxStream.
The default OutputHandler
,
com.snowtide.pdf.OutputTarget
, is optimized for performance
and use in semantically-sensitive environments: search, indexing,
summarization, etc. In these kinds of environments, maintaining the
spacing of text elements (including table columns and such) is mostly
unnecessary, but it is important to ensure that logically contiguous
text remains contiguous in the linear representation that is plain text.
Given columnated content in a PDF that looks like this:
I celebrate myself, and sing myself, I loafe and invite my soul,
And what I assume you shall assume, I lean and loafe at my ease observing a
For every atom belonging to me as good spear of summer grass.
belongs to you.
A naive text extraction process may well not recognize the columnation here,
and glom lines together from disparate columns (e.g. extracting
I celebrate myself, and sing myself, I loafe and invite my soul,
as the
first line). PDFxStream takes pains to recognize columns though, so
OutputTarget
will produce text formatted like
this, with the contents of each column (and other elements) in
"natural" read ordering:
I celebrate myself, and sing myself,
And what I assume you shall assume,
For every atom belonging to me as good
belongs to you.
I loafe and invite my soul,
I lean and loafe at my ease observing a
spear of summer grass.
That's good for prose, and generally any content, but sometimes you are more interested in data held in a PDF document. For example, PDF content like this:
2011 2010
Cash flow from operating activities:
Net income $9,126 $8,987
Depreciation 2,816 1,223
Amortization of intangibles 719 971
Would likely be extracted by OutputTarget
like so:
Cash flow from operating activities:
Net income
Depreciation
Amortization of intangibles
2011
$9,126
2,816
719
2010
$8,987
1,223
971
If your application were interested in reliably extracting that tabular
data, this content-oriented layout would not work very well. In such a
context, it is important for plain text extracts to mirror the
"visual" layout of the pages in the source PDF document as much as
possible, such that the relative horizontal and vertical positioning of
each character and word reflects as accurately as possible their
positioning in the source PDF document. To do this, one can use one of
the many alternate OutputHandler
implementations included with PDFxStream,
com.snowtide.pdf.VisualOutputTarget
.
Making this change is simple in most cases -- just replace all of your
references to OutputTarget
with
VisualOutputTarget
, and the resulting text
extract will be a close approximation to the physical layout of the
content in the PDF:
2011 2010
Cash flow from operating activities:
Net income $9,126 $8,987
Depreciation 2,816 1,223
Amortization of intangibles 719 971
There are other OutputHandler
implementations
with PDFxStream, including pdfts.examples.XMLOutputTarget
,
com.snowtide.pdf.RegionOutputTarget
, and
pdfts.examples.GoogleHTMLOutputHandler
. Of course, if you need
to, you can build an OutputHandler
implemention of your own to perform custom formatting of PDF text
extracts based on the observed structure of each page and your
domain-specific understanding of the type(s) of documents your
application needs to process.