Applies to:
PDFTextStream
Restricting PDF text extraction to only specific coordinates
PDFTextStream can be used to extract content from particular coordinates on one or many pages in a document.
com.snowtide.pdf.RegionOutputTarget
is
an com.snowtide.pdf.OutputHandler
implementation that allows you to define areas on a page where the desired
content is positioned. Once a page is piped through
the RegionOutputTarget
instance, the content found in the specified regions is then available
from
com.snowtide.pdf.RegionOutputTarget.getRegionText(int)
,
or mapped to user-defined "field" names
via RegionOutputTarget.getRegionText(int)
.
This is particularly useful when extracting form data where each field is
known to be positioned in a specific location (perhaps on a specific page in
each document of a particular type): each form fields' coordinates can be
defined in the RegionOutputTarget
and given a name ("address"
or "ID No."
or "field-553"
– whatever is appropriate for your application). Then that RegionOutputTarget
can be queried for the text values of
those fields after a PDF Page has been piped through it.
Some example code is available on RegionOutputTarget's API reference page.