Extracting PDF text from only specific areas
With some minimal configuration, you can extract text content from only specific areas of a source PDF document.
com.snowtide.pdf.RegionOutputTarget
is an
com.snowtide.pdf.OutputHandler
implementation that allows you
to define areas on a page where the desired content is positioned. Once
a page is piped through the
RegionOutputTarget
instance, the content
found in the specified regions is then available from
com.snowtide.pdf.RegionOutputTarget.getRegionText(int)
, or
mapped to user-defined "field" names via
RegionOutputTarget.getRegionText(int)
.
This is particularly useful when extracting form data where each field
is known to be positioned in a specific location (perhaps on a specific
page in each document of a particular type): each form fields'
coordinates can be defined in the
RegionOutputTarget
and given a name
("address"
or "ID No."
or "field-553"
-- whatever is appropriate
for your application). Then that
RegionOutputTarget
can be queried for the
text values of those fields after a PDF Page has been piped through it.
Some example code is available on RegionOutputTarget
's API
reference.