Extracting PDF text from only specific areas
With some minimal configuration, you can extract text content from only specific areas of a source PDF document.
com.snowtide.pdf.RegionOutputTarget is an
com.snowtide.pdf.OutputHandler implementation that allows you
to define areas on a page where the desired content is positioned. Once
a page is piped through the
RegionOutputTarget instance, the content
found in the specified regions is then available from
com.snowtide.pdf.RegionOutputTarget.getRegionText(int), or
mapped to user-defined "field" names via
RegionOutputTarget.getRegionText(int).
This is particularly useful when extracting form data where each field
is known to be positioned in a specific location (perhaps on a specific
page in each document of a particular type): each form fields'
coordinates can be defined in the
RegionOutputTarget and given a name
("address" or "ID No." or "field-553" -- whatever is appropriate
for your application). Then that
RegionOutputTarget can be queried for the
text values of those fields after a PDF Page has been piped through it.
Some example code is available on RegionOutputTarget's API
reference.