Restricting PDF text extraction to only specific coordinates
Yes, PDFTextStream can be used to extract content from specified coordinates on one or many pages in a document.
com.snowtide.pdf.RegionOutputTarget is an
com.snowtide.pdf.OutputHandler implementation that allows you to define
areas on a page where the desired content is positioned. Once a page is
piped through the
instance, the content found in the specified regions is then available from
optionally mapped to user-defined names.
This is particularly handy for extracting form data where each field is
known to be positioned in a specific location (perhaps on a specific page in
each document of a particular type): each form fields' coordinates can be
defined in the
and given a name (
"ID No." or
– whatever is appropriate for your application). Then that
RegionOutputTarget can be queried for the text values of
those fields after a PDF Page has been piped through it.
Some example code is available on RegionOutputTarget's API reference page.