Restricting PDF text extraction to only specific coordinates

PDFTextStream can be used to extract content from particular coordinates on one or many pages in a document.

com.snowtide.pdf.RegionOutputTarget is an com.snowtide.pdf.OutputHandler implementation that allows you to define areas on a page where the desired content is positioned. Once a page is piped through the RegionOutputTarget instance, the content found in the specified regions is then available from com.snowtide.pdf.RegionOutputTarget.getRegionText(int), or mapped to user-defined "field" names via RegionOutputTarget.getRegionText(int).

This is particularly useful when extracting form data where each field is known to be positioned in a specific location (perhaps on a specific page in each document of a particular type): each form fields' coordinates can be defined in the RegionOutputTarget and given a name ("address" or "ID No." or "field-553" – whatever is appropriate for your application). Then that RegionOutputTarget can be queried for the text values of those fields after a PDF Page has been piped through it.

Some example code is available on RegionOutputTarget's API reference page.

PDFxStream v3.5.0 Technical Documentation

Next >> Unicode text and character sets

<< Previous Controlling the formatting of extracted text