Extracting and updating PDF form data
Forms are used in virtually every industry and environment to efficiently collect data from individuals, but paper forms have frequently represented the worst of modern institutions – bureaucracy, unresponsiveness, and inflexibility.
The interactive form features offered by PDF document technology are helping to ease the handling of forms and form data by eliminating the need for paper forms, enabling user-friendly entry of form data and information, and providing for the efficient extraction of that data and information after a form is submitted. PDFTextStream supports both the extraction of form data from PDF documents as well as the generation of PDF documents with updated form field data.
Take, for example, a form that is ubiquitous and known to all within the United States, the dreaded IRS Form 1040:

The image above is taken from the PDF version of the 1040, which faithfully reproduces the appearance of the 1040 as the U.S. Government prints it each year. It could be printed, filled in by hand, and submitted by mail. However, if opened by a PDF viewer (like Adobe Acrobat) that is forms- capable, then each field within the form becomes available to user input, like so:

So, once the form is filled out within the PDF viewer, it could be submitted electronically to the IRS, which could then extract all of the data from the form programmatically instead of employing thousands to tediously enter such data by hand from paper copies of the form. PDFTextStream could be used for this extraction task; here’s what it would "see" when presented with the 1040 form:

Notice that each form field has a name – for example, the
city/town/state/zip field has the name f1
- f10
.
Each field’s name is unique within the form, making it very easy to access
only particular form field elements and values.
Extracting PDF form data
Here’s a code sample using PDFTextStream where the main name and address information is extracted from the 1040 form and associated with application-specific names:
public static Map get1040Data (PDFTextStream pdfts_1040) throws IOException { com.snowtide.pdf.forms.Form form = pdfts_1040.getFormData(); HashMap data = new HashMap(); com.snowtide.pdf.forms.FormField field = form.getField("f1-4"); data.put("first_name", field.getValue()); field = form.getField("f1-5"); data.put("last_name", field.getValue()); field = form.getField("f1-8"); data.put("address", field.getValue()); field = form.getField("f1-10"); data.put("city_state_zip", field.getValue()); return data; }
A com.snowtide.pdf.forms.Form
object contains references to all
of the form field elements included in the PDF form, mapped to each form
fields’ full, unique name. Specifically, that Form
object is a
com.snowtide.pdf.forms.AcroForm
instance; the AcroForm
subinterface guarantees that all fields it contains implement the com.snowtide.pdf.forms.AcroFormFieldinterface
.
All Form
objects provide methods for iterating over all of the
available form fields (iterator()
), getting an Enumeration
of all the names of a form’s FormFields (getFieldNames()
), and
getting a particular FormField
instance using its unique name (getField(String)
).
The forms extraction API presents a fundamentally simple name/value mapping,
and is therefore conceptually very similar to the document metadata extraction API. This is especially true with regard to
text-based form fields, represented by com.snowtide.pdf.forms.AcroTextField
s,
whose getValue()
method will always return a String
of the retained contents of the form field.
Export and display values
Nontext form fields such as button fields (represented by com.snowtide.pdf.forms.AcroButtonField
objects) and choice fields (represented by com.snowtide.pdf.forms.AcroChoiceField
objects) have slightly more complex aspects.
AcroButtonField
s have a variety of subtypes – principally,
checkboxes (com.snowtide.pdf.forms.AcroCheckboxField
) and radio
button groups (com.snowtide.pdf.forms.AcroRadioButtonGroupField
).
These kinds of widgets are quite familiar to users of web browsers, which
have analogous form entry elements. However, since these form fields are
primarily visual in nature, their retained values are visually-oriented as
well – the getValue()
method of all AcroButtonField
s
will return a String
code indicating how a PDF viewer should
display the field’s widget.
In most cases, this code will have no meaning to an extracting application,
so many PDF document forms will specify export values that correspond to
each potential display code, and likely describe the field’s selected
widget. All export values known for a particular field are available via the
getExportValues()
method; the single export value associated
with a field’s current value (display code) is available via the getExportValue()
method.
AcroChoiceField
s have a different design, which is similar to
how dropdown choice widgets and their values are described in HTML
documents. Each choice available in an AcroChoiceField
is a
pairing of values: one is an export value, which is typically used in
programmatic extraction and/or submission of form data, and the other is an
associated display value that is shown to the user when inputting or viewing
form data.
When an AcroChoiceField
allows only one selection (as indicated
by the allowsMultipleChoices()
method), the getValue()
function provided by AcroChoiceField
s will return a field’s
export value. The corresponding display value is available via the getDisplayValue(String)
function. When multiple selections are allowed in an AcroChoiceField
,
the getValue()
function can return an Object[]
containing String
export values.
Finally, in some cases, an AcroChoiceField
’s value may be
arbitrarily set by the user. If this is possible, the field’s isEditable()
function will return true, and the String
returned by the getValue()
function may not yield any associated display value via the getDisplayValue(String)
function.
Updating form field values
PDFTextStream also supports the generation of PDF documents containing updated interactive form field values. This is supported for text, checkbox, radio button group, and choice form fields. This feature may be used to support a user-centric forms update process, as well as to drive an automated forms generation system, where (for example) template PDF form documents are customized with customers’ specific information prior to being delivered or archived.
The actual update process is very simple:
- Retrieve the form fields to be updated
- Set new values on each form field (typically using
AcroFormField.setValue(String)
– although some form fields have specialized value setters, such asAcroCheckboxField
) - Finally, call
AcroForm.writeUpdatedDocument(File)
(orAcroForm.writeUpdatedDocument(OutputStream)
if you want to redirect the PDF document data somewhere other than a file) to write out a copy of the open PDF document that contains the updated form field data.
An instance of this procedure is shown below, continuing with our use of the IRS Form 1040 as an example:
public static void update1040Data (PDFTextStream pdfts_1040, String firstName, String lastName, String address, String city_state_zip, File updatePath) throws IOException { AcroForm form = (AcroForm)pdfts_1040.getFormData(); AcroTextField field = (AcroTextField)form.getField("f1-4"); field.setValue(firstName); field = (AcroTextField)form.getField("f1-5"); field.setValue(lastName); field = (AcroTextField)form.getField("f1-8"); field.setValue(address); field = (AcroTextField)form.getField("f1-10"); field.setValue(city_state_zip); form.writeUpdatedDocument(updatePath); }
Accessing XFA PDF forms
In addition to the (now "legacy") interactive PDF forms, the PDF specification now includes support for XFA PDF forms. XFA is a way to represent forms data using XML, which makes it very easy to support form data interchange.
PDFTextStream allows you to access the XML documents that comprise a PDF document’s XFA forms, which you can then query or process to meet your specific application requirements. Doing this is very simple, and builds upon PDFTextStream’s existing interactive form data API. In the example below, we’ll retrieve the XML document (as a byte array) that contains the XFA form’s current values:
public static byte[] getXFADatasets (PDFTextStream stream) throws IOException { AcroForm form = (AcroForm)stream.getFormData(); return form.getXFAPacketContents("datasets"); }
Further, we can access the full set of XFA form data in a PDF document using
the getXFAContents()
method on AcroForm
. These
values can be fed into any existing XML libraries or tools to support XFA
form data extraction, mapping of the form data to databases, or whatever
else your application requires.