Applies to:
PDFFormStream
Extracting and updating PDF form data
Forms are used in virtually every industry and environment to efficiently collect data from individuals, but paper forms have frequently represented the worst of modern institutions – bureaucracy, unresponsiveness, and inflexibility.
The interactive form features offered by PDF document technology are helping to ease the handling of forms and form data by eliminating the need for paper forms, enabling user-friendly entry of form data and information, and providing for the efficient extraction of that data and information after a form is submitted. PDFxStream supports both the extraction of form data from PDF documents as well as the generation of PDF documents with updated form field data.
Take, for example, a form that is ubiquitous and known to all within the United States, the dreaded IRS Form 1040:

The image above is taken from the PDF version of the 1040, which faithfully reproduces the appearance of the 1040 as the U.S. Government prints it each year. It could be printed, filled in by hand, and submitted by mail. However, if opened by a PDF viewer (like Adobe Acrobat) that is forms- capable, then each field within the form becomes available to user input, like so:

So, once the form is filled out within the PDF viewer, it could be submitted electronically to the IRS, which could then extract all of the data from the form programmatically instead of employing thousands to tediously enter such data by hand from paper copies of the form. PDFxStream could be used for this extraction task; here’s what it would "see" when presented with the 1040 form:

Notice that each form field has a name – for example, the
city/town/state/zip field has the name f1
- f10
.
Each field’s name is unique within the form, making it very easy to access
only particular form field elements and values.
Extracting PDF form data
Here’s a code sample using PDFxStream where the main name and address information is extracted from the 1040 form and associated with application-specific names:
public static Map<String,String> get1040Data (Document pdfts_1040) throws IOException { com.snowtide.pdf.forms.Form form = pdfts_1040.getFormData(); HashMap<String,String> data = new HashMap<String,String>(); data.put("first_name", form.getField("f1-4").getValue()); data.put("last_name", form.getField("f1-5").getValue()); data.put("address", form.getField("f1-8").getValue()); data.put("city_state_zip", form.getField("f1-10").getValue()); return data; }
A com.snowtide.pdf.forms.Form
contains references
to all of the fields included in the PDF form, mapped to each form fields’
full, unique name. Specifically,
that Form
object is a
com.snowtide.pdf.forms.AcroForm
instance;
the AcroForm
subinterface
guarantees that all fields it contains implement
the com.snowtide.pdf.forms.AcroFormField
interface. All Form
objects provide methods for iterating over all of the available form fields,
getting a collection of all the names of a form’s fields
(com.snowtide.pdf.forms.Form.getFieldNames()
),
and getting a
particular com.snowtide.pdf.forms.FormField
instance using its unique name.
The forms extraction API presents a fundamentally simple name/value mapping,
and is therefore conceptually very similar to
the document metadata
API. This is especially true with regard to text-based form fields,
represented by com.snowtide.pdf.forms.AcroTextField
s,
whose com.snowtide.pdf.forms.AcroTextField#getValue()
method
will always return a String
of the retained contents of the
form field.
Export and display values
Nontext form fields such as button fields (represented
by com.snowtide.pdf.forms.AcroButtonField
objects) and choice
fields (represented by com.snowtide.pdf.forms.AcroChoiceField
objects) have slightly more complex aspects.
AcroButtonField
s have a
variety of subtypes – principally, checkboxes
(com.snowtide.pdf.forms.AcroCheckboxField
) and
radio button groups
(com.snowtide.pdf.forms.AcroRadioButtonGroupField
).
These kinds of widgets are quite familiar to users of web browsers, which
have analogous form entry elements. However, since these form fields are
primarily visual in nature, their retained values are visually-oriented as
well – the getValue()
method of
all AcroButtonField
s will
return a String
code indicating how a PDF viewer should display
the field’s widget.
In most cases, this code will have no meaning to an extracting application,
so many PDF document forms will specify export values that correspond to
each potential display code, and likely describe the field’s selected
widget. All export values known for a particular field are available via the
com.snowtide.pdf.forms.AcroButtonField.getExportValues()
method; the single export value associated with a field’s current value
(display code) is available via
the com.snowtide.pdf.forms.AcroButtonField.getExportValue()
method.
AcroChoiceField
s have a different design, which is similar to
how dropdown choice widgets and their values are described in HTML
documents. Each choice available in an AcroChoiceField
is a
pairing of values: one is an export value, which is typically used in
programmatic extraction and/or submission of form data, and the other is an
associated display value that is shown to the user when inputting or viewing
form data.
When an AcroChoiceField
allows only one selection (as indicated by
com.snowtide.pdf.forms.AcroChoiceField.allowsMultipleChoices()
),
its com.snowtide.pdf.forms.AcroChoiceField.getValue()
method will return its
export value. The corresponding display value is available
via com.snowtide.pdf.forms.AcroChoiceField.getDisplayValue(String)
.
When multiple selections are allowed in
an AcroChoiceField
AcroChoiceField.getValue()
can return an Object[]
containing String
export
values.
Finally, in some cases,
an AcroChoiceField
’s value
may be arbitrarily set by the user. If this is possible, the
field’s com.snowtide.pdf.forms.AcroChoiceField.isEditable()
method will return true, and the String
returned
by com.snowtide.pdf.forms.AcroChoiceField#getValue()
may not be
associated with any display value provided
by com.snowtide.pdf.forms.AcroChoiceField#getDisplayValue(java.lang.String)
.
Updating form field values
PDFxStream also supports the generation of PDF documents containing updated interactive form field values. This is supported for text, checkbox, radio button group, and choice form fields. This feature may be used to support a user-centric forms update process, as well as to drive an automated forms generation system, where (for example) template PDF form documents are customized with customers’ specific information prior to being delivered or archived.
The actual update process is very simple:
- Retrieve the form fields to be updated
- Set new values on each form field (typically using
com.snowtide.pdf.forms.AcroFormField.setValue(String)
– although some form fields have specialized value setters, such ascom.snowtide.pdf.forms.AcroCheckboField
) - Finally,
call
com.snowtide.pdf.forms.AcroForm#writeUpdatedDocument(java.lang.String)
(orAcroForm.writeUpdatedDocument(OutputStream)
if you want to redirect the PDF document data somewhere other than a file) to write out a copy of the open PDF document that contains the updated form field data.
An instance of this procedure is shown below, continuing with our use of the IRS Form 1040 as an example:
public static void update1040Data (Document pdf_1040, String firstName, String lastName, String address, String city_state_zip, String updatePath) throws IOException { AcroForm form = (AcroForm)pdf_1040.getFormData(); AcroTextField field = (AcroTextField)form.getField("f1-4"); field.setValue(firstName); field = (AcroTextField)form.getField("f1-5"); field.setValue(lastName); field = (AcroTextField)form.getField("f1-8"); field.setValue(address); field = (AcroTextField)form.getField("f1-10"); field.setValue(city_state_zip); form.writeUpdatedDocument(updatePath); }
Accessing XFA PDF forms
In addition to (now "legacy") interactive PDF forms, the PDF specification includes support for XFA PDF forms. XFA is a way to represent forms data using XML, which makes it very easy to support form data interchange.
PDFxStream allows you to access the XML documents that comprise a PDF document’s XFA forms, which you can then query or process to meet your specific application requirements. Doing this is very simple, and builds upon PDFxStream’s existing interactive form data API. In the example below, we’ll retrieve the XML document (as a byte array) that contains the XFA form’s current values:
public static byte[] getXFADatasets (Document pdf) throws IOException { AcroForm form = (AcroForm)pdf.getFormData(); return form.getXFAPacketContents("datasets"); }
Further, we can access the full set of XFA form data in a PDF document using
com.snowtide.pdf.forms.AcroForm.getXFAContents()
. These
values can be fed into any existing XML libraries or tools to support XFA
form data extraction, mapping of the form data to databases, or whatever
else your application requires.