Extracting and updating PDF form data
Forms are used in virtually every industry and environment to efficiently collect data from individuals, but paper forms have frequently represented the worst of modern institutions -- bureaucracy, unresponsiveness, and inflexibility.
The interactive form features offered by PDF document technology are helping to ease the handling of forms and form data by eliminating the need for paper forms, enabling user-friendly entry of form data and information, and providing for the efficient extraction of that data and information after a form is submitted. PDFxStream supports both the extraction of form data from PDF documents as well as the generation of PDF documents with updated form field data.
Take, for example, a form that is ubiquitous and known to all within the United States, the dreaded IRS Form 1040:
The image above is taken from the PDF version of the 1040, which faithfully reproduces the appearance of the 1040 as the U.S. Government prints it each year. It could be printed, filled in by hand, and submitted by mail. However, if opened by a PDF viewer (like Adobe Acrobat) that is forms- capable, then each field within the form becomes available to user input, like so:
So, once the form is filled out within the PDF viewer, it could be submitted electronically to the IRS, which could then extract all of the data from the form programmatically instead of employing thousands to tediously enter such data by hand from paper copies of the form. PDFxStream could be used for this extraction task; here's what it would "see" when presented with the 1040 form:
Notice that each form field has a name -- for example, the
city/town/state/zip field has the name f1
- f10
. Each field's name
is unique within the form, making it very easy to access only particular
form field elements and values.
Extracting PDF form data
Here's a code sample using PDFxStream where the main name and address information is extracted from the 1040 form and associated with application-specific names:
public static Map<String,String> get1040Data (Document pdfts_1040) throws IOException {
com.snowtide.pdf.forms.Form form = pdfts_1040.getFormData();
HashMap<String,String> data = new HashMap<String,String>();
data.put("first_name", form.getField("f1-4").getValue());
data.put("last_name", form.getField("f1-5").getValue());
data.put("address", form.getField("f1-8").getValue());
data.put("city_state_zip", form.getField("f1-10").getValue());
return data;
}
A com.snowtide.pdf.forms.Form
contains references to all of
the fields included in the PDF form, mapped to each form fields' full,
unique name. Specifically, that Form
object is a com.snowtide.pdf.forms.AcroForm
instance; the
AcroForm
subinterface guarantees that
all fields it contains implement the
com.snowtide.pdf.forms.AcroFormField
interface. All
Form
objects provide methods for
iterating over all of the available form fields, getting a collection of
all the names of a form's fields
(com.snowtide.pdf.forms.Form.getFieldNames()
), and getting a
particular com.snowtide.pdf.forms.FormField
instance using its
unique name.
The forms extraction API presents a fundamentally simple name/value
mapping, and is therefore conceptually very similar to the document
metadata API. This is especially true
with regard to text-based form fields, represented by
com.snowtide.pdf.forms.AcroTextField
s, whose
com.snowtide.pdf.forms.AcroTextField.getValue()
method will always
return a String
of the retained contents of the form field.
Export and display values
Nontext form fields such as button fields (represented by
com.snowtide.pdf.forms.AcroButtonField
objects) and choice
fields (represented by com.snowtide.pdf.forms.AcroChoiceField
objects) have slightly more complex aspects.
AcroButtonField
s have a variety of
subtypes -- principally, checkboxes
(com.snowtide.pdf.forms.AcroCheckboxField
) and radio button
groups (com.snowtide.pdf.forms.AcroRadioButtonGroupField
).
These kinds of widgets are quite familiar to users of web browsers,
which have analogous form entry elements. However, since these form
fields are primarily visual in nature, their retained values are
visually-oriented as well -- the getValue()
method of all
AcroButtonField
s will return a String
code indicating how a PDF viewer should display the field's widget.
In most cases, this code will have no meaning to an extracting
application, so many PDF document forms will specify export values that
correspond to each potential display code, and likely describe the
field's selected widget. All export values known for a particular field
are available via the
com.snowtide.pdf.forms.AcroButtonField.getExportValues()
method; the single export value associated with a field's current value
(display code) is available via the
com.snowtide.pdf.forms.AcroButtonField.getExportValue()
method.
AcroChoiceField
s have a different
design, which is similar to how dropdown choice widgets and their values
are described in HTML documents. Each choice available in an
AcroChoiceField
is a pairing of values:
one is an export value, which is typically used in programmatic
extraction and/or submission of form data, and the other is an
associated display value that is shown to the user when inputting or
viewing form data.
When an AcroChoiceField
allows only one
selection (as indicated by
com.snowtide.pdf.forms.AcroChoiceField.allowsMultipleChoices()
),
its com.snowtide.pdf.forms.AcroChoiceField.getValue()
method
will return its export value. The corresponding display value is
available via
com.snowtide.pdf.forms.AcroChoiceField.getDisplayValue(String)
.
When multiple selections are allowed in an
AcroChoiceField
AcroChoiceField.getValue()
can return
an Object[]
containing String
export values.
Finally, in some cases, an
AcroChoiceField
's value may be
arbitrarily set by the user. If this is possible, the field's
com.snowtide.pdf.forms.AcroChoiceField.isEditable()
method
will return true, and the String
returned by
AcroChoiceField.getValue()
may not be
associated with any display value provided by
AcroChoiceField.getDisplayValue(String)
.
Updating form field values
PDFxStream also supports the generation of PDF documents containing updated interactive form field values. This is supported for text, checkbox, radio button group, and choice form fields. This feature may be used to support a user-centric forms update process, as well as to drive an automated forms generation system, where (for example) template PDF form documents are customized with customers' specific information prior to being delivered or archived.
The actual update process is very simple:
- Retrieve the form fields to be updated
- Set new values on each form field (typically using
com.snowtide.pdf.forms.AcroFormField.setValue(String)
-- although some form fields have specialized value setters, such ascom.snowtide.pdf.forms.AcroCheckboField
) - Finally, call
com.snowtide.pdf.forms.AcroForm.writeUpdatedDocument(String)
(orAcroForm.writeUpdatedDocument(OutputStream)
if you want to redirect the PDF document data somewhere other than a file) to write out a copy of the open PDF document that contains the updated form field data.
An instance of this procedure is shown below, continuing with our use of the IRS Form 1040 as an example:
public static void update1040Data (Document pdf_1040, String firstName,
String lastName, String address,
String city_state_zip, String updatePath) throws IOException {
AcroForm form = (AcroForm)pdf_1040.getFormData();
AcroTextField field = (AcroTextField)form.getField("f1-4");
field.setValue(firstName);
field = (AcroTextField)form.getField("f1-5");
field.setValue(lastName);
field = (AcroTextField)form.getField("f1-8");
field.setValue(address);
field = (AcroTextField)form.getField("f1-10");
field.setValue(city_state_zip);
form.writeUpdatedDocument(updatePath);
}
Accessing XFA PDF forms
In addition to (now "legacy") interactive PDF forms, the PDF specification includes support for XFA PDF forms. XFA is a way to represent forms data using XML, which makes it very easy to support form data interchange.
PDFxStream allows you to access the XML documents that comprise a PDF document's XFA forms, which you can then query or process to meet your specific application requirements. Doing this is very simple, and builds upon PDFxStream's existing interactive form data API. In the example below, we'll retrieve the XML document (as a byte array) that contains the XFA form's current values:
public static byte[] getXFADatasets (Document pdf) throws IOException {
AcroForm form = (AcroForm)pdf.getFormData();
return form.getXFAPacketContents("datasets");
}
Further, we can access the full set of XFA form data in a PDF document
using com.snowtide.pdf.forms.AcroForm.getXFAContents()
. These
values can be fed into any existing XML libraries or tools to support
XFA form data extraction, mapping of the form data to databases, or
whatever else your application requires.