Extracting and updating PDF form data

Forms are used in virtually every industry and environment to efficiently collect data from individuals, but paper forms have frequently represented the worst of modern institutions – bureaucracy, unresponsiveness, and inflexibility.

The interactive form features offered by PDF document technology are helping to ease the handling of forms and form data by eliminating the need for paper forms, enabling user-friendly entry of form data and information, and providing for the efficient extraction of that data and information after a form is submitted. PDFTextStream supports both the extraction of form data from PDF documents as well as the generation of PDF documents with updated form field data.

Take, for example, a form that is ubiquitous and known to all within the United States, the dreaded IRS Form 1040:

The image above is taken from the PDF version of the 1040, which faithfully reproduces the appearance of the 1040 as the U.S. Government prints it each year. It could be printed, filled in by hand, and submitted by mail. However, if opened by a PDF viewer (like Adobe Acrobat) that is forms- capable, then each field within the form becomes available to user input, like so:

So, once the form is filled out within the PDF viewer, it could be submitted electronically to the IRS, which could then extract all of the data from the form programmatically instead of employing thousands to tediously enter such data by hand from paper copies of the form. PDFTextStream could be used for this extraction task; here’s what it would "see" when presented with the 1040 form:

Notice that each form field has a name – for example, the city/town/state/zip field has the name f1 - f10. Each field’s name is unique within the form, making it very easy to access only particular form field elements and values.

Extracting PDF form data

Here’s a code sample using PDFTextStream where the main name and address information is extracted from the 1040 form and associated with application-specific names:

public static Map get1040Data (PDFTextStream pdfts_1040) throws IOException {
    com.snowtide.pdf.forms.Form form = pdfts_1040.getFormData();
    HashMap data = new HashMap();
    com.snowtide.pdf.forms.FormField field = form.getField("f1-4");
    data.put("first_name", field.getValue());
    field = form.getField("f1-5"); data.put("last_name", field.getValue());
    field = form.getField("f1-8"); data.put("address", field.getValue());
    field = form.getField("f1-10"); data.put("city_state_zip", field.getValue());
    return data;
}

A com.snowtide.pdf.forms.Form object contains references to all of the form field elements included in the PDF form, mapped to each form fields’ full, unique name. Specifically, that Form object is a com.snowtide.pdf.forms.AcroForm instance; the AcroForm subinterface guarantees that all fields it contains implement the com.snowtide.pdf.forms.AcroFormFieldinterface. All Form objects provide methods for iterating over all of the available form fields (iterator()), getting an Enumeration of all the names of a form’s FormFields (getFieldNames()), and getting a particular FormField instance using its unique name (getField(String)).

The forms extraction API presents a fundamentally simple name/value mapping, and is therefore conceptually very similar to the document metadata extraction API. This is especially true with regard to text-based form fields, represented by com.snowtide.pdf.forms.AcroTextFields, whose getValue() method will always return a String of the retained contents of the form field.

Export and display values

Nontext form fields such as button fields (represented by com.snowtide.pdf.forms.AcroButtonField objects) and choice fields (represented by com.snowtide.pdf.forms.AcroChoiceField objects) have slightly more complex aspects.

AcroButtonFields have a variety of subtypes – principally, checkboxes (com.snowtide.pdf.forms.AcroCheckboxField) and radio button groups (com.snowtide.pdf.forms.AcroRadioButtonGroupField). These kinds of widgets are quite familiar to users of web browsers, which have analogous form entry elements. However, since these form fields are primarily visual in nature, their retained values are visually-oriented as well – the getValue() method of all AcroButtonFields will return a String code indicating how a PDF viewer should display the field’s widget.

In most cases, this code will have no meaning to an extracting application, so many PDF document forms will specify export values that correspond to each potential display code, and likely describe the field’s selected widget. All export values known for a particular field are available via the getExportValues() method; the single export value associated with a field’s current value (display code) is available via the getExportValue() method.

AcroChoiceFields have a different design, which is similar to how dropdown choice widgets and their values are described in HTML documents. Each choice available in an AcroChoiceField is a pairing of values: one is an export value, which is typically used in programmatic extraction and/or submission of form data, and the other is an associated display value that is shown to the user when inputting or viewing form data.

When an AcroChoiceField allows only one selection (as indicated by the allowsMultipleChoices() method), the getValue() function provided by AcroChoiceFields will return a field’s export value. The corresponding display value is available via the getDisplayValue(String) function. When multiple selections are allowed in an AcroChoiceField, the getValue() function can return an Object[] containing String export values.

Finally, in some cases, an AcroChoiceField’s value may be arbitrarily set by the user. If this is possible, the field’s isEditable() function will return true, and the String returned by the getValue() function may not yield any associated display value via the getDisplayValue(String) function.

Updating form field values

PDFTextStream also supports the generation of PDF documents containing updated interactive form field values. This is supported for text, checkbox, radio button group, and choice form fields. This feature may be used to support a user-centric forms update process, as well as to drive an automated forms generation system, where (for example) template PDF form documents are customized with customers’ specific information prior to being delivered or archived.

The actual update process is very simple:

  1. Retrieve the form fields to be updated
  2. Set new values on each form field (typically using AcroFormField.setValue(String) – although some form fields have specialized value setters, such as AcroCheckboxField)
  3. Finally, call AcroForm.writeUpdatedDocument(File) (or AcroForm.writeUpdatedDocument(OutputStream) if you want to redirect the PDF document data somewhere other than a file) to write out a copy of the open PDF document that contains the updated form field data.

An instance of this procedure is shown below, continuing with our use of the IRS Form 1040 as an example:

public static void update1040Data (PDFTextStream pdfts_1040, String firstName,
                                   String lastName, String address,
                                   String city_state_zip, File updatePath) throws IOException {
    AcroForm form = (AcroForm)pdfts_1040.getFormData();
    AcroTextField field = (AcroTextField)form.getField("f1-4");
    field.setValue(firstName);
    field = (AcroTextField)form.getField("f1-5");
    field.setValue(lastName);
    field = (AcroTextField)form.getField("f1-8");
    field.setValue(address);
    field = (AcroTextField)form.getField("f1-10");
    field.setValue(city_state_zip);
    form.writeUpdatedDocument(updatePath);
}

Accessing XFA PDF forms

In addition to the (now "legacy") interactive PDF forms, the PDF specification now includes support for XFA PDF forms. XFA is a way to represent forms data using XML, which makes it very easy to support form data interchange.

PDFTextStream allows you to access the XML documents that comprise a PDF document’s XFA forms, which you can then query or process to meet your specific application requirements. Doing this is very simple, and builds upon PDFTextStream’s existing interactive form data API. In the example below, we’ll retrieve the XML document (as a byte array) that contains the XFA form’s current values:

public static byte[] getXFADatasets (PDFTextStream stream) throws IOException {
    AcroForm form = (AcroForm)stream.getFormData();
    return form.getXFAPacketContents("datasets");
}

Further, we can access the full set of XFA form data in a PDF document using the getXFAContents() method on AcroForm. These values can be fed into any existing XML libraries or tools to support XFA form data extraction, mapping of the form data to databases, or whatever else your application requires.