Extracting and updating PDF form data

Forms are used in virtually every industry and environment to efficiently collect data from individuals, but paper forms have frequently represented the worst of modern institutions – bureaucracy, unresponsiveness, and inflexibility.

The interactive form features offered by PDF document technology are helping to ease the handling of forms and form data by eliminating the need for paper forms, enabling user-friendly entry of form data and information, and providing for the efficient extraction of that data and information after a form is submitted. PDFxStream supports both the extraction of form data from PDF documents as well as the generation of PDF documents with updated form field data.

Take, for example, a form that is ubiquitous and known to all within the United States, the dreaded IRS Form 1040:

The image above is taken from the PDF version of the 1040, which faithfully reproduces the appearance of the 1040 as the U.S. Government prints it each year. It could be printed, filled in by hand, and submitted by mail. However, if opened by a PDF viewer (like Adobe Acrobat) that is forms- capable, then each field within the form becomes available to user input, like so:

So, once the form is filled out within the PDF viewer, it could be submitted electronically to the IRS, which could then extract all of the data from the form programmatically instead of employing thousands to tediously enter such data by hand from paper copies of the form. PDFxStream could be used for this extraction task; here’s what it would "see" when presented with the 1040 form:

Notice that each form field has a name – for example, the city/town/state/zip field has the name f1 - f10. Each field’s name is unique within the form, making it very easy to access only particular form field elements and values.

Extracting PDF form data

Here’s a code sample using PDFxStream where the main name and address information is extracted from the 1040 form and associated with application-specific names:

public static Map<String,String> get1040Data (Document pdfts_1040) throws IOException {
    com.snowtide.pdf.forms.Form form = pdfts_1040.getFormData();
    HashMap<String,String> data = new HashMap<String,String>();
    data.put("first_name", form.getField("f1-4").getValue());
    data.put("last_name", form.getField("f1-5").getValue());
    data.put("address", form.getField("f1-8").getValue());
    data.put("city_state_zip", form.getField("f1-10").getValue());
    return data;
}

A com.snowtide.pdf.forms.Form contains references to all of the fields included in the PDF form, mapped to each form fields’ full, unique name. Specifically, that Form object is a com.snowtide.pdf.forms.AcroForm instance; the AcroForm subinterface guarantees that all fields it contains implement the com.snowtide.pdf.forms.AcroFormField interface. All Form objects provide methods for iterating over all of the available form fields, getting a collection of all the names of a form’s fields (com.snowtide.pdf.forms.Form.getFieldNames()), and getting a particular com.snowtide.pdf.forms.FormField instance using its unique name.

The forms extraction API presents a fundamentally simple name/value mapping, and is therefore conceptually very similar to the document metadata API. This is especially true with regard to text-based form fields, represented by com.snowtide.pdf.forms.AcroTextFields, whose com.snowtide.pdf.forms.AcroTextField#getValue() method will always return a String of the retained contents of the form field.

Export and display values

Nontext form fields such as button fields (represented by com.snowtide.pdf.forms.AcroButtonField objects) and choice fields (represented by com.snowtide.pdf.forms.AcroChoiceField objects) have slightly more complex aspects.

AcroButtonFields have a variety of subtypes – principally, checkboxes (com.snowtide.pdf.forms.AcroCheckboxField) and radio button groups (com.snowtide.pdf.forms.AcroRadioButtonGroupField). These kinds of widgets are quite familiar to users of web browsers, which have analogous form entry elements. However, since these form fields are primarily visual in nature, their retained values are visually-oriented as well – the getValue() method of all AcroButtonFields will return a String code indicating how a PDF viewer should display the field’s widget.

In most cases, this code will have no meaning to an extracting application, so many PDF document forms will specify export values that correspond to each potential display code, and likely describe the field’s selected widget. All export values known for a particular field are available via the com.snowtide.pdf.forms.AcroButtonField.getExportValues() method; the single export value associated with a field’s current value (display code) is available via the com.snowtide.pdf.forms.AcroButtonField.getExportValue() method.

AcroChoiceFields have a different design, which is similar to how dropdown choice widgets and their values are described in HTML documents. Each choice available in an AcroChoiceField is a pairing of values: one is an export value, which is typically used in programmatic extraction and/or submission of form data, and the other is an associated display value that is shown to the user when inputting or viewing form data.

When an AcroChoiceField allows only one selection (as indicated by com.snowtide.pdf.forms.AcroChoiceField.allowsMultipleChoices()), its com.snowtide.pdf.forms.AcroChoiceField.getValue() method will return its export value. The corresponding display value is available via com.snowtide.pdf.forms.AcroChoiceField.getDisplayValue(String). When multiple selections are allowed in an AcroChoiceField AcroChoiceField.getValue() can return an Object[] containing String export values.

Finally, in some cases, an AcroChoiceField’s value may be arbitrarily set by the user. If this is possible, the field’s com.snowtide.pdf.forms.AcroChoiceField.isEditable() method will return true, and the String returned by com.snowtide.pdf.forms.AcroChoiceField#getValue() may not be associated with any display value provided by com.snowtide.pdf.forms.AcroChoiceField#getDisplayValue(java.lang.String).

Updating form field values

PDFxStream also supports the generation of PDF documents containing updated interactive form field values. This is supported for text, checkbox, radio button group, and choice form fields. This feature may be used to support a user-centric forms update process, as well as to drive an automated forms generation system, where (for example) template PDF form documents are customized with customers’ specific information prior to being delivered or archived.

The actual update process is very simple:

Retrieve the form fields to be updated
Set new values on each form field (typically using com.snowtide.pdf.forms.AcroFormField.setValue(String) – although some form fields have specialized value setters, such as com.snowtide.pdf.forms.AcroCheckboField)
Finally, call com.snowtide.pdf.forms.AcroForm#writeUpdatedDocument(java.lang.String) (or AcroForm.writeUpdatedDocument(OutputStream) if you want to redirect the PDF document data somewhere other than a file) to write out a copy of the open PDF document that contains the updated form field data.

An instance of this procedure is shown below, continuing with our use of the IRS Form 1040 as an example:

public static void update1040Data (Document pdf_1040, String firstName,
                                   String lastName, String address,
                                   String city_state_zip, String updatePath) throws IOException {
    AcroForm form = (AcroForm)pdf_1040.getFormData();
    AcroTextField field = (AcroTextField)form.getField("f1-4");
    field.setValue(firstName);
    field = (AcroTextField)form.getField("f1-5");
    field.setValue(lastName);
    field = (AcroTextField)form.getField("f1-8");
    field.setValue(address);
    field = (AcroTextField)form.getField("f1-10");
    field.setValue(city_state_zip);
    form.writeUpdatedDocument(updatePath);
}

Accessing XFA PDF forms

In addition to (now "legacy") interactive PDF forms, the PDF specification includes support for XFA PDF forms. XFA is a way to represent forms data using XML, which makes it very easy to support form data interchange.

PDFxStream allows you to access the XML documents that comprise a PDF document’s XFA forms, which you can then query or process to meet your specific application requirements. Doing this is very simple, and builds upon PDFxStream’s existing interactive form data API. In the example below, we’ll retrieve the XML document (as a byte array) that contains the XFA form’s current values:

public static byte[] getXFADatasets (Document pdf) throws IOException {
    AcroForm form = (AcroForm)pdf.getFormData();
    return form.getXFAPacketContents("datasets");
}

Further, we can access the full set of XFA form data in a PDF document using com.snowtide.pdf.forms.AcroForm.getXFAContents(). These values can be fed into any existing XML libraries or tools to support XFA form data extraction, mapping of the form data to databases, or whatever else your application requires.

PDFxStream v3.5.0 Technical Documentation

Next >> Error handling

<< Previous Extracting Images from PDF Documents