PDFs made easy, all in one box

PDFxStream Features & Capabilities

Everything in one box to make accessing content and data from PDFs easy.

info

Most developers don't need to read this page. PDFxStream supports nearly everything in the PDF specification (and hundreds of commonly-found constructs that aren't in the standard spec!), and is built so that your use of it requires zero knowledge of those details.

(If you care about a particular PDF data type or characteristic that is not listed below, then please feel free to contact us to confirm that PDFxStream supports what you need.)

That said, for those that do care about PDF internals and want to know about PDFxStream's level of support for them, read on and enjoy!

Making PDF data access simple and easy

Working with PDF documents is often a frustrating and difficult exercise. Few developers are familiar with the PDF file specification (and all of its incorporated sub-specifications), and even those that are usually don't want to have to consider those details when completing what should be an easy task like "store each uploaded PDF document's text in the database".

For this reason, one of PDFxStream's primary features is that it allows you to complete those sorts of tasks using an API that doesn't require knowledge of the particulars of the PDF file format. See for yourself just how little is necessary to access all of the essential pools of content and data within PDF documents using PDFxStream:

Choose a PDF data extraction task:

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;

public class ExtractTextAllPages {
    public static void main (String[] args) throws java.io.IOException {
        String pdfFilePath = args[0];

        Document pdf = PDF.open(pdfFilePath);
        StringBuilder text = new StringBuilder(1024);
        pdf.pipe(new OutputTarget(text));
        pdf.close();
        System.out.println(text);
    }
}

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.Page;
import com.snowtide.pdf.layout.Image;

import java.io.File;
import java.io.FileOutputStream;

public class ExtractImages {
    public static void main (String[] args) throws java.io.IOException {
        String pdfFilePath = args[0];
        File outputDir = new File(args[1]);
        if (!outputDir.exists()) outputDir.mkdirs();

        Document pdf = PDF.open(pdfFilePath);
        for (Page p : pdf.getPages()) {
            int i = 0;
            for (Image img : p.getImages()) {
                FileOutputStream out = new FileOutputStream(
                        new File(outputDir, String.format("%s-%s.%s",
                                p.getPageNumber(), i, img.dataFormat().name().toLowerCase())));
                out.write(img.data());
                out.close();
                i++;
            }
            System.out.printf("Found %s images on page %s", p.getImages().size(), p.getPageNumber());
            System.out.println();
        }
    }
}

import java.io.IOException;

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;

public class ExtractTextOnePage {
    public static void main (String[] args) throws IOException {
        String pdfFilePath = args[0];
        Document pdfts = PDF.open(pdfFilePath);
        StringBuilder text = new StringBuilder(1024);
        pdfts.getPage(0).pipe(new OutputTarget(text));
        pdfts.close();
        System.out.println(text);
    }
}

import com.snowtide.PDF;
import com.snowtide.pdf.Document;

public class ExtractMetadata {
    public static void main (String[] args) throws java.io.IOException {
        String pdfFilePath = args[0];

        System.out.println("All document metadata from " + pdfFilePath + ":");
        Document doc = PDF.open(pdfFilePath);
        for (String key : doc.getAttributeKeys()) {
            System.out.printf("%s: %s", key, doc.getAttribute(key));
            System.out.println();
        }
        doc.close();
    }
}

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.forms.*;

public class ExtractFormData {
    public static void main (String[] args) throws java.io.IOException {
        String pdfFilePath = args[0];
        Document pdfts = PDF.open(pdfFilePath);
        AcroForm form = (AcroForm)pdfts.getFormData();
        
        // access specific fields directly
        AcroTextField projectName = (AcroTextField)form.getField("color.1");
        AcroCheckboxField isPrivateNonProfit =
                (AcroCheckboxField)form.getField("color.10-privatenonprofit");

        System.out.printf("Project %s %s run by a nonprofit organization",
                projectName.getValue(),
                isPrivateNonProfit.isChecked() ? "is" : "is not");
        System.out.println();
        
        // access all fields (just a sampling of available data/functionality)
        System.out.println(String.format("All form data from %s:", pdfFilePath));
        for (AcroFormField field : form) {
            Object ftype = field.getType();
            if (ftype == AcroFormField.FIELD_TYPE_TEXT) {
                System.out.printf("Field %s is a text box; value: %s",
                        field.getFullName(), field.getValue());
            } else if (ftype == AcroFormField.FIELD_TYPE_BUTTON) {
                switch (((AcroButtonField)field).getButtonType()) {
                    case AcroButtonField.BUTTON_TYPE_PUSHBUTTON:
                        System.out.printf("Field %s is a pushbutton; value: %s",
                                field.getFullName(), field.getValue());
                        break;
                    case AcroButtonField.BUTTON_TYPE_CHECKBOX:
                        System.out.printf("Field %s is a checkbox; value: %s; is checked? %s",
                                field.getFullName(), field.getValue(),
                                ((AcroCheckboxField)field).isChecked());                        
                        break;
                    case AcroButtonField.BUTTON_TYPE_RADIO_GROUP:
                        System.out.printf("Field %s is a radio button group; value: %s; possible values: %s",
                                field.getFullName(), field.getValue(),
                                ((AcroRadioButtonGroupField)field).getPossibleValues());
                        break;
                }
            } else if (ftype == AcroFormField.FIELD_TYPE_CHOICE) {
                System.out.printf("Field %s is 'select' dropdown; value: %s; display label: %s",
                        field.getFullName(), field.getValue(),
                        ((AcroChoiceField)field).getDisplayValue((String)field.getValue()));
            } else if (ftype == AcroFormField.FIELD_TYPE_SIGNATURE) {
                System.out.printf("Field %s is a signature; value: %s",
                        field.getFullName(), field.getValue());
            } else {
                System.out.printf("Field %s is of unknown type; value: %s",
                        field.getFullName(), field.getValue());
            }
            System.out.println();
        }

        pdfts.close();
    }
}

import com.snowtide.PDF;
import com.snowtide.pdf.Bookmark;
import com.snowtide.pdf.Document;

public class AccessBookmarks {
    public static void main (String[] args) throws java.io.IOException {
        String pdfFilePath = args[0];
        
        Document doc = PDF.open(pdfFilePath);
        Bookmark root = doc.getBookmarks();
        if (root == null) {
            System.out.println(pdfFilePath + " does not contain any bookmarks.");
        } else {
            for (Bookmark b : root.getAllDescendants()) {
                System.out.printf("Bookmark '%s' points at page %s, bounds %s, %s, %s, %s",
                        b.getTitle(), b.getPageNumber(),
                        b.getLeftBound(), b.getBottomBound(),
                        b.getRightBound(), b.getTopBound());
                System.out.println();
            }
        }

        doc.close();
    }
}

import java.io.*;

import com.snowtide.PDF;
import com.snowtide.pdf.Document;

public class ExtractXMPMetadata {
    public static void main (String[] args) throws IOException {
        String pdfFilePath = args[0];
        
        Document doc = PDF.open(pdfFilePath);
        String outPath = args[0] + ".xmp.xml";
        FileOutputStream s = new FileOutputStream(outPath);
        s.write(doc.getXmlMetadata());
        s.close();
        doc.close();
        
        System.out.println("Wrote Adobe XMP metadata to " + outPath);
    }
}

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;

public class DecryptWithPassword {
    public static void main (String[] args) throws java.io.IOException {
        String pdfFilePath = args[0];
        Document pdfts = PDF.open(pdfFilePath, args[1].getBytes());
        StringBuilder text = new StringBuilder(1024);
        pdfts.pipe(new OutputTarget(text));
        pdfts.close();
        System.out.println(text);
    }
}

Give PDFxStream a try for your project's PDF data access needs!

Get Started

Baseline PDF format compatibility and basic data extraction capabilities

The official PDF file format specification is large and complex. PDF files can be rich, dynamic documents, and getting to all of the interesting and useful parts of them (i.e. their content, text, metadata, etc) is a daunting task.

Further, Adobe's specification only provides normative descriptions of how PDF documents should be constructed. Experience shows that applications must often process PDF documents from flawed sources that sometimes generate PDF files that bend and often break the "official" PDF specification — similar to how web browsers are forced to support broken and malformed HTML documents as best as they can.

This is just one of the many reasons why continually supporting and maintaining PDFxStream is a never-ending task. Doing anything else would prevent us from guaranteeing maximum compatibility with all PDF document formats and variants "in the field", regardless of their source or to what degree they violate certain rules of good PDF file format etiquette.

PDF Format Support Details

The range of PDF file format features (and quirks!) that PDFxStream supports is broad and deep. Below is a partial list of the major facets of the PDF specification that PDFxStream supports.

Compatibility with all versions of the PDF document specification, from v1.0 (corresponding to Acrobat 1) to v1.7 (corresponding to Acrobat 8 and higher).
Support for decryption of PDF documents encrypted with or without a password using 40-bit, 128-bit, 256-bit, and variable bitlength ciphers (including RC4 and AES)
Automatic "repair" of PDF documents to account for common malformations and irregularities
Extraction of PDF annotations (links, text notes, etc)
Extraction of embedded files and attachments
Extraction of PDF bookmarks (a.k.a. outline, table of contents)
Extraction of document metadata, as either key/value pairs or XML
Extraction of raw character data
Extraction of image metadata, including image dimensions, locations, and types
PDF file merging

Give PDFxStream a try for your project's PDF data access needs!

Get Started

Text extraction features

Unicode text extraction, including support for Chinese, Japanese, and Korean (CJK) in both horizontal and vertical writing modes
OutputHandler API for efficiently customizing PDF text extract formatting
Regional text extraction, ideal for extracting data from fixed-format forms
Complete support for embedded and standard fonts and character encodings:
- Type 0, 1, and 1C
- TrueType
- Identity-H and Identity-V encodings
- CMap encodings (including hundreds of Chinese, Japanese, and Korean character sets, both horizontal and vertical writing modes)
Automated layout processing, providing a traversable PDF document model including inferred block, line, column, and table structure
Support for extracting text from "searchable image" PDFs
Support for all varieties of rotated text
Comprehensive support for extracting PDF tables, including via CSV for export to Excel
Support for indexing PDF documents with Apache Lucene via lucene-pdf

Image extraction features

Decompression and decoding of dozens of PDF image types
Rendering of images to on-screen graphics contexts (java.awt.image.BufferedImage on Java, or System.Drawing.Bitmap on .NET) and saving to disk in familiar formats:
- JPEG
- TIFF
- GIF
- PNG
- BMP
Automatic stitching of image tiles and strips

Form data extraction features

Support for extracting "Acroform" (interactive) form data from all types of fields:
- Text
- Dropdowns ("Choice" fields)
- Radio buttons
- Checkboxes
- Pushbuttons
- Signatures
Support for extracting XFA form data
Support for filling "Acroform" fields, writing updated PDF documents

Learn more

To get the most out of PDFxStream's capabilities and PDF file format support, please check out the developer's guide and API reference.

Give PDFxStream a try for your project's PDF data access needs!

Get Started

Everything in one box to make accessing content and data from PDFs easy.

Making PDF data access simple and easy​

Baseline PDF format compatibility and basic data extraction capabilities​

PDF Format Support Details​

Text extraction features​

Image extraction features​

Form data extraction features​

Learn more​

Making PDF data access simple and easy

Baseline PDF format compatibility and basic data extraction capabilities

PDF Format Support Details

Text extraction features

Image extraction features

Form data extraction features

Learn more