Introducing PDFxStream

PDFxStream is a Java and .NET library that enables applications to access text, image, tabular, metadata, and form content in PDF documents quickly, easily, and accurately. While there are many excellent tools and libraries available for generating PDF documents, PDFxStream was the first and remains the only library to focus exclusively on the extraction of data from PDF files.

The various capabilities provided by PDFxStream are separated into a small number of distinct components; while they are delivered as a single, unified API, each can be licensed and enabled separately so you only pay for what you need and use:

  • PDFxStream Base, providing:
    • Support for versions 1.0 – 1.7 (ExtensionLevel 5) of the PDF document specification
    • Document decryption, including 40-bit, 128-bit, 256-bit, and variable bitlength RC4 and AES ciphers
    • Support for extracting bookmarks, annotations, and document attachments
    • Access to all document metadata contained in a PDF file, including Adobe XMP metadata streams
  • PDFTextStream, providing:
    • Pure Unicode text extraction, including Chinese, Japanese, and Korean (CJK) support
    • Automatic detection of tabular data and inference of table structure
  • PDFImageStream, providing extraction of images embedded in PDF documents for immediate display or for storage as PNG, JPEG, TIFF, GIF, or BMP formats
  • PDFFormStream, providing extraction and filling of interactive PDF form data

(PDFxStream Complete license option includes all of the above components, and will include all PDFxStream components introduced in the future.)

Each component depends upon and requires PDFxStream Base to provide fundamental PDF file access primitives.

Given PDFxStream's capabilities and its focus on performance, it is well suited for use in a number of different development environments, including:

  • High-volume enterprise environments that need to extract data from large numbers of PDF files
  • Content management systems (CMS’s) that need access to the text of PDF files for categorization or summarization purposes
  • Full-text indexing and search systems that wish to add comprehensive support for searching PDF documents
  • Data conversion processes, especially those that aim to selectively extract and convert unstructured PDF content into structured data formats and databases.
  • Alternative content delivery systems that need to provide access to PDF document content to devices that cannot readily open and view PDF content (i.e. mobile phones, PDA's, etc)

The following sections will provide you with the reference and tutorial information you need to successfully integrate PDFxStream into your application.