Introducing PDFTextStream

PDFTextStream is a Java and .NET library that enables applications to access text, metadata, and forms content in PDF documents quickly, easily, and accurately. While there are many excellent tools and libraries available for generating PDF documents, PDFTextStream is the first and only library to focus on the extraction of textual data from PDF files. As such, it is the fastest, most feature-complete PDF text extraction library available on the market today.

Here is a summary of the main features of PDFTextStream:

  • Support for versions 1.0 – 1.7 (ExtensionLevel 5) of the PDF document specification (current through Acrobat 9.x / X)
  • Pure Unicode output, including Chinese, Japanese, and Korean (CJK) support
  • Document decryption, including 40-bit, 128-bit, 256-bit, and variable bitlength RC4 and AES ciphers
  • Support for extracting bookmarks, annotations, and interactive form data
  • Access to all document metadata contained in a PDF file, including Adobe XMP metadata streams
  • Subclasses java.io.Reader, which ensures a simple, familiar interface, and easy integration opportunities with existing components expecting a java.io.Reader instance
  • Easy integration with Apache Lucene, the most popular pure-Java indexing and search engine

A complete enumeration of PDFTextStream's features and capabilities can be found at here.

Given PDFTextStream's capabilities and its focus on performance, it is well suited for use in a number of different development environments, including:

  • High-volume enterprise environments that need to extract text from large numbers of PDF files
  • Content management systems (CMS’s) that need access to the text of PDF files for categorization or summarization purposes
  • Full-text indexing and search systems that wish to add comprehensive support for searching PDF documents
  • Data conversion processes, especially those that aim to selectively extract and convert unstructured PDF content into structured data elements.
  • Alternative content delivery systems that need to provide access to PDF document content to devices that cannot readily open and view PDF content (i.e. mobile phones, PDA's, etc)

The following sections will provide you with the reference and tutorial information you need to successfully integrate PDFTextStream into your application.