PDF text extraction done right.
Get the text, metadata, and form data you need from those PDFs and get back to your real job.
Yes, getting data out of PDF documents really can be this easy
Ready to get started with PDFTextStream for ?
PDFTextStream is used by companies and governments around the world to process billions of documents yearly.
PDFTextStream for Java is written in 100% pure Java, with no native components or dependencies. Its only requirement is a compliant Java 1.5 (or higher) JVM.
PDFTextStream is suitable for use in demanding desktop and server applications, including those with significant concurrency requirements. It has been designed to be amenable to parallelization, so that you can fully utilize your hardware and infrastructure investments when processing PDF documents without worrying about locking or race conditions.
Of course, being a Java library, PDFTextStream may be used by any JVM language that supports interoperability with Java APIs, including Clojure, Scala, Groovy, JRuby, Jython, and so on.
PDFTextStream for .NET is produced by translating the standard PDFTextStream for Java binary into a pure managed .NET 2.0 assembly. This translation process is complete, and does not entail any side effects that impair the functionality, robustness, API's, or performance of PDFTextStream for .NET.
All of the concurrency and parallelism guarantees provided by PDFTextStream for Java apply to its .NET cousin.
As with PDFTextStream for Java, PDFTextStream for .NET may be used by any .NET language, including C#, VB.NET, F#, managed C++, and so on.
PDFTextStream is the fastest component available for extracting text and metadata from PDF documents, period.
PDFTextStream has two main goals when it extracts the text content of a PDF document: do it accurately, and do it fast.
Which of those two attributes is more important to your application is something only you can decide. However, in many environments, text extraction performance isn't just a nice-to-have: it's critical to your project's success. That's why we're glad to be able to make such a bold claim without reservation, and we have the numbers to back it up.
PDFTextStream was built from the ground up specifically to meet the most stringent PDF text and metadata content extraction requirements. Its API is comprehensive, and includes the following features:
- Extensive support for the PDF file format specification and all known variants. Learn more
- Full Unicode-capable text extraction facilities, including support for extracting Chinese, Japanese, and Korean (CJK) text, in both horizontal and vertical writing modes
- Comprehensive PDF document metadata access
- Page-level object model via
com.snowtide.pdf.Page(api doc), providing page-specific text extraction (api doc) and page metrics (height, width, rotation angle, etc)
- Acroform (interactive form) data extraction including text, checkbox, radio button, and choice fields (api doc), as well as form update facilities (api doc)
- PDF bookmark (document outline) access api doc
- PDF annotation access (including Link (web URL) annotations) api doc
- Seamless Lucene integration article api doc
EncryptionInfoAPI: provides access to PDF document encryption parameters api doc
- Text-piping API for super-fast text extraction (api doc) provides hooks for customizing how PDF text extracts are formatted (such as when the visual layout of each page needs to be maintained)
- Selective regional text extraction built-in, ideal for extracting data from fixed-format forms api doc
- Optional in-memory operation api doc
- PDF to HTML exporter api doc
- PDFTextStream subclasses
java.io.Reader, which ensures a simple, familiar interface, and straightforward integration opportunities with existing components that expect a
- Flexible logging toolkit hooks
- Built-in support for logging to standard out, Log4J, and
- Ability to plug in custom logging implementations api doc
- Built-in support for logging to standard out, Log4J, and
There's much, much more to the PDFTextStream API than we can reasonably list here. Check out its API reference and the PDFTextStream Developer's Guide to learn about all that PDFTextStream has to offer.
The official PDF file format specification (published by Adobe) is large and complex. PDF files can be rich, dynamic documents, and getting to all of the interesting and useful parts of them (i.e. their content, text, metadata, etc) is a daunting task.
Further, Adobe's specification only provides normative descriptions of how PDF documents should be constructed. Experience shows that applications must often process PDF documents from multiple sources, each of which may (and do) generate PDF files that sometimes bend and often break the "official" PDF specification — similar to how web browsers are forced to support broken and malformed HTML documents as best as they can.
This is just one of the many reasons why we aggressively supporting and maintaining PDFTextStream is a never-ending task. Doing anything else would prevent us from guaranteeing maximum compatibility with all PDF document formats and variants "in the field", regardless of their source or to what degree they violate certain rules of good PDF file format etiquette.
Our aim is to ensure that PDFTextStream can extract the full text of any PDF document that you can copy-and-paste text from using Adobe Acrobat — and do it accurately in an automated, high-performance environment.
PDF Format Support Details
The range of PDF file format features (and quirks!) that PDFTextStream supports is broad and deep. To the right is a partial list of the major facets of the PDF specification that PDFTextStream supports. If you are aware of a particular detail that is not listed, then please feel free to contact us to confirm that PDFTextStream supports what you need.
From PDFTextStream, you can expect:
- Compatibility with all versions of the PDF document
- v1.0 (Acrobat 1)
- v1.1 (Acrobat 2)
- v1.2 (Acrobat 3)
- v1.3 (Acrobat 4)
- v1.4 (Acrobat 5)
- v1.5 (Acrobat 6)
- v1.6 (Acrobat 7)
- v1.7 (Acrobat 8, 9, & X)
- Support for decryption of encrypted PDF documents, using 40-bit, 128-bit, 256-bit, and variable bitlength ciphers (including RC4 and AES)
- Excellent embedded and standard font and character encoding
support (critical for enabling proper layout and spacing of text
- Type 0
- Type 1
- Type 1C
- Identity-H and Identity-V encodings
- CMap encodings (including Chinese, Japanese, and Korean character sets, both horizontal and vertical writing modes)
- Support for extracting and updating Acroform (interactive form) data
- Support for extracting text from "searchable image" PDF documents (common in files sourced from an OCR processes)
- Support for all varieties of rotated text (page-level as well as text-level rotations)
- Support for extracting all PDF bookmark (document outline) data
- Support for extracting document annotations (including web links)
- Support for both types of document-wide metadata (classic key/value attributes as well as Adobe XMP XML-format metadata)
- And much more...