A performance comparison of PDF text extraction libraries
PDFTextStream — a PDFxStream component — has two primary goals when it extracts the text content of a PDF document: do it accurately, and do it fast.
Which of those two attributes is more important to your application is something only you can decide. However, in many environments, text extraction performance is critical. That's why we're glad to be able to make such a bold statement without reservation:
PDFTextStream is the fastest component available for extracting text from PDF documents.
Thankfully, we have the numbers to back this claim up. Using 1000 PDF files representing all known variations of the PDF specification and dozens of languages and character sets that have been randomly selected from those uploaded by users of some of our online services, we ran a series of benchmark tests that compared the performance of PDFTextStream with four of the most widely-used PDF libraries that are capable of extracting text content from PDF documents.
The results are of this benchmarking indicate a clear performance winner:
Figure 1. Relative performance of PDF text extraction libraries across 1000 randomly-selected PDF documents. Cumulative processing times are normalized to PDFTextStream’s processing time, which was given a score of 100. Larger scores (and longer bars) are better.
PDFTextStream | pdftotext | PDFBox | |
Number of errors | 0 | 0 | 22 |
Total processing time in minutes | 1.969 | 2.240 | 4.395 |
Relative performance scores | 100.00 | 87.91 | 44.81 |
Figure 2. Summary benchmark results table, showing for each component benchmarked: number of errors over the set of 1000 test PDF documents, total processing time, and relative performance (normalized to PDFTextStream's processing time, which was assigned a value of 100).
Discussion of Results
We will discuss only the bottom-line results here, presented above. Fully-detailed benchmark results are also available, which includes timings for each of the 1000 PDF files in the benchmark.
We believe the results speak for themselves: PDFTextStream (part of PDFxStream v3.x) is the fastest PDF text extraction component. As shown in Figure 1, PDFTextStream is around 13% faster than the next fastest known PDF text extraction component (xpdf's pdftotext utility, which is actually written in native C/C++), and around 2.25x (yes, 225%) faster than PDFBox, the next-fastest Java PDF text extraction library.
Further, as Figure 2 shows, PDFTextStream is more reliable, robust, and predictable as well. Across the 1000 PDF files in the benchmark collection, PDFTextStream experienced zero errors.
xpdf's pdftotext utility also finished the benchmark error-free. However, when processing PDF documents adhering to v1.6 of the PDF file format specification (corresponding to Adobe Acrobat 7), it did warn a number of times that it only supports v1.5 of the PDF document spec. (PDFTextStream supports all versions of the PDF document specification.)
Benchmark Methodology
Our aim here was to determine (as objectively as possible) which PDF text extraction library provides the best overall performance.
These results are what we obtained given the randomly-selected collection of PDF documents we used. Relative performance between any set of libraries may be significantly different given significantly different test data. We encourage anyone who needs to make a technology decision regarding PDF extraction libraries to do their own due diligence and determine which library suits their needs best, with respect to performance as well as features, robustness, and support options.
To accomplish this, we developed a benchmarking testbed, consisting of
a set of test Java classes and accompanying scripts. A single main
class was developed (com.snowtide.pdf.test.TestPerformance
)
that contained the timing infrastructure. This main test class, by
default, tested the performance of the PDFTextStream library. A number
of subclasses were then developed that extended this main class to
test the performance of each of the competing PDF libraries. This
approach had the advantage of ensuring that the critical timing
infrastructure remained unchanged regardless of the library being
tested. (See below for information
on the approaches used in connection with the test classes developed
for each library.)
This timing infrastructure had two very important attributes that ensured a fair test for all libraries involved:
- For each PDF file tested, it was processed by each library's test class once before any real timing began. This allowed the JVM's classloader to initialize all of the classes that would be required to extract the text out of each PDF file; because this initialization is complete when real testing begins, the benchmarks are not affected by the unpredictable slowdowns that can occur in connection with the classloading process.
- Each PDF file was processed then by each library's test class four times. The best performance of these four test runs was taken as the reported processing time for each library; the reported times are therefore not averages, but the best result each library could manage for each test file. This approach accounts for any transient factors that might impact performance negatively within the scope of a single test run (i.e. swap file access, transient network activity, etc).
The PDF documents used as testcases in the benchmark tests were randomly selected from PDF documents uploaded by users of our various online services. This selection represents a wide variety of document types (i.e. presentations, academic papers, corporate reports, white papers, technical documentation, etc) and producers (i.e. Adobe Acrobat, Adobe Pagemaker, PDFWriter, InDesign, QuarkXPress, Oracle Reports, etc.), as well as languages and character sets. Such diversity in the kinds of PDF documents presented to each library tested gives some confidence that the results of the benchmarks will correspond to real-world performance.
All benchmarking was run on a 2.0Ghz AMD Opteron 146 Sun Fire X2100 server running Red Hat Enterprise Linux 4.0 with 3GB of memory. The Java VM used reported this from running java -version:
java version "1.4.2_10" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_10-b03) Java HotSpot(TM) Server VM (build 1.4.2_10-b03, mixed mode)
Note that the benchmarks used the "server" configuration of the JVM – this most closely matches the configuration likely to be used in a typical enterpise software deployment environment.
No other applications or non-system processes were running at the same time as the test, and all non-essential services and scheduled jobs were halted. A python script was used to automate the testing, and to collect the results that were outputted by the main test class and its subclasses. Those results are presented above.
Verification
Anyone is welcome to inspect and confirm our findings. The test classes we developed for these benchmarks are available for download:
Here are links to each of the benchmarked components:
We also are glad to provide anyone with the full set of 1000 PDF documents we used in the benchmark. However, these are available by request only -- the full archive of these PDF documents, even when compressed, is 261MB. It would therefore be unwise for us to post such a large file for public download. So, if you would like to download the archive of the 1000 PDF documents used in the benchmark, simply send us an email, and we'll provide you with a download link.
We welcome any comments or suggestions you might have for how we can make this benchmark more accurate, fair, comprehensive, etc.; please feel free to contact us if you have any ideas.
What about .NET?
PDFxStream for .NET is derived from the same codebase as PDFxStream for Java, and therefore shares the latter's architecture, algorithms, and methods. It is reasonable to expect comparable performance from PDFxStream on .NET as it delivers on Java, and our customers' experience bears that out.
In addition, xpdf's pdftotext utility is written in native C/C++, and had long been considered the fastest PDF text extraction solution. Given that PDFTextStream is faster than pdftotext, we are quite confident that PDFTextStream for .NET will not disappoint.
Why did we choose these libraries, and not others?
There are scads of PDF libraries on the market, both commercially and in the open source world. PDFBox and pdftotext happen to be the most popular open source text extraction options; however, we would be glad to other libraries to this benchmark, as long as they:
- provide a reasonably straightforward mechanism for extracting text from PDF documents -- there are some libraries which provide PDF parsers or access to low-level PDF objects, but which do not provide a simple way to usefully extract the text of a PDF document
- are somehow available such that they can be benchmarked without being in a hobbled state. For example, some commercial libraries' evaluation modes restrict the size or number of pages they will process without purchasing a license. Such restrictions make it very difficult to keep a benchmark like this properly updated and accurate. (PDFTextStream's evaluation mode does not place any restrictions on its core text extraction functionality.)
If there is a PDF library you would like to see added to this benchmark, do let us know.
Library-Specific Notes
In order to squeeze every possible drop of performance from each library, we developed adapters (some of which were heavily based upon code examples included with each library) to streamline and optimize the libraries' methods for reading PDF text. In some cases, this meant setting certain flags or calling particular methods in each library to provide hints that only the text and metadata content of each source PDF file were of interest. In others, we eliminated file-based output (as included in some sample code for some libraries) and replaced it with in-memory output.
Below are some additional noteworthy items specific to each library tested.
pdftotext
pdftotext is available as a command-line executable, so it could not be plugged in to the base benchmarking code that we built for the other libraries. However, because it is a command-line utility, it was trivial to write a script that would execute pdftotext for each of the PDF documents in the benchmark collection and take appropriate measure of how long the spawned pdftotext process ran.
PDFBox
PDFBox was benchmarked using an optimized version of one of its
included classes, org.pdfbox.ExtractText
. We modified
the code to operate entirely in-memory (it had been spooling text
content out to disk and/or standard out), and to always extract the
metadata from each PDF it was tested with (PDFTextStream always
provides access to a PDF document's metadata).