Skip to main content

Is PDFxStream thread-safe?

I want to process a large pool of PDF documents using all the hardware at my disposal. How can I parallelize PDFxStream's operation to do this?

content from old parallelization "reference topic":

# Using PDFxStream in Multiple-CPU, Multithreaded Environments

## Web Application Environments

All webapp deployment environments provide various mechanisms for
throttling how many concurrent processes are active, and many of those
mechanisms are completely automated. This has the wonderful effect of
ensuring that your deployment environments' CPU resources will always
be utilized maximally. Because of this, in most circumstances, you can
use PDFxStream freely, without regard for how many CPUs are present, and
what their load is.

For example, consider a Java servlet that accepts an upload of a PDF
document, and persists the document and its text content to a database
for indexing and summarization later. Thanks to the automatic resource
(CPU) management provided by most Java application servers, that servlet
can be written quite simply:

```java
// proper exception handling and package imports not included for brevity
public void doPost (HttpServletRequest req, HttpServletResponse resp) {
// [ ... obtain upload data from request ... ]
byte[] pdfData = getUploadData(req);

// [ ... use PDFxStream to extract PDF text ... ]
Document pdf = PDF.open(pdfData);
StringWriter pdfText = new StringWriter(1024);
stream.pipe(new OutputTarget(pdfText));
stream.close();

// [ ... store PDF document and its text in database ... ]
persistToDatabase(pdfData, pdfText);

// [ ... redirect to view ... ]
}

It should be noted that it is almost always a bad idea to manually create new threads within the context of a managed deployment environment like a Java servlet container. Such environments provide significant resources dedicated to managing threads and processes; by creating your own threads in an attempt to fully utilize all of the CPUs in a deployment environment, you may end up inadvertently compromising the overall performance of your web application.

Standalone Applications

When building a standalone application, you typically need to manage concurrent processes without the benefit of the types of frameworks provided by web application servers. To do this, one should become familiar with the details of threading in your preferred environment (Java/JVM, or the .NET CLR) to ensure consistent application performance and maximal CPU utilization.

Good starting points for learning how to manage threads efficiently include:

Regardless of which thread-management library or approach your team uses, there is one overarching guideline that should be adhered to in order to ensure that all of your deployment environment's CPU resources are being fully utilized: for each available CPU, ensure that at least one thread is running that is using PDFxStream. Of course, this guideline is no replacement for proper application tuning, but it will get you off the ground and avoid wasting available resources.


# Can PDFxStream "repair" PDF documents that are damaged, incomplete, or
which contain out-of-specification structures?

# What causes PDFxStream to emit empty text extracts?

There are some (usually rare) situations where extracting text from PDF
documents is not possible:

- PDF documents whose pages are simply a series of images. This is
most common with PDF documents that have been scanned from physical
documents, but which have not had a text "layer" added by an
optical character recognition (OCR) process.
- In some very rare cases (less than 0.1% of all PDF documents in our
testing), a PDF document may contain text that uses a font that
"draws" glyphs using images, rather than referring to an actual
character. In these cases, PDFxStream will yield either empty
text extracts, or extracts that are "junk" -- a series of
nonsensical characters.

In principle, these issues could be solved by embedding an OCR process
within PDFxStream. This may be done for some future release.

### PDFTextStream Bug

Finally, it is possible that you have stumbled across a bug in
PDFxStream. The easiest way to test for this possibility is to
attempt to copy-and-paste text from the PDF file in question using Adobe
Acrobat. If you can successfully do this (and the pasted text looks
correct), but PDFxStream is not delivering text, or the text it is
delivering is incorrect in some way, then you have likely discovered a
bug in PDFxStream. In this case, please [let us
know](http://snowtide.com/contact)[http://snowtide.com/contact](http://snowtide.com/contact),
and we'll work to resolve the problem straight away.

# What was the inspiration for PDFTextStream?

In Snowtide's earliest days (circa 2001-2002), we were focused on
building search and data mining tools for professional researchers. As
such, we were very interested in finding a high-quality library that
would enable our software to extract content from PDF documents so that
it could provide search functionality for PDF content.

However, we consistently found all of the available PDF content
extractions libraries to be unacceptable in various ways. Some had
significant accuracy problems, many had API's that unfortunately
presented a literal representation of a PDF document's data structures
(which are extremely complicated and ill-suited for high-performance,
accurate content extraction), many had serious PDF file format
incompatibilities, and nearly all of them were quite slow.

So, we embarked on an effort to build our own PDF content extraction
functionality. We quickly discovered why all of the other libraries that
we had evaluated had various flaws that we found unacceptable -- the
problem of PDF text extraction is a very difficult and complex one. We
further found it likely that if we could build a better mousetrap in
this context, we would find a very open and appreciative market for that
mousetrap.

Two years later in the summer of 2004, we released PDFTextStream. It set
(and continues to maintain) a new gold standard for PDF content
extraction accuracy, performance, PDF file format compatibility, and
"developer friendliness" (thanks to its significantly simpler API).