Using PDFxStream in Multiple-CPU, Multithreaded Environments

Web Application Environments

All webapp deployment environments provide various mechanisms for throttling how many concurrent processes are active, and many of those mechanisms are completely automated. This has the wonderful effect of ensuring that your deployment environments' CPU resources will always be utilized maximally. Because of this, in most circumstances, you can use PDFxStream freely, without regard for how many CPUs are present, and what their load is.

For example, consider a Java servlet that accepts an upload of a PDF document, and persists the document and its text content to a database for indexing and summarization later. Thanks to the automatic resource (CPU) management provided by most Java application servers, that servlet can be written quite simply:

// proper exception handling and package imports not included for brevity
public void doPost (HttpServletRequest req, HttpServletResponse resp) {
    // [ ... obtain upload data from request ... ]
    byte[] pdfData = getUploadData(req);
 
    // [ ... use PDFxStream to extract PDF text ... ]
    Document pdf = PDF.open(pdfData);
    StringWriter pdfText = new StringWriter(1024);
    stream.pipe(new OutputTarget(pdfText));
    stream.close();
     
    // [ ... store PDF document and its text in database ... ]
    persistToDatabase(pdfData, pdfText);
 
    // [ ... redirect to view ... ]
}

It should be noted that it is almost always a bad idea to manually create new threads within the context of a managed deployment environment like a Java servlet container. Such environments provide significant resources dedicated to managing threads and processes; by creating your own threads in an attempt to fully utilize all of the CPUs in a deployment environment, you may end up inadvertently compromising the overall performance of your web application.

Standalone Applications

When building a standalone application, you typically need to manage concurrent processes without the benefit of the types of frameworks provided by web application servers. To do this, one should become familiar with the details of threading in your preferred environment (Java/JVM, or the .NET CLR) to ensure consistent application performance and maximal CPU utilization.

Good starting points for learning how to manage threads efficiently include:

Regardless of which thread-management library or approach your team uses, there is one overarching guideline that should be adhered to in order to ensure that all of your deployment environment's CPU resources are being fully utilized: for each available CPU, ensure that at least one thread is running that is using PDFxStream. Of course, this guideline is no replacement for proper application tuning, but it will get you off the ground and avoid wasting available resources.