Using PDFTextStream in Multiple-CPU, Multithreaded Environments

Web Application Environments

All J2EE webapp environments provide various mechanisms for throttling how many concurrent processes are active, and many of those mechanisms are completely automated. This has the wonderful effect of ensuring that your deployment environments' CPU resources will always be utilized maximally. Because of this, in most circumstances, you can use PDFTextStream freely, without regard for how many CPUs are present, and what their load is.

For example, consider a J2EE servlet that accepts an upload of a PDF document, and persists the document and its text content to a database for indexing and summarization later. Thanks to the automatic resource (CPU) management provided by most J2EE application servers, that servlet can be written quite simply:

// proper exception handling is not included for brevity
public void doPost (HttpServletRequest req, HttpServletResponse resp) {
    // [ ... obtain upload data from request ... ]
    byte[] pdfData = getUploadData(req);
 
    // [ ... use PDFTextStream to convert PDF to text ... ]
    PDFTextStream stream = new PDFTextStream(pdfData);
    StringBuffer pdfText = new StringBuffer(1024);
    OutputTarget tgt = OutputTarget.forBuffer(pdfText);
    stream.pipe(tgt);
    stream.close();
     
    // [ ... store PDF document and conversion text to database ... ]
    persistToDatabase(pdfData, pdfText);
 
    // [ ... redirect to view ... ]
}

It should be noted that it is almost always a bad idea to manually create new threads within the context of a J2EE application. Your webapp container provides significant resources dedicated to managing threads and processes; by creating your own threads in an attempt to fully utilize all of the CPUs in a deployment environment, you may end up inadvertently compromising the overall performance of your web application.

Standalone Applications

When building a standalone application, you typically need to manage concurrent processes without the benefit of the types of frameworks provided by J2EE application servers. To do this, one should become familiar with the details of threading in the Java environment, and how best to use threads to ensure consistent application performance and maximal CPU utilization.

For a good foundational overview of the principal approaches to managing threads, refer to this article on the IBM developerWorks site: Java theory and practice: Thread pools and work queues.

The toolkits used to implement the thread management approaches presented in that article have progressed significantly since the its publication in 2002; the links below are to currently-supported API's:

  • If your team is deploying to a JDK 1.5 environment, then the included java.util.concurrent package is likely the best solution to utilizing thread pools and/or work queues. There is a fantastic tutorial on how to make the most of this toolkit on Oracle's Java developer site: Concurrent Programming with J2SE 5.0.

Regardless of which thread-management library or approach your team uses, there is one overarching guideline that should be adhered to in order to ensure that all of your deployment environment's CPU resources are being fully utilized: for each available CPU, ensure that at least one thread is running that is using PDFTextStream. Of course, this guideline is no replacement for proper application tuning, but it will get you off the ground and avoid wasting available resources.