Error handling
PDFTextStream is designed to only throw java.io.IOException
exceptions. This is true when invoking any of PDFTextStream’s constructors
or other functions. This is convenient in that, in the simplest cases, you
only need to worry about catching IOException instances.
However, in a few special cases, PDFTextStream will throw other kinds of exceptions in order to indicate that particular kinds of errors have occurred. Thankfully, each of these exception types subclass IOException, which is helpful in keeping prototyping code simple and clean.
EncryptedPDFException
We saw in the last section that PDFTextStream’s constructors can throw EncryptedPDFExceptions when an encryption-related error occurs. Please refer to the examples and explanation in the previous section for details on this exception type.
FaultyPDFException
PDFTextStream is also capable of throwing com.snowtide.pdf.FaultyPDFException
from its constructors, as well as from most of its other functions that
access PDF data. This exception type is thrown when PDFTextStream encounters
file data that it doesn’t understand. This indicates one of the following:
- The file in question is not a PDF document
- The file is a PDF document, but is corrupted or otherwise unusable, and PDFTextStream cannot repair it
Exception Handling Patterns
In production environments, especially when PDFTextStream is being used to extract content from PDF documents sourced from untrusted parties (such as indexing PDF documents found on the internet), handling these exceptions properly is important for proper monitoring of the results of your PDF content extraction efforts.
Below is a typical pattern that is ideal for such environments – it illustrates the pattern that should be used for properly handling each of the three types of exceptions most commonly seen when working with PDFTextStream.
public static String extractPDFText (File pdfFile) { try { PDFTextStream stream = new PDFTextStream(pdfFile); StringBuffer sb = new StringBuffer(1024); OutputTarget tgt = new OutputTarget(sb); stream.pipe(tgt); stream.close(); return sb.toString(); } catch (EncryptedPDFException e) { System.out.println("PDF document (" + pdfFile.getAbsolutePath() + ") is encrypted..."); } catch (FaultyPDFException e) { System.out.println("PDF document (" + pdfFile.getAbsolutePath() + ") cannot be read because: " + e.getMessage()); } catch (IOException e) { System.out.println("PDF document (" + pdfFile.getAbsolutePath() + ") caused general IO error: " + e.getMessage()); } return null; }
Obviously, logging these errors to System.out
isn’t what one
would do in production, but the pattern is the same – just insert the
appropriate logging or other application-specific routines for handling each
type of exception.