Extracting text from PDF documents

PDFTextStream provides three different ways to extract text from PDF documents:

  • Use PDFTextStream as a java.io.Reader
  • Extract text using PDFTextStream's com.snowtide.pdf.OutputHandler interface and default implementations
  • Directly traverse the PDF document model provided by PDFTextStream

We'll look at each of these approaches in turn.

PDFTextStream as a java.io.Reader

While performance and text extraction accuracy are critical concerns of PDFTextStream, it also aims to be extraordinarily easy to use within your applications. It subclasses java.io.Reader, and therefore can be used anywhere code expects to work with a java.io.Reader instance.

Using PDFTextStream in this way is so simple, an example would belabor the point. If you have (or can create) an API that accepts a java.io.Reader instance as an argument, you can pass a PDFTextStream instance to that API.

Things that should be noted at this point:

  • If PDFTextStream’s java.io.Reader interface is used, then the output of PDFTextStream should not be buffered using a java.io.BufferedReader. The PDFTextStream instance already reads the text out of the provided PDF file one page at a time, performing any necessary buffering automatically. This is done because generic buffering strategies are not effective when reading content out of a PDF -- the structure of PDF files is very complex, requiring specialized buffering techniques for optimum performance. Any additional buffering would diminish performance and increase memory consumption.
  • Regardless of which extraction method you use, PDFTextStream instances need to be closed when they are no longer needed. This ensures that various system-level resources are released after all of the text is read out of the PDF file.

Redirecting extracted PDF content using OutputHandlers

While it is easily approachable, treating extracted PDF content as any other source of readable text via the java.io.Reader abstraction is often not convenient. In many cases, getting extracted text into an in-memory buffer or written out to disk is more helpful. For these cases, you can take advantage of OutputTarget, the most commonly-used OutputHandler implementation included in PDFTextStream:

public StringBuffer getPDFText (File pdfFile) throws IOException {
    PDFTextStream stream = new PDFTextStream(pdfFile);
    StringBuffer sb = new StringBuffer(1024);
    // get OutputTarget configured to pipe text to the provided StringBuffer
    OutputTarget tgt = OutputTarget.forBuffer(sb);
    stream.pipe(tgt);
    stream.close();
    return sb;
}

Here, a com.snowtide.pdf.OutputTarget instance is used instead of the java.io.Reader interface. When a PDFTextStream instance is provided with an OutputTarget via its pipe(OutputHandler) method, it sends all of its output to that OutputHandler (we'll see later why and how you might create your own OutputHandler implementations). This eliminates some internal buffering that would otherwise be needed to support the java.io.Reader interface, and prevents you from having to create a temporary buffer to take advantage of PDFTextStream’s java.io.Reader implementation.

This is certainly cleaner than using the java.io.Reader interface if all you need to do is read all of the text out of a PDF document. OutputTarget instances can also be used to send a PDFTextStream’s output directly to a local file:

public void savePDFText (File pdfFile, File textFile) throws IOException {
    PDFTextStream stream = new PDFTextStream(pdfFile);
    BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(textFile)));
    // get OutputTarget configured to pipe text to the provided file path
    OutputTarget tgt = new OutputTarget(writer);
    stream.pipe(tgt);
    writer.flush();
    writer.close();
    stream.close();
}

This is a significant improvement compared to the repetitive code that you would otherwise need to write to gather PDFTextStream’s output into a buffer, and then write that buffer out to a file.

Extracting Text, Page by Page

PDFTextStream also provides access to the text content of individual pages in PDF documents. For example, these functions extract only the fifth page of text from the given file, and the first 3 pages of text, respectively:

public String getFifthPageText (File pdf) throws IOException {
    PDFTextStream stream = new PDFTextStream(pdf);
    StringBuffer sb = new StringBuffer(1024);
    OutputTarget tgt = new OutputTarget(sb);
    Page fifthPage = stream.getPage(4); // page numbers are 0-based
    fifthPage.pipe(tgt);
    stream.close();
    return sb.toString();
}

public String getFirstThreePagesText (File pdf) throws IOException {
    PDFTextStream stream = new PDFTextStream(pdf);
    StringBuffer sb = new StringBuffer(1024);
    OutputTarget tgt = OutputTarget.forBuffer(sb);
    for (int pagenum = 0; pagenum < 3; pagenum++) {
        Page page = stream.getPage(pagenum);
	page.pipe(tgt);
    }
    stream.close();
    return sb.toString();
}

Pages in PDF documents are represented bycom.snowtide.pdf.Page instances. In addition to providing a pipe(OutputTarget) method (which works just like the PDFTextStream.pipe(OutputTarget) method we've already seen, but only within the context of a single page), com.snowtide.pdf.Page instances provide a set of functions for accessing a range of page-related attributes. These attributes include page height, width, rotation, and more (see the API reference for com.snowtide.pdf.Page for details).

The PDFTextStream Document Model

PDF documents specify their text content one character at a time, without any indication of physical structure (such as lines, paragraphs, columns, etc). Therefore, PDFTextStream must employ advanced document understanding processes to derive the structure of each PDF document page it is asked to extract. These processes gather characters into lines, lines into blocks, blocks into columns, and so on. The document structure that these entities represent and the API that PDFTextStream exposes for developers to work with them collectively forms the PDFTextStream document model.

The document model is necessarily hierarchical, and its API mirrors that hierarchy:

Pages (com.snowtide.pdf.Page) contain Blocks (com.snowtide.pdf.layout.Block). Blocks may contain other Blocks or Lines (com.snowtide.pdf.layout.Line) (but not both). Lines contain TextUnits (com.snowtide.pdf.layout.TextUnit), which roughly represent single characters.

This structure is rooted at the page-level by an object that implements the com.snowtide.pdf.layout.BlockParent interface, available via the Page.getTextContent() function. Objects that implement theBlockParent interface contain an ordered set of com.snowtide.pdf.layout.Block instances, each of which represent a block of text presented on a page within a PDF document. In most circumstances, each Block instance corresponds to a paragraph of text.

Blocks, Lines, and TextUnits all implement the com.snowtide.pdf.layout.Region interface, which allows for the retrieval of their positioning on the page (x- and y-coordinates, height, width, etc.). Additionally, Blocks also implement the BlockParent interface, which means that Blocks can contain other Blocks. This is necessitated by document structures such as tables, where distinct blocks of content must be grouped and ordered together. Use the PDFTextStream javadoc to find the particulars of how to traverse the document model – each document model entity presents a very simple list-like API that should be essentially self-explanatory.

TextUnit Details

To anyone who does not know the inner workings of PDF documents and how fonts and encodings work in the PDF document specification, "text unit" might seem to be a strange diversion from the straightforward names given to the other parts of the PDFTextStream document model ("page", "block", "line", and so on). It is worth exploring why these entities are called "TextUnits" and not simply "Characters".

A quick look at the javadoc of com.snowtide.pdf.layout.TextUnit reveals a few interesting functions: TextUnit.getCharacterSequence() and TextUnit.getCharCode(). Text in PDF documents is encoded as a series of character codes (available via TextUnit.getCharCode()), which are then mapped to a concrete sequence of Unicode characters (available via TextUnit.getCharacterSequence()) based on the encoding that is specified by a PDF document. That sounds straightforward enough until one realizes that PDF documents can encode more than one Unicode character for each individual raw character code.

For example, a PDF document might specify that the character code 188 should be mapped to the Unicode (and ASCII) character ‘f’. However, it could specify instead that the character code 189 should be mapped to a sequence of Unicode characters, such as "fi" or "ae". Therefore, each TextUnit instance can represent an indeterminate number of Unicode characters.

Please note that the font in effect when each TextUnit is outputted is available via TextUnit.getFont().

OutputHandlers: Text Extraction using Document Model Events

In many applications, simple text extraction as shown in the above sections is not enough to meet requirements. Sometimes a PDF document is so large that extracting all of its text would strain your application’s available resources. Perhaps your application needs to produce something other than plain text, such as an HTML version of the PDF document. The best way to meet such requirements is to utilize the com.snowtide.pdf.OutputHandler interface.

The OutputHandler interface is directly analogous to the lightweight SAX XML ContentHandler interface. Just like XML, PDFTextStream defines a document model that can be traversed systematically using a random-access interface (which is called DOM in the XML world). But also just like XML, PDFTextStream also provides a way to process document content in a lightweight, evented fashion. The OutputHandler interface represents this second option. It defines a range of functions that can be selectively implemented to, for example, only be notified of character-level data. An OutputHandler subclass that does this by overriding the OutputHandler.textUnit(TextUnit) function will receive one event (in the form of a TextUnit object) for every TextUnit object in a given PDF document page or block. The OutputHandler subclass can then take whatever action is necessary given its purpose – write the TextUnit’s content to disk, send it over a network connection, make note of where the TextUnit is located on the page for display purposes, and so on.

Each PDFTextStream, com.snowtide.pdf.Page, and com.snowtide.pdf.layout.Block instance provides a pipe(OutputHandler) function. Invoking this function on any instance of these classes will cause the appropriate PDF document model events to be sent to the provided OutputHandler object in the natural order that they occur. For example, just before starting the events associated with a block of content, the OutputHandler’s startBlock(Block) function will be called with that Block instance as a parameter; when all of the child entities of that Block have been delivered (its child blocks, its child lines, its child TextUnits, etc), the OutputHandler’s endBlock(Block) function will be called. Events bookend content like this for all of the containers in the PDFTextStream document model: the PDF document itself (startPDF(String, File) and endPDF(String, File)), pages (startPage(Page) and endPage(Page)), blocks (as has already been discussed), and lines (startLine(Line) and endLine(Line)).

Source code for example OutputHandler implementations are included with your PDFTextStream distribution. The pdfts.examples.GoogleHTMLOutputHandler sample will produce an XHTML document that roughly duplicates the spirit of the "view as text" page that Google provides for PDF search results. Another OutputHandler example, pdfts.examples.XMLOutputTarget, writes an XML document directly to a provided StringBuffer that includes structural document information, as well as indications of text formatting (e.g. bolding, underlining, strikethroughs, italics, etc.).

Finally, a .NET-specific OutputHandler implementation example can be found here.