Extracting text from PDF documents
PDFTextStream provides three different ways to extract text from PDF documents:
- Use PDFTextStream as a
java.io.Reader
- Extract text using PDFTextStream's
com.snowtide.pdf.OutputHandler
interface and default implementations - Directly traverse the PDF document model provided by PDFTextStream
We'll look at each of these approaches in turn.
PDFTextStream as a java.io.Reader
While performance and text extraction accuracy are critical concerns of
PDFTextStream, it also aims to be extraordinarily easy to use within your
applications. It subclasses java.io.Reader
, and therefore can
be used anywhere code expects to work with a java.io.Reader
instance.
Using PDFTextStream in this way is so simple, an example would belabor the
point. If you have (or can create) an API that accepts a java.io.Reader
instance as an argument, you can pass a PDFTextStream
instance
to that API.
Things that should be noted at this point:
- If PDFTextStream’s
java.io.Reader
interface is used, then the output of PDFTextStream should not be buffered using ajava.io.BufferedReader
. The PDFTextStream instance already reads the text out of the provided PDF file one page at a time, performing any necessary buffering automatically. This is done because generic buffering strategies are not effective when reading content out of a PDF -- the structure of PDF files is very complex, requiring specialized buffering techniques for optimum performance. Any additional buffering would diminish performance and increase memory consumption. - Regardless of which extraction method you use, PDFTextStream instances need to be closed when they are no longer needed. This ensures that various system-level resources are released after all of the text is read out of the PDF file.
Redirecting extracted PDF content using OutputHandler
s
While it is easily approachable, treating extracted PDF content as any other
source of readable text via the java.io.Reader
abstraction is
often not convenient. In many cases, getting extracted text into an
in-memory buffer or written out to disk is more helpful. For these cases,
you can take advantage of OutputTarget
, the most commonly-used
OutputHandler
implementation included in PDFTextStream:
public StringBuffer getPDFText (File pdfFile) throws IOException { PDFTextStream stream = new PDFTextStream(pdfFile); StringBuffer sb = new StringBuffer(1024); // get OutputTarget configured to pipe text to the provided StringBuffer OutputTarget tgt = OutputTarget.forBuffer(sb); stream.pipe(tgt); stream.close(); return sb; }
Here, a com.snowtide.pdf.OutputTarget
instance is
used instead of the java.io.Reader
interface. When a PDFTextStream
instance is provided with an OutputTarget
via its pipe(OutputHandler)
method, it sends all of its output to that OutputHandler
(we'll
see later why and how you might create your own OutputHandler
implementations). This eliminates some internal buffering that would
otherwise be needed to support the java.io.Reader
interface,
and prevents you from having to create a temporary buffer to take advantage
of PDFTextStream’s java.io.Reader
implementation.
This is certainly cleaner than using the java.io.Reader
interface if all you need to do is read all of the text out of a PDF
document. OutputTarget
instances can also be used to send a
PDFTextStream’s output directly to a local file:
public void savePDFText (File pdfFile, File textFile) throws IOException { PDFTextStream stream = new PDFTextStream(pdfFile); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(textFile))); // get OutputTarget configured to pipe text to the provided file path OutputTarget tgt = new OutputTarget(writer); stream.pipe(tgt); writer.flush(); writer.close(); stream.close(); }
This is a significant improvement compared to the repetitive code that you would otherwise need to write to gather PDFTextStream’s output into a buffer, and then write that buffer out to a file.
Extracting Text, Page by Page
PDFTextStream also provides access to the text content of individual pages in PDF documents. For example, these functions extract only the fifth page of text from the given file, and the first 3 pages of text, respectively:
public String getFifthPageText (File pdf) throws IOException { PDFTextStream stream = new PDFTextStream(pdf); StringBuffer sb = new StringBuffer(1024); OutputTarget tgt = new OutputTarget(sb); Page fifthPage = stream.getPage(4); // page numbers are 0-based fifthPage.pipe(tgt); stream.close(); return sb.toString(); } public String getFirstThreePagesText (File pdf) throws IOException { PDFTextStream stream = new PDFTextStream(pdf); StringBuffer sb = new StringBuffer(1024); OutputTarget tgt = OutputTarget.forBuffer(sb); for (int pagenum = 0; pagenum < 3; pagenum++) { Page page = stream.getPage(pagenum); page.pipe(tgt); } stream.close(); return sb.toString(); }
Pages in PDF documents are represented bycom.snowtide.pdf.Page
instances. In addition to providing a pipe(OutputTarget)
method
(which works just like the PDFTextStream.pipe(OutputTarget)
method we've already seen, but only within the context of a single page), com.snowtide.pdf.Page
instances provide a set of functions for accessing a range of page-related
attributes. These attributes include page height, width, rotation, and more
(see the API reference for com.snowtide.pdf.Page
for details).
The PDFTextStream Document Model
PDF documents specify their text content one character at a time, without any indication of physical structure (such as lines, paragraphs, columns, etc). Therefore, PDFTextStream must employ advanced document understanding processes to derive the structure of each PDF document page it is asked to extract. These processes gather characters into lines, lines into blocks, blocks into columns, and so on. The document structure that these entities represent and the API that PDFTextStream exposes for developers to work with them collectively forms the PDFTextStream document model.
The document model is necessarily hierarchical, and its API mirrors that hierarchy:
Pages (com.snowtide.pdf.Page
) contain Blocks (com.snowtide.pdf.layout.Block
).
Blocks may contain other Blocks or Lines (com.snowtide.pdf.layout.Line
)
(but not both). Lines contain TextUnits (com.snowtide.pdf.layout.TextUnit
),
which roughly represent single characters.
This structure is rooted at the page-level by an object that implements the
com.snowtide.pdf.layout.BlockParent
interface, available via
the Page.getTextContent()
function. Objects that implement theBlockParent
interface contain an ordered set of com.snowtide.pdf.layout.Block
instances, each of which represent a block of text presented on a page
within a PDF document. In most circumstances, each Block
instance corresponds to a paragraph of text.
Block
s, Line
s, and TextUnit
s all
implement the com.snowtide.pdf.layout.Region
interface, which
allows for the retrieval of their positioning on the page (x- and
y-coordinates, height, width, etc.). Additionally, Block
s also
implement the BlockParent
interface, which means that Block
s
can contain other Block
s. This is necessitated by document
structures such as tables, where distinct blocks of content must be grouped
and ordered together. Use the PDFTextStream javadoc to find the particulars
of how to traverse the document model – each document model entity presents
a very simple list-like API that should be essentially self-explanatory.
TextUnit
Details
To anyone who does not know the inner workings of PDF documents and how fonts and encodings work in the PDF document specification, "text unit" might seem to be a strange diversion from the straightforward names given to the other parts of the PDFTextStream document model ("page", "block", "line", and so on). It is worth exploring why these entities are called "TextUnits" and not simply "Characters".
A quick look at the javadoc of com.snowtide.pdf.layout.TextUnit reveals a few interesting functions: TextUnit.getCharacterSequence() and TextUnit.getCharCode(). Text in PDF documents is encoded as a series of character codes (available via TextUnit.getCharCode()), which are then mapped to a concrete sequence of Unicode characters (available via TextUnit.getCharacterSequence()) based on the encoding that is specified by a PDF document. That sounds straightforward enough until one realizes that PDF documents can encode more than one Unicode character for each individual raw character code.
For example, a PDF document might specify that the character code 188 should be mapped to the Unicode (and ASCII) character ‘f’. However, it could specify instead that the character code 189 should be mapped to a sequence of Unicode characters, such as "fi" or "ae". Therefore, each TextUnit instance can represent an indeterminate number of Unicode characters.
Please note that the font in effect when each TextUnit is outputted is available via TextUnit.getFont().
OutputHandlers: Text Extraction using Document Model Events
In many applications, simple text extraction as shown in the above sections
is not enough to meet requirements. Sometimes a PDF document is so large
that extracting all of its text would strain your application’s available
resources. Perhaps your application needs to produce something other than
plain text, such as an HTML version of the PDF document. The best way to
meet such requirements is to utilize the com.snowtide.pdf.OutputHandler
interface.
The OutputHandler
interface is directly analogous to the
lightweight SAX XML ContentHandler
interface. Just like XML, PDFTextStream
defines a document model that can be traversed systematically using a
random-access interface (which is called DOM in the XML world). But also
just like XML, PDFTextStream also provides a way to process document content
in a lightweight, evented fashion. The OutputHandler
interface
represents this second option. It defines a range of functions that can be
selectively implemented to, for example, only be notified of character-level
data. An OutputHandler
subclass that does this by overriding
the OutputHandler.textUnit(TextUnit)
function will receive one
event (in the form of a TextUnit
object) for every TextUnit
object in a given PDF document page or block. The OutputHandler
subclass can then take whatever action is necessary given its purpose –
write the TextUnit
’s content to disk, send it over a network
connection, make note of where the TextUnit
is located on the
page for display purposes, and so on.
Each PDFTextStream
, com.snowtide.pdf.Page
, and com.snowtide.pdf.layout.Block
instance provides a pipe(OutputHandler)
function. Invoking this
function on any instance of these classes will cause the appropriate PDF
document model events to be sent to the provided OutputHandler
object in the natural order that they occur. For example, just before
starting the events associated with a block of content, the OutputHandler
’s
startBlock(Block)
function will be called with that Block
instance as a parameter; when all of the child entities of that Block have
been delivered (its child blocks, its child lines, its child TextUnit
s,
etc), the OutputHandler
’s endBlock(Block)
function
will be called. Events bookend content like this for all of the containers
in the PDFTextStream document model: the PDF document itself (startPDF(String,
File)
and endPDF(String, File)
), pages (startPage(Page)
and endPage(Page)
), blocks (as has already been discussed), and
lines (startLine(Line)
and endLine(Line)
).
Source code for example OutputHandler
implementations are
included with your PDFTextStream distribution. The pdfts.examples.GoogleHTMLOutputHandler
sample will produce an XHTML document that roughly duplicates the spirit of
the "view as text" page that Google provides for PDF search results. Another
OutputHandler
example, pdfts.examples.XMLOutputTarget
,
writes an XML document directly to a provided StringBuffer
that
includes structural document information, as well as indications of text
formatting (e.g. bolding, underlining, strikethroughs, italics, etc.).
Finally, a .NET-specific OutputHandler
implementation example
can be found here.