Extracting text from PDF documents
PDFxStream provides two ways to extract text from PDF documents:
- The
com.snowtide.pdf.OutputHandler
interface and its various implementations direct extracted text at the document, page, or block level to files and in-memory buffers, while optionally applying arbitrary formatting logic. - a document model corresponding to the content found on each page of a PDF that can be traversed to selectively access and extract content based on arbitrary application and business logic
Most applications will use the former, and PDFxStream ships with a
number of OutputHandler
implementations
designed to satisfy common text extraction use cases.
The most commonly-used OutputHandler
is
com.snowtide.pdf.OutputTarget
. It extracts all text from the
document, page, or block being piped as efficiently as possible, aiming
to produce content that is in a natural read-ordering suitable for
content-oriented processing (such as indexing and search, summarization,
dictation, and translation purposes). Here's an example showing
OutputTarget
in use; in general, aside from
minor points of configuration, all
OutputHandler
usage follows the same pattern:
public String getPDFText (File pdfFile) throws IOException {
try (Document pdf = PDF.open(pdfFile)) {
StringBuilder buf = new StringBuilder();
pdf.pipe(new OutputTarget(buf));
return buffer.toString();
}
}
When a pipe(OutputHandler)
method is invoked, all of the content held
by the object being piped is sent (in document-model order) to the
OutputHandler
, which can decide what content
should or should not be included in the resulting extract, and what
formatting should be applied to it. The
OutputHandler
interface essentially defines a
visitor pattern over
the domain of the PDFxStream document model, making implementing
custom formatting and extraction strategies very straightforward; see
here for details.
pipe(OutputHandler)
methods are available at all levels of the
PDFxStream document model:
- Document:
com.snowtide.pdf.Document.pipe(OutputHandler)
- Page:
com.snowtide.pdf.Page.pipe(OutputHandler)
- Block:
com.snowtide.pdf.layout.Block.pipe(OutputHandler)
- Line:
com.snowtide.pdf.layout.Line.pipe(OutputHandler)
So in the above example, invoking the
Document.pipe(OutputHandler)
method sends the entire document's content to the
OutputHandler
. Piping a block to an
OutputHandler
will yield only that block's
content, and so on.
OutputHandler
's can generally write text
content to any java.lang.Appendable
, which includes
java.nio.CharBuffer
s and java.io.Writer
s. This makes many use cases
very easy to implement, such as sending extracted PDF text directly to a
local file, which is done here for a single page:
public void savePDFText (File pdfFile, int pageNumber, File textFile) throws IOException {
Document pdf = PDF.open(pdfFile);
Page page = pdf.getPage(pageNumber);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(textFile)));
page.pipe(new OutputTarget(writer));
pdf.close();
writer.flush();
writer.close();
}
The PDFxStream Document Model
PDF documents specify their textual content one character at a time, without any indication of physical structure (such as lines, blocks, columns, etc) or logical organization (headers, paragraphs, captions, footers, etc). Therefore, PDFxStream must employ advanced document understanding processes to derive the structure of each PDF document page it is presented. These processes gather characters into lines, lines into blocks, blocks into columns, and so on. The document structure that these entities represent and the API that PDFxStream exposes for developers to work with them collectively forms the PDFxStream document model.
The document model is necessarily hierarchical, and its API mirrors that hierarchy:
com.snowtide.pdf.Page
s contain
com.snowtide.pdf.layout.Block
s.
Block
s may contain other
Block
s or
com.snowtide.pdf.layout.Line
s (but not both).
Line
s contain
com.snowtide.pdf.layout.TextUnit
s which roughly represent
single characters.
This structure is rooted at the page-level by an object that implements
the com.snowtide.pdf.layout.BlockParent
interface, available
via com.snowtide.pdf.Page.getTextContent()
. Objects that
implement the BlockParent
interface
contain an ordered set of Block
s, each
of which represent a block of text presented on a page within a PDF
document. In most circumstances, each
Block
instance corresponds to a
paragraph of text.
Block
s, Line
s, and
TextUnit
s all implement the com.snowtide.pdf.layout.Bounded
interface, meaning that each provides a com.snowtide.pdf.layout.Bounded.bounds()
method
that returns a bounding box rectangle object allowing for the retrieval of their
positioning on the page (x- and y-coordinates, height, width, etc.). Additionally,
Block
s also implement the BlockParent
interface, which means that BlockParent
s can contain other
BlockParent
s. This is necessitated by document structures such
as tables, where distinct blocks of content must be grouped and ordered together. Use the
API reference
to find the particulars of how to traverse the document model -- each document model
entity presents a very simple list-like API that should be essentially self-explanatory.
TextUnit
Details
To anyone who does not know the inner workings of PDF documents and how fonts and encodings work in the PDF document specification, "text unit" might seem to be a strange diversion from the straightforward names given to the other parts of the PDFxStream document model ("page", "block", "line", and so on). It is worth exploring why these entities are called "TextUnits" and not simply "Characters".
A quick look at the javadoc of
TextUnit
reveals a few interesting methods:
com.snowtide.pdf.layout.TextUnit.getCharacterSequence()
and
com.snowtide.pdf.layout.TextUnit.getCharCode()
. Text in PDF
documents is encoded as a series of character codes (available via
TextUnit.getCharCode()
), which are
then mapped to a concrete sequence of Unicode characters (available via
TextUnit.getCharacterSequence()
) based
on the encoding that is specified by a PDF document. That sounds
straightforward enough until one realizes that PDF documents can encode
more than one Unicode character for each individual raw character code.
For example, a PDF document might specify that the character code 188
should be mapped to the Unicode (and ASCII) character f
. However, it
could specify instead that the character code 189 should be mapped to a
sequence of Unicode characters, such as fi
or ae
. Therefore,
each TextUnit instance can represent an indeterminate number of Unicode
characters.
Please note that the font in effect when each TextUnit is outputted is
available via com.snowtide.pdf.layout.TextUnit.getFont()
.
OutputHandlers: Text Extraction using Document Model Events
In many applications, simple text extraction as shown in the above
sections is not enough to meet requirements. Sometimes a PDF document is
so large that extracting all of its text would strain your application's
available resources. Perhaps your application needs to produce something
other than plain text, such as an HTML version of the PDF document. The
best way to meet such requirements is to utilize the
OutputHandler
interface.
The OutputHandler
interface is directly
analogous to the lightweight SAX XML
ContentHandler
interface. Just like XML, PDFxStream defines a document model that
can be traversed systematically using a random-access interface (which
is called DOM in the XML world). But also just like XML, PDFxStream
provides a way to process document content in a lightweight, evented
fashion. The OutputHandler
interface
represents this second option. It defines a range of methods that can be
selectively implemented to, for example, only be notified of
character-level data. An OutputHandler
subclass that does this by overriding the
com.snowtide.pdf.OutputHandler.textUnit(TextUnit)
function will receive one event (in the form of a TextUnit
object) for
every TextUnit
object in a given PDF document page or block. The
OutputHandler
subclass can then take whatever
action is necessary given its purpose -- write the
TextUnit
's content to disk, send it
over a network connection, make note of where the
TextUnit
is located on the page for
display purposes, and so on.
Each com.snowtide.pdf.Document
,
Page
, and
Block
instance provides a
pipe(OutputHandler)
function. Invoking this function on any instance
of these classes will cause the appropriate PDF document model events to
be sent to the provided OutputHandler
object
in the natural order that they occur. For example, just before starting
the events associated with a block of content,
com.snowtide.pdf.OutputHandler.startBlock(Block)
will be called with that Block
instance as a parameter; when all of the child entities of that
Block
have been delivered (its child
blocks, its child lines, its child
TextUnit
s, etc),
com.snowtide.pdf.OutputHandler.endBlock(Block)
will be called. Events bookend content like this for all of the
containers in the PDFxStream document model: the PDF document itself
(com.snowtide.pdf.OutputHandler.startPDF(String, File)
and
com.snowtide.pdf.OutputHandler.endPDF(String, File)
),
pages
(com.snowtide.pdf.OutputHandler.startPage(Page)
and
com.snowtide.pdf.OutputHandler.endPage(Page)
),
blocks (as has already been discussed), and lines
(com.snowtide.pdf.OutputHandler.startLine(Line)
and
com.snowtide.pdf.OutputHandler.endLine(Line)
).
Source code for example OutputHandler
implementations are included with your PDFxStream distribution. The
pdfts.examples.GoogleHTMLOutputHandler
sample will produce an XHTML
document that roughly duplicates the spirit of the "view as text" page
that Google provides for PDF search results. Another
OutputHandler
example,
pdfts.examples.XMLOutputTarget
, writes an XML document directly to a
provided StringBuffer
that includes structural document information,
as well as indications of text formatting (e.g. bolding, underlining,
strikethroughs, italics, etc.).