Appendix: The Art of Reading PDF Text

This appendix is provided solely for those interested in some of the technical issues involved in reading text out of PDF files.

The Portable Document Format, invented by Adobe in 1993, is a document file format derived from Postscript, a control language used by most laser printers today. The PDF specification was created primarily to allow a single document to be reliably displayed or printed using a shared and consistent set of graphical instructions.

The PDF format has proven to be a successful and useful technology, given the ease with which professional-grade documents may be exchanged among computing platforms and the rapid pace at which the PDF file format has been accepted by the marketplace.

PDF files are structured to encode display-oriented information, such as where a particular image should be placed, or how wide the letter 'a' should be, etc. This makes for a very robust and powerful publishing and display technology, but it makes it very difficult to reliably read text out of PDFs.

For example, consider this sample text that could exist in any PDF document:

Hello there
Hello there

First, notice that in the line of larger text, the space between the words is around double the size of the space between the words in the smaller line or text. If those spaces were encoded in the PDF file explicitly, then reading that text out of the document would be very simple. However, many PDF files are written with instructions like these:

[(Hello)-1650(there)]TJ
[(Hello)-3500(there)]TJ

which roughly translates to these operations:

print "Hello"
move .25" right
print "there"
print "Hello"
move .55" right
print "there"

Notice that there is no actual space character in these instructions – there’s just an instruction indicating that after the word ‘Hello’ is printed, the word ‘there’ should appear some distance to the right (.25" in one case, .55" in the other). The question is, how many spaces should be included in a text representation of that content? Maybe .25" is one space, and .55" is two. Perhaps .25" is really two spaces, and .55" is four. Just maybe, the first line’s font size is much smaller than the second, so the differing amounts of space between the words really represents only one space for both lines (this is the case in our example here).

A related example that is even more difficult to process concerns justified text, as shown below:

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.

Here, the problem of determining what is and is not a "real" space is compounded by two issues:

The spacing between words in the one line has no relationship between how wide a space should be between words in that font, at that font size.
There is no relationship between the width of an actual space in one line and the width of an actual space in another line, even though both lines use the same font, at the same font size.

It should be obvious that making a determination of where spaces should go and how many should be outputted when converting such content to plain text is very difficult.

One final example will illustrate even more difficult obstacles in converting a PDF document into text. Consider this example page layout:

Main text, column 1	Main text, column 2
Footnote section

There are a couple of approaches available for extracting the text out of a PDF document that has this kind of common layout.

The first is to output text in a way that matches the layout exactly. This would result in (for example) the first line of each column being on the same line of plain text, the seconds lines in each column being on the same second line of plain text, etc. This would be a disaster for any application that processed the resulting text in an automated way (for indexing, searching, or summarization purposes, for example). This is because such automated processing would have no way to know that content from completely different sections of the document is present on the same lines of text. This could lead to a range of problems, including nonsensical summaries and indexes keying on phrases that never really existed in the original document.

The second approach is to output the columns in order, resulting in the outputted text losing all visual correspondence with the original document’s layout. This is an improvement (at least with regard to the viability of the outputted text in an automated processing environment), in that the text of a column is uninterrupted by the text that actually belongs in other columns. However, there is still an issue to overcome: how would such a method know that the footnote section at the bottom-left of the page shouldn’t interrupt the main body text in columns one and two? After all, the footnote section is aligned in the second column, and without the benefit of human intelligence, a text extraction process might assume that the footnote section is part of the first column.

This is just a small taste of the complexities involved in reading useful, accurate text out of PDF files. This is compounded by innumerable variations between PDF files (for example, some may encode text backwards; a text extraction library that doesn’t handle such things properly may end up outputting ‘there Hello’). PDFTextStream employs hundreds of specialized, intelligent processes to figure out how best to handle these difficulties, and provide a high-quality, accurate text extract of whatever PDF files your application needs to process. By no means are these processes perfect. However, we are confident that they are the best available, and that we will continue to improve and perfect them to provide your applications with the best text output possible.

PDFxStream v3.8.0 Technical Documentation

Next >> Extracting Images from PDF Documents

<< Previous Indexing PDF documents with Lucene