Right-to-Left (RTL) and Bidirectional (bidi) Text

PDFxStream safely handles right-to-left (RTL) and bidirectional (bidi) text as described in the Unicode standard. This is essential if any part of the PDF documents you aim to extract text from contain any language written right-to-left or in a bidirectional fashion when mixed with left-to-write scripts. Examples of RTL/bidi languages include:

Arabic
Hebrew
Urdu
Persian
Thaana
Rohingya
Syriac

info

Even if the documents you are routinely processing with PDFxStream contain predominantly left-to-right (LTR) text in languages like English, bidirectional concerns remain important thanks to the fact that RTL text can appear anywhere, even in fundamentally LTR documents (via things like quotes, proper names, individuals' titles or roles, etc).

The challenge of RTL and bidi text (briefly)

A proper treatment of how RTL and bidi text work is well beyond the scope of this documentation. (Go ahead and skip this overview if it doesn't suit you.) But, in brief:

Natural languages are written mostly in one of three directions: left-to-right (LTR), top-to-bottom (as with Chinese, Japanese, and Korean in those languages' vertical writing modes), and right-to-left. Software that is not careful to treat RTL text properly (i.e. by presuming all text is LTR) will result in misordered, comically incorrect text.

Things get even more complicated once LTR and RTL text is mixed, which is then called bidirectional (bidi) text. For example:

An example of bidirectional Unicode text, with arrows indicating reading order: the first run of English words are read left-to-right; the run of Arabic is read right-to-left, and then the final English words are again read left-to-right. Image courtesy of the W3C's very approachable *Unicode Bidirectional Algorithm basics* page

Software that only deals in LTR text will generally render or process this text completely incorrectly, also running through the Arabic left-to-right, producing completely incomprehensible results.

RTL-aware text processing also applies to block- and paragraph- level ordering. For example, given an arrangement of paragraphs like this:

Strictly LTR read-ordering would have the paragraphs ordered A, B, C, whereas a proper RTL ordering would be C, A, B. Software that is not RTL-aware will make dire errors in handling inter-paragraph text orderings.

There is much, much more to say about the challenges of RTL and bidi text, including the subtleties of numerics and punctuation, how brackets are (sometimes) mirrored in complicated, context-dependent ways, and how Unicode combining characters interact with everything else.

Extracting accurate RTL & bidi text with PDFxStream

As with most of PDFxStream's capabilities, the bulk of this complexity is handled for you automatically: if you are using com.snowtide.pdf.OutputHandlers as shown in most of this documentation, then RTL and bidi text will be extracted in the correct order, and blocks in RTL pages and documents will be traversed in the correct order.

warning

One of the few parts of PDFxStream that does not handle RTL and bidi text properly is com.snowtide.pdf.VisualOutputTarget; in some important ways, the mandate of VisualOutputTarget ("extract a page's text while preserving its visual appearance") runs entirely counter to what is required for proper RTL and bidi text handling. We do aim to eventually address this based on customer demand.

With that said, there are some things you need to keep in mind to be successful in using PDFxStream to extract RTL and bidi text from PDF documents:

Unicode in, Unicode out

While PDFxStream will take care to properly handle RTL and bidi text, you must ensure that all the steps in your handling of that text is also RTL- and bidi-aware. This includes things like:

always using Unicode-aware text editors and viewers; if you're looking at extracted text in an editor that only works with legacy (non-Unicode) text encodings, or one that doesn't use Unicode's Bidirectional Algorithm for visually rendering RTL/bidi text, then nothing will look right.
Performing text operations (e.g. concatenation, substring replacement, and so on) without accounting for RTL/bidi concerns can break text flow, especially if the text includes numbers, punctuation, or mixed language phrases.
Higher-level text handling has to likewise be Unicode- and bidi-aware, or you'll get very unexpected results from indexing, search, summarization, classification, translation, and LLM training processes.

The challenge of RTL and bidi text (briefly)​

Extracting accurate RTL & bidi text with PDFxStream​

Unicode in, Unicode out​

The challenge of RTL and bidi text (briefly)

Extracting accurate RTL & bidi text with PDFxStream

Unicode in, Unicode out