Right-to-Left (RTL) and Bidirectional (bidi) Text
PDFxStream safely handles right-to-left (RTL) and bidirectional (bidi) text as described in the Unicode standard. This is essential if any part of the PDF documents you aim to extract text from contain any language written right-to-left or in a bidirectional fashion when mixed with left-to-write scripts. Examples of RTL/bidi languages include:
- Arabic
- Hebrew
- Urdu
- Persian
- Thaana
- Rohingya
- Syriac
Even if the documents you are routinely processing with PDFxStream contain predominantly left-to-right (LTR) text in languages like English, bidirectional concerns remain important thanks to the fact that RTL text can appear anywhere, even in fundamentally LTR documents (via things like quotes, proper names, individuals' titles or roles, etc).
The challenge of RTL and bidi text (briefly)
A proper treatment of how RTL and bidi text work is well beyond the scope of this documentation. (Go ahead and skip this overview if it doesn't suit you.) But, in brief:
Natural languages are written mostly in one of three directions: left-to-right (LTR), top-to-bottom (as with Chinese, Japanese, and Korean in those languages' vertical writing modes), and right-to-left. Software that is not careful to treat RTL text properly (i.e. by presuming all text is LTR) will result in misordered, comically incorrect text.
Things get even more complicated once LTR and RTL text is mixed, which is then called bidirectional (bidi) text. For example:
Software that only deals in LTR text will generally render or process this text completely incorrectly, also running through the Arabic left-to-right, producing completely incomprehensible results.
RTL-aware text processing also applies to block- and paragraph- level ordering. For example, given an arrangement of paragraphs like this:
A
B
C
Strictly LTR read-ordering would have the paragraphs ordered A, B, C
, whereas a proper RTL
ordering would be C, A, B
. Software that is not RTL-aware will make dire errors
in handling inter-paragraph text orderings.
There is much, much more to say about the challenges of RTL and bidi text, including the subtleties of numerics and punctuation, how brackets are (sometimes) mirrored in complicated, context-dependent ways, and how Unicode combining characters interact with everything else.
Extracting accurate RTL & bidi text with PDFxStream
As with most of PDFxStream's capabilities, the bulk of this complexity is
handled for you automatically: if you are using
com.snowtide.pdf.OutputHandler
s as shown in most of this
documentation, then RTL and bidi text will
be extracted in the correct order, and blocks in RTL pages and documents will be
traversed in the correct order.
One of the few parts of PDFxStream that does not handle RTL and bidi text
properly is com.snowtide.pdf.VisualOutputTarget
; in some important ways, the
mandate of VisualOutputTarget
("extract a page's text while
preserving its visual appearance") runs entirely counter to what is required for
proper RTL and bidi text handling. We do aim to eventually address this based
on customer demand.
With that said, there are some things you need to keep in mind to be successful in using PDFxStream to extract RTL and bidi text from PDF documents:
Unicode in, Unicode out
While PDFxStream will take care to properly handle RTL and bidi text, you must ensure that all the steps in your handling of that text is also RTL- and bidi-aware. This includes things like:
- always using Unicode-aware text editors and viewers; if you're looking at extracted text in an editor that only works with legacy (non-Unicode) text encodings, or one that doesn't use Unicode's Bidirectional Algorithm for visually rendering RTL/bidi text, then nothing will look right.
- Performing text operations (e.g. concatenation, substring replacement, and so on) without accounting for RTL/bidi concerns can break text flow, especially if the text includes numbers, punctuation, or mixed language phrases.
- Higher-level text handling has to likewise be Unicode- and bidi-aware, or you'll get very unexpected results from indexing, search, summarization, classification, translation, and LLM training processes.