Unicode and character sets

Though PDF documents may encode text using any of a number of predefined encodings (and PDFs can actually define and use their own encodings in addition), PDFxStream normalizes all extracted text to Unicode. Combined with the fact that both Java strings use UTF-16 internally, this means that text extracted from PDF documents using PDFxStream will always be accurate with regard to "special" characters. For example:

Characters that include accents and diacritical marks, such as é, ö, ç
Characters used in all of the languages supported by Unicode, including Chinese, Japanese, Korean, Arabic, Hebrew, Urdu, and so on.
Ligatures, such as ﬀ, ﬄ, æ, لا ,ﭏ, and نـَ
Super- and subscripted ordinals, like ₀, ²
"Smart quotes"

Note that all of Unicode is supported, including double-byte character sets, such as those used in conjunction with Chinese, Japanese, and Korean (CJK). Both horizontal and vertical writing modes are recognized and translated into appropriate text extracts.

PDFxStream usage is unchanged regardless of the type of text being extracted; all character encoding issues are handled automatically.

Controlling CJK Capabilities

PDFxStream actively caches the resources it uses when extracting Chinese, Japanese, or Korean (CJK) text; this significant improves performance. However, that caching does lead to increased memory consumption. To prevent this, you can turn off PDFxStream's CJK text extraction capabilities by setting the pdfts.cjk.enable system property before using PDFxStream, presuming CJK capabilities are unimportant to your application.

Right-to-left and Bidirectional Text

PDFxStream supports the extraction of text that uses right-to-left (RTL) or bidirectional (bidi) writing modes and alphabets, such as Arabic and Hebrew. This support is also automatic, and requires no special configuration. Read more about this in the next section.

Controlling CJK Capabilities​

Right-to-left and Bidirectional Text​

Controlling CJK Capabilities

Right-to-left and Bidirectional Text