Unicode and character sets
Though PDF documents may encode text using any of a number of predefined
encodings (and PDFs can actually define and use their own encodings in
addition), PDFxStream normalizes all extracted text to Unicode. Combined with
the fact that both Java strings use UTF-16
internally, this means that text
extracted from PDF documents using PDFxStream will always be accurate with
regard to "special" characters. For example:
- Characters that include accents and diacritical marks, such as é, ö, ç
- Characters used in all of the languages supported by Unicode, including Chinese, Japanese, Korean, Arabic, Hebrew, Urdu, and so on.
- Ligatures, such as ff, ffl, æ, لا ,ﭏ, and نـَ
- Super- and subscripted ordinals, like ₀, ²
- "Smart quotes"
Note that all of Unicode is supported, including double-byte character sets, such as those used in conjunction with Chinese, Japanese, and Korean (CJK). Both horizontal and vertical writing modes are recognized and translated into appropriate text extracts.
PDFxStream usage is unchanged regardless of the type of text being extracted; all character encoding issues are handled automatically.
Controlling CJK Capabilities
PDFxStream actively caches the resources it uses when extracting Chinese,
Japanese, or Korean (CJK) text; this significant improves performance. However,
that caching does lead to increased memory consumption. To prevent this, you can
turn off PDFxStream's CJK text extraction capabilities by setting the
pdfts.cjk.enable
system property before using
PDFxStream, presuming CJK capabilities are unimportant to your application.
Right-to-left and Bidirectional Text
PDFxStream supports the extraction of text that uses right-to-left (RTL) or bidirectional (bidi) writing modes and alphabets, such as Arabic and Hebrew. This support is also automatic, and requires no special configuration. Read more about this in the next section.