Regarding Unicode text and character sets

Though PDF documents may encode text using any of a number of predefined encodings (and PDFs can actually define and use their own encodings in addition), PDFxStream normalizes all extracted text to Unicode. Combined with the fact that both Java strings use UTF-16 internally, this means that text extracted from PDF documents using PDFxStream will always be accurate with regard to "special" characters. For example:

Characters that include accents and diacritical marks, such as é, ö, ç
Characters used in all of the languages supported by Unicode, including Chinese, Japanese, and Korean
Ligatures, such as ﬀ, ﬄ, æ
Super- and subscripted ordinals, like ₀, ²
"Smart quotes"

Note that all of Unicode is supported, including double-byte character sets, such as those used in conjunction with Chinese, Japanese, and Korean (CJK). Both horizontal and vertical writing modes are recognized and translated into appropriate text extracts.

PDFxStream usage is unchanged regardless of the type of text being extracted; all character encoding issues are handled automatically.

Controlling CJK Capabilities

PDFxStream actively caches the resources it uses when extracting Chinese, Japanese, or Korean (CJK) text; this significant improves performance. However, that caching does lead to increased memory consumption. To prevent this, you can turn off PDFxStream's CJK text extraction capabilities by setting the pdfts.cjk.enable system property before using PDFxStream, presuming CJK capabilities are unimportant to your application.

Controlling CJK Capabilities​

Controlling CJK Capabilities