Regarding Unicode text and character sets
Though PDF documents may encode text using any of a number of predefined
encodings (and PDFs can actually define and use their own encodings in
addition), PDFxStream normalizes all extracted text to Unicode. Combined with
the fact that both Java strings use UTF-16
internally, this means that text
extracted from PDF documents using PDFxStream will always be accurate with
regard to "special" characters. For example:
- Characters that include accents and diacritical marks, such as é, ö, ç
- Characters used in all of the languages supported by Unicode, including Chinese, Japanese, and Korean
- Ligatures, such as ff, ffl, æ
- Super- and subscripted ordinals, like ₀, ²
- "Smart quotes"
Note that all of Unicode is supported, including double-byte character sets, such as those used in conjunction with Chinese, Japanese, and Korean (CJK). Both horizontal and vertical writing modes are recognized and translated into appropriate text extracts.
PDFxStream usage is unchanged regardless of the type of text being extracted; all character encoding issues are handled automatically.
Controlling CJK Capabilities
PDFxStream actively caches the resources it uses when extracting Chinese,
Japanese, or Korean (CJK) text; this significant improves performance. However,
that caching does lead to increased memory consumption. To prevent this, you can
turn off PDFxStream's CJK text extraction capabilities by setting the
pdfts.cjk.enable
system property before using
PDFxStream, presuming CJK capabilities are unimportant to your application.