Applies to:
PDFTextStream
Regarding Unicode text and character sets
Though PDF documents may encode text using any of a number of predefined
encodings (and PDFs may actually define and use their own encodings),
PDFTextStream normalizes all extracted text to Unicode. Combined with the
fact that both Java and .NET strings use the UTF-16
Unicode
encoding internally, this means that text extracted from PDF documents using
PDFTextStream will always be accurate with regard to "special" characters.
For example:
- Characters that include accents and diacritical marks, such as é, ö, ç
- Ligatures, such as ff, ffl, æ
- Super- and subscripted ordinals, like ₀, ²
- “Smart quotes”
PDFTextStream always aims to preserve all "special" characters. If your application requires normalization of extracted PDF text to a character set more constrained than Unicode BMP character set (e.g. ASCII or Latin-1), then a simple regular expression search-and-replace can be applied. For example, replacing "æ" with "ae".
Note that all of Unicode is supported, including double-byte character sets, such as those used in conjunction with Chinese, Japanese, and Korean (CJK). Both horizontal and vertical writing modes are recognized and translated into appropriate text extracts.
PDFTextStream usage is unchanged regardless of the type of text being extracted – all character encoding issues are handled automatically.
Controlling CJK Capabilities
PDFTextStream actively caches the resources it uses when extracting CJK text
– this significant improves performance. However, caching does lead to
increased memory consumption. To prevent this, you can turn off
PDFTextStream’s CJK text extraction capabilities by setting
the pdfts.cjk.enable
system
property before using PDFTextStream.
Bidirectional Text
PDFTextStream does not yet support the extraction of text that uses right-to-left or bidirectional writing modes or alphabets, such as Arabic and Hebrew. Support for the extraction of such text is planned for a future PDFTextStream release.