Applies to:
PDFTextStream

Regarding Unicode text and character sets

Though PDF documents may encode text using any of a number of predefined encodings (and PDFs may actually define and use their own encodings), PDFTextStream normalizes all extracted text to Unicode. Combined with the fact that both Java and .NET strings use the UTF-16 Unicode encoding internally, this means that text extracted from PDF documents using PDFTextStream will always be accurate with regard to "special" characters. For example:

PDFTextStream always aims to preserve all "special" characters. If your application requires normalization of extracted PDF text to a character set more constrained than Unicode BMP character set (e.g. ASCII or Latin-1), then a simple regular expression search-and-replace can be applied. For example, replacing "æ" with "ae".

Note that all of Unicode is supported, including double-byte character sets, such as those used in conjunction with Chinese, Japanese, and Korean (CJK). Both horizontal and vertical writing modes are recognized and translated into appropriate text extracts.

PDFTextStream usage is unchanged regardless of the type of text being extracted – all character encoding issues are handled automatically.

Controlling CJK Capabilities

PDFTextStream actively caches the resources it uses when extracting CJK text – this significant improves performance. However, caching does lead to increased memory consumption. To prevent this, you can turn off PDFTextStream’s CJK text extraction capabilities by setting the pdfts.cjk.enable system property before using PDFTextStream.

Bidirectional Text

PDFTextStream does not yet support the extraction of text that uses right-to-left or bidirectional writing modes or alphabets, such as Arabic and Hebrew. Support for the extraction of such text is planned for a future PDFTextStream release.