Regarding Unicode text and character sets

PDFTextStream supports the extraction of Unicode text from all PDF files, with very few restrictions or caveats with regard to language or character set.

Single-byte character sets, such as those used in conjunction with Roman text (i.e. English, French, Spanish, Italian, German, Dutch, etc.) are fully supported.

Double-byte character sets, such as those used in conjunction with Chinese, Japanese, and Korean (CJK) text is also fully supported. Both horizontal and vertical writing modes are recognized and translated into appropriate text extracts.

PDFTextStream usage is unchanged regardless of the type of text being extracted – all character encoding issues are handled automatically.

Controlling CJK Capabilities

PDFTextStream actively caches the resources it uses when extracting CJK text – this significant improves performance. However, caching does lead to increased memory consumption. To prevent this, you can turn off PDFTextStream’s CJK text extraction capabilities by setting the pdfts.cjk.enable system property before using PDFTextStream.

Future plans

PDFTextStream does not yet support the extraction of text that uses right-to-left or bidirectional writing modes or alphabets, such as Arabic and Hebrew. Support for the extraction of such text is planned for a future PDFTextStream release.