PDFTextStream configuration options

PDFTextStream's configuration can be controlled in three different ways:

  • Globally, by changing the state of the default instance of com.snowtide.pdf.PDFTextStreamConfig, available via com.snowtide.pdf.PDFTextStreamConfig.getDefaultConfig()
  • Globally, by setting particular system properties that com.snowtide.pdf.PDFTextStreamConfig uses to initialize its default instance.
  • Locally, on a per-document / pdf-PDFTextStream-instance basis, by providing a separate instance of com.snowtide.pdf.PDFTextStreamConfig, modified as desired, to each com.snowtide.pdf.PDFTextStream constructor.

Each of the options available in com.snowtide.pdf.PDFTextStreamConfig is detailed in its API documentation. The rest of this document will walk through how to set system properties so that they will be picked up by com.snowtide.pdf.PDFTextStreamConfig, as well as an enumeration of the available system properties themselves.

Each of the following system properties must be set before referencing PDFTextStream in any way, as the properties are checked and their values (if any) are acted upon when PDFTextStream is statically initialized. Therefore, the safest way to use these configuration-related system properties is to set them when starting your application:

java –cp [classpath] –Dpdfts.config.property=value your.main.classname

You can also set system properties in your code as long as you do so before your first usage of PDFTextStream. Using Java on the JVM:

System.setProperty("pdfts.config.property", "config_value");
PDFTextStream stream = new PDFTextStream(new File("c:\some\path.pdf"));

Using C# on .NET:

using com.snowtide.pdf;
java.lang.System.setProperty("pdfts.config.property", "config_value");
PDFTextStream stream = new PDFTextStream(new java.io.File("c:\some\path.pdf"));

PDFTextStream.NET users can also set these properties the app.config file, which is equivalent to the Java convention of specifying system properties on the command line using the -D options (note the ikvm: prefix, which exposes the property to the Java namespaces):

<?xml version="1.0"?>
<configuration>
  <appSettings>
    <add key="ikvm:pdfts.config.property" value="config_value" />
  </appSettings>
</configuration>

Available system properties

line.separator

Set this system property to the string you want PDFTextStream to use to separate lines in text extracts. This defaults to your platform's default line separator ("\n" on Linux/Unix/Mac OS X, and "\r\n" on Windows platforms).

pdfts.cjk.enable

Setting this system property to "N" will disable PDFTextStream’s ability to extract Chinese, Japanese, or Korean (CJK) text. This may be desirable if memory utilization is a concern – CJK character maps are very large, and can consume significant amounts of memory. As always, application profiling is recommended to determine the actual source(s) of memory consumption.

pdfts.logfactory

PDFTextStream defaults to using java.util.logging or Log4J for logging informational and error messages. However, many environments demand customized logging frameworks. Therefore, PDFTextStream provides a pluggable logging architecture that enables you to hook your custom logging framework into PDFTextStream. To do so, simply implement the com.snowtide.util.logging.LogFactory interface, and set the pdfts.logfactory system property to the full classname of your implementation.

pdfts.loggingtype

PDFTextStream normally defaults to using the java.util.logging logging framework. To force PDFTextStream to default to using Log4J, set the pdfts.loggingtype system property to "log4j".

pdfts.layout.detectTables

By default, PDFTextStream will attempt to detect tabular data on each extracted page, and infer the structure of each table. This structure is then materialized as rows of com.snowtide.pdf.layout.Blocks within higher-level com.snowtide.pdf.layout.Table blocks.

This detection and inference can be disabled globally by setting the pdfts.layout.detectTables system property to "N".

pdfts.mmap.enable (deprecated)

By default, PDFTextStream does not memory-map opened PDF files. This feature can be enabled by setting the pdfts.mmap.enable system property to "Y".

This option is deprecated, and will be removed in future releases of PDFTextStream.

Due to an unfortunate bug in Java’s implementation of memory-mapped files in Windows environments, it is possible that a PDF file opened and processed by PDFTextStream will remain locked even after the PDFTextStream instance’s close() function has been called, and PDFTextStream has released all of the filesystem handles it has allocated. This locking behaviour (which is known to occur only on Windows) will prevent the PDF file from being deleted or moved until Java’s garbage collector eliminates certain JDK-internal objects that are used to track and manage the previously memory-mapped PDF file.