PDFxStream for .NET

PDFxStream.NET is produced by translating the PDFxStream for Java binary into a managed .NET assembly. This translation process is complete, preserving PDFxStream’s API, architecture, functionality, and performance characteristics.

This kind of translation is possible because the Java Virtual Machine (JVM) and the .NET Common Language Runtime (CLR) are very similar architecturally, and the Java and .NET object models are conceptually analogous. The actual translation is performed by IKVM's static compilation process. IKVM is an open source toolkit that makes it possible to run Java applications and libraries within the .NET environment.

IKVM and the included OpenJDK library both use a liberal open-source license that makes it possible to redistribute them with commercial products without constraining such products' own licenses.

Requirements

PDFxStream.NET requires v2.0 SP2 or higher of the .NET or Mono runtime. All DLLs for a given PDFxStream release are found in the lib directory of the PDFxStream.NET distribution. This includes a number of IKVM.*.dll files (e.g. IKVM.Runtime.dll), as well as two PDFxStream DLLs, only one of which you will use, depending on the .NET language you are using:

  • PDFxStreamVB.dll, for use only in VB.NET projects
  • PDFxStream.dll, for use with any language other than VB.NET

As indicated above, you should choose only one of the PDFxStream DLLs, based on which .NET language you are using: VB.NET projects should use PDFxStreamVB.dll, while all other languages should use PDFxStream.dll.

The IKVM DLL files are PDFxStream.NET's only dependencies. They provide the implementation of Java's standard library in .NET, as well as some runtime components that are required by any Java JAR that has been translated into a .NET assembly. No configuration or special initialization of these DLL files are necessary.

Why are there different PDFxStream DLLs for different .NET languages?

Symbols in VB are case-insensitive, which causes a collision between the com.snowtide.pdf namespace and our primary entry point, the com.snowtide.PDF class. In the PDFxStreamVB.dll library for use with VB.NET, the com.snowtide.PDF class is renamed to com.snowtide.PDFxStream, eliminating any ambiguity. No other changes to the API documented here or in our API reference is affected, so you can continue to use these resources while programming PDFxStream via VB.NET.

All other .NET languages (including C#, F#, and others) do support case-sensitivity in namespace and class symbols, so they can use the standard PDFxStream API as-is

Installation

Using PDFxStream.NET within your .NET project is as simple as adding references to each of the DLL files indicated in the previous section: all of the IKVM.*.dlls, and one of either PDFxStream.dll or PDFxStreamVB.dll, depending on the .NET language your project uses.

Typical Usage

Using PDFxStream.NET is very straightforward, and mirrors typical PDFxStream for Java usage. Here's a sample text extraction function in C#:

using com.snowtide;
using com.snowtide.pdf;
using java.io;

class ExtractTextAllPages
{
    public static void Main(string[] args)
    {
        string pdfFilePath = args[0];
        StringWriter text = new StringWriter(1024);
        using (Document doc = PDF.open(pdfFilePath))
        {
            doc.pipe(new OutputTarget(text));
        }
        System.Console.WriteLine("The text extracted from {0} is:",
            pdfFilePath);
        System.Console.WriteLine(text.toString());
    }
}

Without exception, all of the PDFxStream API is available in .NET. Because of this, the PDFxStream javadoc is the authoritative API reference for PDFxStream, whether it is used in Java or .NET.

Notes and Limitations

The sole minor difference between the documented PDFxStream API and its usage in .NET is how one obtains bitmap objects from extracted PDF image data. See this note for details.

Aside from this minor irregularity, PDFxStream.NET carries no limitations; it is a pure .NET assembly, through and through, and it acts like it.

For example, you can freely write com.snowtide.pdf.OutputHandler implementations in .NET. Here is a contrived example for illustration that will count the number of characters extracted from a PDF:

namespace SubclassingExample
{
    class CharCountingTarget : com.snowtide.pdf.OutputTarget
    {
        private int cnt = 0;
        
        public CharCountingTarget (java.lang.Appendable sb) : base(sb)
        {
        }
        
        public override void textUnit (com.snowtide.pdf.layout.TextUnit tu)
        {
            base.textUnit(tu);
            cnt++;
        }
        
        public int getCount ()
        {
            int _cnt = cnt;
            cnt=0;
            return cnt;
        }
    }
}

An OutputHandler (or com.snowtide.pdf.OutputTarget, in this case) subclass like this can be used in conjunction with any pipe(OutputHandler) method, found on instances of com.snowtide.pdf.Document, com.snowtide.pdf.Page, and com.snowtide.pdf.layout.Block.

Snowtide Collection Method Extensions

The com.snowtide namespace provides a couple of extension methods to make it easier to use parts of the PDFxStream API in .NET.

Consuming collections as IEnumerable

Java collections all implement the java.util.Iterable interface, which is analogous to .NET's IEnumerable interface. Unfortunately, the IKVM compilation process does not expose Java collections as IEnumerables; without an appropriate method extension, this would mean that iterating through any collection returned by PDFxStream could not be traversed with e.g. foreach or passed to any method that requires an IEnumerable.

Using the com.snowtide namespace will bring an extension method into scope that makes it easy to treat any collection returned by PDFxStream as an IEnumerable, e.g. here used to easily iterate through the keys of the document metadata in a PDF document:

using com.snowtide;
using com.snowtide.pdf;

class ExtractMetadata
{
    public static void Main(string[] args)
    {   
        string pdfFilePath = args[0];
        System.Console.WriteLine("All document metadata from {0}:", pdfFilePath);
        using (Document doc = PDF.open(pdfFilePath))
        {
            foreach (string attrKey in doc.getAttributeKeys().toIEnumerable<string>())
            {
                System.Console.WriteLine("{0}: {1}", attrKey, doc.getAttribute(attrKey));
            }
        }
    }
}

Using StringBuffer and StringBuilder as Appendables

Many implementations of OutputHandler provided by PDFxStream accept java.lang.Appendable objects as their principal constructor argument. This interface is implemented by a number of useful sinks for textual output, including java.lang.StringBuffer, java.lang.StringBuilder, java.nio.CharBuffer, any subclass of java.io.Writer, etc.

The one wrinkle to this is that StringBuffer and StringBuilder implement Appendable via a shared package-private superclass, the methods and implemented interfaces of which are not visible to code using StringBuffer or StringBuilder in .NET. This means that this C# code will not compile:

using com.snowtide;
using com.snowtide.pdf;
// ...
StringBuilder sb = new java.lang.StringBuilder();
OutputTarget tgt = new OutputTarget(sb);

The simple solution to this is to simply not use java.lang.StringBuilder or java.lang.StringBuffer from .NET. Any usage of them in conjunction with PDFxStream can be replaced with e.g. java.io.StringWriter; all PDFxStream code samples demonstrate and recommend using StringWriter with OutputHandler implementations.

The other option is to use the .toAppendable() extension method provided by the com.snowtide namespace:

using com.snowtide;
using com.snowtide.pdf;
// ...
StringBuilder sb = new java.lang.StringBuilder();
OutputTarget tgt = new OutputTarget(sb.toAppendable());