PDFxStream for .NET

PDFxStream.NET is produced by translating the PDFxStream for Java binary into a managed .NET assembly. This translation process is complete, preserving PDFxStream’s API, architecture, functionality, and performance characteristics.

This kind of translation is possible because the Java Virtual Machine (JVM) and the .NET Common Language Runtime (CLR) are very similar architecturally, and the Java and .NET object models are conceptually analogous. The actual translation is performed by IKVM's static compilation process. IKVM is an open source toolkit that makes it possible to run Java applications and libraries within the .NET environment.

IKVM and the included OpenJDK library both use a liberal open-source license that makes it possible to redistribute them with commercial products without constraining such products' own licenses.

Requirements

PDFxStream.NET requires v2.0 SP2 or higher of the .NET or Mono runtime. All DLLs for a given PDFxStream release are found in the lib directory of the PDFxStream.NET distribution:

  • PDFxStream.dll
  • IKVM.Runtime.dll
  • IKVM.AWT.WinForms.dll
  • IKVM.OpenJDK.Beans.dll
  • IKVM.OpenJDK.Charsets.dll
  • IKVM.OpenJDK.Core.dll
  • IKVM.OpenJDK.Media.dll
  • IKVM.OpenJDK.Security.dll
  • IKVM.OpenJDK.SwingAWT.dll
  • IKVM.OpenJDK.Text.dll
  • IKVM.OpenJDK.Util.dll
  • IKVM.OpenJDK.XML.API.dll

The IKVM DLL files are PDFxStream.NET's only dependencies. They provide the implementation of Java's standard library in .NET, as well as some runtime components that are required by any Java JAR that has been translated into a .NET assembly. No configuration or special initialization of these DLL files are necessary.

Installation

Using PDFxStream.NET within your .NET project is as simple as adding references to each of the DLL files indicated in the previous section.

Typical Usage

Using PDFxStream.NET is very straightforward, and mirrors typical PDFxStream for Java usage. Here's a sample text extraction function in C#:

using com.snowtide;
using com.snowtide.pdf;
using java.io;

class ExtractTextAllPages
{
    public static void Main(string[] args)
    {
        string pdfFilePath = args[0];
        StringWriter text = new StringWriter(1024);
        using (Document doc = PDF.open(pdfFilePath))
        {
            doc.pipe(new OutputTarget(text));
        }
        System.Console.WriteLine("The text extracted from {0} is:",
            pdfFilePath);
        System.Console.WriteLine(text.toString());
    }
}

Without exception, all of the PDFxStream API is available in .NET. Because of this, the PDFxStream javadoc is the authoritative API reference for PDFxStream, whether it is used in Java or .NET.

Notes and Limitations

The sole minor difference between the documented PDFxStream API and its usage in .NET is how one obtains bitmap objects from extracted PDF image data. See this note for details.

Aside from this minor irregularity, PDFxStream.NET carries no limitations; it is a pure .NET assembly, through and through, and it acts like it.

For example, you can freely write com.snowtide.pdf.OutputHandler implementations in .NET. Here is a contrived example for illustration that will count the number of characters extracted from a PDF:

namespace SubclassingExample
{
    class CharCountingTarget : com.snowtide.pdf.OutputTarget
    {
        private int cnt = 0;
        
        public CharCountingTarget (java.lang.Appendable sb) : base(sb)
        {
        }
        
        public override void textUnit (com.snowtide.pdf.layout.TextUnit tu)
        {
            base.textUnit(tu);
            cnt++;
        }
        
        public int getCount ()
        {
            int _cnt = cnt;
            cnt=0;
            return cnt;
        }
    }
}

An OutputHandler (or com.snowtide.pdf.OutputTarget, in this case) subclass like this can be used in conjunction with any pipe(OutputHandler) method, found on instances of com.snowtide.pdf.Document, com.snowtide.pdf.Page, and com.snowtide.pdf.layout.Block.

Snowtide Collection Method Extensions

The com.snowtide namespace provides a couple of extension methods to make it easier to use parts of the PDFxStream API in .NET.

Consuming collections as IEnumerable

Java collections all implement the java.util.Iterable interface, which is analogous to .NET's IEnumerable interface. Unfortunately, the IKVM compilation process does not expose Java collections as IEnumerables; without an appropriate method extension, this would mean that iterating through any collection returned by PDFxStream could not be traversed with e.g. foreach or passed to any method that requires an IEnumerable.

Using the com.snowtide namespace will bring an extension method into scope that makes it easy to treat any collection returned by PDFxStream as an IEnumerable, e.g. here used to easily iterate through the keys of the document metadata in a PDF document:

using com.snowtide;
using com.snowtide.pdf;

class ExtractMetadata
{
    public static void Main(string[] args)
    {   
        string pdfFilePath = args[0];
        System.Console.WriteLine("All document metadata from {0}:", pdfFilePath);
        using (Document doc = PDF.open(pdfFilePath))
        {
            foreach (string attrKey in doc.getAttributeKeys().toIEnumerable<string>())
            {
                System.Console.WriteLine("{0}: {1}", attrKey, doc.getAttribute(attrKey));
            }
        }
    }
}

Using StringBuffer and StringBuilder as Appendables

Many implementations of OutputHandler provided by PDFxStream accept java.lang.Appendable objects as their principal constructor argument. This interface is implemented by a number of useful sinks for textual output, including java.lang.StringBuffer, java.lang.StringBuilder, java.nio.CharBuffer, any subclass of java.io.Writer, etc.

The one wrinkle to this is that StringBuffer and StringBuilder implement Appendable via a shared package-private superclass, the methods and implemented interfaces of which are not visible to code using StringBuffer or StringBuilder in .NET. This means that this C# code will not compile:

using com.snowtide;
using com.snowtide.pdf;
// ...
StringBuilder sb = new java.lang.StringBuilder();
OutputTarget tgt = new OutputTarget(sb);

The simple solution to this is to simply not use java.lang.StringBuilder or java.lang.StringBuffer from .NET. Any usage of them in conjunction with PDFxStream can be replaced with e.g. java.io.StringWriter; all PDFxStream code samples demonstrate and recommend using StringWriter with OutputHandler implementations.

The other option is to use the .toAppendable() extension method provided by the com.snowtide namespace:

using com.snowtide;
using com.snowtide.pdf;
// ...
StringBuilder sb = new java.lang.StringBuilder();
OutputTarget tgt = new OutputTarget(sb.toAppendable());