PDFTextStream for .NET

PDFTextStream.NET is produced by translating the standard PDFTextStream for Java binary into a pure managed .NET assembly. This translation process is complete, and preserves PDFTextStream’s Java API’s, architecture, functionality, and performance, and delivers it for the .NET platform.

This kind of translation is possible because the Java Virtual Machine (JVM) and the .NET Common Language Runtime (CLR) are very similar architecturally, and the Java and .NET object models are virtually analogous. The actual translation is performed by IKVM's static compilation process. IKVM is an open source toolkit that makes it possible to run Java applications and libraries within the .NET environment.

IKVM and the included GNU Classpath library both use a license that makes it possible to redistribute them with commercial products. Therefore, licensing PDFTextStream.NET for inclusion in your own product on an OEM basis is perfectly straightforward and poses no threat to your product's license.

Requirements

PDFTextStream.NET requires v2.0 or higher of the .NET or Mono runtime. All DLLs for a given PDFTextStream release are found in the lib directory of the PDFTextStream.NET distribution:

  • PDFTextStream.dll
  • IKVM.Runtime.dll
  • IKVM.OpenJDK.Beans.dll
  • IKVM.OpenJDK.Core.dll
  • IKVM.OpenJDK.Media.dll
  • IKVM.OpenJDK.Security.dll
  • IKVM.OpenJDK.SwingAWT.dll
  • IKVM.OpenJDK.Text.dll
  • IKVM.OpenJDK.Util.dll
  • IKVM.OpenJDK.XML.API.dll

The IKVM DLL files are PDFTextStream.NET's only dependencies. They provide the implementation of Java's standard library in .NET, as well as some runtime components that are required by any Java JAR that has been translated into a .NET assembly. No configuration or special initialization of these DLL files are necessary.

Installation

Using PDFTextStream.NET within your .NET project is as simple as adding references to each of the three DLL files indicated in the previous section.

Typical Usage

Using PDFTextStream.NET is very straightforward, and mirrors typical PDFTextStream for Java usage. Here's a sample text extraction function in C#:

using com.snowtide.pdf;
namespace DotNetExampleFunction
{
    class ExampleFunction
    {
        static string extractPDFText (string pdfFilePath)
        {
            java.lang.StringBuffer sb = new java.lang.StringBuffer(1024);
            OutputTarget tgt = new OutputTarget(sb);
            PDFTextStream stream = new PDFTextStream(new java.io.File(pdfFilePath));
            stream.pipe(tgt);
            stream.close();
            return sb.toString();
        }
    }
}

All of the PDFTextStream API is available in .NET, including support for extracting bookmarks, annotations and hyperlinks, interactive form data, and document properties. Because of this, the PDFTextStream javadoc is the authoritative API reference for PDFTextStream, whether it is used in Java or .NET.

Notes and Limitations

There are no limitations. PDFTextStream.NET is a pure .NET assembly, through and through, and it acts like it.

For example, you can freely write OutputHandler implementations in .NET. Here is a contrived example for illustration that will count the number of characters extracted from a PDF:

namespace SubclassingExample
{
    class CharCountingTarget : com.snowtide.pdf.OutputTarget
    {
        private int cnt = 0;
        
        public CharCountingTarget (java.lang.StringBuffer sb) : base(sb)
        {
        }
        
        public override void textUnit (com.snowtide.pdf.layout.TextUnit tu)
        {
            base.textUnit(tu);
            cnt++;
        }
        
        public int getCount ()
        {
            int _cnt = cnt;
            cnt=0;
            return cnt;
        }
    }
}

An OutputHandler (or OutputTarget, in this case) subclass like this can be used in conjunction with any pipe(OutputHandler) method, found on instances of the com.snowtide.pdf.PDFTextStream, com.snowtide.pdf.Page, and com.snowtide.pdf.layout.Block classes.