Applies to:
PDFImageStream

Extracting Images from PDF Documents

PDFImageStream provides a comprehensive set of PDF image extraction capabilities that are exposed within the broader PDFxStream document model API. This includes support for dozens of PDF image encoding schemes, rendering of images to on-screen graphics contexts, serializing images to familiar formats (e.g. JPEG, PNG, etc), and automatic stitching of image tiles and strips.

PDF images are accessible via com.snowtide.pdf.Page.getImages(); each com.snowtide.pdf.layout.Image included in the returned collection offers ways to obtain:

Though this API is extremely simple, it serves as a façade for a great deal of functionality, all applied automatically so you can ignore the stark complexities of how images are embedded and encoded within PDF documents.

Here is a simple program that extracts and saves all images from a specified PDF document to disk:

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.Page;
import com.snowtide.pdf.layout.Image;

import java.io.File;
import java.io.FileOutputStream;

public class ExtractImages {
    public static void main (String[] args) throws java.io.IOException {
        String pdfFilePath = args[0];
        File outputDir = new File(args[1]);
        if (!outputDir.exists()) outputDir.mkdirs();

        Document pdf = PDF.open(pdfFilePath);
        for (Page p : pdf.getPages()) {
            int i = 0;
            for (Image img : p.getImages()) {
                FileOutputStream out = new FileOutputStream(
                        new File(outputDir, String.format("%s-%s.%s",
                                p.getPageNumber(), i, img.dataFormat().name().toLowerCase())));
                out.write(img.data());
                out.close();
                i++;
            }
            System.out.printf("Found %s images on page %s", p.getImages().size(), p.getPageNumber());
            System.out.println();
        }
    }
}

Automatic stitching of image tiles / strips

Many programs that generate PDFs will split images to be embedded into tiles or strips, so that e.g. what appears to be a single image when displayed in a PDF viewer is actually embedded in the PDF document as many smaller images arranged seamlessly. This is an irrelvant implementation detail as long as the documents in question are only being viewed, but is a significant problem when one's objective is to extract the original image as conceived by the author of the PDF (who is surely unaware of just how their documents' images are encoded).

PDFxStream addresses this by detecting image tiles and strips, and automatically joining them appropriately, without any intervention or configuration on your part.

Obtaining .NET bitmaps from Images

One of the few areas in which the documented PDFxStream API differs when used in Java vs. .NET is in obtaining bitmap objects suitable for drawing to the appropriate platform-native graphics context.

Image.bitmap() is defined to return a java.awt.image.BufferedImage, one of the primary representations of raster image data on Java. This is not particularly useful on .NET, where System.Drawing.Bitmap is required for drawing to a runtime graphics context (i.e. System.Drawing.Graphics). As a workaround, java.awt.image.BufferedImage provides a special getBitmap() method on .NET, which efficiently returns a System.Drawing.Image:

 Page page = pdf.getPage(0);
 Image image = page.getImages().get(0).bitmap();
 if (image != null)
 {
     System.Drawing.Bitmap bmp = image.getBitmap();
     // ...use .NET bitmap...
 }