Extracting Images from PDF Documents

PDFImageStream provides a comprehensive set of PDF image extraction capabilities that are exposed within the broader PDFxStream document model API. This includes support for dozens of PDF image encoding schemes, rendering of images to on-screen graphics contexts, serializing images to familiar formats (e.g. JPEG, PNG, etc), and automatic stitching of image tiles and strips.

PDF images are accessible via com.snowtide.pdf.Page.getImages(); each com.snowtide.pdf.layout.Image included in the returned collection offers ways to obtain:

the image's location, its intrinsic dimensions, and its dimensions as rendered on the page: com.snowtide.pdf.layout.Image.bitmapBounds() and com.snowtide.pdf.layout.Image.bounds()
a platform-specific bitmap object (java.awt.image.BufferedImage on Java, System.Drawing.Bitmap on .NET) that can be painted directly to a runtime graphics context: com.snowtide.pdf.layout.Image.bitmap()
image data (com.snowtide.pdf.layout.Image.data()) encoded in one of a few common formats (com.snowtide.pdf.layout.Image.dataFormat()), suitable for saving to disk or database

Though this API is extremely simple, it serves as a façade for a great deal of functionality, all applied automatically so you can ignore the stark complexities of how images are embedded and encoded within PDF documents.

Here is a simple program that extracts and saves all images from a specified PDF document to disk:

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.Page;
import com.snowtide.pdf.layout.Image;

import java.io.File;
import java.io.FileOutputStream;

public class ExtractImages {
    public static void main (String[] args) throws java.io.IOException {
        String pdfFilePath = args[0];
        File outputDir = new File(args[1]);
        if (!outputDir.exists()) outputDir.mkdirs();

        Document pdf = PDF.open(pdfFilePath);
        for (Page p : pdf.getPages()) {
            int i = 0;
            for (Image img : p.getImages()) {
                FileOutputStream out = new FileOutputStream(
                        new File(outputDir, String.format("%s-%s.%s",
                                p.getPageNumber(), i, img.dataFormat().name().toLowerCase())));
                out.write(img.data());
                out.close();
                i++;
            }
            System.out.printf("Found %s images on page %s", p.getImages().size(), p.getPageNumber());
            System.out.println();
        }
    }
}

Automatic stitching of image tiles / strips

Many programs that generate PDFs will split images to be embedded into tiles or strips, so that e.g. what appears to be a single image when displayed in a PDF viewer is actually embedded in the PDF document as many smaller images arranged seamlessly. This is an irrelvant implementation detail as long as the documents in question are only being viewed, but is a significant problem when one's objective is to extract the original image as conceived by the author of the PDF (who is surely unaware of just how their documents' images are encoded).

PDFxStream addresses this by detecting image tiles and strips, and automatically joining them appropriately, without any intervention or configuration on your part.

Obtaining .NET bitmaps from `Image`s

One of the few areas in which the documented PDFxStream API differs when used in Java vs. .NET is in obtaining bitmap objects suitable for drawing to the appropriate platform-native graphics context.

Image.bitmap() is defined to return a java.awt.image.BufferedImage, one of the primary representations of raster image data on Java. This is not particularly useful on .NET, where System.Drawing.Bitmap is required for drawing to a runtime graphics context (i.e. System.Drawing.Graphics). As a workaround, java.awt.image.BufferedImage provides a special getBitmap() method on .NET, which efficiently returns a System.Drawing.Image:

 Page page = pdf.getPage(0);
 Image image = page.getImages().get(0).bitmap();
 if (image != null)
 {
     System.Drawing.Bitmap bmp = image.getBitmap();
     // ...use .NET bitmap...
 }

PDFxStream v3.3.7 Technical Documentation

Next >> Extracting and updating PDF form data

<< Previous Appendix: The Art of Reading PDF Text

Extracting Images from PDF Documents

Automatic stitching of image tiles / strips

Obtaining .NET bitmaps from Images

Obtaining .NET bitmaps from `Image`s