Skip to main content

Extracting Images from PDF Documents

PDFx`Stream provides a comprehensive set of PDF image extraction capabilities that are exposed within the broader PDFxStream document model API. This includes support for dozens of PDF image encoding schemes, rendering of images to on-screen graphics contexts, serializing images to familiar formats (e.g. JPEG, PNG, etc), and automatic stitching of image tiles and strips.

PDF images are accessible via com.snowtide.pdf.Page.getImages(); each com.snowtide.pdf.layout.Image included in the returned collection offers ways to obtain:

Though this API is extremely simple, it serves as a façade for a great deal of functionality, all applied automatically so you can ignore the devious complexities of how images are embedded and encoded within PDF documents.

Here is a simple program that extracts and saves all images from a specified PDF document to disk:

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.Page;
import com.snowtide.pdf.layout.Image;

import java.io.File;
import java.io.FileOutputStream;

public class ExtractImages {
public static void main (String[] args) throws java.io.IOException {
String pdfFilePath = args[0];
File outputDir = new File(args[1]);
if (!outputDir.exists()) outputDir.mkdirs();

Document pdf = PDF.open(pdfFilePath);
for (Page p : pdf.getPages()) {
int i = 0;
for (Image img : p.getImages()) {
FileOutputStream out = new FileOutputStream(
new File(outputDir, String.format("%s-%s.%s",
p.getPageNumber(), i, img.dataFormat().name().toLowerCase())));
out.write(img.data());
out.close();
i++;
}
System.out.printf("Found %s images on page %s", p.getImages().size(), p.getPageNumber());
System.out.println();
}
}
}

Automatic stitching of image tiles / strips

Many programs that generate PDFs will split images to be embedded into tiles or strips, so that e.g. what appears to be a single image when displayed in a PDF viewer is actually embedded in the PDF document as many smaller images arranged seamlessly. This is an irrelvant implementation detail as long as the documents in question are only being viewed, but is a significant problem when one's objective is to extract the original image as conceived by the author of the PDF (who is surely unaware of just how their documents' images are encoded).

PDFxStream addresses this by detecting image tiles and strips, and automatically joining them appropriately, without any intervention or configuration on your part.