Extracting Images from PDF Documents
PDFx`Stream provides a comprehensive set of PDF image extraction capabilities that are exposed within the broader PDFxStream document model API. This includes support for dozens of PDF image encoding schemes, rendering of images to on-screen graphics contexts, serializing images to familiar formats (e.g. JPEG, PNG, etc), and automatic stitching of image tiles and strips.
PDF images are accessible via
com.snowtide.pdf.Page.getImages()
; each
com.snowtide.pdf.layout.Image
included in the returned
collection offers ways to obtain:
- the image's location, its intrinsic dimensions, and its dimensions
as rendered on the page:
com.snowtide.pdf.layout.Image.bitmapBounds()
andcom.snowtide.pdf.layout.Image.bounds()
- a platform-specific bitmap object (
java.awt.image.BufferedImage
on Java,System.Drawing.Bitmap
on .NET) that can be painted directly to a runtime graphics context:com.snowtide.pdf.layout.Image.bitmap()
- image data (
com.snowtide.pdf.layout.Image.data()
) encoded in one of a few common formats (com.snowtide.pdf.layout.Image.dataFormat()
), suitable for saving to disk or database
Though this API is extremely simple, it serves as a façade for a great deal of functionality, all applied automatically so you can ignore the devious complexities of how images are embedded and encoded within PDF documents.
Here is a simple program that extracts and saves all images from a specified PDF document to disk:
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.Page;
import com.snowtide.pdf.layout.Image;
import java.io.File;
import java.io.FileOutputStream;
public class ExtractImages {
public static void main (String[] args) throws java.io.IOException {
String pdfFilePath = args[0];
File outputDir = new File(args[1]);
if (!outputDir.exists()) outputDir.mkdirs();
Document pdf = PDF.open(pdfFilePath);
for (Page p : pdf.getPages()) {
int i = 0;
for (Image img : p.getImages()) {
FileOutputStream out = new FileOutputStream(
new File(outputDir, String.format("%s-%s.%s",
p.getPageNumber(), i, img.dataFormat().name().toLowerCase())));
out.write(img.data());
out.close();
i++;
}
System.out.printf("Found %s images on page %s", p.getImages().size(), p.getPageNumber());
System.out.println();
}
}
}
Automatic stitching of image tiles / strips
Many programs that generate PDFs will split images to be embedded into tiles or strips, so that e.g. what appears to be a single image when displayed in a PDF viewer is actually embedded in the PDF document as many smaller images arranged seamlessly. This is an irrelvant implementation detail as long as the documents in question are only being viewed, but is a significant problem when one's objective is to extract the original image as conceived by the author of the PDF (who is surely unaware of just how their documents' images are encoded).
PDFxStream addresses this by detecting image tiles and strips, and automatically joining them appropriately, without any intervention or configuration on your part.