Applies to:
PDFImageStream
Extracting Images from PDF Documents
PDFImageStream provides a comprehensive set of PDF image extraction capabilities that are exposed within the broader PDFxStream document model API. This includes support for dozens of PDF image encoding schemes, rendering of images to on-screen graphics contexts, serializing images to familiar formats (e.g. JPEG, PNG, etc), and automatic stitching of image tiles and strips.
PDF images are accessible
via com.snowtide.pdf.Page.getImages()
;
each com.snowtide.pdf.layout.Image
included in
the returned collection offers ways to obtain:
-
the image's location, its intrinsic dimensions, and its dimensions as
rendered on the page:
com.snowtide.pdf.layout.Image.bitmapBounds()
andcom.snowtide.pdf.layout.Image.bounds()
- a platform-specific bitmap object (
java.awt.image.BufferedImage
on Java,System.Drawing.Bitmap
on .NET) that can be painted directly to a runtime graphics context:com.snowtide.pdf.layout.Image.bitmap()
- image data (
com.snowtide.pdf.layout.Image.data()
) encoded in one of a few common formats (com.snowtide.pdf.layout.Image.dataFormat()
), suitable for saving to disk or database
Though this API is extremely simple, it serves as a façade for a great deal of functionality, all applied automatically so you can ignore the stark complexities of how images are embedded and encoded within PDF documents.
Here is a simple program that extracts and saves all images from a specified PDF document to disk:
import com.snowtide.PDF; import com.snowtide.pdf.Document; import com.snowtide.pdf.Page; import com.snowtide.pdf.layout.Image; import java.io.File; import java.io.FileOutputStream; public class ExtractImages { public static void main (String[] args) throws java.io.IOException { String pdfFilePath = args[0]; File outputDir = new File(args[1]); if (!outputDir.exists()) outputDir.mkdirs(); Document pdf = PDF.open(pdfFilePath); for (Page p : pdf.getPages()) { int i = 0; for (Image img : p.getImages()) { FileOutputStream out = new FileOutputStream( new File(outputDir, String.format("%s-%s.%s", p.getPageNumber(), i, img.dataFormat().name().toLowerCase()))); out.write(img.data()); out.close(); i++; } System.out.printf("Found %s images on page %s", p.getImages().size(), p.getPageNumber()); System.out.println(); } } }
Automatic stitching of image tiles / strips
Many programs that generate PDFs will split images to be embedded into tiles or strips, so that e.g. what appears to be a single image when displayed in a PDF viewer is actually embedded in the PDF document as many smaller images arranged seamlessly. This is an irrelvant implementation detail as long as the documents in question are only being viewed, but is a significant problem when one's objective is to extract the original image as conceived by the author of the PDF (who is surely unaware of just how their documents' images are encoded).
PDFxStream addresses this by detecting image tiles and strips, and automatically joining them appropriately, without any intervention or configuration on your part.
Obtaining .NET bitmaps
from Image
s
One of the few areas in which the documented PDFxStream API differs when used in Java vs. .NET is in obtaining bitmap objects suitable for drawing to the appropriate platform-native graphics context.
Image.bitmap()
is defined
to return a java.awt.image.BufferedImage
, one of the primary
representations of raster image data on Java. This is not particularly
useful on .NET, where System.Drawing.Bitmap
is required for
drawing to a runtime graphics context
(i.e. System.Drawing.Graphics
). As a
workaround, java.awt.image.BufferedImage
provides a
special getBitmap()
method on .NET, which efficiently returns
a System.Drawing.Image
:
Page page = pdf.getPage(0); Image image = page.getImages().get(0).bitmap(); if (image != null) { System.Drawing.Bitmap bmp = image.getBitmap(); // ...use .NET bitmap... }