Accessing PDF document metadata

PDFxStream allows your applications to access both varieties of document-level metadata that might be available in a PDF file: "DocumentInfo" name/value mappings, and Adobe XMP data.

"DocumentInfo" Name / Value Metadata

Sometimes referred to as "classic" metadata, "DocumentInfo" name/value pairs typically include creation and modification dates, the PDF document's author's name, sometimes a document title, and other potentially useful metadata attributes. Retrieving this metadata is easy:

Document doc = PDF.open(pdfFilePath);
// access all available metadata keys
for (String key : doc.getAttributeKeys()) {
    System.out.printf("%s: %s\n", key, doc.getAttribute(key));
}

// Use predefined keys (Document.ATTR_*) to access well-known metadata
String authorName = (String)stream.getAttribute(Document.ATTR_AUTHOR);
System.out.println("Author: " + authorName);

A few notes about this code:

A number of methods are available for retrieving all of a PDF file's document attributes (com.snowtide.pdf.Document.getAttributeMap()), just the keys of the attributes(com.snowtide.pdf.Document.getAttributeKeys()), and so on.
The names of many standard document attributes are held as static final Strings in the com.snowtide.pdf.Document interface. Such Strings' variable names all begin with ATTR by convention.
While in most circumstances, com.snowtide.pdf.Document.getAttribute(String) will return a String, and that is expected according to the PDF specification, PDF file generators are allowed to use non-String attributes. Numbers (either integers or floats), booleans, and object arrays that contain any of the other possible attribute value types are technically permitted.
Date attributes are stored in PDF files as specially-encoded strings. Such attributes can be converted into java.util.Date objects by calling com.snowtide.pdf.PDFDateParser.parseDateString(String) method with a String date attribute as the only parameter. The only standard date attributes contained in PDF files are the creation and modification dates, mapped to com.snowtide.pdf.Document.ATTR_CREATION_DATE and com.snowtide.pdf.Document.ATTR_MOD_DATE keys, respectively.
Document.getAttribute(String) will return null if an attribute key is provided that is not defined in the PDF file that was read.

Adobe XMP metadata

A PDF document may also contain metadata in the form of an Adobe XMP (Extensible Metadata Platform) stream. XMP streams are XML documents that adhere to the XMP metadata schema as defined by Adobe. XMP streams typically contain the same set of metadata attributes that are available through the "classic" metadata attribute accessors, described above. However, some specialized PDF generators and workflows do add metadata constructs to a document's XMP stream that does not fit within the simple name / value pair structure of "classic" metadata.

PDFxStream allows your application to access XMP streams very easily, as shown in this example:

import java.io.*;

import com.snowtide.PDF;
import com.snowtide.pdf.Document;

public class ExtractXMPMetadata {
    public static void main (String[] args) throws IOException {
        String pdfFilePath = args[0];
        
        Document doc = PDF.open(pdfFilePath);
        String outPath = args[0] + ".xmp.xml";
        FileOutputStream s = new FileOutputStream(outPath);
        s.write(doc.getXmlMetadata());
        s.close();
        doc.close();
        
        System.out.println("Wrote Adobe XMP metadata to " + outPath);
    }
}

The byte array returned by com.snowtide.pdf.Document.getXmlMetadata() is XML data pulled directly from the PDF file being read. Its format is defined by the Adobe XMP specification.

"DocumentInfo" Name / Value Metadata​

Adobe XMP metadata​

"DocumentInfo" Name / Value Metadata

Adobe XMP metadata