Accessing PDF document metadata
PDFTextStream allows your applications to access both varieties of document-level metadata that might be available in a PDF file: "DocumentInfo" name/value mappings, and Adobe XMP data.
"DocumentInfo" Name / Value Metadata
Sometimes referred to as "classic" metadata, "DocumentInfo" name/value pairs typically include creation and modification dates, the PDF document’s author’s name, and other potentially interesting metadata attributes. Retrieving the document metadata attributes contained in a PDF file is a no-brainer, as shown in this code segment:
PDFTextStream stream = new PDFTextStream(pdfFile); // get collection of all document attribute names Set attributeKeys = stream.getAttributeKeys(); // print the values of all document attributes to System.out Iterator iter = attributeKeys.iterator(); String attrKey; while (iter.hasNext()) { attrKey = (String)iter.next(); System.out.println(attrKey + "=" + stream.getAttribute(attrKey)); } // print the value of the Author attribute to System.out String authorName = (String)stream.getAttribute(PDFTextStream.ATTR_AUTHOR); System.out.println("Author: " + authorName);
A few notes about this code:
- If a PDFTextStream constructor returns successfully, then it can be assumed that all of the document attributes in the PDF file have been read and are available for retrieval. The attributes can be accessed from a PDFTextStream instance even if the instance has been closed.
- A number of methods are available for retrieving all of a PDF file's document attributes, just the keys of the attributes, and so on. Check the PDFTextStream javadocs for details.
- The names of many standard document attributes are held as static
final
String
s by PDFTextStream; suchString
s' variable names all begin with "ATTR" to identify them as such. PDFTextStream.getAttribute(String)
can technically return anyObject
. However, in most circumstances, the value of document attributes areString
s; this is true for all standard PDF attributes, but PDF file generators are allowed to add non-String attributes. The allowable data types in PDF attribute values includeString
s,Number
s (eitherInteger
orFloat
instances depending on the type of number),Boolean
s, andObject[]
arrays that contain any of the other possible attribute value types.- Date attributes are stored in PDF files as specially-encoded
String
s. Such attributes can be converted intojava.util.Date
objects by calling thePDFDateParser.parseDateString(String)
method with the String date attribute as the only parameter. The only standard date attributes contained in PDF files are the creation and modification dates, associated with thePDFTextStream.ATTR_CREATION_DATE
andPDFTextStream.ATTR_MOD_DATE
keys, respectively. PDFTextStream.getAttribute(String)
will returnnull
if an attribute key is provided that is not defined in the PDF file that was read.
Adobe XMP metadata
A PDF document may also contain metadata in the form of an Adobe XMP (Extensible Metadata Platform) stream. XMP streams are XML documents that adhere to the XMP metadata schema as defined by Adobe. XMP streams typically contain the same set of metadata attributes that are available through the "classic" metadata attribute accessors, described above. However, some specialized PDF generators and workflows do add metadata constructs to a document’s XMP stream that does not fit within the simple name / value pair structure of "classic" metadata.
PDFTextStream allows your application to access XMP streams very easily, as shown in this example:
PDFTextStream stream = new PDFTextStream(pdfFile); // get XMP data stream byte[] xmlMetadata = stream.getXmlMetadata(); // close PDFTextStream instance **AFTER** retrieving XMP metadata stream stream.close(); // handle metadata from XMP stream in application-specific manner processXMPMetadata(pdfFile, xmlMetadata);
Things to consider when retrieving XMP data from a PDFTextStream instance:
- The
PDFTextStream.getXmlMetadata()
method may not be called after the PDFTextStream instance is closed. - The byte array returned by
PDFTextStream.getXmlMetadata()
is XML data pulled directly from the PDF file being read; PDFTextStream does not process this data at all. Its format is defined by the Adobe XMP specification. PDFTextStream.getXmlMetadata()
will returnnull
if the PDF file being read does not contain any document-level XMP data.