Accessing PDF bookmarks
Some PDF documents contain bookmarks (which are sometimes referred to collectively as a document outline) that refer to significant document sections. If a document contains bookmarks, they appear in the ‘Bookmarks’ panel in Adobe Reader:
PDFTextStream allows you to access the bookmarks contained in PDF documents and all of the attributes associated with those bookmarks.
Bookmark structure and attributes
PDF bookmarks are organized into a tree structure with a single root node.
If a PDF document contains bookmarks, that root node is returned by the PDFTextStream.getBookmarks()
method as a com.snowtide.pdf.Bookmark
instance.
Each bookmark may contain child bookmarks, accessible using the Bookmark.getChildCnt()
and Bookmark.getChild(int)
methods; entire branches of the
bookmark tree can also be easily retrieved using the Bookmark.getAllDescendants()
and Bookmark.getAllDescendants(java.util.List)
methods.
Bookmarks have two main attributes: a title (the text that describes the
section to which the bookmark refers) and a page number. These attributes
are accessible using the Bookmark.getTitle()
and the Bookmark.getPageNumber()
functions, respectively. All leaf nodes in the bookmark tree should have a
page number defined, and many branch nodes may specify a page number as
well. It is common for the root node of the bookmark tree to define neither
a page number or title. In that case, the Bookmark.getTitle()
method will return null, and the Bookmark.getPageNumber()
method will return -1
.
Precise Bookmark Positioning
In addition to the page number, some bookmarks will provide specific spatial
coordinates, defining where on the target page a PDF viewer should position
its viewing window when a user activates a bookmark. These functions (Bookmark.getTopBound()
,
Bookmark.getLeftBound()
, Bookmark.getRightBound()
,
and Bookmark.getBottomBound()
) return such coordinates. Many
bookmarks will specify only some coordinates, in which case a PDF viewer
would orient its viewing window along the defined coordinates, and simply
show all of the remaining portions of the target page.
For example, a bookmark referring to page 12 might specify a top bound of 400
,
a left bound of 25
, and undefined right and bottom bounds
(values of -1
). A PDF viewer would therefore position its
viewing window like so:
Having this level of precision available can be very useful, especially when requirements specify the extraction of text from only particular sections of a document. Selective Text Extraction Based on Bookmark Coordinates includes a sample code listing that will extract only a particular section of text from a document based on the precise coordinate bounds specified by a PDF document’s bookmarks.