Accessing PDF bookmarks
Some PDF documents contain bookmarks (referred to collectively as a "document outline" in the PDF spec) that refer to significant document sections. If a document contains bookmarks, they appear in the 'Bookmarks' panel in Adobe Reader, forming an interactive table of contents for the document:
PDFxStream allows you to access the bookmarks contained in PDF documents and all of the attributes associated with those bookmarks. By explicitly identifying starting points for different regions in a document, this data can be helpful in informing many different data extraction tasks.
Bookmark structure and attributes
PDF bookmarks are organized into a tree structure with a single root
node. If a PDF document contains bookmarks, that root node is returned
by the com.snowtide.pdf.Document.getBookmarks()
method as a
com.snowtide.pdf.Bookmark
instance. Each bookmark may contain
child bookmarks, accessible using the
com.snowtide.pdf.Bookmark.getChildCnt()
and
com.snowtide.pdf.Bookmark.getChild(int)
methods; entire
branches of the bookmark tree can also be easily retrieved using the
com.snowtide.pdf.Bookmark.getAllDescendants()
and
com.snowtide.pdf.Bookmark.getAllDescendants(List)
methods.
Bookmarks have two main attributes: a title (the text that describes the
section to which the bookmark refers) and a page number. These
attributes are accessible using the
com.snowtide.pdf.Bookmark.getTitle()
and the
com.snowtide.pdf.Bookmark.getPageNumber()
methods. All leaf
nodes in the bookmark tree should have a page number defined, and many
branch nodes may specify a page number as well. It is common for the
root node of the bookmark tree to define neither a page number or title.
In that case, the Bookmark.getTitle()
method
will return null, and the
Bookmark.getPageNumber()
method will return
-1
.
Precise Bookmark Positioning
In addition to the page number, some bookmarks will provide specific
spatial coordinates, defining where on the target page a PDF viewer
should position its viewing window when a user activates a bookmark.
These functions (com.snowtide.pdf.Bookmark.getTopBound()
,
com.snowtide.pdf.Bookmark.getLeftBound()
,
com.snowtide.pdf.Bookmark.getRightBound()
, and
com.snowtide.pdf.Bookmark.getBottomBound()
) return those
coordinates. Many bookmarks will specify only some coordinates, in which
case a PDF viewer would orient its viewing window along the defined
coordinates, and simply show all of the remaining portions of the target
page.
For example, a bookmark referring to page 12 might specify a top bound
of 400
, a left bound of 25
, and undefined right and bottom bounds
(values of -1
). A PDF viewer would therefore position its viewing
window like so:
Having this level of precision available can be very useful, especially when requirements specify the extraction of text from only particular sections of a document. This tutorial demonstrates how to extract only a particular section of text from a document based on the precise coordinate bounds specified by a PDF document's bookmarks.