Skip to main content

Selective Text Extraction Guided by Bookmark Coordinates

This code sample uses PDFxStream's bookmark capabilities to selectively extract text from PDF documents using specific spatial coordinates provided by the documents' bookmarks.

Scenario: consider a collection of thousands of PDF documents, all following a particular format. Suppose you only want to extract the text of a particular section -- perhaps the summary, which would make a good set of inputs for indexing. The problem is that the summary does not start on the same page in every document, and it is of varying lengths in every document.

However, all of the documents do have bookmarks, and through experimentation on a few of them, you have found that their bookmarks specify accurate top bound coordinates (the vertical coordinate where the bookmark is positioned, indicating the start of the corresponding section). For example, one of the documents has the bookmark for its summary section referring to the third page of the document, with a top bound (accessible using the com.snowtide.pdf.Bookmark.getTopBound() function) of 560:

That neatly gives us the location of the start of the summary section, but doesn't help with where the section ends. For that, we simply look at the bookmark that follows the summary bookmark; our example document has the next bookmark referring to page 6, with a top bound of 220:

So, now we know that the summary section for this particular PDF document runs from page 3, y-coordinate 560, to page 6, y-coordinate 220. Extracting only the text from those pages and between those coordinates is pretty easy:

  • Open the document using e.g. com.snowtide.PDF.open(File).
  • Extract all of the bookmarks in the document.
  • Order those bookmarks according to the positions they refer to in the document.
  • Find the bookmark corresponding to the start of the section of interest. (The sample code uses the bookmark title to determine this. The method that your applications uses might be different, such as the position of a bookmark in the bookmark tree hierarchy if a particular document type is very consistent in its structure from edition to edition).
  • Find the next bookmark, which will indicate the end of the section of interest.
  • Access each of the pages between the page numbers indicated by the bookmarks being used (inclusive).
  • For the first and last pages in the range indicated by the bookmarks, crop their contents based on the top and bottom bounds of those bookmarks. This eliminates the text before the start and after the end of the section of interest.
  • Pipe the contents of each page in the bookmarked range (after cropping as appropriate), yielding the text of the section.

The code sample below implements this approach, with some minor embellishments to handle boundary cases, such as if there is no bookmark following the section of interest.

import com.snowtide.pdf.Bookmark;
import com.snowtide.pdf.OutputTarget;
import com.snowtide.pdf.PDFTextStream;
import com.snowtide.pdf.Page;
import com.snowtide.pdf.layout.Block;
import com.snowtide.pdf.layout.BlockParent;
import java.io.*;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

public class ExtractBookmarkedSection {
/**
* Extracts from the given PDF file only the text from the section that is
* delimited by a PDF Bookmark with the given section title.
*/
public static String extractSectionText(File pdffile, String sectionTitle)
throws IOException {
PDFTextStream stream = new PDFTextStream(pdffile);
Bookmark root = stream.getBookmarks();
List allbookmarks = root.getAllDescendants();
Collections.sort(allbookmarks, new DocumentOrderBookmarkComparator());
Bookmark bm;
int startpage, endpage;
float starttop, endtop;
starttop = endtop = startpage = endpage = -1;
for (int i = 0, len = allbookmarks.size(); i < len; i++) {
bm = (Bookmark)allbookmarks.get(i);
if (bm.getTitle().equals(sectionTitle)) {
startpage = bm.getPageNumber();
starttop = bm.getTopBound();
if (i + 1 < len) {
bm = (Bookmark)allbookmarks.get(i + 1);
endpage = bm.getPageNumber();
endtop = bm.getTopBound();
}
break;
}
}
// couldn't find section start from title
if (startpage == -1)
return null;
// handle when we're extracting the last bookmarked section
if (endpage == -1)
endpage = stream.getPageCnt() - 1;
Page page;
StringBuffer sb = new StringBuffer(1024);
OutputTarget tgt = OutputTarget.forBuffer(sb);
for (int i = startpage; i <= endpage; i++) {
page = stream.getPage(i);
if (i == startpage && starttop != -1) {
// remove all blocks above bookmark,
// if bookmark bound is defined
removeBlocksAbove(page.getTextContent(), starttop);
} else if (i == endpage && endtop != -1) {
// remove all blocks below end bookmark,
// if bookmark bound is defined
removeBlocksBelow(page.getTextContent(), endtop);
}
page.pipe(tgt);
}
stream.close();
return sb.toString();
}

/**
* Removes all of the child blocks within the given BlockParent instance
* that are positioned above the given y-coordinate position.
*/
private static void removeBlocksAbove(BlockParent blocks, float pos) {
Block b;
for (int i = blocks.getChildCnt() - 1; i > -1; i--) {
b = blocks.getChild(i);
if (b.ypos() >= pos) {
blocks.removeChild(i);
} else {
removeBlocksAbove(b, pos);
}
}
}

/**
* Removes all of the child blocks within the given BlockParent instance
* that are positioned below the given y-coordinate position.
*/
private static void removeBlocksBelow(BlockParent blocks, float pos) {
Block b;
for (int i = blocks.getChildCnt() - 1; i > -1; i--) {
b = blocks.getChild(i);
if (b.endypos() <= pos) {
blocks.removeChild(i);
} else {
removeBlocksAbove(b, pos);
}
}
}

/**
* Orders the Bookmarks within a List according to where they refer within
* the document (technically, bookmarks can refer to any page, any location,
* and not necessarily be in a typical reading order within the tree).
*/
private static class DocumentOrderBookmarkComparator implements Comparator {
private Bookmark b1, b2;

public int compare(Object o1, Object o2) {
b1 = (Bookmark)o1;
b2 = (Bookmark)o2;
if (b1.getPageNumber() < b2.getPageNumber()) {
return -1;
} else if (b1.getPageNumber() > b2.getPageNumber()) {
return 1;
} else {
if (b1.getTopBound() > b2.getTopBound()) {
return -1;
} else if (b1.getTopBound() == b2.getTopBound()) {
return 0;
} else {
return 1;
}
}
}
}
}