Everything in one box to make accessing content and data from PDFs easy.
Most developers don't need to read this page. PDFxStream supports nearly everything in the PDF specification (and hundreds of commonly-found constructs that aren't in the standard spec!), and is built so that your use of it requires zero knowledge of those details.
(If you care about a particular PDF data type or characteristic that is not listed below, then please feel free to contact us to confirm that PDFxStream supports what you need.)
That said, for those that do care about PDF internals and want to know about PDFxStream's level of support for them, read on and enjoy!
Making PDF data access simple and easy
Working with PDF documents is often a frustrating and difficult exercise. Few developers are familiar with the PDF file specification (and all of its incorporated sub-specifications), and even those that are usually don't want to have to consider those details when completing what should be an easy task like "store each uploaded PDF document's text in the database".
For this reason, one of PDFxStream's primary features is that it allows you to complete those sorts of tasks using an API that doesn't require knowledge of the particulars of the PDF file format. See for yourself just how little is necessary to access all of the essential pools of content and data within PDF documents using PDFxStream:
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;
public class ExtractTextAllPages {
public static void main (String[] args) throws java.io.IOException {
String pdfFilePath = args[0];
Document pdf = PDF.open(pdfFilePath);
StringBuilder text = new StringBuilder(1024);
pdf.pipe(new OutputTarget(text));
pdf.close();
System.out.println(text);
}
}
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.Page;
import com.snowtide.pdf.layout.Image;
import java.io.File;
import java.io.FileOutputStream;
public class ExtractImages {
public static void main (String[] args) throws java.io.IOException {
String pdfFilePath = args[0];
File outputDir = new File(args[1]);
if (!outputDir.exists()) outputDir.mkdirs();
Document pdf = PDF.open(pdfFilePath);
for (Page p : pdf.getPages()) {
int i = 0;
for (Image img : p.getImages()) {
FileOutputStream out = new FileOutputStream(
new File(outputDir, String.format("%s-%s.%s",
p.getPageNumber(), i, img.dataFormat().name().toLowerCase())));
out.write(img.data());
out.close();
i++;
}
System.out.printf("Found %s images on page %s", p.getImages().size(), p.getPageNumber());
System.out.println();
}
}
}
import java.io.IOException;
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;
public class ExtractTextOnePage {
public static void main (String[] args) throws IOException {
String pdfFilePath = args[0];
Document pdfts = PDF.open(pdfFilePath);
StringBuilder text = new StringBuilder(1024);
pdfts.getPage(0).pipe(new OutputTarget(text));
pdfts.close();
System.out.println(text);
}
}
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
public class ExtractMetadata {
public static void main (String[] args) throws java.io.IOException {
String pdfFilePath = args[0];
System.out.println("All document metadata from " + pdfFilePath + ":");
Document doc = PDF.open(pdfFilePath);
for (String key : doc.getAttributeKeys()) {
System.out.printf("%s: %s", key, doc.getAttribute(key));
System.out.println();
}
doc.close();
}
}
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.forms.*;
public class ExtractFormData {
public static void main (String[] args) throws java.io.IOException {
String pdfFilePath = args[0];
Document pdfts = PDF.open(pdfFilePath);
AcroForm form = (AcroForm)pdfts.getFormData();
// access specific fields directly
AcroTextField projectName = (AcroTextField)form.getField("color.1");
AcroCheckboxField isPrivateNonProfit =
(AcroCheckboxField)form.getField("color.10-privatenonprofit");
System.out.printf("Project %s %s run by a nonprofit organization",
projectName.getValue(),
isPrivateNonProfit.isChecked() ? "is" : "is not");
System.out.println();
// access all fields (just a sampling of available data/functionality)
System.out.println(String.format("All form data from %s:", pdfFilePath));
for (AcroFormField field : form) {
Object ftype = field.getType();
if (ftype == AcroFormField.FIELD_TYPE_TEXT) {
System.out.printf("Field %s is a text box; value: %s",
field.getFullName(), field.getValue());
} else if (ftype == AcroFormField.FIELD_TYPE_BUTTON) {
switch (((AcroButtonField)field).getButtonType()) {
case AcroButtonField.BUTTON_TYPE_PUSHBUTTON:
System.out.printf("Field %s is a pushbutton; value: %s",
field.getFullName(), field.getValue());
break;
case AcroButtonField.BUTTON_TYPE_CHECKBOX:
System.out.printf("Field %s is a checkbox; value: %s; is checked? %s",
field.getFullName(), field.getValue(),
((AcroCheckboxField)field).isChecked());
break;
case AcroButtonField.BUTTON_TYPE_RADIO_GROUP:
System.out.printf("Field %s is a radio button group; value: %s; possible values: %s",
field.getFullName(), field.getValue(),
((AcroRadioButtonGroupField)field).getPossibleValues());
break;
}
} else if (ftype == AcroFormField.FIELD_TYPE_CHOICE) {
System.out.printf("Field %s is 'select' dropdown; value: %s; display label: %s",
field.getFullName(), field.getValue(),
((AcroChoiceField)field).getDisplayValue((String)field.getValue()));
} else if (ftype == AcroFormField.FIELD_TYPE_SIGNATURE) {
System.out.printf("Field %s is a signature; value: %s",
field.getFullName(), field.getValue());
} else {
System.out.printf("Field %s is of unknown type; value: %s",
field.getFullName(), field.getValue());
}
System.out.println();
}
pdfts.close();
}
}
import com.snowtide.PDF;
import com.snowtide.pdf.Bookmark;
import com.snowtide.pdf.Document;
public class AccessBookmarks {
public static void main (String[] args) throws java.io.IOException {
String pdfFilePath = args[0];
Document doc = PDF.open(pdfFilePath);
Bookmark root = doc.getBookmarks();
if (root == null) {
System.out.println(pdfFilePath + " does not contain any bookmarks.");
} else {
for (Bookmark b : root.getAllDescendants()) {
System.out.printf("Bookmark '%s' points at page %s, bounds %s, %s, %s, %s",
b.getTitle(), b.getPageNumber(),
b.getLeftBound(), b.getBottomBound(),
b.getRightBound(), b.getTopBound());
System.out.println();
}
}
doc.close();
}
}
import java.io.*;
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
public class ExtractXMPMetadata {
public static void main (String[] args) throws IOException {
String pdfFilePath = args[0];
Document doc = PDF.open(pdfFilePath);
String outPath = args[0] + ".xmp.xml";
FileOutputStream s = new FileOutputStream(outPath);
s.write(doc.getXmlMetadata());
s.close();
doc.close();
System.out.println("Wrote Adobe XMP metadata to " + outPath);
}
}
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;
public class DecryptWithPassword {
public static void main (String[] args) throws java.io.IOException {
String pdfFilePath = args[0];
Document pdfts = PDF.open(pdfFilePath, args[1].getBytes());
StringBuilder text = new StringBuilder(1024);
pdfts.pipe(new OutputTarget(text));
pdfts.close();
System.out.println(text);
}
}
Give PDFxStream a try for your project's PDF data access needs!
Get StartedBaseline PDF format compatibility and basic data extraction capabilities
The official PDF file format specification is large and complex. PDF files can be rich, dynamic documents, and getting to all of the interesting and useful parts of them (i.e. their content, text, metadata, etc) is a daunting task.
Further, Adobe's specification only provides normative descriptions of how PDF documents should be constructed. Experience shows that applications must often process PDF documents from flawed sources that sometimes generate PDF files that bend and often break the "official" PDF specification — similar to how web browsers are forced to support broken and malformed HTML documents as best as they can.
This is just one of the many reasons why continually supporting and maintaining PDFxStream is a never-ending task. Doing anything else would prevent us from guaranteeing maximum compatibility with all PDF document formats and variants "in the field", regardless of their source or to what degree they violate certain rules of good PDF file format etiquette.
PDF Format Support Details
The range of PDF file format features (and quirks!) that PDFxStream supports is broad and deep. Below is a partial list of the major facets of the PDF specification that PDFxStream supports.
- Compatibility with all versions of the PDF document specification, from v1.0 (corresponding to Acrobat 1) to v1.7 (corresponding to Acrobat 8 and higher).
- Support for decryption of PDF documents encrypted with or without a password using 40-bit, 128-bit, 256-bit, and variable bitlength ciphers (including RC4 and AES)
- Automatic "repair" of PDF documents to account for common malformations and irregularities
- Extraction of PDF annotations (links, text notes, etc)
- Extraction of embedded files and attachments
- Extraction of PDF bookmarks (a.k.a. outline, table of contents)
- Extraction of document metadata, as either key/value pairs or XML
- Extraction of raw character data
- Extraction of image metadata, including image dimensions, locations, and types
- PDF file merging
Give PDFxStream a try for your project's PDF data access needs!
Get StartedText extraction features
- Unicode text extraction, including support for Chinese, Japanese, and Korean (CJK) in both horizontal and vertical writing modes
OutputHandler
API for efficiently customizing PDF text extract formatting- Regional text extraction, ideal for extracting data from fixed-format forms
- Complete support for embedded and standard fonts and character encodings:
- Type 0, 1, and 1C
- TrueType
- Identity-H and Identity-V encodings
- CMap encodings (including hundreds of Chinese, Japanese, and Korean character sets, both horizontal and vertical writing modes)
- Automated layout processing, providing a traversable PDF document model including inferred block, line, column, and table structure
- Support for extracting text from "searchable image" PDFs
- Support for all varieties of rotated text
- Comprehensive support for extracting PDF tables, including via CSV for export to Excel
- Support for indexing PDF documents with Apache Lucene via lucene-pdf
Image extraction features
- Decompression and decoding of dozens of PDF image types
- Rendering of images to
on-screen graphics contexts
(
java.awt.image.BufferedImage
on Java, orSystem.Drawing.Bitmap
on .NET) and saving to disk in familiar formats:- JPEG
- TIFF
- GIF
- PNG
- BMP
- Automatic stitching of image tiles and strips
Form data extraction features
- Support for extracting "Acroform" (interactive) form data from all types of
fields:
- Text
- Dropdowns ("Choice" fields)
- Radio buttons
- Checkboxes
- Pushbuttons
- Signatures
- Support for extracting XFA form data
- Support for filling "Acroform" fields, writing updated PDF documents
Learn more
To get the most out of PDFxStream's capabilities and PDF file format support, please check out the developer's guide and API reference.
Give PDFxStream a try for your project's PDF data access needs!
Get Started