Posted by Mark Stephens on Wed, Aug 11, 2010 @ 08:31 AM
One of the really useful features of the PDF file format is the ability to have interactive elements. These began as just simple checkboxes, buttons, comboboxes and textfields type widgets and the list has expanded to include the ability to embed Sounds, Movies and even other files or URLs. This makes the PDF file format a very interactive medium.
Here is one of my favorite examples

All of these interactive features can be defined in 2 ways. Firstly, they can exist in as PDF objects defined within the PDF file and they inherit values from their parent objects. This is the original FDF version. It uses the standard PDF Cos format and would look something like this
26 0 obj
<<
/F 4
/I[1]
/Type/Annot
/Rect[196 594 314 613]
/BS<</W 1/S/U>>
/FT/Ch/
Subtype/Widget
/P 24 0 R
/T(Item)
/V(Soft Taco)
/AP<</N 142 0 R>>
/Ff 393216
/MK<</BC[0 0 0]>>
/Opt[(Burrito)(Soft Taco)(Mexico City)(Quesadilla)(Taquitaco)]
/DA(/TiRo 0 Tf 0 0 1 rg)
>>endobj
Or they can appear in one of several XML structures inside the file. Here is an example - the actual XML is buried inside streams in the referenced objects.
<<
/XFA[(preamble)40 0 R
(config)41 0 R
(template)42 0 R
You can also define as both a Cos object with data in the XFA - the spec is nothing if not flexible!
(datasets)43 0 R
(localeSet)44 0 R
(postamble)45 0 R]
Forms can also be linked to events and to Javascript code inside the PDF and can have tooltips, change their visibilty and interact with other components. They can also reference widgets on other pages. So there is not much you cannot achieve with them...
The thing that I found most confusing when starting was that interactive elements can actually be referenced in 2 separate ways. A PDF document can have a single Acroform or XFA object (which lists all the widgets in the document), but each page can also have a Annots object which lists the widgets on that page. So you need to potentially look at both lists and then workout which are used on any page.
So if you think PDF files are about just static WYSIWYG documents, you have been missing a whole dimension. Give them a try.
Posted by Mark Stephens on Thu, Sep 03, 2009 @ 09:56 AM
Because PDF is very much an output and display format it does not contain much format information such as paragraph breaks and spaces unless these tags are explicity added (Adobe calls it MarkedContent). In this case, it is possible to extract an almost perfect copy of the text data in a PDF. Otherwise, the software needs to guess such details. This is why it is very hard to extract complex irregular or multi-column text from a PDF file - the correct definition of what a column is varies with every file. Even spaces and returns have to be guessed.
What is available, however, is a lot of information on the text 'style' including Font used, size and even the colour. This information can be very useful for identifying structures on the page (such as page titles or headers and footers) or making sense of some values in Symbolic fonts.
Most of the time customers want just the text (and building XML trees is a relatively slow process in Java). So our JPedal library extracts just the text by default because it is fast and what most people want. All that XML metadata is extracted though as par tof the process (you need to know the font to make sense of the text encoding). So we offer an option to include this information.
The best way to see the XML tags available is to run the JPedal text extraction example (the demoor full version) and see how it works or use the text extraction menu option in our example Viewer. You will see there is a surprising amount of useful data in XML mode.
<p><font face="Helvetica" style="font-size:8pt">JUNE 2003</font>
</p>
<p><font face="Helvetica-Bold" style="font-size:8pt">3</font></p>
<p><font face="GillSans" style="font-size:11pt">Policymakers strive to make efficient use of taxpayer dollars <SpaceCount space="12" />The transition from welfare to work is proving more difficult</font></p>
<p><font face="GillSans" style="font-size:11pt">low and serving program goals. <SpaceCount space="287" /></font><font face="Helvetica" style="font-size:8pt">AMBER WAVES</font><font face="GillSans" style="font-size:11pt">in rural than in urban areas, especially in remote, sparselypopulated areas where job opportunities are few.</font></p>
<p><font face="GillSans" style="font-size:11pt">in the design and administration of USDA's food assistance</font></p>
<p><font face="GillSans" style="font-size:11pt">programs. Balance must be struck between keeping costs</font>
</p>
<p><font face="GillSans" style="font-size:20pt">12 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">DATA FEATURE</font></p>
<p><font face="GillSans" style="font-size:11pt">Trends in U.S. Per Capita Consumption of</font></p>
<p><font face="GillSans" style="font-size:11pt">Dairy Products, 1909 to 2001</font></p>
<p><font face="GillSans" style="font-size:20pt">46 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">INDICATORS</font></p>
<p><font face="GillSans" style="font-size:11pt">Selected statistics on agriculture and trade, diet and health,</font></p>
<p><font face="GillSans" style="font-size:11pt">natural resources, and rural America</font></p>
<p><font face="GillSans" style="font-size:20pt">50 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">GLEANINGS</font></p>
<p><font face="GillSans" style="font-size:11pt">Snapshots of recent events at ERS, highlights of new</font></p>
<p><font face="GillSans" style="font-size:11pt">publications, and previews of research in the works</font></p>
<p><font face="GillSans" style="font-size:20pt">52 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">PROFILES</font></p>
<p><font face="GillSans" style="font-size:11pt">Recent accolades for ERS researchers</font>