PDF format and style information
Posted by Mark Stephens on Thu, Sep 03, 2009 @ 09:56 AM
Because PDF is very much an output and display format it does not contain much format information such as paragraph breaks and spaces unless these tags are explicity added (Adobe calls it MarkedContent). In this case, it is possible to extract an almost perfect copy of the text data in a PDF. Otherwise, the software needs to guess such details. This is why it is very hard to extract complex irregular or multi-column text from a PDF file - the correct definition of what a column is varies with every file. Even spaces and returns have to be guessed.
What is available, however, is a lot of information on the text 'style' including Font used, size and even the colour. This information can be very useful for identifying structures on the page (such as page titles or headers and footers) or making sense of some values in Symbolic fonts.
Most of the time customers want just the text (and building XML trees is a relatively slow process in Java). So our JPedal library extracts just the text by default because it is fast and what most people want. All that XML metadata is extracted though as par tof the process (you need to know the font to make sense of the text encoding). So we offer an option to include this information.
The best way to see the XML tags available is to run the JPedal text extraction example (the demoor full version) and see how it works or use the text extraction menu option in our example Viewer. You will see there is a surprising amount of useful data in XML mode.
<p><font face="Helvetica" style="font-size:8pt">JUNE 2003</font>
</p>
<p><font face="Helvetica-Bold" style="font-size:8pt">3</font></p>
<p><font face="GillSans" style="font-size:11pt">Policymakers strive to make efficient use of taxpayer dollars <SpaceCount space="12" />The transition from welfare to work is proving more difficult</font></p>
<p><font face="GillSans" style="font-size:11pt">low and serving program goals. <SpaceCount space="287" /></font><font face="Helvetica" style="font-size:8pt">AMBER WAVES</font><font face="GillSans" style="font-size:11pt">in rural than in urban areas, especially in remote, sparselypopulated areas where job opportunities are few.</font></p>
<p><font face="GillSans" style="font-size:11pt">in the design and administration of USDA's food assistance</font></p>
<p><font face="GillSans" style="font-size:11pt">programs. Balance must be struck between keeping costs</font>
</p>
<p><font face="GillSans" style="font-size:20pt">12 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">DATA FEATURE</font></p>
<p><font face="GillSans" style="font-size:11pt">Trends in U.S. Per Capita Consumption of</font></p>
<p><font face="GillSans" style="font-size:11pt">Dairy Products, 1909 to 2001</font></p>
<p><font face="GillSans" style="font-size:20pt">46 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">INDICATORS</font></p>
<p><font face="GillSans" style="font-size:11pt">Selected statistics on agriculture and trade, diet and health,</font></p>
<p><font face="GillSans" style="font-size:11pt">natural resources, and rural America</font></p>
<p><font face="GillSans" style="font-size:20pt">50 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">GLEANINGS</font></p>
<p><font face="GillSans" style="font-size:11pt">Snapshots of recent events at ERS, highlights of new</font></p>
<p><font face="GillSans" style="font-size:11pt">publications, and previews of research in the works</font></p>
<p><font face="GillSans" style="font-size:20pt">52 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">PROFILES</font></p>
<p><font face="GillSans" style="font-size:11pt">Recent accolades for ERS researchers</font>