Posted by Mark Stephens on Tue, Jul 13, 2010 @ 02:44 AM
I see a lot of complaints about the PDF file format on various forums, moaning about it. They tend to focus mainly on 2 issues:-
1. The PDF file format is complicated.
2. Extraction, especially of text, is not always straight-forward.
Both of these, I think, are essentially unfair. PDF arose out of Postscript and is more akin to a program, with the final display, as its output. It offers a very powerful and elegant structure to do this, but getting into PDF is a bit like learning a programming language. As with any programming language, you need to have a decent set of tools and a good working knowledge to achieve anything.
Many so-called 'PDF killers' have appeared over the years and yet PDF still remains because it is an excellent technical solution for many problems. PDF was never envisaged as something you could hack in a text editor.
The issue with text extraction arises because PDF was designed as an end-file display format so it does not contain lots of details on text structure and layout which you might find in other formats. Adobe did remedy this by adding a feature to embed Structured content tags into the PDF and if this is used, very accurate text can be extracted. The problem is that very few people use this when creating PDFs. So again, don't blame the format - if used correctly it works very well.
The PDF format's biggest issue really is that it has been so successful, people are trying to push it into areas which are not it's strength or push beyond what it was designed to do.
Posted by Mark Stephens on Wed, May 26, 2010 @ 02:50 AM
I have been looking at an issue for a potential client recently which required the generation of different views of the page. This is interesting because it allows me to show you the internal workings of the PDF file format rather elegantly. It seems to be an increasingly common activity from our clients these days as they build web applications to display PDFs and need to separate out text and images.
What is in a PDF
A PDF can contain bitmapped images, Vector graphics and text (which can be Vector or bitmapped depending on the font used). Sometimes, you may be surprised at what you find. While a PDF may look like it contains text, the lettering may actually be part of the image (as in a scan) or shapes (where the text was converted to paths). Here is a rather nice PDF page showing what is going on...
Here is the complete page
which consists of images

text and vector graphics

and just the text
(the white text is invisible on a default white background)
The white text in particular illustrates how dependent on each other the layers are - we could generate it as a transparent image and add a coloured background if we wanted to highlight the text layer on its own.
Creating your own separations
If you would like to create your own separations, there is a new support page explaining how to use the feature in our JPedal PDF library - you will need version 4.20 or later.
Posted by chris wade on Mon, Apr 19, 2010 @ 11:00 AM
I came across an interesting issue with PDF Text fields while debugging a file this week. We were sent a 2 page document created with IText, containing some text fields and we were displaying both pages with text fields containing identical values - they appear different in Acrobat. Obviously Acrobat is always right (even when it disagrees with the PDF specification) so we dug deeper to see what was going on....
With PDF forms, all form objects can share common Parent objects and they can then inherit values from them. So if a text field does not have a text value, it can inherit its Parent's value. This is really useful because you can avoid having to repeat common values.
In this PDF, the Text fields on both pages shared the same Parent and because they had no text values, we were inheriting the value from the Parent. So our viewer displayed the same text value on both pages. However, form objects can also have an Appearance Stream which defines the display of the form object. This is what accounts for the different appearance.
So I found out that it is "allowed" to have 2 forms with different Appearance Streams, with a single parent that defined the text value for the field. So they both had the same text value but the appearance was different.
So either the appearance over-rides the text value in read only text fields, or the child value is more important in defining the display of the form. So in this example the appearance streams are more important than the text value of the form object.
Its not an ideal way to work, because any software reading the text value for the form will not get the value which the user sees. For reading text values, the file is essentially broken. But our viewer now displays it as Adobe would (which is all most users care about at the end of the day).
We are working on a way to generate a readable string from the appearance stream, so we can make this file more useful in text extraction, so keep watching this space.
So that is another mystery solved for me, and yet another way to interpret the spec. Have you come across any interesting and mysterious PDF files where things are not as they should be?
Posted by Mark Stephens on Thu, Sep 03, 2009 @ 09:56 AM
Because PDF is very much an output and display format it does not contain much format information such as paragraph breaks and spaces unless these tags are explicity added (Adobe calls it MarkedContent). In this case, it is possible to extract an almost perfect copy of the text data in a PDF. Otherwise, the software needs to guess such details. This is why it is very hard to extract complex irregular or multi-column text from a PDF file - the correct definition of what a column is varies with every file. Even spaces and returns have to be guessed.
What is available, however, is a lot of information on the text 'style' including Font used, size and even the colour. This information can be very useful for identifying structures on the page (such as page titles or headers and footers) or making sense of some values in Symbolic fonts.
Most of the time customers want just the text (and building XML trees is a relatively slow process in Java). So our JPedal library extracts just the text by default because it is fast and what most people want. All that XML metadata is extracted though as par tof the process (you need to know the font to make sense of the text encoding). So we offer an option to include this information.
The best way to see the XML tags available is to run the JPedal text extraction example (the demoor full version) and see how it works or use the text extraction menu option in our example Viewer. You will see there is a surprising amount of useful data in XML mode.
<p><font face="Helvetica" style="font-size:8pt">JUNE 2003</font>
</p>
<p><font face="Helvetica-Bold" style="font-size:8pt">3</font></p>
<p><font face="GillSans" style="font-size:11pt">Policymakers strive to make efficient use of taxpayer dollars <SpaceCount space="12" />The transition from welfare to work is proving more difficult</font></p>
<p><font face="GillSans" style="font-size:11pt">low and serving program goals. <SpaceCount space="287" /></font><font face="Helvetica" style="font-size:8pt">AMBER WAVES</font><font face="GillSans" style="font-size:11pt">in rural than in urban areas, especially in remote, sparselypopulated areas where job opportunities are few.</font></p>
<p><font face="GillSans" style="font-size:11pt">in the design and administration of USDA's food assistance</font></p>
<p><font face="GillSans" style="font-size:11pt">programs. Balance must be struck between keeping costs</font>
</p>
<p><font face="GillSans" style="font-size:20pt">12 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">DATA FEATURE</font></p>
<p><font face="GillSans" style="font-size:11pt">Trends in U.S. Per Capita Consumption of</font></p>
<p><font face="GillSans" style="font-size:11pt">Dairy Products, 1909 to 2001</font></p>
<p><font face="GillSans" style="font-size:20pt">46 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">INDICATORS</font></p>
<p><font face="GillSans" style="font-size:11pt">Selected statistics on agriculture and trade, diet and health,</font></p>
<p><font face="GillSans" style="font-size:11pt">natural resources, and rural America</font></p>
<p><font face="GillSans" style="font-size:20pt">50 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">GLEANINGS</font></p>
<p><font face="GillSans" style="font-size:11pt">Snapshots of recent events at ERS, highlights of new</font></p>
<p><font face="GillSans" style="font-size:11pt">publications, and previews of research in the works</font></p>
<p><font face="GillSans" style="font-size:20pt">52 <SpaceCount space="2" /></font><font face="GillSans" style="font-size:11pt;font-style:Bold">PROFILES</font></p>
<p><font face="GillSans" style="font-size:11pt">Recent accolades for ERS researchers</font>