Java PDF Blog

PDF solutions for big and small customers

Java and PDF development - our personal experiences and discoveries

Download JPedal

Download JPedal PDF viewers

PDF to Image service

Try our PDF to image conversion service now.

Java PDF Ebook Solution

Try our Ebook solution now.

Subscribe

Your email:

Java PDF blog

Current Articles | RSS Feed RSS Feed

Don't blame the PDF file format

Posted by Mark Stephens on Tue, Jul 13, 2010 @ 02:44 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

I see a lot of complaints about the PDF file format on various forums, moaning about it. They tend to focus mainly on 2 issues:-

1. The PDF file format is complicated.

2. Extraction, especially of text, is not always straight-forward.

Both of these, I think, are essentially unfair. PDF arose out of Postscript and is more akin to a program, with the final display, as its output. It offers a very powerful and elegant structure to do this, but getting into PDF is a bit like learning a programming language. As with any programming language, you need to have a decent set of tools and a good working knowledge to achieve anything. 

Many so-called 'PDF killers' have appeared over the years and yet PDF still remains because it is an excellent technical solution for many problems. PDF was never envisaged as something you could hack in a text editor.

The issue with text extraction arises because PDF was designed as an end-file display format so it does not contain lots of details on text structure and layout which you might find in other formats. Adobe did remedy this by adding a feature to embed Structured content tags into the PDF and if this is used, very accurate text can be extracted. The problem is that very few people use this when creating PDFs. So again, don't blame the format - if used correctly it works very well.

The PDF format's biggest issue really is that it has been so successful, people are trying to push it into areas which are not it's strength or push beyond what it was designed to do. 

Tags: ,

COMMENTS

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics