Java PDF Blog

PDF solutions for big and small customers

Java and PDF development - our personal experiences and discoveries

Download JPedal

Download JPedal PDF viewers

PDF to Image service

Try our PDF to image conversion service now.

Java PDF Ebook Solution

Try our Ebook solution now.

Subscribe

Your email:

Java PDF blog

Current Articles | RSS Feed RSS Feed

There is more than one PDF file specification

Posted by Mark Stephens on Mon, Apr 12, 2010 @ 11:49 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

One of the really annoying thing about working with PDF is that there is actually more than one PDF file specification.

First of all, there is the very long Adobe PDF Reference Produced by Adobe. This freely available document is very long, detailed and often rather cryptic. Sometimes you need to hunt through alsorts of minor appendices to find out a specific case or you can find information is missing.  

Today I was debugging some code to read the Hint stream in a Linearized Object - this allows a PDF viewer to display pages before the document has finished loading. This object consists of a set of variable bit length values split into sub-tables. It seems that while the values are packed together into a bitstream, the sub-tables themselves must be byte aligned. I could not find any mention of this is the spec - it was an intuitive guess on my part to try it. 

The second PDF File Specification is what works in Acrobat. We had an interesting case last week of a file which did not work in our JPedal reader but opened in Acrobat. When we hunted it down, the reason turned out to be that the Specification says all PDF files end with the characters %%EOF - this file ended %%EO but still worked. So it seems, that in this case the real specification is a guideline and not a rule.

So, if you are working with PDF files, always keep a copy of the specification handy - the printed copy also double up as an excellent doorstop or monitor stand. But also make sure you have a copy of Acrobat handy and be prepared to find rules can be interpreted rather loosely. 

2 Comments Click here to read/write comments

Linearized PDF files

Posted by Mark Stephens on Fri, Feb 26, 2010 @ 11:36 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

Linearized PDF is a special way to organize a PDF file.

In general, the PDF is a very elegant and well-designed format. A PDF consists of lots of PDF objects which are used to create the pages. This information is stored in a binary tree which also stores the location of each object in the file. So only the tree needs to be loaded when the file is opened, and it can then be used to load the required objects to display a page. The whole file itself does not need to be read, only the tree. The location of the tree is always stored at the end of the file so it is easy to find and also simple to modify the file just by appending new information and a new tree.

However, if the file is read via the web, it is accessed as a stream of bytes. This means the reference (which is at the end of the file) cannot be read until the whole PDF file has been transfered. This can take some time with large files.

So Adobe created a new way to layout the PDF called Linearized PDF. The file format is still the same, but there is a special tag at the start of the file and all the objects needed to create the first page (and a mini binary tree describing them) are stored at the START of the file. As soon as this data has been read, the first page can be displayed, while the rest of the file is downloaded. This makes the whole thing seem much faster and gives the user something to look at almost immediately even on huge files.

In Adobe Acrobat and Adobe Reader, the best way to see if a PDF is Linearized is to look at the Document properties. If the file is a linearized PDF, the item Fast Web View will display Yes.



In JPedal 4.0.1, we have added a similar option so show if the file is Linearized to the Document properties. If it is Linearized, the word linearized appears in the general section after the PDF version.


In JPedal you can also check progammatically to see if a file is Linearized by seeing if the Linearized object exists - if it does it is a Linearized PDF. Here is the code.

//decode_pdf is the instance of PdfDecoder representing the opened PDF file
if(this.decode_pdf.getJPedalObject(PdfDictionary.Linearized)!=null)
System.out.println("This file is linearized");


So in a nutshell, Linearized PDF is a way of organizing a PDF file so that if it is going to be accessed over the Internet it will appear to load much faster. And it does this very well!

0 Comments Click here to read/write comments

All Posts