Java PDF Blog

PDF solutions for big and small customers

Java and PDF development - our personal experiences and discoveries

Download JPedal

Download JPedal PDF viewers

PDF to Image service

Try our PDF to image conversion service now.

Java PDF Ebook Solution

Try our Ebook solution now.

Subscribe

Your email:

Java PDF blog

Current Articles | RSS Feed RSS Feed

Why we need to see your PDF files...

Posted by Mark Stephens on Fri, Jul 02, 2010 @ 08:17 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

What makes writing a PDF parser especially interesting (ie complex) is that the specification is often ambiguous and that PDF is a very complex structure. To Display a PDF file requires the parser to correctly scan the PDF object data structure, to correctly decode and assemble all the data, and then parse the stream of Postscript commands. There could be issues at any level. 

Occasionally we have to tweak our parser to allow for bugs in our code, things we had not considered, areas where the PDF does something which is permissible but not clear from the spec or even cases where the PDF does not actually follow the specification. Most PDF creation tool writers create a PDF according to their interpretation of the PDF specification and if it opens in Acrobat, they leave it at that. If it does not open in our parser, it is obviously our fault, not theirs.

Over time we have become very adept at tweaking our code to allow for all the little idiosyncracies of various PDF tools - we have lots of interesting internal flags in our source code and Intellij IDEA(my preferred Java IDE) excellent tracing allows us to follow the flow through code we know very well. It is normally a quick fix and regression test.

Sometimes, people send screenshots or say the file does not open. Unfortunately, it is very hard to help in this case. Send us the file and we can quickly find the issue. Screenshots are generally like giving a car mechanic a picture of your car and asking what is wrong - let him open up the bonnet and hear the engine and you'll get a quick answer. 

1 Comments Click here to read/write comments

Why can't I just open and edit a PDF file

Posted by Mark Stephens on Tue, Jun 15, 2010 @ 04:51 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

People sometimes try to edit a PDF file by opening the file in a text editor. This very rarely works for 3 reasons.

Firstly, a PDF file is effectively a dump of PDF objects. The file contains a reference table giving the exact byte offset locations of each object from the start of the file, and the references tables. If you add or delete a character, or even resave it from an editor which converts line ending from one platform format to another, all these numbers will be incorrect. You would need to update them all. To prove it, just try opening a PDF, type in a space, save it and then see what happens if you try to open it...

Secondly, if you open a PDF file, much of the data is stored inside binary streams, in which data has been encrypted or compressed. If you view a PDF you will see some text but lots of incomprehensive 'garbage'. This is the binary data. You cannot edit it, but you can easily break it just by adding a character.

 

Finally, much of the PDF data needs to be looked at in connection with other data in the file. Text only makes sense by looking at the encoding on the font object, images have their data partly in XObjects and partly in ColorSpace objects, and so forth...

Some files formats such as HTML, Javascript and most source code can be easily manipulated in a text editor. The PDF file format is not one of these and is best accessed using a library which takes away all this complexity. Fortunately there are lots of both free and commercial tools available for all the most popular languages. If you have a favourite, why not post a recommendation here? 

0 Comments Click here to read/write comments

Understanding the PDF file format - Text, shapes and images

Posted by Mark Stephens on Wed, May 26, 2010 @ 02:50 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

I have been looking at an issue for a potential client recently which required the generation of different views of the page. This is interesting because it allows me to show you the internal workings of the PDF file format rather elegantly. It seems to be an increasingly common activity from our clients these days as they build web applications to display PDFs and need to separate out text and images.

What is in a PDF

A PDF can contain bitmapped images, Vector graphics and text (which can be Vector or bitmapped depending on the font used). Sometimes, you may be surprised at what you find. While a PDF may look like it contains text, the lettering may actually be part of the image (as in a scan) or shapes (where the text was converted to paths). Here is a rather nice PDF page showing what is going on...

Here is the complete page

 

which consists of images

text and vector graphics

and just the text

(the white text is invisible on a default white background)

  

The white text in particular illustrates how dependent on each other the layers are - we could generate it as a transparent image and add a coloured background if we wanted to highlight the text layer on its own. 

Creating your own separations

If you would like to create your own separations, there is a new support page explaining how to use the feature in our JPedal PDF library - you will need version 4.20 or later. 

 

1 Comments Click here to read/write comments

What new PDF developers need to know

Posted by Mark Stephens on Fri, May 21, 2010 @ 11:03 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

We had a discussion last week about what tips would help new developers get to grips when starting to work with PDF files. Here are some of the ideas which came out of that. It is very much a personal suggestion list so please feel free to add your own suggestions.

Do not think of a PDF file as a 'file'

When you start to learn HTML, you can open a file, hack it in a text editor and see what happens. You can't do this with a PDF file. It is essentially a binary data structure - lots of the information cannot be seen if you open the raw file and editing one byte could potentially break the whole file. There are lots of really good tools out there on multiple platforms for examining the contents of a PDF file so you should not need to try and open the file directly.

PDF is all about objects

What the PDF file essentially contains is a whole lot of PDF objects. They all have a unique ID of the format number generation R (so you might see 3 0 R, 144 0 R). Most of the time generation is zero but not always.

There are lots types of objects - a Page Object describes a particular page, a Font object contains all the information about a specific font, a Form object contains information. Objects can reference other objects, so Page Object 5 0 R might reference Resources object 10 0 R which contains a list of Font objects used for the page, including Font objects  16 0 R, 17 0 R, 18 0 R. 

The objects can also be thought of as a Tree. This is what allows any page to be opened quickly. The PDF root object points to the list of pages which point to the resources they use and their contents. 

Two identical looking PDFs can be very different inside

The PDF specification is very broad and flexible so there are lots of different ways to achieve the same result. The specification does not enforce any approach so all the PDF creation tools do things in different ways. If you have a strange PDF, it is always worth seeing what the Producer or Creator settings are.

Images are 'ripped' up inside a PDF

When a PDF is created, images are broken up into their pixel and colour data so that they can be compressed as efficiently as possible. JPEG data may well be stored in a JPEG compression format (DCTDecode or JPXDecode) but it may still  need to have colour information applied.

Essential reference material - The PDF Reference Guide

Adobe produces a detailed specification of the PDF Reference guide which is free to download. It is very big and there is an awful lot to it. Ideally, a beginner should start with the outline of the file format and just the areas they need to understand.

The PDF specification goes into considerable detail on the specification. But it may not be written from the precise viewpoint you need and also Adobe allows considerable interpretation in of what is acceptable. While there are lots of examples, it is possible for tools to do things in other ways.

What makes a PDF 

A PDF file should ideally have a .pdf file type, an xref pointer in the last 1024 bytes of its data and the file line of a PDF should be the version. But there is quite a lot of variation in what is actually allowed in a PDF and how useful a PDF is. A PDF file can contain fonts and editable text or just be a raw around an image.

At the end of the day, if it opens in Acrobat it is accepted as a PDF and you need to handle it...

PDF is a collection of other technologies

There are lots of other technologies used inside the PDF file format including compression algorithms, encryption, font technologies, Javascript and so on. This makes it harder to understand because you need to have a grasp of these as well to understand what is going on.

Use the tools

There are lots of  tools (both free and commercial) on all platforms and in different languages (C, Java, Perl, Php, etc). They make it much easier to work with PDF files and also experimenting with them (especially if you can access the source code) is a good way to understand how PDF works.

There are people to ask

I remember meeting Tom Phelps, the developer or Multivalent, at a conference in 2002. We were so pleased to find someone else we could actually have a conversation with, we spent the whole night discussing PDF issues at the pub afterwards. Everyone else in the bar complained it was the most boring night of their lives, but we both had a good time... 

Thanks to the Internet, you can discuss PDF issues without totally destroying your street credibility! Many of the people or companies producing PDF tools run mailing lists or discussions forums (my first job every morning is to check the JPedal Support forums) and there are more general forums. I personally find stackoverflow a good place to ask questions. 

 

Becoming an expert in PDF is not an overnight process

I started working with PDF files over 10 years ago and I still learn new things every day. PDF is a big, complex file format including a lot of technologies so it will need time to become proficient with it.

So that is my advice. What would say to a new PDF developer? Or do you have any tips or advice? 

0 Comments Click here to read/write comments

Understanding the PDF file format - CMYK does not always mean CMYK

Posted by Mark Stephens on Tue, May 04, 2010 @ 09:13 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

PDF is a very flexible file format in which colour can be represented in lots of different ways. This allows great flexibility and also reflects the fact that PDF is used in many different environments.

One of the most useful color formats is  CMYK which matches how professional printers work. Colours are made by mixing together 4 inks - Cyan, Magenta, Yellow and Black (it is actually called Key and is the K in CMYK). PDFs can be created which professional printers can use and users can be sure that the printed output is correct.

However, it turns out that some images in PDFs are not actually CMYK - they use a different form of encoding, called YCCK. Most of the time, this is hidden from the user, but if you are working with PDFs or doing an development, you may need to understand what is going on.

YCCK does not have its own type - it is always treated as CMYK and detected internally. If you save out DCTDecoded data which is flagged as CMYK it may well be YCCK - there is a flag in the header to show if it is. 

YCCK also consists of 4 components. As with CMYK, there is a black element (the K value) but instead of Cyan, Magenta and Yellow, there are a Luma value (Y) and 2 chroma values (Cb and Cr). The maths on this is quite fiddly so the best way to think of this is that the information is encoded not in terms of ink colours but in terms of how your eye sees the colour. Your eye is more sensitive to the luminance value as opposed to the chrominance value. By separating off the chroma values they can be compressed more (reducing the filesize) without the eye noticing - you just can't do this with CMYK.

So that explains why it might be used, but what about actually using the data. What you can do is convert the 3 YCC colours values into 3 CMY colour values. Add back the K component and you have CMYK.

As with lots of colour operations, there are 2 ways to do this:-

1. With a mathematical formula. This provides a fast approximation but is not always correct, especially on very dark or light colours.

2. Use a colour profiles - these files are essentially very precise lookup tables allowing accurate mapping from one colourspace to another. They are more precise but slower, at least in Java. 

So if you are doing some serious work with the CMYK colorspace or saving out CMYK data, do be aware that not all CMYK is CMYK and it may need conversion into 'proper' CMYK.

0 Comments Click here to read/write comments

Understanding the PDF file format - text streams

Posted by Mark Stephens on Tue, Apr 06, 2010 @ 10:53 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

Inside a PDF is a Postscript stream of commands which describe the page - they draw the text, images or shapes. You can extract this stream and look at it directly. It looks like this -I have added comments in brackets after each command to explain.

BT (begin a block of text)

/F13 12 Tf (Choose Font F13 and set size to 12)

288 720 Td (move the location relative from where it now is

(ABC) Tj (Draw the Text ABC)

ET (End the text block)

So far so good, but this code is actually rather deceptive. Most people assume from looking at it that Tj take a String (ABC), but it does not. It actually contains a set of binary index values. These are then decoded using the Fonts inbuilt decoding - it can be one of the Standard Encodings (WIN, MAC, EXPERT, etc) which are defined in Appendix D of the PDFReference. For subsetted fonts (where only the characters used in the PDF are included) they could be any arbitary set of values - they will have no meaning until you look them up with the Fonts custom encoding table (the Differences Object).

The reason they look like text in the example above and those in the PDF Reference guide are because the vales for WIN encoding happen to be the same as the ASCII characters. So the binary value for A shows up as A if it is WIN encoded. 

However, they are not actually text values and should not be treated as such unless you can guarantee that the only PDFs you look at will be WIN encoded. Otherwise you will get a very nasty surprise on some PDFs...

0 Comments Click here to read/write comments

Working out PDF page size in inches or centimetres

Posted by Mark Stephens on Thu, Mar 18, 2010 @ 10:07 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

The size of a PDF is generally defined by the CropBox or MediaBox setting for each page. This is the number of pixels. This provides a set of 4 number (x,y,width,height) to define how big the page is. A common value is 0 0 595 842 for an A4 page.

However, most people are interested in actual units and that is what Adobe Reader or Acrobat displays. Here is the output from a file with CropBox[0 0 585 832] and MediaBox[0 0 585 832]

So where does this number of centimetres come from? 

Standard dpi is 72 dots per inch so we can convert the CropBox and MediaBox to inches by dividing these numbers by 72. This gives us 8.125 inches by 11.556 inches.

There are 2.54 centimetres in an inch so multiplying by this we get 20.635 cm by 29.35cm

So that is how Adobe creates the size from the raw PDF Crop or Media box sizes.

If you are interested in using JPedal to generate the size values of each page, here is a simple example.

 

0 Comments Click here to read/write comments

PDF page size in bytes

Posted by Mark Stephens on Sun, Oct 18, 2009 @ 04:49 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: ,

An interesting question on our forums, made me look at PDF files in a new light. We know how big a PDF file is in bytes, but how big is each page?

To answer this, you need to understand a little about the contents of a PDF file and how a file is constructed. A PDF file is a dump of PDF objects. It consists of the objects themselves and a trailer - metadata so that each object can be found. The objects usually consist of a set of key values and often a compressed binary stream of data. The binary data is usually image data or colour data or the set of instructions used to draw the page. The data is decompressed in memory when the object is read.

A page itself does not have a size - you cannot say it  starts at a certain point and ends at another. What you could say however is that a page consists of a set of objects:-

1. The Page objects which describe the page and contain the binary stream of page instructions used to contruct its contents.

2. The local Resources objects which contains colors, fonts and images used on the page.

3. Global Resource objects (which may be used on any page) and also consist of colors, fonts and images.

4. A proportion of the PDF file metadata.

 The last item is probably small enough that it can be reasonably ignored and we can also reasonably ignore the non-binary content of the objects.

So a good guess for a pagesize is the sum of the binary streams which might be used on it. The compressed size probably provides a good guess as to the PDF page size in the PDF file and the uncompressed size might well be an equally good guess at how much an unrendered page (ie not drawn) would use in memory if you needed that.

0 Comments Click here to read/write comments

Learning about PDF

Posted by Mark Stephens on Tue, Aug 11, 2009 @ 06:34 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: 

The PDF file format is very useful and well-documented, but it is also quite complicated and it does not work how most people imagine. It is structured very differently from a Word or Excel document.

Most of the time, this is not an issue - you can just use PDF files without knowing anything about them and just enjoy the benefits. There comes a time though, when you may need to start to dabble. So this article is designed to give you some starting points.

It is worth getting to grips first with the basic idea that a PDF file is essentially a set of linked objects (so each page has a page object, which may include font objects defining the fonts, XObjects storing image data and so on). Then you can look at all the different types of objects. The PDF file contains all these objects and their locations (the references) so that they can be read as needed.

The definitive guide to the PDF file is the Adobe PDF reference guide. It is a very complete and comprehensive(and equally dull) volume which explains most of the internal working of the PDF file format. It is not designed to tell you about how to create or modify the PDF file - just to provide all the details. It is not an easy read, but the first 2 chapters do provide an excellent introduction to the PDF file format.

A slightly less technical introduction to the internals of a PDF file can be found at wikipedia. This also gives you a detailled inside into the structure of the file.

Once you have started to explore the internal guts of the PDF file format you can open up a few PDF files. It is not recommended that you directly edit this file (even adding a space can break it), but you can open it in a Text editor and view it. Much of the data is encrypted or compressed so a more useful tool is Acrobat 9. I explained how you can use this to examine the internals of a PDF file in my first posting.

To really do much with the PDF file you will need a third party library to manipulate the PDFs. We always recommend IText as a good starting point as its free and well-documented, with lots of examples.

So if you have reached the point where you want to start to explore the PDF file format, I hope this has provided some useful starting points and do feel free to post your experiences.

3 Comments Click here to read/write comments

All Posts