Java PDF Blog

PDF solutions for big and small customers

Java and PDF development - our personal experiences and discoveries

Download JPedal

Download JPedal PDF viewers

PDF to Image service

Try our PDF to image conversion service now.

Java PDF Ebook Solution

Try our Ebook solution now.

Subscribe

Your email:

Java PDF blog

Current Articles | RSS Feed RSS Feed

Why we need to see your PDF files...

Posted by Mark Stephens on Fri, Jul 02, 2010 @ 08:17 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

What makes writing a PDF parser especially interesting (ie complex) is that the specification is often ambiguous and that PDF is a very complex structure. To Display a PDF file requires the parser to correctly scan the PDF object data structure, to correctly decode and assemble all the data, and then parse the stream of Postscript commands. There could be issues at any level. 

Occasionally we have to tweak our parser to allow for bugs in our code, things we had not considered, areas where the PDF does something which is permissible but not clear from the spec or even cases where the PDF does not actually follow the specification. Most PDF creation tool writers create a PDF according to their interpretation of the PDF specification and if it opens in Acrobat, they leave it at that. If it does not open in our parser, it is obviously our fault, not theirs.

Over time we have become very adept at tweaking our code to allow for all the little idiosyncracies of various PDF tools - we have lots of interesting internal flags in our source code and Intellij IDEA(my preferred Java IDE) excellent tracing allows us to follow the flow through code we know very well. It is normally a quick fix and regression test.

Sometimes, people send screenshots or say the file does not open. Unfortunately, it is very hard to help in this case. Send us the file and we can quickly find the issue. Screenshots are generally like giving a car mechanic a picture of your car and asking what is wrong - let him open up the bonnet and hear the engine and you'll get a quick answer. 

1 Comments Click here to read/write comments

The real reasons why you should be going to BoS 2010?

Posted by Mark Stephens on Fri, Jun 25, 2010 @ 08:54 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
The Business of Software conference is happening in October 2010 in Boston. You will see lots of postings telling you good reasons to go. Yes, the Speakers list looks pretty impressive, and I have even heard of some of them! It would be cool to meet these people in person but lets face it, the videos will be posted on Youtube eventually... Most conferences are a total waste of time - you just sit there tweeting and surfing the Internet. Why bother???

Well, for starters, it is in Boston. It is one of the most beautiful cities in the USA, especially in the Autumn and it has something for everyone - food, history, shops, entertainment, sport...

Walk over to the boating on the Lake in Central Park and see if you can remember which films it was in (my kids tell me it was definitely in the Parent Trap with Hayley Mills).
 
 

Give my regards to the Penguins at the Aquarium. They look even better in real life.
 
 

Go down to MIT or Harvard and dream (or hire some staff maybe).

The Boston Redsox are playing the New York Yankees at home.

Pop over to the Kennedy Library or jump on the train down to Salem to hunt down some witches.

Maybe just get a photograph of you outside the Cheers Bar.

And the conference helpfully starts on Sunday night so you obviously need to fly out on the friday night to 'prepare' (ie spend the weekend chilling out in Boston).

So, that's the weekend taken care of - what about the actual conference? What I think makes the conference different is that it is not organised by professional conference organizers. It is very competently done but it can't make much of a profit - all those free drinks, there are usually lots of books and other freebies. Remember the empty suitcase to bring them all home in! Maybe the organizers feel guilty about making so much money the rest of the year that they feel they need to give some back, but who am I to complain?

It is also very different meeting people in the flesh and hearing the speakers live - the videos posted are good but only give a pale feel of what it is like to attend. There has been a real buzz in the years I have attended and you will leave exhausted but full of ideas.

The most important reason to go there, however, is that you will find 300 of the brightest people in the Software industry are also going to be there. It is your chance to meetup with lots of other people who might be potential clients or customers, but who will definitely have lots of valuable experience to share with you, even if it is only to reassure you that times are tough.

See you in Boston!

0 Comments Click here to read/write comments

Using the PDF file format for 'plumbing' at RandomHouse

Posted by Mark Stephens on Thu, Jun 24, 2010 @ 10:23 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

I was visiting a client (RandomHouse) yesterday and it struck me that PDF is actually very commonly used inhouse to provide other products and services. Although the end user never sees the PDF files, they are critical to the whole setup.

RandomHouse is one of the world's most successful book publishers, and their material comes from all sorts of sources - multiple versions of Indesign and Quark, other publishing tools, PDF files, OCR Tiff - the list is potentially endless... Then there is archive material of older books much of which needs to be scanned in. All of these can be turned into PDF files which gives one common internal format.

A single file standard which can cope with all these very different sources of content is essential. This allows all content to flow into a single system and be processed by one workflow. And because it is PDF you get access to the text, its co-ordinates and good quality images of the page. Ideal for building some clever widgets.

They have a nifty little site which not only allows users to purchase books but also to preview and search some of the book contents. I have a kindle, an IPod and a Mac but still like to read material, so I think books will remain, just as radio has not been killed off by TV but remained a distinct format with its own advantages.

So I can appreciate being able to search and dip into books online which I can also then obtain to read. Here is an example where you can click on the book, browse pages and search the text. There is not a PDF file to be seen as a user, but the PDF format is behind the scenes allowing it to all work.

Do you have any examples of 'hidden uses' of PDF files? 

0 Comments Click here to read/write comments

Why can't I just open and edit a PDF file

Posted by Mark Stephens on Tue, Jun 15, 2010 @ 04:51 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

People sometimes try to edit a PDF file by opening the file in a text editor. This very rarely works for 3 reasons.

Firstly, a PDF file is effectively a dump of PDF objects. The file contains a reference table giving the exact byte offset locations of each object from the start of the file, and the references tables. If you add or delete a character, or even resave it from an editor which converts line ending from one platform format to another, all these numbers will be incorrect. You would need to update them all. To prove it, just try opening a PDF, type in a space, save it and then see what happens if you try to open it...

Secondly, if you open a PDF file, much of the data is stored inside binary streams, in which data has been encrypted or compressed. If you view a PDF you will see some text but lots of incomprehensive 'garbage'. This is the binary data. You cannot edit it, but you can easily break it just by adding a character.

 

Finally, much of the PDF data needs to be looked at in connection with other data in the file. Text only makes sense by looking at the encoding on the font object, images have their data partly in XObjects and partly in ColorSpace objects, and so forth...

Some files formats such as HTML, Javascript and most source code can be easily manipulated in a text editor. The PDF file format is not one of these and is best accessed using a library which takes away all this complexity. Fortunately there are lots of both free and commercial tools available for all the most popular languages. If you have a favourite, why not post a recommendation here? 

0 Comments Click here to read/write comments

Java Performance tuning

Posted by Mark Stephens on Thu, Jun 10, 2010 @ 11:30 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: 

One of my favourite coding activities is profiling - taking a Java application and making it run faster. Every so often we set aside some time to just focus on making our code run faster.

Don't optimise code where speed does not matter. Not only does it give no benefit, it probably makes that code harder to support and may introduce bugs.  

Before you can do this,  you need to find the bottlenecks - which bits of the code are used most. These are the sections which are worth improving. You can find these using a Profiler.

In the Java world we are lucky to have lots of profiling tools. The two we use are the one built-in NetBeans and JProfiler. Here is what is JProfiler shows...

 

Methods that are frequently called or used often are worth looking at. Other methods are not worth bothering with. If you take a method which takes 2% of the time and make it twice as fast, the code will be a petty 1% faster. If you take a routine used 30% of the time and make it 10% faster (a much easier task) you will get 3 times the benefit.

Profiling also tells you where it might be worth caching values. If you can reduce calls to some routines it will make it faster without having to change code.

You also find some surprises with Java. In one routine, we generated some data and wrote it to a ByteArrayOutputStream. It actually turned out to be 5 times faster if we called it twice instead, first to see how many bytes are created and then again with a byte[] array to store the data in... 

So if you have not tried profiling your code, try NetBeans or the JProfiler demo on some code and be amazed at what is going on.

And if you want some more speed in your PDF Viewer, try our new 4.21 release here

2 Comments Click here to read/write comments

Understanding the PDF file format - OCR PDF files

Posted by Mark Stephens on Wed, Jun 09, 2010 @ 06:25 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

Some PDF files are generated from scanning in pages as images, and these have their own unique quirks. Sometimes, the original book copy is the only copy available so this is the only way to get hold of the content. I hope to explain some of these and the impact they might have in this article...

You can usually tell an OCR PDF file from it's appearance - the text on the pages has a 'jagged' bitmapped appearance to it rather than smooth look you get with text rendered as Vector graphics. If it doubt, you can have a look at the PDF Properties for the Producer or Creator (Abbyy Fine Reader is a common tool for converting scanned pages into PDF files).  

When pages are scanned in, the text is calculated using Optical Character Recognition software. This is not always 100% perfect. This might be because the page scan is poor quality, the text is at an angle, the font has very similar letters, and so on. To hide this fact, the text is often placed behind the image by the PDF creator. That way it still looks perfect and it is only if you start to search that you will see any errors.

Generally, each page is scanned in as a single high resolution image which is usually embedded as a large black and white or grayscale image.

This has two big implications for you as users of PDF files.

First of all, the files are bigger because they contain both the text (or an OCR tools best guess) and a high resolution image. Sometimes this image will have real images (ie page logos) on them.

Secondly, just because it looks like a perfect representation of the page, it does not mean that the text is actually correct and can be searched. 

Sometimes, the original book copy is the only copy available so this is the only way to get hold of the content. Google currently has a big project to scan in lots of old books - many created before computers even existed. 

So PDF files created with OCR are okay (and often the only thing available), but not as useful as a 'proper' PDF file version if you can get it.

0 Comments Click here to read/write comments

Understanding the PDF file format - Layers

Posted by Mark Stephens on Fri, Jun 04, 2010 @ 07:27 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

One of the really cool features about PDF files is that they are not a boring, static file format - they can interact with the user and the display can change. One of my favourite interactive features is Layers.

You can think of Layers as a separate overlaid page on which text or images can be added. The Layer can have a name and the user (or the PDF itself) can alter whether this Layer is visible or not.

This has lots of practical and fun uses. There is a PDF version of the Hangman game which uses Javascript to think of a word and progressively update the display as the user plays the game.

All the parts to the Hangman drawing are on Layers and updated by Javascript.

A more practical example is to allow additional information to be included on plans, diagrams or maps

And we can see just the detail we want with a couple of clicks... 

 

Layers are also useful as we can display different things in a printout compared with onscreen display.

So Layers offer a very powerful and user-friendly way to enhance your PDF files for useful purposes or just for fun. If you have seen any really interesting uses of Layers, why not drop me an email or post the link here. 

If you would like to play with Layer in JPedal, the functionality is built into the viewer and there is a tutorial here

0 Comments Click here to read/write comments

Understanding the PDF file format - Text, shapes and images

Posted by Mark Stephens on Wed, May 26, 2010 @ 02:50 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

I have been looking at an issue for a potential client recently which required the generation of different views of the page. This is interesting because it allows me to show you the internal workings of the PDF file format rather elegantly. It seems to be an increasingly common activity from our clients these days as they build web applications to display PDFs and need to separate out text and images.

What is in a PDF

A PDF can contain bitmapped images, Vector graphics and text (which can be Vector or bitmapped depending on the font used). Sometimes, you may be surprised at what you find. While a PDF may look like it contains text, the lettering may actually be part of the image (as in a scan) or shapes (where the text was converted to paths). Here is a rather nice PDF page showing what is going on...

Here is the complete page

 

which consists of images

text and vector graphics

and just the text

(the white text is invisible on a default white background)

  

The white text in particular illustrates how dependent on each other the layers are - we could generate it as a transparent image and add a coloured background if we wanted to highlight the text layer on its own. 

Creating your own separations

If you would like to create your own separations, there is a new support page explaining how to use the feature in our JPedal PDF library - you will need version 4.20 or later. 

 

1 Comments Click here to read/write comments

Software Development: Are we listening carefully?

Posted by Sam Howard on Tue, May 25, 2010 @ 01:53 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

The quality of documentation and tools in an area of software development has an incredible impact on us developers. They both influence how fast we can work, the quality of the software we write, and even how we feel about our work.

I've recently been working on adding some support for hinting into our TrueType font renderer. I was delighted to find a number of excellent tools for working with TrueType fonts, such as Microsoft's font tools and FontForge, but unfortunately found the TrueType documentation rather lacking.

Instructed font hinting like that used in TrueType fonts is a complicated subject. Each glyph contains a small (ish!) program written in TrueType byte code, which is run in order to manipulate the points. This manipulation was initially designed primarily for grid fitting, a process which improved the appearance of text on screen before anti-aliasing became a feasible option, but is now also used extensively in foreign language fonts to move and change the shape of components of a compound glyph.

 

Japanese characters before and after their glyph programs have been run
Chinese characters before and after their glyph programs are run.


As can be expected with such a complex field, mistakes and ambiguities have crept into the documentation, and while generally very good, even the tools have some flaws.

TrueType was initially developed by Apple in response to Adobe's rather restrictively licensed Type 1 font technology, but was later licensed by Microsoft, eventually making it the de-facto standard on desktop computers. As a result, there are two primary sources of documentation – Apple and Microsoft. While they supposedly define exactly the same system, there are occasional direct contradictions in what they say! In fact, in Apple's guide the definition and example given for one of the most important byte codes is completely wrong.

This wouldn't be surprising in a new document – as I said, it's a complex topic – but these guides were written in the mid nineties! I can't be the first to have found these mistakes, but since in both cases there is no obvious way of contacting the authors, they've stayed incorrect for almost 15 years.

This, to me, highlights the need for a clear and direct line of communication between those writing specifications and those who use them – something we've been trying hard to achieve. Anything unclear? Let us know! We're here to help.

11 Comments Click here to read/write comments

What new PDF developers need to know

Posted by Mark Stephens on Fri, May 21, 2010 @ 11:03 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

We had a discussion last week about what tips would help new developers get to grips when starting to work with PDF files. Here are some of the ideas which came out of that. It is very much a personal suggestion list so please feel free to add your own suggestions.

Do not think of a PDF file as a 'file'

When you start to learn HTML, you can open a file, hack it in a text editor and see what happens. You can't do this with a PDF file. It is essentially a binary data structure - lots of the information cannot be seen if you open the raw file and editing one byte could potentially break the whole file. There are lots of really good tools out there on multiple platforms for examining the contents of a PDF file so you should not need to try and open the file directly.

PDF is all about objects

What the PDF file essentially contains is a whole lot of PDF objects. They all have a unique ID of the format number generation R (so you might see 3 0 R, 144 0 R). Most of the time generation is zero but not always.

There are lots types of objects - a Page Object describes a particular page, a Font object contains all the information about a specific font, a Form object contains information. Objects can reference other objects, so Page Object 5 0 R might reference Resources object 10 0 R which contains a list of Font objects used for the page, including Font objects  16 0 R, 17 0 R, 18 0 R. 

The objects can also be thought of as a Tree. This is what allows any page to be opened quickly. The PDF root object points to the list of pages which point to the resources they use and their contents. 

Two identical looking PDFs can be very different inside

The PDF specification is very broad and flexible so there are lots of different ways to achieve the same result. The specification does not enforce any approach so all the PDF creation tools do things in different ways. If you have a strange PDF, it is always worth seeing what the Producer or Creator settings are.

Images are 'ripped' up inside a PDF

When a PDF is created, images are broken up into their pixel and colour data so that they can be compressed as efficiently as possible. JPEG data may well be stored in a JPEG compression format (DCTDecode or JPXDecode) but it may still  need to have colour information applied.

Essential reference material - The PDF Reference Guide

Adobe produces a detailed specification of the PDF Reference guide which is free to download. It is very big and there is an awful lot to it. Ideally, a beginner should start with the outline of the file format and just the areas they need to understand.

The PDF specification goes into considerable detail on the specification. But it may not be written from the precise viewpoint you need and also Adobe allows considerable interpretation in of what is acceptable. While there are lots of examples, it is possible for tools to do things in other ways.

What makes a PDF 

A PDF file should ideally have a .pdf file type, an xref pointer in the last 1024 bytes of its data and the file line of a PDF should be the version. But there is quite a lot of variation in what is actually allowed in a PDF and how useful a PDF is. A PDF file can contain fonts and editable text or just be a raw around an image.

At the end of the day, if it opens in Acrobat it is accepted as a PDF and you need to handle it...

PDF is a collection of other technologies

There are lots of other technologies used inside the PDF file format including compression algorithms, encryption, font technologies, Javascript and so on. This makes it harder to understand because you need to have a grasp of these as well to understand what is going on.

Use the tools

There are lots of  tools (both free and commercial) on all platforms and in different languages (C, Java, Perl, Php, etc). They make it much easier to work with PDF files and also experimenting with them (especially if you can access the source code) is a good way to understand how PDF works.

There are people to ask

I remember meeting Tom Phelps, the developer or Multivalent, at a conference in 2002. We were so pleased to find someone else we could actually have a conversation with, we spent the whole night discussing PDF issues at the pub afterwards. Everyone else in the bar complained it was the most boring night of their lives, but we both had a good time... 

Thanks to the Internet, you can discuss PDF issues without totally destroying your street credibility! Many of the people or companies producing PDF tools run mailing lists or discussions forums (my first job every morning is to check the JPedal Support forums) and there are more general forums. I personally find stackoverflow a good place to ask questions. 

 

Becoming an expert in PDF is not an overnight process

I started working with PDF files over 10 years ago and I still learn new things every day. PDF is a big, complex file format including a lot of technologies so it will need time to become proficient with it.

So that is my advice. What would say to a new PDF developer? Or do you have any tips or advice? 

0 Comments Click here to read/write comments

Previous Page | All Posts | Next Page