Java PDF Blog

PDF solutions for big and small customers

Java and PDF development - our personal experiences and discoveries

Download JPedal

Download JPedal PDF viewers

PDF to Image service

Try our PDF to image conversion service now.

Java PDF Ebook Solution

Try our Ebook solution now.

Subscribe

Your email:

Java PDF blog

Current Articles | RSS Feed RSS Feed

The Java PDF blog is moving (action required)

Posted by Mark Stephens on Fri, Aug 27, 2010 @ 10:07 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: 

We have been writing the Java PDF blog for over a year now and have decided to move it to Wordpress. This is a much better tool and will allow us to do far more.

So, if you would like to continue to follow the blog (which I hope you will!), you will now find it at http://www.jpedal.org/PDFblog/

You can sign up to the RSS at http://www.jpedal.org/PDFblog/?feed=rss2

We will be leaving the old site as an archive but also updating and republishing the technical articles on the new site, along with lots of new material. I hope you will join us there.

0 Comments Click here to read/write comments

Adobe's thoughts on Open Source and Open Standards

Posted by Mark Stephens on Tue, Aug 24, 2010 @ 10:33 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: 

Adobe is a very important player in not just the PDF arena but a whole host of critical areas. As the producers of Flash, Flex and the PDF specification they will have a major input into future developments in Internet, Smart phones, publishing and desktops. Indeed their actions may partly determine whether these solutions thrive or whither and are replaced by alternatives.

So what Adobe says is important to everyone even if you do not use their products.

Adobe has always been reasonably open about their plans and their ideas, and they have been running an interesting series of interviews on their blog. If you want to hear their take on Open standards and Open source as well as Open Government, click here.  What do you think?

0 Comments Click here to read/write comments

Eclipse PDF viewer plugin and the marketplace

Posted by Mark Stephens on Fri, Aug 20, 2010 @ 10:24 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: ,

We have always offered a free PDF viewer plugin which is a nifty little plugin to not only allow you to view and search PDF files but also to bookmark them so that you can easily access them from inside Eclipse.

Every year there is a big release of the next version of the Eclipse Operating system and we always like to do an update shortly after release.

The Eclipse release always has a number as well as a name so the latest version is Helios (Eclipse 3.6). As well as the usual collection of improvements and bug fixes, Helios has an additional rather cool feature - the Eclipse marketplace. It provides a plugins store and allows Eclipse users to search and browse a database full of software.

Here is what came up when I searched for PDF.

marketplace resized 600 

And the best bit is that you can then install the software just by clicking on the install button. The days of having to understand update sites are long gone!

So I hope you will give Helios a try and I also hope you will download and try our free plugin - let us know what you think...

1 Comments Click here to read/write comments

Punctuation ?

Posted by kieran france on Mon, Aug 16, 2010 @ 03:08 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

So what is punctuation?

This may seem like a simple question yet I find myself asking it more and more often whilst working on our pdf search and text extraction. So once again today I found myself asking this same question. What is punctuation?

According to dictionary.com punctution is,

"The practice or system of using certain conventional marks or characters in writing  or printing in orderto separate elements and make the meaning clear, as in ending a  sentence or separating clauses."

Unfortunately this is not all to useful as the english language has many different forms of punctuation and often uses the same symbols in a multitude of ways. We can even see punctuation used in ways other than for sentence structure, for example as emoticons.

For instance the character '.' could be a full stop, it could be a decimal place or it could even be apart of '...'
In a pdf the character '.' could also be used in a multitude of other ways to help format a page and improve the flow of the text.

This is just one trivial example from many but I keep finding examples when searching for whole words only or when extracting text as a word list the results are being thrown off by the use of punctuation.

When searching or extracting text, what of the '-' character.
Is the term "mutli-tasking" one word or two?

If it's one word should we allow it to contain  the '-'?

How do we check if this is a valid use within a word?

 

What of one word split across two line with '-' at the end of the first line?

Is this  one word or two?

What of the '-'?

I'm not writing this to provide a concrete solution, neither am I looking to be provided with one as I believe there not to be one due to the way punctuation can be used in text documents.

These questions arise often as everyone producing pdfs produces them in different  styles. These questions are just a few of the things that make my job interesting.

0 Comments Click here to read/write comments

Getting to the Business of Software conference for free

Posted by Mark Stephens on Fri, Aug 13, 2010 @ 08:02 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

Readers of this blog may have guessed 2 things about me:-

1. In general I hate conferences - generally a poor use of time, money.

2. I think Business of Software is a great conference and I bore people constantly about it.

I have attended the last 2 conferences and signed up for the next one as soon as it was announced in the Spring. 

This year I have been given a small walk-on part (probably to give everyone a chance to catch-up on lost sleep), but that aside it is going to be an awesome conference. The speakers list includes the likes of Joel Spolsky, Dan Bricklin, Eric Sink and Seth Godin - just take a look at the full list....

For 3 days you get the chance to hear these people, and to chat socially to them and around 300 other attendees from across the globe. It is an exhausting and mind-blowing experience - you will need that snooze on tuesday afternoon. If you do not come back with something worthwhile, software is probably not the business you should be in...

Now Neil is offering a chance to win a free ticket to attend by posting the best suggestion for how to allocate the free tickets.

In fact it is two chances - presumably if you get a chance to suggest the rules of a game, you will suggest some which make it easier for you to be able to win.

At the very least, it must be worth posting a comment.

0 Comments Click here to read/write comments

Understanding the PDF file format - interactive elements

Posted by Mark Stephens on Wed, Aug 11, 2010 @ 08:31 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: ,

One of the really useful features of the PDF file format is the ability to have interactive elements. These began as just simple checkboxes, buttons, comboboxes and textfields type widgets and the list has expanded to include the ability to embed Sounds, Movies and even other files or URLs. This makes the PDF file format a very interactive medium.

Here is one of my favorite examples

PDF form

All of these interactive features can be defined in 2 ways. Firstly, they can exist in  as PDF objects defined within the PDF file and they inherit values from their parent objects. This is the original FDF version. It uses the standard PDF Cos format and would look something like this

26 0 obj

<<

/F 4

/I[1]

/Type/Annot

/Rect[196 594 314 613]

/BS<</W 1/S/U>>

/FT/Ch/

Subtype/Widget

/P 24 0 R

/T(Item)

/V(Soft Taco)

/AP<</N 142 0 R>>

/Ff 393216

/MK<</BC[0 0 0]>>

/Opt[(Burrito)(Soft Taco)(Mexico City)(Quesadilla)(Taquitaco)]

/DA(/TiRo 0 Tf 0 0 1 rg)

>>endobj

Or they can appear in one of several XML structures inside the file. Here is an example - the actual XML is buried inside streams in the referenced objects. 

<<

/XFA[(preamble)40 0 R

(config)41 0 R

(template)42 0 R

 

You can also define as both a Cos object with data in the XFA - the spec is nothing if not flexible!

(datasets)43 0 R

(localeSet)44 0 R

(postamble)45 0 R]

Forms can also be linked to events and to Javascript code inside the PDF and can have tooltips, change their visibilty and interact with other components. They can also reference widgets on other pages. So there is not much you cannot achieve with them...

The thing that I found most confusing when starting was that interactive elements can actually be referenced in 2 separate ways. A PDF document can have a single Acroform or XFA object (which lists all the widgets in the document), but each page can also have a Annots object which lists the widgets on that page. So you need to potentially look at both lists and then workout which are used on any page. 

So if you think PDF files are about just static WYSIWYG documents, you have been missing a whole dimension. Give them a try.

0 Comments Click here to read/write comments

Why the TrueType hinting patent expiration matters

Posted by Sam Howard on Thu, Aug 05, 2010 @ 12:12 PM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

A while back we put a lot of effort into implementing some of the hinting technology in the TrueType specification. This system effectively uses small programs in a stack based environment to manipulate the set of points which define the contours of a glyph.

Up until recently, some of the instructions available in the language were covered by a set of patents belonging to Apple, meaning anyone wishing to actually execute those particular instructions would need a licence from Apple. Unfortunately the patented instructions were the most frequently used instructions for moving points, meaning that simply executing the rest of the instructions does more harm than good. Now that those patents have expired, this is no longer an issue.

So what does this actually mean? These patents have stood for 20 years and the world hasn't fallen apart.

Well, that's true. New font technologies have been developed with different hinting mechanisms, tools for rendering TrueType have created their own automatic hinting algorithms and antialiasing technologies have vastly improved. All of this is true, and to some extent has reduced the need for the original hinting instructions, but the fact remains that a font hand hinted by an expert will always look clearer than any automatically hinted rendering.

However, what really clinches it for me is the fact that numerous Chinese fonts actually construct their glyphs by defining a range of glyphs as simple strokes, then using them in composite glyphs which are heavily manipulated by hinting instructions in order to form the final characters. It is impossible to work around as performing the relevant shifts and alignments uses the methods described by the patents. Now that the patents have expired there will be no need to buy a license from Apple, meaning many products can now add (or enable) the functionality used to display these fonts, including open source packages like FreeType which is used for font rendering in many Linux distributions.

components

Individual stroke component glyphs

 

unhinted

Unhinted composite glyph outline

 

hinted

Final hinted glyph

There's a lot of discussion on this topic over at Slashdot, and some more details about what was patented can be found at the home of FreeType.

0 Comments Click here to read/write comments

Annoying Java Bugs - who broke right aligned text fields

Posted by Mark Stephens on Tue, Jul 27, 2010 @ 02:12 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: ,

Bugs are an unfortunate part of a coder's life. Working in PDF, which has quite an 'elastic' specification and where often real world does not match what is supposedly allowed - Sam yelled at me last week that another PDF file we were looking at was missing supposedly mandatory values - there is lots of scope for coding and logic errors. We have three machines in the office which constantly run regression tests on Windows and Linux, so hopefully we only have to fix a bug once and we can see if a fix breaks other things. So our bugs are a controllable annoyance where we can constantly try to raise our game and improve.

No, the most annoying bugs are the ones we do not write - we can put our hands up to those and can fix them easily...

Last week we had an issue with right aligned text values not appearing correctly. It took some time to hunt this one down as we simply could not reproduce it - until we made sure that we were using the same version of the JVM as our client. It turns out that alignment of right aligned text values was broken in JDK1.6_update 10 and does not appear to have been fixed yet.

The fix is a hack, which adds a spaces to the right of all text values and adds spaces so that all the text values are the same length. Then they align nicely. So tomorrow's release of JPedal has a boolean flag in it to enable this 'hack' to ensure that fields works in all JVMs. Hopefully Sun (or rather Oracle now), will fix it as it is a big deal to anyone writing financial software in Java. It also means that you may need to tie down customers to a certain version of Java to avoid a whole nightmare of issues and work arounds.

Do you have a particularly annoying Java bug?

2 Comments Click here to read/write comments

Oracle and Java

Posted by no reply on Tue, Jul 20, 2010 @ 04:52 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

We always felt that things needed to change in the Java world so have been keenly waiting to see what Oracle would do with Java...


Well we got our first clues, last week with a letter from Oracle. Sent by express Fedex from California, the good news is that Oracle clearly has money to spend. The letter was a legal notice terminating a minor agreement we had with Sun relating to its marketing scheme. I can't even remember what it was. Oracle clearly has some plan, because it looks like it is clearing the decks and if it is  cancelling all previous agreements. So something is going on. It is also very good news for Fedex and its shareholders! 


What makes me really sad though is the wasted opportunity. By all means, make sure you tie up any loose ends in a legal and water-tight way. But what makes Java superior to DotNet (in my opinion), is the thriving ecosystem that Java has. From the Eclipse Organisiation to the smallest developer, there are lots of really creative things happening in Java. Oracle could have used this as a chance to also communicate with all of these developers (and potential clients of Oracle) or at least included a note to say "Sorry guys, but we need the legal stuff as we have to cover all the legal issues". A legal-sounding letter from Oracle is a scary thing for any developer... 


There already seems to be a bit of a lull in the Java world as everyone waits to see what Oracle does. And Oracle has paid a lot of money for Java, so it really cannot be in its interests to see it decline. So I really hope we see something with more vision and less legalise coming out of Oracle. 

0 Comments Click here to read/write comments

Don't blame the PDF file format

Posted by Mark Stephens on Tue, Jul 13, 2010 @ 02:44 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

I see a lot of complaints about the PDF file format on various forums, moaning about it. They tend to focus mainly on 2 issues:-

1. The PDF file format is complicated.

2. Extraction, especially of text, is not always straight-forward.

Both of these, I think, are essentially unfair. PDF arose out of Postscript and is more akin to a program, with the final display, as its output. It offers a very powerful and elegant structure to do this, but getting into PDF is a bit like learning a programming language. As with any programming language, you need to have a decent set of tools and a good working knowledge to achieve anything. 

Many so-called 'PDF killers' have appeared over the years and yet PDF still remains because it is an excellent technical solution for many problems. PDF was never envisaged as something you could hack in a text editor.

The issue with text extraction arises because PDF was designed as an end-file display format so it does not contain lots of details on text structure and layout which you might find in other formats. Adobe did remedy this by adding a feature to embed Structured content tags into the PDF and if this is used, very accurate text can be extracted. The problem is that very few people use this when creating PDFs. So again, don't blame the format - if used correctly it works very well.

The PDF format's biggest issue really is that it has been so successful, people are trying to push it into areas which are not it's strength or push beyond what it was designed to do. 

0 Comments Click here to read/write comments

All Posts | Next Page