Java PDF Blog

Java and PDF development - our personal experiences and discoveries

Download JPedal

Download JPedal PDF viewers

PDF to Image service

Try our PDF to image conversion service now.

Java PDF Ebook Solution

Try our Ebook solution now.

Subscribe

Your email:

Java PDF blog

Current Articles | RSS Feed RSS Feed

PdfHelp - a free PDF based alternative to JavaHelp

Posted by Mark Stephens on Fri, Mar 05, 2010 @ 08:39 AM
Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon 
Tags: 

PdfHelp is a free PDF based alternative to JavaHelp for adding searchable application help to your Java programs.

When Java first came out, Sun produced JavaHelp as a tool to create HTML help for Java. It works well but it can be fiddly to setup and cumbersome to maintain. So we came up with PdfHelp....


PdfHelp offers a fully searchable way to add interactive help to any Java program (or to use as a standalone tool to provide documentation). It uses the Open Source version of the JPedal PDF Viewer to display and search PDF files - all the developer has to do is to provide a list of PDF files - often the same files provided as documentation for the user.

So creating and updating is quick and simple. You just run the jar, click to add the files.

It will give you the required code for your application. 


Because, because it uses PDF files, the display quality is excellent and you get full interactive search.

So we hope you will give PdfHelp a try at its new home http://www.pdfhelp.org 

0 Comments Click here to read/write comments

Linearized PDF files

Posted by Mark Stephens on Fri, Feb 26, 2010 @ 11:36 AM
Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon 

Linearized PDF is a special way to organize a PDF file.

In general, the PDF is a very elegant and well-designed format. A PDF consists of lots of PDF objects which are used to create the pages. This information is stored in a binary tree which also stores the location of each object in the file. So only the tree needs to be loaded when the file is opened, and it can then be used to load the required objects to display a page. The whole file itself does not need to be read, only the tree. The location of the tree is always stored at the end of the file so it is easy to find and also simple to modify the file just by appending new information and a new tree.

However, if the file is read via the web, it is accessed as a stream of bytes. This means the reference (which is at the end of the file) cannot be read until the whole PDF file has been transfered. This can take some time with large files.

So Adobe created a new way to layout the PDF called Linearized PDF. The file format is still the same, but there is a special tag at the start of the file and all the objects needed to create the first page (and a mini binary tree describing them) are stored at the START of the file. As soon as this data has been read, the first page can be displayed, while the rest of the file is downloaded. This makes the whole thing seem much faster and gives the user something to look at almost immediately even on huge files.

In Adobe Acrobat and Adobe Reader, the best way to see if a PDF is Linearized is to look at the Document properties. If the file is a linearized PDF, the item Fast Web View will display Yes.



In JPedal 4.0.1, we have added a similar option so show if the file is Linearized to the Document properties. If it is Linearized, the word linearized appears in the general section after the PDF version.


In JPedal you can also check progammatically to see if a file is Linearized by seeing if the Linearized object exists - if it does it is a Linearized PDF. Here is the code.

//decode_pdf is the instance of PdfDecoder representing the opened PDF file
if(this.decode_pdf.getJPedalObject(PdfDictionary.Linearized)!=null)
System.out.println("This file is linearized");


So in a nutshell, Linearized PDF is a way of organizing a PDF file so that if it is going to be accessed over the Internet it will appear to load much faster. And it does this very well!

0 Comments Click here to read/write comments

Search in continuous mode and future plans

Posted by kieran france on Thu, Feb 18, 2010 @ 04:59 AM
Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon 
Tags: , , ,
For some time the Jpedal library has had the ability to search exclusively in single page mode. For our release of Jpedal 4.0 we have begun to expand this functionality to the other view modes. As a start we have added the search functionality to the continuous single page view mode with plans to expand this into the other view modes.

To allow for this new functionality we have needed to make alterations to a few of our exsisting public methods in order to allow for highlights to be assigned to or retrieved from a particular page.


On top of this the highlights are no longer stored in a Rectangle array. The highlgihts are stored in a Hashmap using the page number as the key and a Vector_Rectangle (org.jpedal.utils.repositories.Vector_Rectangle) as the associated value.


We have also moved the page text areas and text orientation into hash maps. In order to store this information it must be retrieved from PdfStreamDecoder after decodePageContent (PdfObject pdfObject, int minX, int minY, GraphicsState newGS, byte[] pageStream) is called as each call to this method will rewrite the localy stored data for the previous page.

The follow methods have changed in version 4.0 to allow for highlights of multiple pages being stored.

Commands.ExecuteCommands(Commands.HIGHLIGHT, new Rectangle[]{})
has become
Commands.ExecuteCommands(Commands.HIGHLIGHT, new Object[]{Rectangle[] areas, int page})

GethighlightAreas()
has become
GethighlightAreas(int page)

setFoundParagraph(int x, int y)
has become
setFoundParagraph(int x, int y, int page)

addHighlights(rectangle[], boolean)
has become
addHighlights(rectangle[], boolean, int page)

RemoveFoundTextArea(Rectangle)
has become
RemoveFoundTextArea(Rectangle, int page)

RemoveFoundTextAreas(Rectangle[])
has become
RemoveFoundTextAreas(Rectangle[], int page)

 

As you will notice the above methods have had a new integer added as an input called page. This value is the page number to which you wish to direct the method.


As well as the above methods the following method has also changed.

Display.initRenderer(Rectangle[] areas, Graphics2D g2,Border myBorder,int indent)
has become
Display.initRenderer(Map areas, Graphics2D g2,Border myBorder,int indent)

The above method would originaly recieve the rectangle array we used to use for highlighting. We have updated the method to accept a map as this is how the highlights are now stored.

 

Earlier in this article was mentioned that PdfStreamDecoder holds a local copy of the text areas and orientation when a pages content is decoded. In order to retrieve this data we have added the follow two methods.

Vector_Rectangle getTextAreas()

Vector_Int getTextDirections()

In the releases to follow we will be moving more functionality into the continuous single page view mode, then to the other view modes, such as highlighting with the mouse, extraction and the right click menu.

0 Comments Click here to read/write comments

Java printing of custom Swing Components

Posted by Mark Stephens on Wed, Feb 10, 2010 @ 09:50 AM
Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon 

If you are implementing printing in Java, there is a really good tutorial showing how to add printing support to any Swing component. Printing generally works very well and we use it in our PDF library to provide printing of PDF files.

The example uses a PrinterJob to print via Java - essentially, your Swing component just needs to implement Pageable or Printable and have a print() method. This is often virtually identical to the draw method, but just renders onto a Print Graphics2D object rather than the Screen Graphics2D object.

However, PrinterJob is a rather basic function and does not have all the features of DocPrint. In particular, DocPrintJob offers a couple of very useful additional features:-

1. It can print different sized pages (with PrinterJob it seems hard-coded to the size of the first page).

2. It adds Listeners functionality to allow monitoring of Print activity. The method addPrintJobListener provides events to query the printer.

Converting from PrinterJob to DocPrintJob

It is very straight-forward to switch from using PrinterJob to DocPrintJob. In this example I have a custom Swing component decode_pdf which implements Pageable to print itself. Here are the steps:-

1. Alter your PrinterJob object to a DocPrintJob. In my code I just had to change

PrinterJob pj = PrinterJob.getPrinterJob();

to 

PrintService[] service=PrinterJob.lookupPrintServices(); //list of printers

printJob= service[i].createPrintJob(); //i is whichever printer you use

2. Wrap your Pageable or Printable object inside a doc, specifying which interface to use in the call

 //wrap in Doc as we can then add a listeners
                Doc doc=new SimpleDoc(decode_pdf, DocFlavor.SERVICE_FORMATTED.PAGEABLE,null);

 

3. Alter the print call from directly calling your Swing component to use the doc so 

decode_pdf.print();

becomes

printJob.print(doc,null);

 

Finally, you can now add a listener to track what happens.

printJob.addPrintJobListener(new PDFPrintJobListener());
 

Here is a simple example of the Listener class

private class PDFPrintJobListener implements PrintJobListener {
        public void printDataTransferCompleted(PrintJobEvent printJobEvent) {
            System.out.println("printDataTransferCompleted="+printJobEvent.toString());
        }

        public void printJobCompleted(PrintJobEvent printJobEvent) {
             System.out.println("printJobCompleted="+printJobEvent.toString());
        }

        public void printJobFailed(PrintJobEvent printJobEvent) {
             System.out.println("printJobEvent="+printJobEvent.toString());
        }

        public void printJobCanceled(PrintJobEvent printJobEvent) {
             System.out.println("printJobFailed="+printJobEvent.toString());
        }

        public void printJobNoMoreEvents(PrintJobEvent printJobEvent) {
             System.out.println("printJobNoMoreEvents="+printJobEvent.toString());
        }

        public void printJobRequiresAttention(PrintJobEvent printJobEvent) {
             System.out.println("printJobRequiresAttention="+printJobEvent.toString());
        }
    }

So switching to DocPrintJob is simple and easy and improves printing support. And if you use our PDF library for printing, you should see the benefits in the next release.

 

 

0 Comments Click here to read/write comments

Java File handling - when is a file actually saved

Posted by Mark Stephens on Fri, Feb 05, 2010 @ 04:20 AM
Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon 

Consider the following code....

File ff=File.createTempFile("page",".bin", new File(ObjectStore.temp_dir));

BufferedOutputStream to = new BufferedOutputStream(new FileOutputStream(ff));

to.write(currentDisplay.serializeToByteArray(null));         

to.flush();           

to.close();

pagesOnDisk.put(key,ff.getAbsolutePath());

It stores a serialised Java Object (currentDisplay) on disk and then stores the file location so we can reuse it. So in theory,  if the value is in the Map pagesOnDisk, we should be able to retrieve the data and reuse it...

Unfortunately, that is not always the case. While Java may think the file has been written out, an attempt to immediately reuse it results in alsorts or errors arising from trying to read a File which has not fully been written out to disk.  At the OS system level, the file has been Buffered and is still being written out.

So be aware of this 'gotcha' in Java and either ensure that there is a sufficient time delay to allow the data to be written out, or include some check to make sure the data is valid.

0 Comments Click here to read/write comments

Printing PDF files from Java

Posted by Mark Stephens on Sat, Jan 30, 2010 @ 04:10 AM
Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon 

Printing PDF files from Java is something that raises a lot of general questions, so this short article is a general guide to the options available. Java itself contains a built-in print system (JPS). JPS itself does not internally support the PDF file format.

There are 3 ways to print PDF files in Java:-

1. Use a printer which directly supports PDF files and use JPS to send the data directly to it.

All the work is done by the printer, often in hardware so this is a brilliant solution if you can precisely define the printers used but does not provide a generic solution. If you want to try this, here is some generic code

FileInputStream fis = new FileInputStream("C:/mypdf.pdf");

Doc pdfDoc = new SimpleDoc(fis, null, null);

DocPrintJob printJob = printService.createPrintJob();

printJob.print(pdfDoc, new HashPrintRequestAttributeSet());

fis.close();

2. Print from Java using a non-java application

Java allows you to access non-java code so that you can access Acrobat, Ghostscript, CUPS or any other solution. You can do this with the Java command

Runtime.getRuntime().exec("commands");

Again, this works if you have control of the exact platforms and software available but does not provide a generic solution.

3. Print using JPS

JPS does not include PDF support, but it does have hooks to allow any Java program to print content to any printer. A number of Java PDF libraries offer printing - they essentially convert the PDF into a rendered page which JPS then prints. This provides a generic solution but the files tend to be larger and it relies on the capabilities of the PDF library which vary.

All three methods have their pros and cons so try them to find out which one offers the best fit for your requirements. Try and see what meets your needs best. If you would like to see a more detailled article please let us know or post your comments here.

0 Comments Click here to read/write comments

Two issues with Java File access for temporary files

Posted by Mark Stephens on Tue, Jan 19, 2010 @ 05:36 AM
Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon 
Tags: 

In theory, Java has a very flexible and powerful set of commands for manipulating Files. The File object allows you create a File or Directory, access alsorts of information about it and also delete it. The problem we find, however, is that it is not always possible to delete a file in Java by using file.delete(), even if you created it. Especially on Windows, Java retains a lock on the files and will not delete the file in the session - even if you created it. This can be a real issue if you create lots of temporary files which need to be cleaned up. You can create a File with a deleteOnExit() call, but if the program is expected to run continually, this will create a memory leak and not ever delete the files.

So we added a work around, checking in the next session for old files and deleting them so keep our temp directory clean. This exposes a second issue - the lastModified() method seems to be incredibly inefficient. In our tests, on systems with lots of files, this function alone was accounting for 30% of the CPU usage in a process which opened PDFs decoded them, rasterized them and then deleted them. 

So if you are using File commands for temporary files, be careful to ensure they are deleted, and try to have small numbers of files if you need to access their timestamp.

0 Comments Click here to read/write comments

Embedded PDF Truetype fonts are always MAC encoded unless they are not

Posted by Mark Stephens on Wed, Jan 13, 2010 @ 06:32 AM
Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon 
Tags: 

As the PDF file specification has evolved, it has developed some 'quirks' - areas where it does not always work as documented. One of the most annoying areas of these is in Truetype font encoding. It is one of these features which is broken but it is now too late to fix.

Inside a PDF file, all text data is stored as a binary number and this value is decoded into the actual glyph value (ie the value 65 is converted into the text value 'A'). Because the PDF file format is 'multiplatform', there are a several possible sets of Standard Encoding Formats to use for this conversion (ie WinAnsi for Windows, and MacRoman for standard MAC values). This is because Windows and MAC originally evolved with different charactersets and values. Most of the time values are identical (A is value 65 in both MAC and WIN encoding) but certain accented characters have different values. So values 132 is Ntilde (letter N with a wavy line above in MAC encoding) but quotedblbase (double quotes at bottom of the line) on Windows. So long as we know which translation table to use, this is not a problem of course....

The issue comes with embedded Truetype fonts because they will always list them as MAC encoded in the PDF file (which is what the specification says they should be) when they are actually WIN encoded. Using the wrong look-up table does not matter for most values (as the results are identical) but it does break certain letters.

So what you need to do is to figure out if the font is actually WIN or MAC encoded yourself and ignore the setting in the PDF file. There is (of course) no documented way to do and several values can appear as different values in either...

What we did was to develop some heuristics to work it out which we continually test against known files and tweak as needed looking at the actually font values present, seeing whether WIN or MAC encoding gives a 'better fit' and checking certain key values. It also needs to factor in the fact that the font maybe subsetted so only a selection of values will be present.

So if you get some odd characters working with PDF files containing Truetype fonts, this may well be the reason. And if you come across a file displayed in our PDF viewer which has some odd characters,  please do send us the file so we can continue to improve our code.

 

0 Comments Click here to read/write comments

PDF with odd Type3 fonts in Ghostscript 8.50

Posted by Mark Stephens on Wed, Jan 06, 2010 @ 10:28 AM
Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon 

As the developers of a PDF renderer, we spend a lot of time analysing PDF files and working out exactly what is going on. Recently we came across an intriguing font issue with a PDF file created in Ghostscript. It is worth relating as it is interesting in terms of Font technologies and for seeing what might be happening in Ghostscript...

Over the years, Adobe has added several font technologies to the PDF specification, including Type3, Type1, Truetype and OpenType. Type3 was one of the first and is not often used now because the quality of the other font technolgies is superior. However, one tool which does make extensive use of Type3 fonts is Ghostscript - the Open Source PDF 'printing' library.

The way Type 3 works is that it allows you to define each character using the same commands as are used to define the PDF itself - so each character can be like a mini-PDF drawn in the space of the character. Most examples we have seen up until now have been reasonably simple with either images or shapes used to drawn the letters.

We were sent a PDF file last week to look at - it still amazes me after 10 years of developing a PDF library how many different ways people have managed to interpret the Specification and produce files which still work in Acrobat. We drilled down and fixed the issue, but what made it interesting was what was happening in the file.

In this file, each Type 3 font character had been created by adding instructions to draw a Type1 font character - so the Type3 font definition for the letter A just drew a letter A, and so forth. The Type 3 font is essentially acting as a wrapper for a Type1 font. This works but it is a fairly hacky and inefficient way to do things - it would be far better to just use a Type1 font which would also offer lots of other advantages. But PDFs are often created by tools which are just patched and patched - most users do not see the internal guts of a PDF so they do not appreciate what is going on. If it looks okay, they are happy with that.

This suggests to me that maybe Ghostscript is being patched and patched - sometimes code reaches a point where it is time to take a block and rewrite it properly...

0 Comments Click here to read/write comments

PDF questions - where should you ask them

Posted by Mark Stephens on Tue, Dec 29, 2009 @ 08:38 AM
Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon 
Tags: ,

The PDF file format is a very common format, and also quite a complex one. So you may well want to ask some questions. What you need is somewhere where there are knowledgeable experts prepared to share knowledge and help out. As a late Christmas present, here are some recommendations on places we think are really good.

PlanetPdf has been going a long time. Its forum has some very talented developers (Aandi Inston is a regular) who patiently answer alsorts of questions.

4 x PDF Help is a relatively new site from part of the crowd who helped setup PlanetPdf. I already recognise some of the 'rockstars' of the PDF world answering questions on it. It uses the interface from stackoverflow so people can vote on issues and the most popular float to the top.

Stackoverflow has really taken the developer world by storm in 2009, making it much easier to get technical answers. Although a general site, it has a regular selection of PDF related questions and lots of knowledgeable answers. 

I have seen quite a few questions about IText on general forums, and the best place to get answers to this is the IText mailing lists where the actual IText developers are very active in answering questions.

And lastly, if you are using JPedal, remember we offer a support forum for questions.

Wherever you choose to post your question, always remember:-

1. Be specific - the more precise the question the more likely you are to get an answer.

2. Be courteous and polite.

3. Don't expect other people to do your work for you. People on these forums are happy to help you learn but not there to write code for you.

0 Comments Click here to read/write comments

All Posts | Next Page