Java PDF Blog

PDF solutions for big and small customers

Java and PDF development - our personal experiences and discoveries

Download JPedal

Download JPedal PDF viewers

PDF to Image service

Try our PDF to image conversion service now.

Java PDF Ebook Solution

Try our Ebook solution now.

Subscribe

Your email:

Java PDF blog

Current Articles | RSS Feed RSS Feed

Understanding the PDF file format - how are images stored

Posted by Mark Stephens on Sun, Apr 25, 2010 @ 09:22 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

When I was learning the PDF file format, I found Images could be quite a complex topic in PDF so I wrote this article to hopefully explain them clearly.  Please do let me know if you have any suggestions to improve it or it raises any questions for you.

A PDF file usually stores an image as a separate object (an XObject) which contains the raw binary data for the image. These are all listed in the Resources object for the page or the file and each has a name (ie Im1). It is wrong to think of images embedded inside a PDF as Tif, Gif, Bmp, Jpeg or Png. They are not.

It is important to appreciate that this is not usually an image in the sense of a Tif or a Jpg or a Png image - it is the binary data for the pixels, the colorspace used for the image, information about the Image. The image is ripped apart when the PDF is created and different PDF creation tools may store the same image in very different ways.

Here is an example shown in the PDF object viewer in Acrobat 9

 

Sometimes the raw image data is adjusted to the required size needed for the page and sometimes it is not - in that case it is scaled up or down when it is drawn - different PDF creation tools create PDF files in very different ways.  

The actual pixel data can be compressed and one of the compression formats (DCTDecode) is the same used as in a JPEG (JPX is the same as Jpeg2000). If you save this data, it can be opened as a JPEG file, but it may need altering to include the colorspace data. 

This image is then drawn in the PDF contents stream by a DO command and the image name (ie Im1). The image can be used multiple times and scaled, rotated or clipped - it takes whatever vales are set when the DO command is executed. Some things which appear as an image to the eye may also be made up of multiple images or not even images at all!

All this means that if you want to extract images from a PDF, you need to assemble the image from all the raw data - it is not stored as a complete image file you can just rip out.

And also there is a 'raw' (which is sometimes a much higher quality and sometimes exactly the same size) version of the image and a clipped/scaled version of the image - both can be extracted (and you can also scale the clip up onto the raw to produce a higher quality image - see this example). 

As with everything PDF, there is a lot of flexibility and lots of alternatives and options... 

2 Comments Click here to read/write comments

Converting Java BufferedImage between Colorspaces

Posted by Mark Stephens on Thu, Oct 22, 2009 @ 02:15 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: ,

The Java BufferedImage class provides a very powerful 'abstraction' of images in Java. It lets you create a huge array of image types which can all be seemlessly accessed. One of its key features is that you can decide what type of image you have - black and white, grayscale or full ARGB. 

When we create an image from a PDF in JPedal, we use an ARGB BufferedImage because it is the only mode which support the color range and transparency which can be found in many PDFs.

Sometimes, you want to convert a BufferedImage from one type to another, and Java makes this very easy. We have a method in our ColorSpaceConvertor class convertColorspace(image, type) which does the conversion but the code is very simple and reproduced below. The Type is a constant from BufferedImage so making a page GRAY would need

image=ColorSpaceConvertor.convertColorspace(image, BufferedImage.TYPE_BYTE_GRAY);

/**
     * convert a BufferedImage to RGB colourspace
     */
    final public static BufferedImage convertColorspace(
        BufferedImage image,
        int newType) {

        try {
            BufferedImage raw_image = image;
            image =
                new BufferedImage(
                    raw_image.getWidth(),
                    raw_image.getHeight(),
                    newType);
            ColorConvertOp xformOp = new ColorConvertOp(null);
            xformOp.filter(raw_image, image);
        } catch (Exception e) {
            LogWriter.writeLog("Exception " + e + " converting image");

        }

        return image;
    }

 

One issue that can arise is that detail can be lost so an alternative method is to create an image in the format you need and then draw the original image onto it. Here is how you can do this

BufferedImage image_to_save2=new BufferedImage(image_to_save.getWidth(),image_to_save.getHeight(), BufferedImage.TYPE_BYTE_GRAY);
                                image_to_save2.getGraphics().drawImage(image_to_save,0,0,null);
image_to_save = image_to_save2; 

 

2 Comments Click here to read/write comments

PDF to image quality

Posted by Mark Stephens on Thu, Jul 23, 2009 @ 04:48 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: 

The PDF file format was designed as much as possible to be Vector graphics. The problem with bitmaps and pixels is that pixels can only be on or off. While you can use some clever tricks such as anti-aliasing and hinting to smooth lines, you can't draw fractions of a pixel. If a line is half a pixel wide, you need to make it either zero or one pixel wide...

In principle, this means that as soon as you convert a PDF to an image, you can lose detail. This article describes one possible answer.

The solution we have adopted in our JPedal software is to provide an example to allow the user to draw the PDF onto a much bigger image (providing more pixels) which can then be scaled down or adjusted to take account of various dpis. You can get the details of the example here and the example source code here.

There are a number of possible scenarios so we added some flags to allow the user to choose how to improve the image quality.

Scenario One - Bigger initial image

We added a flag EXTRACT_AT_PAGE_SIZE which takes a set of values

//alternatively secify a page size (aspect ratio preserved so will do best fit)
//set a page size (JPedal will put best fit to this)
mapValues.put(JPedalSettings.EXTRACT_AT_PAGE_SIZE, new String[]{"2000","1500"});

JPedal will attempt to scale up the page to fit into this box (so a PDF page of dimensions 1000x750 would be scaled up by a factor of 2, which a page which is 1000x1000 would be scaled up by a factor of 1.5 to 1500x1500. 

If you wanted to optimise a page for 96 dpi rather than 72 dpi (ie 1.33333 times more detail), you could get the page size (its in the PdfPageData class in JPedal) multiple the dimensions by 1.333 and then use this method to produce an image. If saving as a jpeg, you can then alter the dpi settings and it will display as intended at 96 dpi.

If you wanted the highest possible image, you could create a larger version (providing the maximum number of physical pixels) and then bicubically rescale to desired size.

All these processes are slower and use more memory but give a higher quality result - like so much in life, there is always a trade-off.

Scenario Two - Use the embedded images

PDFs contain raw images, which are often scaled down when drawn onto the PDF and detail is lost. So we thought, rather than scale down the image to fit the PDF, why not scale up the PDF to fit the image!

This worked very well until someone found a PDF with a image which was actually scaled down 47 times. When we use this file, the amount of memory required to create the page was way beyond the capabilities of any current machine....

So we add a limit as to how much the page could be scaled up.

//do not scale above this figure    mapValues.put(JPedalSettings.EXTRACT_AT_BEST_QUALITY_MAXSCALING, new Integer(2));

So here we allow the page to scale up to twice its size. If the raw image was bigger than that, it will be scaled down, but the results should still be better. 

Lastly we need a way to choose which of these 2 rules takes priority, so we added a flag

//which takes priority (default is false)
mapValues.put(JPedalSettings.PAGE_SIZE_OVERRIDES_IMAGE, Boolean.TRUE);

So we now have a flexible way to generate higher quality images and the user can choose the best tradeoffs for them in terms of speed, memory, quality, etc.

 

1 Comments Click here to read/write comments

All Posts