Java PDF Blog

PDF solutions for big and small customers

Java and PDF development - our personal experiences and discoveries

Download JPedal

Download JPedal PDF viewers

PDF to Image service

Try our PDF to image conversion service now.

Java PDF Ebook Solution

Try our Ebook solution now.

Subscribe

Your email:

Java PDF blog

Current Articles | RSS Feed RSS Feed

Punctuation ?

Posted by kieran france on Mon, Aug 16, 2010 @ 03:08 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

So what is punctuation?

This may seem like a simple question yet I find myself asking it more and more often whilst working on our pdf search and text extraction. So once again today I found myself asking this same question. What is punctuation?

According to dictionary.com punctution is,

"The practice or system of using certain conventional marks or characters in writing  or printing in orderto separate elements and make the meaning clear, as in ending a  sentence or separating clauses."

Unfortunately this is not all to useful as the english language has many different forms of punctuation and often uses the same symbols in a multitude of ways. We can even see punctuation used in ways other than for sentence structure, for example as emoticons.

For instance the character '.' could be a full stop, it could be a decimal place or it could even be apart of '...'
In a pdf the character '.' could also be used in a multitude of other ways to help format a page and improve the flow of the text.

This is just one trivial example from many but I keep finding examples when searching for whole words only or when extracting text as a word list the results are being thrown off by the use of punctuation.

When searching or extracting text, what of the '-' character.
Is the term "mutli-tasking" one word or two?

If it's one word should we allow it to contain  the '-'?

How do we check if this is a valid use within a word?

 

What of one word split across two line with '-' at the end of the first line?

Is this  one word or two?

What of the '-'?

I'm not writing this to provide a concrete solution, neither am I looking to be provided with one as I believe there not to be one due to the way punctuation can be used in text documents.

These questions arise often as everyone producing pdfs produces them in different  styles. These questions are just a few of the things that make my job interesting.

0 Comments Click here to read/write comments

Understanding the PDF file format - Color

Posted by Mark Stephens on Fri, May 14, 2010 @ 08:39 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Color is a complex topic in PDF. This article helps to explain how it works.

Color in PDF files

Color can be defined in different ways in a PDF. This is because the PDF file specification is a very flexible format with lots of uses. Different tasks have come up with different ways to talk about colours. A way of defining colors is called a Colorspace.

Televisions and computers use 3 'base' colours generated by a Red, a Green and a Blue cathode. The output of these would be mixed together in different amounts to give all the colors you see on the television screen (the RGB colorspace).

A printer would usually print using a combination of 4 inks (Cyan, Magenta, Yellow and Key, which is really black) to produce color prints. Or they might use a selection of known inks and print them one at a time (Separation colorspaces).

Because PDFs are used in digital, print and lots of other environments, the PDF specification allows you to choose the most appropriate and natural way to think about color for that process. When a PDF is displayed the software has to work out how to convert the color into an appropriate form (for example a print PDF using CMYK needs to be displayed on an RGB computer screen).

Color conversion

Converting between colors is not always a straight-forward task. For some conversions, there is a simple Maths formula while for others there are complex translation tables called profiles. Even with a formula, there are different versions available which give different results. There are also fast and approximate methods versus more accurate and slow methods. All PDF tools have to choose the methods which offer the best compromise for their requirements. Xpdf, for example, usually uses a formula to handle CMYK, which is why some shades of black or white can look different compared with Adobe Acrobat, which uses a profile.

Profiles

The most accurate way to convert between colors is to use a profile. When I wrote the color handling code for our Java PDF viewer, I needed to convert all the colors in a PDF file into sRGB so that I could use them in Java. Wherever possible, I used profiles to give the closest match to what Adobe Acrobat does.
 
More help on color conversions
 
If you need to understand color and color versions in more detail, I have found the best source of information is wikipedia. Good luck and let us know if you come across any interesting tips...  

0 Comments Click here to read/write comments

Understanding the PDF file format - PDF password protection

Posted by Mark Stephens on Thu, Apr 22, 2010 @ 07:47 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: ,

One of the many reasons people do not put their content online is worries about security - they worry that the material can be repeatedly copied and they will have no control over it. So Acrobat files can be protected with 2 passwords - an Owner password which also allows the user to alter the file and a more limited User password. If you do not have the password, you can still copy the file but not open it.

PDF Encryption relies on the fact that some mathematical functions are much harder to reverse than to do. For example, the question 'what is 6 times 7?' is easy - it's 42 (indeed the answer to life, the Universe and Everything according to the writer Douglas Adams). But it is much harder to do in reverse - given 42 as a value, there are an infinite number of sums. Encryption works in a similar way - given a password, its easy to encode and decode some data. But its much harder to have some data and workout the password from it. The only way to work it out is to try random combination - and there are a lot of them....

Encryption is actually done using a key value (usually 32 or 128 bits in size). All the bytes are changed using the key - unless they are altered back using the same key, they are rubbish and the file cannot be opened. They only make sense if altered back using the correct value. A key with 128 bits has far more possible values than one with 32 bits, so it is much more secure.

The key is calculated for each PDFObject  in the PDF file using local data for each object and the password - that makes it much harder to crack because all the key keeps changing in the data. If you want to decipher something, the longer the example you have, the easier it is. 

So this makes PDF a secure medium - the file cannot be opened and the data is securely hidden so it is very hard to get it out of the raw file. The main issue is the no-technical one of how to keep the passwords secure. 

Password encryption is now automatically supported in our free Ebook reader. If you upload an encrypted PDF file, the additional Java files needed for encryption will be added and the user will need a password to open it. 

2 Comments Click here to read/write comments

Working out PDF page size in inches or centimetres

Posted by Mark Stephens on Thu, Mar 18, 2010 @ 10:07 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

The size of a PDF is generally defined by the CropBox or MediaBox setting for each page. This is the number of pixels. This provides a set of 4 number (x,y,width,height) to define how big the page is. A common value is 0 0 595 842 for an A4 page.

However, most people are interested in actual units and that is what Adobe Reader or Acrobat displays. Here is the output from a file with CropBox[0 0 585 832] and MediaBox[0 0 585 832]

So where does this number of centimetres come from? 

Standard dpi is 72 dots per inch so we can convert the CropBox and MediaBox to inches by dividing these numbers by 72. This gives us 8.125 inches by 11.556 inches.

There are 2.54 centimetres in an inch so multiplying by this we get 20.635 cm by 29.35cm

So that is how Adobe creates the size from the raw PDF Crop or Media box sizes.

If you are interested in using JPedal to generate the size values of each page, here is a simple example.

 

0 Comments Click here to read/write comments

Search in continuous mode and future plans

Posted by kieran france on Thu, Feb 18, 2010 @ 04:59 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: , , ,
For some time the Jpedal library has had the ability to search exclusively in single page mode. For our release of Jpedal 4.0 we have begun to expand this functionality to the other view modes. As a start we have added the search functionality to the continuous single page view mode with plans to expand this into the other view modes.

To allow for this new functionality we have needed to make alterations to a few of our exsisting public methods in order to allow for highlights to be assigned to or retrieved from a particular page.


On top of this the highlights are no longer stored in a Rectangle array. The highlgihts are stored in a Hashmap using the page number as the key and a Vector_Rectangle (org.jpedal.utils.repositories.Vector_Rectangle) as the associated value.


We have also moved the page text areas and text orientation into hash maps. In order to store this information it must be retrieved from PdfStreamDecoder after decodePageContent (PdfObject pdfObject, int minX, int minY, GraphicsState newGS, byte[] pageStream) is called as each call to this method will rewrite the localy stored data for the previous page.

The follow methods have changed in version 4.0 to allow for highlights of multiple pages being stored.

Commands.ExecuteCommands(Commands.HIGHLIGHT, new Rectangle[]{})
has become
Commands.ExecuteCommands(Commands.HIGHLIGHT, new Object[]{Rectangle[] areas, int page})

GethighlightAreas()
has become
GethighlightAreas(int page)

setFoundParagraph(int x, int y)
has become
setFoundParagraph(int x, int y, int page)

addHighlights(rectangle[], boolean)
has become
addHighlights(rectangle[], boolean, int page)

RemoveFoundTextArea(Rectangle)
has become
RemoveFoundTextArea(Rectangle, int page)

RemoveFoundTextAreas(Rectangle[])
has become
RemoveFoundTextAreas(Rectangle[], int page)

 

As you will notice the above methods have had a new integer added as an input called page. This value is the page number to which you wish to direct the method.


As well as the above methods the following method has also changed.

Display.initRenderer(Rectangle[] areas, Graphics2D g2,Border myBorder,int indent)
has become
Display.initRenderer(Map areas, Graphics2D g2,Border myBorder,int indent)

The above method would originaly recieve the rectangle array we used to use for highlighting. We have updated the method to accept a map as this is how the highlights are now stored.

 

Earlier in this article was mentioned that PdfStreamDecoder holds a local copy of the text areas and orientation when a pages content is decoded. In order to retrieve this data we have added the follow two methods.

Vector_Rectangle getTextAreas()

Vector_Int getTextDirections()

In the releases to follow we will be moving more functionality into the continuous single page view mode, then to the other view modes, such as highlighting with the mouse, extraction and the right click menu.

0 Comments Click here to read/write comments

PDF questions - where should you ask them

Posted by Mark Stephens on Tue, Dec 29, 2009 @ 08:38 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: ,

The PDF file format is a very common format, and also quite a complex one. So you may well want to ask some questions. What you need is somewhere where there are knowledgeable experts prepared to share knowledge and help out. As a late Christmas present, here are some recommendations on places we think are really good.

PlanetPdf has been going a long time. Its forum has some very talented developers (Aandi Inston is a regular) who patiently answer alsorts of questions.

4 x PDF Help is a relatively new site from part of the crowd who helped setup PlanetPdf. I already recognise some of the 'rockstars' of the PDF world answering questions on it. It uses the interface from stackoverflow so people can vote on issues and the most popular float to the top.

Stackoverflow has really taken the developer world by storm in 2009, making it much easier to get technical answers. Although a general site, it has a regular selection of PDF related questions and lots of knowledgeable answers. 

I have seen quite a few questions about IText on general forums, and the best place to get answers to this is the IText mailing lists where the actual IText developers are very active in answering questions.

And lastly, if you are using JPedal, remember we offer a support forum for questions.

Wherever you choose to post your question, always remember:-

1. Be specific - the more precise the question the more likely you are to get an answer.

2. Be courteous and polite.

3. Don't expect other people to do your work for you. People on these forums are happy to help you learn but not there to write code for you.

0 Comments Click here to read/write comments

New year resolutions

Posted by Mark Stephens on Thu, Dec 24, 2009 @ 05:55 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: 

The end of the year is a great time to step back, have a good think and take perspective, and come up with some new plans. This applies to work just as much as personal matters. It is also a chance to get away from the keyboard, and I find that is when all the best ideas pop into your head.

The important thing with any resolution (personal or work) is to make it clear and achievable - vague and woolly is bound to fail. So I have set myself one clear task to start the year which is to tidy up the API and get rid of clutter (we have lots of deprecated methods which should now go. When we remove a method ,we do need to clearly document the changes).

I will doubtless come up with lots of ideas for JPedal (and we already have lots of plans for 2010), but that is one clear, hopefully realistic task set and publically committed to. 

So what are you going to do for your New Years Resolution???

 

1 Comments Click here to read/write comments

PDF page size in bytes

Posted by Mark Stephens on Sun, Oct 18, 2009 @ 04:49 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: ,

An interesting question on our forums, made me look at PDF files in a new light. We know how big a PDF file is in bytes, but how big is each page?

To answer this, you need to understand a little about the contents of a PDF file and how a file is constructed. A PDF file is a dump of PDF objects. It consists of the objects themselves and a trailer - metadata so that each object can be found. The objects usually consist of a set of key values and often a compressed binary stream of data. The binary data is usually image data or colour data or the set of instructions used to draw the page. The data is decompressed in memory when the object is read.

A page itself does not have a size - you cannot say it  starts at a certain point and ends at another. What you could say however is that a page consists of a set of objects:-

1. The Page objects which describe the page and contain the binary stream of page instructions used to contruct its contents.

2. The local Resources objects which contains colors, fonts and images used on the page.

3. Global Resource objects (which may be used on any page) and also consist of colors, fonts and images.

4. A proportion of the PDF file metadata.

 The last item is probably small enough that it can be reasonably ignored and we can also reasonably ignore the non-binary content of the objects.

So a good guess for a pagesize is the sum of the binary streams which might be used on it. The compressed size probably provides a good guess as to the PDF page size in the PDF file and the uncompressed size might well be an equally good guess at how much an unrendered page (ie not drawn) would use in memory if you needed that.

0 Comments Click here to read/write comments

PDF to image

Posted by Mark Stephens on Thu, Jun 18, 2009 @ 06:15 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: , , ,

PDF to image conversion is something we have spent a lot of time developing for our Java PDF library and it does it really well as a server task.

But sometimes, it would be nice to have a quick and easy way to convert a single PDF file with a simple web form. So we have built a free online PDF to image service available on the internet at http://www.jpedal.org/pdf_to_image.php

This is the first release and we will be adding lots of new enhancements to it. 

0 Comments Click here to read/write comments

PDF font technologies

Posted by Mark Stephens on Tue, May 26, 2009 @ 06:22 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: ,

The thing which makes PDF fonts so confusing for many people is the number of different font technologies which can be used in a PDF file. The PDF file specification has been around for 16 years and in that time a number of different font technologies have appeared (as much for 'business' as technological reasons). So this article will briefly explain some of the main font technologies you can use with PDF files.

Adobe was one of the pioneers of producing high quality fonts for electronic publishing. Until they invented Postscript, most fonts were bit mapped images which had to be specially created for different font sizes. With Postscript, Fonts could be defined as graceful lines and curves with detailed instructions about how to behave as certain sizes (so if you drew the letters at a tiny size, thin lines which made up critical parts of the letter would not disappear). 

Adobe had two original font types - Type 3 and Type 1 fonts. Type 3 does not have all the clever features of Type 1 and generally produces less than perfect results, but Type 1 works very well and was very successful. Type 1 fonts work with Adobe Font manager and usually have .afb or .afm endings. There is also a variation of Type1 called CFF (Compact Font Format).

When Microsoft decided to add proper font support to Windows, they did not want to adopt this solution (possibly to avoid paying royalties to Adobe), so they developed another font technology with Apple called TrueType. This used the same idea of defining fonts as a set of shapes, but is totally incompatible with Type1. These fonts are what you generally find in the fonts directory if you use Windows (they had the ending .ttf).

TrueType and Type1 are comparable - they do essentially the same thing in different ways and both have advantages and disadvantages - Type1 arguably uses a superior method for defining curves while Truetype offers better CMAP capabilities - but either of them work fine for most users. Indeed the latest file format (OpenType) takes features from both TypeType and Type1 and is a result of improved relations with Adobe and Microsoft.

Because the PDF file format is backward compatible it supports all of these types of fonts. So font advice should be to avoid Type3 and stick to Type1, TrueType or OpenType, depending on what fonts you have access to. You just need to understand that they are all different, incompatible implementations of the same idea of defining characters as a set of shapes with rules to ensure good quality at all sizes.

If you embed the fonts, you can generally ignore the font types and leave the PDF viewer to handle them.

1 Comments Click here to read/write comments

All Posts | Next Page