Java PDF Blog

PDF solutions for big and small customers

Java and PDF development - our personal experiences and discoveries

Download JPedal

Download JPedal PDF viewers

PDF to Image service

Try our PDF to image conversion service now.

Java PDF Ebook Solution

Try our Ebook solution now.

Subscribe

Your email:

Java PDF blog

Current Articles | RSS Feed RSS Feed

Punctuation ?

Posted by kieran france on Mon, Aug 16, 2010 @ 03:08 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 

So what is punctuation?

This may seem like a simple question yet I find myself asking it more and more often whilst working on our pdf search and text extraction. So once again today I found myself asking this same question. What is punctuation?

According to dictionary.com punctution is,

"The practice or system of using certain conventional marks or characters in writing  or printing in orderto separate elements and make the meaning clear, as in ending a  sentence or separating clauses."

Unfortunately this is not all to useful as the english language has many different forms of punctuation and often uses the same symbols in a multitude of ways. We can even see punctuation used in ways other than for sentence structure, for example as emoticons.

For instance the character '.' could be a full stop, it could be a decimal place or it could even be apart of '...'
In a pdf the character '.' could also be used in a multitude of other ways to help format a page and improve the flow of the text.

This is just one trivial example from many but I keep finding examples when searching for whole words only or when extracting text as a word list the results are being thrown off by the use of punctuation.

When searching or extracting text, what of the '-' character.
Is the term "mutli-tasking" one word or two?

If it's one word should we allow it to contain  the '-'?

How do we check if this is a valid use within a word?

 

What of one word split across two line with '-' at the end of the first line?

Is this  one word or two?

What of the '-'?

I'm not writing this to provide a concrete solution, neither am I looking to be provided with one as I believe there not to be one due to the way punctuation can be used in text documents.

These questions arise often as everyone producing pdfs produces them in different  styles. These questions are just a few of the things that make my job interesting.

0 Comments Click here to read/write comments

Search in continuous mode and future plans

Posted by kieran france on Thu, Feb 18, 2010 @ 04:59 AM
Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon | Submit to Reddit reddit 
Tags: , , ,
For some time the Jpedal library has had the ability to search exclusively in single page mode. For our release of Jpedal 4.0 we have begun to expand this functionality to the other view modes. As a start we have added the search functionality to the continuous single page view mode with plans to expand this into the other view modes.

To allow for this new functionality we have needed to make alterations to a few of our exsisting public methods in order to allow for highlights to be assigned to or retrieved from a particular page.


On top of this the highlights are no longer stored in a Rectangle array. The highlgihts are stored in a Hashmap using the page number as the key and a Vector_Rectangle (org.jpedal.utils.repositories.Vector_Rectangle) as the associated value.


We have also moved the page text areas and text orientation into hash maps. In order to store this information it must be retrieved from PdfStreamDecoder after decodePageContent (PdfObject pdfObject, int minX, int minY, GraphicsState newGS, byte[] pageStream) is called as each call to this method will rewrite the localy stored data for the previous page.

The follow methods have changed in version 4.0 to allow for highlights of multiple pages being stored.

Commands.ExecuteCommands(Commands.HIGHLIGHT, new Rectangle[]{})
has become
Commands.ExecuteCommands(Commands.HIGHLIGHT, new Object[]{Rectangle[] areas, int page})

GethighlightAreas()
has become
GethighlightAreas(int page)

setFoundParagraph(int x, int y)
has become
setFoundParagraph(int x, int y, int page)

addHighlights(rectangle[], boolean)
has become
addHighlights(rectangle[], boolean, int page)

RemoveFoundTextArea(Rectangle)
has become
RemoveFoundTextArea(Rectangle, int page)

RemoveFoundTextAreas(Rectangle[])
has become
RemoveFoundTextAreas(Rectangle[], int page)

 

As you will notice the above methods have had a new integer added as an input called page. This value is the page number to which you wish to direct the method.


As well as the above methods the following method has also changed.

Display.initRenderer(Rectangle[] areas, Graphics2D g2,Border myBorder,int indent)
has become
Display.initRenderer(Map areas, Graphics2D g2,Border myBorder,int indent)

The above method would originaly recieve the rectangle array we used to use for highlighting. We have updated the method to accept a map as this is how the highlights are now stored.

 

Earlier in this article was mentioned that PdfStreamDecoder holds a local copy of the text areas and orientation when a pages content is decoded. In order to retrieve this data we have added the follow two methods.

Vector_Rectangle getTextAreas()

Vector_Int getTextDirections()

In the releases to follow we will be moving more functionality into the continuous single page view mode, then to the other view modes, such as highlighting with the mouse, extraction and the right click menu.

0 Comments Click here to read/write comments

All Posts