Sunday, March 27, 2011

Performing Optical Character Recognition on PDF's from Coldfusion using a Java or .NET Library?

I am looking to take a PDF and extract any text from it. I then want to make it available using Coldfusion's available Verity search to search the contents.

Are there any libraries out there that do this quite well already? I am including Java or .NET (Java prefered) libraries in the scope since they can be called from CF.

Any insights or experiences would be greatly appreciated... thanks!

Edit: Indexing PDF files works when the text is embedded in the PDF as far as I know with CF. The PDFs I'm having to deal with have the text scanned as an image.

From stackoverflow
  • Verity should be able to index PDF files by default:

    http://livedocs.adobe.com/coldfusion/6/Developing_ColdFusion_MX_Applications_with_CFML/indexSearch2.htm#1142322

    Jas Panesar : Indexing PDF files works when the text is embedded in the PDF. The PDF's i'm having to deal with have the text scanned as an image.. I'll check out this link some more though
  • Ray Camden has an eight-part series on working with PDFs in ColdFusion 8.

    Part 7 of the series covers using DDX to get text out of a PDF.

    Not sure this will work with your OCR needs though, but may still be worth looking at.

  • If you have the ability to run your own software (i.e. Dedicated/VPS) then you could investigate using Tesseract OCR with cfexecute to convert the PDFs to text?

    Jas Panesar : I have my own servers so this looks like it has potential. I had come across this years ago and couldn't remember the name.. thanks! It seems to only process TIFFS though, so I'd have to convert each page of the PDF.
  • On a semi related note, I found a very neat post about encoding and reading 2D Matrix barcodes in coldfusion.

    http://www.stillnetstudios.com/2007/12/15/2d-barcodes-coldfusion/

    This might solve some of my issues in needing to extract encoded information, but I am still after the body of the text.

    Regarding tessnet, found a .net version too. http://www.pixel-technology.com/freeware/tessnet2/ If I could natively feed in PDF's instead of TIFFs.. :)

0 comments:

Post a Comment