The Problem with OCR

Rob Hyndman at the blog has this comment to my post about my favorite technology tool, Copernic. Rob notes:

      "Copernic has changed everything for me, but I've found it to be especially powerful when used in conjunction with OCR'd documents. I don't bother cleaning them up - I just drop pdfs into a watch folder and then file the Word files that are produced. The OCR is so accurate in the latest gen of software that that is all it takes. Puts Copernic on steroids ..."

Rob makes a great point. OCR documents are completely searchable, my problem has been that OCR document sizes are so large that I have difficulty transmitting them to clients, experts and others. I use LeapFile for large file transfers, but the upload download time is still significant. Perhaps I am saving my OCR files with the wrong dpi. I wonder if anyone else has this problem?


Rob Hyndman

I go with 1 bit, 300 dpi, saved in pdf as raw scan output, and then OCR produces a file in native MS Word format. File sizes are smaller if you tell the OCR software not to save images when it does the scan.

Sample work product:

75 pages of paper produces a 4.2M pdf and then a Word doc of under 100K.


Thanks Rob. That makes sense. I'll give that a try.

John Feeney

90% of the time 300dpi is going too be plenty. OCR software has come along way in 9 months. The key being "searchable" documents. Manufactures are listening too their market sectors. In truth, more people like yourselves get updated with current trends, they will trip over each other too grab market share.

