[APG Public List] OCR and names in scanned documents -- was: Indexing of McMullin/Mc Mullen as M'Mullin

Meredith Hoffman / GenerationsWeb mhoffman at generationsweb.com
Mon May 17 09:41:07 MDT 2010


Jacqueline, I think this would be a good discussion to have. I have  
some techie-geek experience with OCR because I wrote user guides and  
technical white papers for a start-up OCR company a while back, and I  
also spent some time evaluating OCR strategies for another client who  
was trying to decide how best to recover the text from some large  
technical documents that they no longer had digital files for (what  
were the cost/accuracy trade-offs between re-keyboarding the text and  
redoing the drawings and having them edited vs. scanning and OCRing  
and post-processing and editing, etc etc...). I've continued to be  
interested in the issue of OCR for the reasons that you mention and  
that started this discussion: (a) having problems finding names in  
online scanned/OCR'd documents and (b) finding the best way to scan/ 
OCR my own documents.

Your question sent me to google to see the latest news about scanning  
accuracy, and things haven't changed much since I last immersed myself  
in this technology. When you google "accuracy of OCR" you can get some  
idea of the mass of information out there about this issue!

I don't have time _right now_ to summarize the information as it  
specifically relates to genealogy, but I think you've given me a great  
idea for an article! I should be able to spend some time researching  
and writing something up within the next couple of weeks.

Here's a couple of highlights that relate to the issue:

The best OCR engines, working with *new* *good quality* *clean*  
*clearly printed* documents in Latin script (let's stick to English  
for now), can have accuracy rates as high as 99%. As far as I know,  
there's no system out there today that can do better than that; 100%  
accuracy can only be achieved with (literate) human post-processing.

The GPO (Government Printing Office) has a minimum standard of 99%  
accuracy for digital conversion of documents.

Think about the fact that even at 99% accuracy, on a 300 word page,  
for instance, assuming an average word-length of 6 letters, you have  
1800 words; a 1% error rate means up to 18 possible mis-identified  
words -- and even if there is a dictionary to do automatic post- 
processing for "real words," the names are still going to be the major  
problem.

For older documents, lower quality documents, etc. etc., the accuracy  
rates go down, sometimes to around 75% or less. (And, as I explained  
in my prior email, the problem with _names_ can lower this rate even  
more....)

Nobody is expecting to get to 100% accuracy with _raw_ scanning; and  
the cost of any kind of post-processing, for the retail sites that are  
converting old documents, is prohibitive.

So in terms of using online commercial databases, we're always going  
to have to live with the knowledge that we're not going to be picking  
up all/many/some/... of the names we're looking for.

However, for the individual genealogist scanning and OCR'ing for  
personal or client use, the fact that you can do your own post- 
processing -- and with some OCR software, you can tweak the input  
variables to make it easier to find the errors and to change the kinds  
of errors it's liable to make -- means that you can compensate for and  
correct the errors. But, just like with the commercial sites, you have  
to find the cost-benefit tradeoff....

If there's any interest in my doing a bit more research and writing up  
my conclusions in terms of specific OCR software packages, I'd be  
interested in doing this.

Meanwhile, I hope that this information helps a bit more.

--Meredith

Meredith Hoffman / GenerationsWeb
Plymouth, MA
http://tinyurl.com/genweb-apg
http://consultant.generationsweb.com/

On 2010May15, at 5:54 PM, Jacqueline Wilson wrote:

> Meredith,  Thank you for this enlightening explanation of the  
> problems of using OCR.  As I plan to scan a lot of my papers - both  
> gen and non-gen - I have been wondering what to use.  So far I have  
> been converting to either Jpeg or PDF depending on what the document  
> is.  I have been debating about upgrading the OCR software that came  
> with my scanner.  I would like to see a discussion on the best OCR  
> software if others are interested.
>
> I also agree that if OCR is used to scan documents for the web, then  
> it definitely is not an exhaustive search - but it is a great  
> starting point.
>
>
> Jacqueline Wilson
> Evanston, IL
> jawgen at comcast.net
>
> Deputy Sheriff for Publications of the Chicago Corral of the  
> Westerners
> Professional Indexer, Historian, and Genealogist
> "Wilssearch - your service of choice for the indexing challenged  
> genealogist."
>
>
>
>
> On May 14, 2010, at 4:10 PM, Meredith Hoffman / GenerationsWeb wrote:
>
>
> OCR algorithms check their potential output against built-in  
> dictionaries, and then do a best-guess template match against the  
> known words in the dictionary to eliminate non-words, but this fails  
> completely when the OCR engine encounters a name, because there's  
> nothing in the dictionary to disambiguate those misreadings. For  
> example, if the input word isn't clear or run together, the OCR  
> engine may "read" the word "inner" as [imer] or [imen] or [innen]  
> even [mner], but it knows those are not English words, and it also  
> has some probabilistic algorithms and heuristics that means that 99%  
> or so of the time it'll correct its input reading and output the  
> word correctly as "inner." But when it encounters the place name  
> Innerton, for example, for starters it may not even recognize that  
> it starts with a capital letter, and it doesn't "know" that it's a  
> name, and it might "read" it as [Imerton] or [imerton] or ... or  
> even [bnertor] or [Imen ton], and, depending on how the algorithms  
> are set for dealing with unknown "words" it might try to output it  
> as "insertion" or give up and just output it as a string of letters.  
> Most OCR engines flag those strings that were "uncertain" so that  
> they can be handled with some post-processing.
>
>



More information about the APGPublicList mailing list