[APG Public List] OCR and names in scanned documents -- was: Indexing of McMullin/Mc Mullen as M'Mullin

Jacqueline Wilson jawgen at comcast.net
Sat May 15 15:54:54 MDT 2010


Meredith,  Thank you for this enlightening explanation of the problems  
of using OCR.  As I plan to scan a lot of my papers - both gen and non- 
gen - I have been wondering what to use.  So far I have been  
converting to either Jpeg or PDF depending on what the document is.  I  
have been debating about upgrading the OCR software that came with my  
scanner.  I would like to see a discussion on the best OCR software if  
others are interested.

I also agree that if OCR is used to scan documents for the web, then  
it definitely is not an exhaustive search - but it is a great starting  
point.


Jacqueline Wilson
Evanston, IL
jawgen at comcast.net

Deputy Sheriff for Publications of the Chicago Corral of the Westerners
Professional Indexer, Historian, and Genealogist
"Wilssearch - your service of choice for the indexing challenged  
genealogist."




On May 14, 2010, at 4:10 PM, Meredith Hoffman / GenerationsWeb wrote:


OCR algorithms check their potential output against built-in  
dictionaries, and then do a best-guess template match against the  
known words in the dictionary to eliminate non-words, but this fails  
completely when the OCR engine encounters a name, because there's  
nothing in the dictionary to disambiguate those misreadings. For  
example, if the input word isn't clear or run together, the OCR engine  
may "read" the word "inner" as [imer] or [imen] or [innen] even  
[mner], but it knows those are not English words, and it also has some  
probabilistic algorithms and heuristics that means that 99% or so of  
the time it'll correct its input reading and output the word correctly  
as "inner." But when it encounters the place name Innerton, for  
example, for starters it may not even recognize that it starts with a  
capital letter, and it doesn't "know" that it's a name, and it might  
"read" it as [Imerton] or [imerton] or ... or even [bnertor] or [Imen  
ton], and, depending on how the algorithms are set for dealing with  
unknown "words" it might try to output it as "insertion" or give up  
and just output it as a string of letters. Most OCR engines flag those  
strings that were "uncertain" so that they can be handled with some  
post-processing.




More information about the APGPublicList mailing list