[APG Public List] OCR and names in scanned documents -- was: Indexing of McMullin/Mc Mullen as M'Mullin

Meredith Hoffman / GenerationsWeb mhoffman at generationsweb.com
Fri May 14 15:10:19 MDT 2010


The fact that they do OCR on the text also means that a percentage of  
the scanned text is just plain misread, especially when the source  
documents contain relatively "low-resolution" typography such as is  
found in early newspapers. The OCR process is particularly susceptible  
to misreading when what is being scanned and interpreted is _names_ as  
opposed to _dictionary words_; that creates a particular problem for  
us genealogists, since mostly what we're looking for in documents is  
names.

OCR algorithms check their potential output against built-in  
dictionaries, and then do a best-guess template match against the  
known words in the dictionary to eliminate non-words, but this fails  
completely when the OCR engine encounters a name, because there's  
nothing in the dictionary to disambiguate those misreadings. For  
example, if the input word isn't clear or run together, the OCR engine  
may "read" the word "inner" as [imer] or [imen] or [innen] even  
[mner], but it knows those are not English words, and it also has some  
probabilistic algorithms and heuristics that means that 99% or so of  
the time it'll correct its input reading and output the word correctly  
as "inner." But when it encounters the place name Innerton, for  
example, for starters it may not even recognize that it starts with a  
capital letter, and it doesn't "know" that it's a name, and it might  
"read" it as [Imerton] or [imerton] or ... or even [bnertor] or [Imen  
ton], and, depending on how the algorithms are set for dealing with  
unknown "words" it might try to output it as "insertion" or give up  
and just output it as a string of letters. Most OCR engines flag those  
strings that were "uncertain" so that they can be handled with some  
post-processing.

In order to get really good OCR output, you have to have some post- 
processing, where either an automated grammar analyzer and/or a human  
analyst looks at the words that have been marked internally as  
"uncertain" and fixes them up. I have no idea whether any of the  
directory and newspaper sites do any post-processing at all, and I'm  
pretty sure that none of them do anything very extensive. I do know  
that some seem to do a better job than others (although I haven't  
systematically looked at which are which).

Because of this, it's a good bet that at least some of the names in  
just about any scanned and OCR'ed document are going to be slightly to  
significantly garbled, often even beyond being recognized as names. So  
I routinely assume that a name I'm searching for may well be in a  
document set, but I won't find it -- or all of the instances of it --  
in the index.

Trying to figure out the possible renderings works to some extent, but  
can't compensate for this phenomenon.

The bottom line is that my reasonably exhaustive search may be the  
best I can do short of going to the local library, or getting the  
microfilm/fiche, and reading through the documents myself, but I  
shouldn't believe that I've actually exhausted the search.

And when I write a report, I would source those scanned and OCR'ed  
documents, not the brick-and-mortar or filmed originals, and be  
especially careful about making claims about negative searches.

--Meredith

Meredith Hoffman / GenerationsWeb
Plymouth, MA
http://tinyurl.com/genweb-apg
http://consultant.generationsweb.com/

On 2010May14, at 12:43 PM, Michael John Neill wrote:

> Keep in mind that sites like genealogybank do not "index" by reading  
> the actual text, they do OCR which means that searches should always  
> be conducted for all possible renderings. Even more of a challenge  
> when sites do not support Soundex or other flexible searching options.
>
> Michael
>
>
> On Fri, May 14, 2010 at 10:59 AM, Maria Hopper  
> <reetree at optonline.net> wrote:
> Matrimonial Notices: Philadelphia Repository, Philadelphia PA   
> (newspaper issue of 5 Feb 1803 p.47) on line at Genealogy Bank  
> "_____ on the 27th ult. by Rev. Mr. Milledoller of Philadelphia John  
> M’Mullin and Miss Maria Ord, both of Southwalk"
>
> Only found this by serching bride's name not indexd under McMullin  
> Mullin or M'Mullin - also almost all of old city directories for  
> Philadelphia have Mc enter as an apostrophe only way to find them  
> that I know of is to browse Browse Philadlhia City Directories in  
> Footnote)
>
> Maria (Ree)  Hopper, CG*
> *CG or Certified Genealogist is a service mark of the Board for  
> Certification of Genealogist, used under license by board certified  
> genealogist after periodic evaluation, and the board name is  
> registered in the U.S. Patent Office.
> Visit  http://www.bcgcertification.org
>
> The Hopper Family Genealogy; The first six Generations
> http://www.reetree.com
> reetree at optonline.net
>
>
>
>
>
>
>
> -- 
> ------------------------------------
> Michael John Neill
> Weekly How-to Column Casefile Clues
> http://www.casefileclues.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <../attachments/20100514/6426366a/attachment.htm>


More information about the APGPublicList mailing list