[APG Public List] OCR and names in scanned documents -- was: Indexing of McMullin/Mc Mullen as M'Mullin
Meredith Hoffman / GenerationsWeb
mhoffman at generationsweb.com
Mon May 17 09:41:07 MDT 2010
Jacqueline, I think this would be a good discussion to have. I have
some techie-geek experience with OCR because I wrote user guides and
technical white papers for a start-up OCR company a while back, and I
also spent some time evaluating OCR strategies for another client who
was trying to decide how best to recover the text from some large
technical documents that they no longer had digital files for (what
were the cost/accuracy trade-offs between re-keyboarding the text and
redoing the drawings and having them edited vs. scanning and OCRing
and post-processing and editing, etc etc...). I've continued to be
interested in the issue of OCR for the reasons that you mention and
that started this discussion: (a) having problems finding names in
online scanned/OCR'd documents and (b) finding the best way to scan/
OCR my own documents.
Your question sent me to google to see the latest news about scanning
accuracy, and things haven't changed much since I last immersed myself
in this technology. When you google "accuracy of OCR" you can get some
idea of the mass of information out there about this issue!
I don't have time _right now_ to summarize the information as it
specifically relates to genealogy, but I think you've given me a great
idea for an article! I should be able to spend some time researching
and writing something up within the next couple of weeks.
Here's a couple of highlights that relate to the issue:
The best OCR engines, working with *new* *good quality* *clean*
*clearly printed* documents in Latin script (let's stick to English
for now), can have accuracy rates as high as 99%. As far as I know,
there's no system out there today that can do better than that; 100%
accuracy can only be achieved with (literate) human post-processing.
The GPO (Government Printing Office) has a minimum standard of 99%
accuracy for digital conversion of documents.
Think about the fact that even at 99% accuracy, on a 300 word page,
for instance, assuming an average word-length of 6 letters, you have
1800 words; a 1% error rate means up to 18 possible mis-identified
words -- and even if there is a dictionary to do automatic post-
processing for "real words," the names are still going to be the major
For older documents, lower quality documents, etc. etc., the accuracy
rates go down, sometimes to around 75% or less. (And, as I explained
in my prior email, the problem with _names_ can lower this rate even
Nobody is expecting to get to 100% accuracy with _raw_ scanning; and
the cost of any kind of post-processing, for the retail sites that are
converting old documents, is prohibitive.
So in terms of using online commercial databases, we're always going
to have to live with the knowledge that we're not going to be picking
up all/many/some/... of the names we're looking for.
However, for the individual genealogist scanning and OCR'ing for
personal or client use, the fact that you can do your own post-
processing -- and with some OCR software, you can tweak the input
variables to make it easier to find the errors and to change the kinds
of errors it's liable to make -- means that you can compensate for and
correct the errors. But, just like with the commercial sites, you have
to find the cost-benefit tradeoff....
If there's any interest in my doing a bit more research and writing up
my conclusions in terms of specific OCR software packages, I'd be
interested in doing this.
Meanwhile, I hope that this information helps a bit more.
Meredith Hoffman / GenerationsWeb
On 2010May15, at 5:54 PM, Jacqueline Wilson wrote:
> Meredith, Thank you for this enlightening explanation of the
> problems of using OCR. As I plan to scan a lot of my papers - both
> gen and non-gen - I have been wondering what to use. So far I have
> been converting to either Jpeg or PDF depending on what the document
> is. I have been debating about upgrading the OCR software that came
> with my scanner. I would like to see a discussion on the best OCR
> software if others are interested.
> I also agree that if OCR is used to scan documents for the web, then
> it definitely is not an exhaustive search - but it is a great
> starting point.
> Jacqueline Wilson
> Evanston, IL
> jawgen at comcast.net
> Deputy Sheriff for Publications of the Chicago Corral of the
> Professional Indexer, Historian, and Genealogist
> "Wilssearch - your service of choice for the indexing challenged
> On May 14, 2010, at 4:10 PM, Meredith Hoffman / GenerationsWeb wrote:
> OCR algorithms check their potential output against built-in
> dictionaries, and then do a best-guess template match against the
> known words in the dictionary to eliminate non-words, but this fails
> completely when the OCR engine encounters a name, because there's
> nothing in the dictionary to disambiguate those misreadings. For
> example, if the input word isn't clear or run together, the OCR
> engine may "read" the word "inner" as [imer] or [imen] or [innen]
> even [mner], but it knows those are not English words, and it also
> has some probabilistic algorithms and heuristics that means that 99%
> or so of the time it'll correct its input reading and output the
> word correctly as "inner." But when it encounters the place name
> Innerton, for example, for starters it may not even recognize that
> it starts with a capital letter, and it doesn't "know" that it's a
> name, and it might "read" it as [Imerton] or [imerton] or ... or
> even [bnertor] or [Imen ton], and, depending on how the algorithms
> are set for dealing with unknown "words" it might try to output it
> as "insertion" or give up and just output it as a string of letters.
> Most OCR engines flag those strings that were "uncertain" so that
> they can be handled with some post-processing.
More information about the APGPublicList