[APG Public List] OCR and names in scanned documents -- was: Indexing of McMullin/Mc Mullen as M'Mullin
stephen at stephendanko.com
Mon May 17 15:21:16 MDT 2010
All this discussion of OCR reminded me of my one foray into OCR several years ago.
Back in December 2005, my good friend Sister Carol Anne O'Marie was just finishing up the final edits on her last novel "Murder at the Monks' Table", a story of how two nuns from a convent in San Francisco took a trip to Ireland, found a dead body in the loo of a local Irish pub, and set out to find the murderer. The final manuscript was due to her editor by the end of the year and Carol Anne was well on her way to meeting that deadline.
As Carol Anne went to save her final edits, she hit a button on the keyboard and the screen on her computer went blank. She looked for the electronic copy of her manuscript but was unable to find it. Carol Anne called her friends for help, some of whom were computer professionals. No one could locate the electronic copy of her manuscript.
Fortunately, Carol Anne had printed out a hard copy of the manuscript before losing the electronic copy, but since Carol Anne suffered from Parkinson's disease, the thought of retyping the entire manuscript was heartbreaking to her. At that point she called me and asked if I could help her by retyping the manuscript into Microsoft Word. I had previously offered to help her with the typing since I knew she had some difficulty with that task.
Instead of typing the entire manuscript (several hundred pages), I decided to scan the document using a photocopier/scanner with an automatic document feeder. I then used the Microsoft Office Document Imaging application that comes with Microsoft Office to convert the scanned images to text via OCR. Although I no longer have the original images, I recall that the accuracy of converting the images to editable text was much better than 99%, and that includes such proper names such as Oonaugh Cox and Liam O'Dea, as well as words found in the dictionary.
I had to reformat the entire document, however, since all formatting was lost in the image to text conversion. Fortunately, this gave me the opportunity to read Sister Carol Anne's latest novel before it was published! As I read the manuscript, reformatting where necessary, I also proofread the text for spelling errors introduced by the OCR process.
As I mentioned, the accuracy of converting the images to text was much better than 99%. One significant exception, however, was one particular name - the name of a priest - Father Keane. The OCR software was able to correctly render the surname "Keane" but erred on the word "Father". The OCR software consistently rendered the word "Father" as "Fathead", so that one sentence read "Fathead Keane," Eileen said, "I'd like you to meet my friend, Sister Mary Helen. ... ".
Sister Carol Anne was delighted to receive a new electronic copy of her manuscript just three days after she sent me the paper copy of the manuscript and, when she heard the story of how the OCR software turned "Father Keane" into "Fathead Keane", she and the other members of her convent had a good laugh!
"Murder at the Monk's Table" appeared in print on June 26, 2007, but it was to be the last novel Sister Carol Anne O'Marie published. She returned to her Maker one year ago on May 27, 2009.
Stephen J. Danko, PhD, PLCGS
From: Meredith Hoffman / GenerationsWeb <mhoffman at generationsweb.com>
To: APG APG Public <apgpubliclist at apgen.org>
Sent: Mon, May 17, 2010 8:41:07 AM
Subject: Re: [APG Public List] OCR and names in scanned documents -- was: Indexing of McMullin/Mc Mullen as M'Mullin
Jacqueline, I think this would be a good discussion to have. I have some techie-geek experience with OCR because I wrote user guides and technical white papers for a start-up OCR company a while back, and I also spent some time evaluating OCR strategies for another client who was trying to decide how best to recover the text from some large technical documents that they no longer had digital files for (what were the cost/accuracy trade-offs between re-keyboarding the text and redoing the drawings and having them edited vs. scanning and OCRing and post-processing and editing, etc etc...). I've continued to be interested in the issue of OCR for the reasons that you mention and that started this discussion: (a) having problems finding names in online scanned/OCR'd documents and (b) finding the best way to scan/OCR my own documents.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the APGPublicList