Friday, October 03, 2008

Google/Ancestry.com followup: Using outsourced Chinese labor to overcome OCR limits

(Update: See my Ancestry.com review) Earlier this week I posted a series of blog posts related to my own genealogical research, as well as a "what if" scenario involving Google and public records. The question I considered: What would it mean if Google or some other company digitized and indexed millions of public records, ranging from census forms to police reports, and made them free to search and view?

I got an answer from Google about this, and also had a nice talk with the CEO of The Generations Network, which owns Ancestry.com. Google declined to discuss its plans, but the Ancestry.com CEO brought up a very interesting point: Unlike printed text, which can be scanned and converted to digital text using OCR software, handwriting is very difficult to read by computer. It's part of the reason why captchas (curvy block letters that are used to authenticate online submission forms) are used to frustrate spam bots. Computers can't read them. And if computers can't read captchas, you can imagine how tough it would be for computers to decipher spidery cursive from the mid-19th century from hundreds of thousands of different hands.

So how did Ancestry.com build its database of all census forms from 1790 through 1930? They used humans to read the forms and enter the data. These weren't Ancestry's employees, but rather the staff of companies in China who could (I assume) do it for less than American labor. Naturally, that leads to the question of whether the results are accurate enough, but the CEO said the transcriptions were handled by staff who were trained to read these old writing styles. The following example from an 1860 census form is from Ancestry.com:


It's worth noting that most Chinese are far more sensitive to variations in writing styles thanks to the Chinese character system and the calligraphy tradition which is still widespread there. Chinese have to be able to discern slight variations in strokes and also writing styles, which can vary greatly from person to person.

    (Update: Since writing this post, I have launched a company which is dedicated to helping people understand complicated technologies and concepts. Besides creating online posts which address questions such as What Is Dropbox and What Is Google Drive, I have also published a series of guides under the In 30 Minutes brand.)

    No comments:

    Post a Comment

    All comments will be reviewed before being published. Spam, off-topic or hateful comments will be removed.