Double Rekeying & OCR: online archives and their approaches to digitization

Online archives now provide academics everywhere with access to thousands of primary sources. Cutting down the cost and time of research trips, these online archives play an important role in modern historical research. The British History Online website currently offers 1,200 volumes for visitors to discover, and that number is only growing.[1] With sources ranging from Ancient history to the 20th century, and covering topics from religion to the economy, the website offers a free and easy service to inquiring historians. Similarly, the Old Bailey online archive features almost 200,000 trials from 1674-1913, for historians to peruse at the click of a button. What methods do these archives use in the digitization of thousands of sources, and how effective are their approaches?

Both websites have utilised the ‘double rekeying’ method. Two typists manually transcribe the sources and then compare their work. Any mistakes are then manually corrected. Whilst double rekeying is arguably the most effective method of digitization, it comes with its own problems. Any project will have to cover the costs of the typists. When you’re dealing with as many as 200,000 sources, (as with the Old Bailey archive), time is certainly an issue. The cost and time involved with this method is not ideal. Even with the financial cost of double rekeying, it’s an effective method of digitisation. Double rekeying offers a ’99.995%’ accuracy rate. [2]However, the Old Bailey websites outlines the limitations of this method. When dealing with handwritten sources from the seventeenth and eighteenth centuries, human error is almost inevitable. To combat this, the website offers the original scanned document to users which provides security to academics wanting to utilise the sources. Although the double rekeying method is not perfect, it seems to be the best option available to historians wanting to digitize sources.

Both websites have utilised OCR in creating these online archives. Optical Character Recognition or ‘OCR’ is more unreliable than the process of double rekeying. This method relies on software rather than human transcription as an optical scanner is used to digitize all documents. The OCR method could create more problems for a project than utilising double rekeying, as the margin of error is much greater with OCR.

The picture below provides an example of errors made by OCR systems.

Screen Shot 2015-04-28 at 17.54.58

The first example shows that the slightest change in writing causes errors within the systems. Words like ‘recommendations’ become ‘RecDmENASTIONS’. With just this example, the problems with OCR became clear. The inaccuracy renders searching for keywords difficult, and without manual corrections OCR can be ineffective. The manual attempt to correct mistakes does not need to be a financial burden to a digitization project. Crowdsourcing is always an option, similar to the Bentham project from the University College of London.

Whilst these two approaches do not provide complete accuracy to a digitization project, there are significant advantages to utilising OCR and double rekeying. As a historian based in England wanting to research American history, online archives are the foundation of my research. Both the British History Online website and the Old Bailey website provide a valuable resource to historians interested in their respective fields. Whilst the approaches themselves may be problematic, using both OCR and the double rekeying method ensures a greater level of trust when using sources on their websites. If either website only used OCR and did not have a team of highly trained academics ensuring the credibility of the site, historians could be well within their right to be wary. However, the approaches adopted by these websites ensure a quick and easy way of making primary sources available to all online, especially in comparison with manual transcriptions. Although vast improvements could be made to OCR software, and double rekeying is time-consuming and expensive, the pairing of these two methods have made the Old Bailey and British History Online websites a worthwhile source, providing historians with accurate transcriptions of primary sources.

[1] Homepage, British History Online, http://www.british-history.ac.uk/; consulted 28 April 2015

[2] ‘About British History Online’, British History Online, http://www.british-history.ac.uk/about; consulted 28 April 2015

Advertisements
Double Rekeying & OCR: online archives and their approaches to digitization

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s