Optical Character Recognition (OCR) technology is frequently used in full-text digital projects. OCR allows the recognition of print text characters in the digital environment, so that the scanned image of a text doesn’t have to be rekeyed or transcribed to be searchable online. But this technology can be problematic for scholars using digital collections of materials from earlier centuries because of the sometimes poor quality of the print and the variety of fonts used by early printers. In these cases the fuzzy search option, which retrieves near matches to the word or phrase being searched, may be offered to counter the limits of the OCR’d text.
Fuzzy searching allows results to be retrieved that approximate the word or string of words. Typeface and spelling were not standardized during the earlier centuries; with fuzzy searching, the system will look for words which somewhat match the terms desired. For example, Early English Books Online (EEBO) provides a check box for “Variant Spellings” to be enabled, so that murder finds murther, murdre, murdir, and mvrder.
However, be aware of some limitations to the fuzzy search.
In his article, “‘The New Machine’: Discovering the Limits of ECCO,”* Peter Spedding illustrates some of the challenges searching a digital database such as Eighteenth Century Collections Online (ECCO) which has been created from a microfilm collection, and which uses OCR and fuzzy searching. The fact that microforms are a step away from the original may cause issues, for, in the digital environment, much depends upon the quality of that copy. As ECCO does not offer the behind-the-scenes OCR’d text for viewing, Spedding examined a passage from Eliza Haywood’s Female Spectator found in Google Books and Internet Archive, in which the OCR’d text can be displayed, in order to determine the drawbacks of eighteenth-century OCR’d texts. In a comparison between the OCR’d documents and the actual text, he found the error rate average over 150 typos per 2,000 characters, making parts of the text unreadable. For example, the long “s” was interpreted as an “f” in both, “theiefined tall” for “the refined taste” and “t:ut: uifimguilhes” for “taste distinguishes.” Spedding points out that “it should also now be clear why the fuzzy search option on ECCO is of limited usefulness. While it seems able to resolve “prefs” and “press” or “molt” and “most,” it does not stand a chance against “fubjr&s,” “t:ut:” or “mofr.” And even “Low” level fuzzy searching tends to vastly increase false returns.”
The quality of OCR and fuzzy searching will continue to improve as technology improves, but, for now, if searching in digital collections of materials from previous centuries, it is wise to use a variety of techniques in order to find relevant words within texts. If you are unsuccesssful in your searching, contact the Research Center to make an appointment with a reference librarian for a consultation .
*Patrick Spedding, “‘The New Machine’: Discovering the limits of ECCO,” Eighteenth-Century Studies 44, no. 4 (2011): 437-53.