Tuesday, May 30, 2023

OCR4all is better than Google

OCR4all is better than Google OCR for early modern printed books.

Klaus Graf pointed me to a suggestion from Lajos Adamik on using Google Docs for OCR. It's a great idea - Google is fast and fairly accurate. For some projects I'd definitely use it.

But not for most early printed books. I had earlier tried out different OCR4all models/model combinations on a page from Abbas Joachim magnus propheta (1516) so I had images (in various stages of preparation) and uncorrected output text to compare.

(NB: I got best results using the "fraktur-hist" models. Contrary to OCR4all's documentation, I got best results from using all five "fraktur-hist" models together compared to using any one of them alone.)


 Here is the uncorrected OCR4all text. This is a fairly difficult text for OCR: Latin, many abbreviations, some barely perceptible abbreviation markings (some symbols may not display correctly in your browser, but I can see all of them here).

¶Jncipit liber de magnis tribulationibus in proximo futuris. Compilatus a docto ⁊ deuoto preſſ ptero ⁊ heremita Theoloſphoro de Cuſentia ꝓuincie Calabrie. Collectus vero ex vaticinijs nouorum prophetarum. ſ. beati Curilli: abbatis Joachim: Danda i: ⁊ Mer lini: ac veterum ſibillarum. Deinde abbreuiatus pervenerabilem fratrem Nuſtitianuʒ: vna cũ tractatu magiſtri Joãnis pa: iſini ordmis pᷣdicatoꝝ: de antichriſto ⁊ ſine mũdi: t fr̃is Tbertini de ſeptẽ ſtatib?eccleſie.

Here is the manually corrected text. Clearly a lot of post-recognition cleanup work is necessary, but OCR4all does a decent job of picking up abbreviations. Here's the corrected version - to get a corrected text of the complete work, perhaps 100-200 hours of work or more would be required.

¶ Incipit liber de magnis tribulationibus in proximo futuris. Compilatus a docto et devoto presbytero et heremita Theolosphoro de Cusentia provincie Calabrie. Collectus vero ex vaticiniis novorum prophetarum. s. beati Cirilli: abbatis Joachim: Dandali: et Merlini: ac veterum sibillarum. Deinde abbreviatus per venerabilem fratrem Rustitianum: una cum tractatu magistri Joannis parisini ordinis predicatorum: de antichristo et fine mundi: et fratris Ubertini de septem statibus ecclesie.

Here are a few examples from Google. The first one uses a deskewed base image. The amount of work required to correct the text will clearly be higher.

CIncipit liber de magnis tribulationibus in proximo futuris. Lompi latus a docto z ceuoro prefbytero z heremita Theolofpbozo de Lufentia puincie Lalabzie. Lollectus vero ex vaticinijs nouozum pzo/ phetarum.f. beati Lirilli: abbatis Joachim: Danda i: Delini:ac veterum fibillarum. Deinde abbreuiatus pervenerabilem fratrem Ruftitianus:vna cu tractatu magiftri Joanis parifini ordinis pdicator de antichrifto z fine müdi:z fris Ubertini de fepté ftatib ecclefie.

Using one of OCR4all's preprocessed images shows what can go wrong with Google Docs:

Incipit liber de magnis omne malum fuper omnes habitatores terre. tribulationibus in proximo futuris. Lompt Itaq; fub tali impatore/tres falt pape creabus tur:vnus grecus:alius italus:z alius germa/ latus a docto z ceuoro presbytero z heremita Theololphoto de Lufentia puincie Lalabae.us/oium peffimus: erunt finguli adinuicez Lollectus vero ex varicinije nouocum pro phetarum,f. beati Lirilli: abbatis Joachim: Danda i: dei lini:ac veterum fibillarum. Deinde abbreuiatus pervenerabilem fratrem Ruffitianuz:vna cũ tractatu magistri Joānis parifini ozdinis pdicator:de antichrifto z fine mūdì:z frio Ubertini de fepté flatib ecclefie.

Pretty bad. While a cleaner image should produce better results, Google's OCR failed to recognize the page's two-column layout and instead smashed the two columns together randomly (underlined/grey text) - sometimes one line at a time, sometimes two lines, then not at all - producing unuseable results. How often would this happen when scanning a 100-page book? 


I realize this is all based on one page of one book, but I think this points to several ways that OCR4all is a better fit for all but the most straightforward projects.

  • More accurate recognition of abbreviations and other symbols typical found in early modern books
  • User control over page layout - you can exclude images, marginalia, etc., without editing the image
  • User control over models - you can check where lines/columns have been divided before recognition, and make changes if necessary
  • Better interface for correcting recognized text - Google doesn't reveal low-probability characters, and you're stuck with whatever it offers
  • Access to better recognition models/user control of models - although OCR4all will need to keep improving its models over time

No comments:

Post a Comment