Thursday, April 6, 2023

OCR4all is good

 For a current project, it would be useful to have an electronic text of pseudo-Vincent Ferrer's De fine mundi (or Anton Sorg's 1486 edition of a German translation, to be precise).

Let me back up a bit. For every project that I've worked on since the beginning of grad school, I've wished for an electronic text of every primary and secondary source. They've rarely been available. When they've been essential, it usually required a lot of manual typing.

For modern works, ABBYY FineReader works extremely well, but it didn't produce usable results for 15/16th-century books, even with considerable fine-tuning. I looked at other options a few years ago, but they looked more like solutions for specialized digitalization centers.

Until recently. After seeing a colleague's results, I gave OCR4all another try, and it's finally giving me the electronic texts I want for a reasonable amount of effort.

The installation instructions are mostly clear. There's a learning curve and I'm far from understanding all the options, but it largely produces good enough output as is. Hardware requirements are quite modest - I'm running it on my work computer, an Intel Core i5-4590 from 2016.

There are a few things I've done to speed up the process.

  1. Most of the texts I want to digitize are available as scanned images. Before I turn OCR4all loose, I do some minimal preparation work using IrfanView, a free, lightweight, all-purpose image viewer and editor: I rotate any pages that are noticeably off horizontal (OCR4all will do its own fine-tuning), and I crop out extraneous margins. I set up an AuoHotKey script so I can save the file with one key combination once I have the cropping rectangle selected.
  2. Once I start up OCR4all, I can mostly let it do its preprocessing, noise removal, page segmentation and line segmentation on its own. But it's a good idea to check a few pages to make sure noise removal hasn't gotten out of control, and then to check the line segmentation of each page to make sure OCR4all hasn't skipped over anything and has chosen a rational reading order.
  3. For text recognition, I select five historical Fraktur models. I wish it was clearer why I'm selecting five, and all Fraktur models rather than a mix or something else, but I'm getting good results.
  4. OCR4all has a good visual editor for comparing recognized text to the page image. To speed things up, I'll zoom in the text to around 160%, or whatever makes the lines about the same length as the page image. Then I check off "Show Prediction" and "Show Confidence" and raise the Word Confidence to .95. This highlights the words I need to double-check in orange. I'll make any edits to those words and assume the rest are good enough for now.
  5. After checking each page, I export a text file, then do some searching and replacing in Notepad++. Its pattern-matching abilities based on regular expressions are much better than what's available in Word. I replace tall ſ and ʒ with s and z, for example - the goal here isn't an edition, but an easily readable text. I also remove line breaks.

 And that is how I can go from a digital facsimile to a usable electronic text in less than an hour for a pamphlet, and a few hours for a longer work like De fine mundi.


How it started

How it's going

Jtem/ vil wilder thier/ id est (Fürsten) werden widerwertiglich mit einander reden. Jm. 1524. jar. Vnd ein grosser adler/ id est (Keiser) desselben haupt/ vnd werden gehen auff die weitte der er den / id est (Türcken vnd das heilig land) vnd welcher den namen des sterns / id est (Bapst)hat/ derselb wirt geladen/ aber er wirt nit kumen / doch so wirt er senden den/ der den namen des visch hat/ id est (delfin/ das ist der Künig von Franckreich) vnd wirt an sich nemen der adler ein grosse samlung/ vñ der Leopardus/ id est (der Römisch Künig) auß Campo albo / vnd durch die Ritterschafft eingehen vñ machen ein haupt in der Marchia/ id est (in der Mar chia vnd eins ist in Welschen landen) vnd wirt darnach gehen wider das erdtreich/ das da seyn grund hat von verreterey/ id est (wider Welsch land vnd der Römer) vnd welcher den namen des vischs hat / der wirt an sich nemen den weg Leopardus / vñ kumen wider jn vber die lender Campaniam zwischen der grund Parmam schlösser/ id est (stet in Welschen landen) vnd wirt sich halten widerwertig / vnd da wirt dann vil blüts vergiessung durch manschlacht bey dem fluß / welcher genennet ist der arbeyt / id est (wascher fiuß) darnach wirdt derselb fluß des blůts / der von der manschlacht einkumpt/ vnd der da den gantzẽ tag billt / id est (Künig von Vngern) der wirt sterben von eisen/ vnd wirt vber jn herschen der Leopardus / darnach wirt er sterben natürlichs tods/ vñ wirt darnach frid durch dz gantz erdtrich werdẽ/ Das werd war/ AMEN.