Thursday, April 27, 2023

Digital text: Methodius, Revelationes [German], 1497

Who needs a digital text when digital facsimiles are available? I do. When I'm hunting citations or comparing versions, I need to be able to annotate, rearrange, and copy and paste. In addition, it's much faster to read an early modern text in a modern font, especially if I'm scanning a text I've read to find something I remember seeing previously. It's what makes the effort required to use OCR4all worth it.

Here's one result, for example: a complete digital text of Pseudo-Methodius' Revelationes in German, published in 1497 (ISTC im00526000, GW M23065; fascimile from the Bayerische Staatsbibliothek).

I've prepared the text for my own purposes, so:

  1. The numbers count images starting from the title page, not anything useful like leaves.
  2. I've made minimal effort to resolve abbreviations or normalize spelling.
  3. I'm using modern equivalents for s and z.
  4. I read through the text once to straighten out the formatting and catch errors, but there are undoubtedly still errors in the text.

The German text was more useful to me than the Latin text, and it uses fewer abbreviations. That does mean that this text doesn't include Wolfgang Aytinger's commentary on Methodius, so I might have to deal with it separately. I'm currently digitalizing a Latin work for the first time, and I don't know how well OCR4all will handle abbreviations yet.

I have several more texts like this that I'll post eventually, and I'm working on more as time permits. One goal is to end up with a workable electronic text of Lichtenberger's Prognosticatio.

Hopefully someone will find these working notes useful. If you need to cite something, cite to Albrecht Kunne's edition or a modern critical edition of Methodius.

Thursday, April 6, 2023

OCR4all is good

 For a current project, it would be useful to have an electronic text of pseudo-Vincent Ferrer's De fine mundi (or Anton Sorg's 1486 edition of a German translation, to be precise).

Let me back up a bit. For every project that I've worked on since the beginning of grad school, I've wished for an electronic text of every primary and secondary source. They've rarely been available. When they've been essential, it usually required a lot of manual typing.

For modern works, ABBYY FineReader works extremely well, but it didn't produce usable results for 15/16th-century books, even with considerable fine-tuning. I looked at other options a few years ago, but they looked more like solutions for specialized digitalization centers.

Until recently. After seeing a colleague's results, I gave OCR4all another try, and it's finally giving me the electronic texts I want for a reasonable amount of effort.

The installation instructions are mostly clear. There's a learning curve and I'm far from understanding all the options, but it largely produces good enough output as is. Hardware requirements are quite modest - I'm running it on my work computer, an Intel Core i5-4590 from 2016.

There are a few things I've done to speed up the process.

  1. Most of the texts I want to digitize are available as scanned images. Before I turn OCR4all loose, I do some minimal preparation work using IrfanView, a free, lightweight, all-purpose image viewer and editor: I rotate any pages that are noticeably off horizontal (OCR4all will do its own fine-tuning), and I crop out extraneous margins. I set up an AuoHotKey script so I can save the file with one key combination once I have the cropping rectangle selected.
  2. Once I start up OCR4all, I can mostly let it do its preprocessing, noise removal, page segmentation and line segmentation on its own. But it's a good idea to check a few pages to make sure noise removal hasn't gotten out of control, and then to check the line segmentation of each page to make sure OCR4all hasn't skipped over anything and has chosen a rational reading order.
  3. For text recognition, I select five historical Fraktur models. I wish it was clearer why I'm selecting five, and all Fraktur models rather than a mix or something else, but I'm getting good results.
  4. OCR4all has a good visual editor for comparing recognized text to the page image. To speed things up, I'll zoom in the text to around 160%, or whatever makes the lines about the same length as the page image. Then I check off "Show Prediction" and "Show Confidence" and raise the Word Confidence to .95. This highlights the words I need to double-check in orange. I'll make any edits to those words and assume the rest are good enough for now.
  5. After checking each page, I export a text file, then do some searching and replacing in Notepad++. Its pattern-matching abilities based on regular expressions are much better than what's available in Word. I replace tall ſ and ʒ with s and z, for example - the goal here isn't an edition, but an easily readable text. I also remove line breaks.

 And that is how I can go from a digital facsimile to a usable electronic text in less than an hour for a pamphlet, and a few hours for a longer work like De fine mundi.


How it started

How it's going

Jtem/ vil wilder thier/ id est (Fürsten) werden widerwertiglich mit einander reden. Jm. 1524. jar. Vnd ein grosser adler/ id est (Keiser) desselben haupt/ vnd werden gehen auff die weitte der er den / id est (Türcken vnd das heilig land) vnd welcher den namen des sterns / id est (Bapst)hat/ derselb wirt geladen/ aber er wirt nit kumen / doch so wirt er senden den/ der den namen des visch hat/ id est (delfin/ das ist der Künig von Franckreich) vnd wirt an sich nemen der adler ein grosse samlung/ vñ der Leopardus/ id est (der Römisch Künig) auß Campo albo / vnd durch die Ritterschafft eingehen vñ machen ein haupt in der Marchia/ id est (in der Mar chia vnd eins ist in Welschen landen) vnd wirt darnach gehen wider das erdtreich/ das da seyn grund hat von verreterey/ id est (wider Welsch land vnd der Römer) vnd welcher den namen des vischs hat / der wirt an sich nemen den weg Leopardus / vñ kumen wider jn vber die lender Campaniam zwischen der grund Parmam schlösser/ id est (stet in Welschen landen) vnd wirt sich halten widerwertig / vnd da wirt dann vil blüts vergiessung durch manschlacht bey dem fluß / welcher genennet ist der arbeyt / id est (wascher fiuß) darnach wirdt derselb fluß des blůts / der von der manschlacht einkumpt/ vnd der da den gantzẽ tag billt / id est (Künig von Vngern) der wirt sterben von eisen/ vnd wirt vber jn herschen der Leopardus / darnach wirt er sterben natürlichs tods/ vñ wirt darnach frid durch dz gantz erdtrich werdẽ/ Das werd war/ AMEN.