Saturday, December 9, 2023

Honorable mention for Theuerdank

 My translation of Theuerdank was awarded one of two honorable mentions in the 19th Aldo and Jeanne Scaglione Prize for a Translation of a Literary Work from the Modern Language Association

My co-author Howard Louthan deserves a lot of the credit for this, as he first suggested the idea of translating Theuerdank for use in undergraduate teaching, and the prize citation specifically mentions the "supporting materials (maps, chronology, a key to the characters and narrative, discussion questions, and suggestions for further reading)" – which Howard provided. We were also fortunate to work with a great team of editors and production personnel at Routledge who worked with us to ensure the complex layout and ~ 120 illustrations were implemented correctly.

Looking back at my files, it looks like Howard and I first discussed the idea in March 2018. I started preparatory work in April and translation work in August 2018, with the initial translation finished by August 2019. Revising the translation and preparing the manuscript lasted into 2020, and the publication process was complete in 2022.

Monday, November 13, 2023

Some final thoughts on eScriptorium

 After three months, I've been able to get back to eScriptorium. At this point, I think it's a reliable solution for creating the kind of digital texts I need in a timely and effective way. Being able to take advantage of the feedback loop from recognition to editing to training to recognition is a huge benefit. Some additional notes:

  1. I upgraded my old machine with a refurbished Nvidia 3060-based GPU, and it makes quite a difference. For a separate project, recognition tasks were sped up 6-10 times. For OCR, it made training with Kraken significantly faster, although I didn't measure it. Page recognition is much quicker. For around $200, definitely worth it - it doesn't just speed up one step, it makes it possible to test and experiment in a reasonable amount of time. I also experienced fewer hanging processes in eScriptorium.
  2.  Docker crashed at one point and took everything with it. Not an eScriptorium problem, but making periodic backups or exporting XML is probably a good idea.
  3. But re-installation was easy. I did have to remember not to reinstall eScriptorium from git in a Windows command prompt, but from git in a WSL shell. I don't know why that makes a difference to Docker, but it does.
  4. eScriptorium does have a panning tool. You can right click and move an image around. However, there are limits in place so you can't move an image around when it fills the workspace. If you shrink the image, you can move it around the workspace at will. If you enlarge it, you can also move it around at will. But if the image is at 100% size, you can't move it at all. I still find this counterintuitive. Why can't I pan a 100%-sized image, but I can pan every other image size?
  5. I wish I could hold CTRL or SHIFT to toggle between scrolling up/down the screen, and resizing the image. Doing one when you mean to do the other is irritating. To scroll up/down the screen, you have to move your mouse pointer to a margin or gutter region.
  6. eScriptorium does let you select and delete lines and points in bulk, for example within a woodcut. Just hold down SHIFT, select, and hit DELETE. However, this only works if you don't have "Cut through lines" (scissors icon) selected. Why can't I select and delete lines in bulk when "Cut through lines" is selected? This is counterintuitive, especially since the first thing I would usually do after cutting through lines was deleting points and lines in bulk.
  7. To turn editing features on/off, you click on their icon. They change from blue to green. (Or is it from green to blue?) The mask button additionally cycles from blue to green to gray. Is blue on and green off, or the other way around? And the "Cut through lines" feature changes from yellow to green. Why yellow and not blue? And I still don't remember which one is on or off, even after many hours using the editing screen. This is, I think, not optimal UI design.
  8. I wish the editing icons stayed on screen instead of scrolling off. When I'm editing the lower section of an image, it's an annoyance to need to scroll up to click on something. The web interface makes it difficult to keep buttons in place, but it's annoying nevertheless. [Just hit the "C" key to switch between "Cut through lines" and usual operation. It's fast and easy and documented in the help screen and doesn't clutter up the screen with floating toolbars.]
  9. There's a question mark icon that you can click for a help screen. Maybe my work flow isn't what the designers envisioned, but almost nothing on the help screen was relevant to how I worked. I searched for and read help files as issues came up, but I didn't watch any of the videos.
  10. Line renumbering works! If a line gets skipped for some reason, it's not too hard to add it, enter the text manually, and have it renumber automatically. I'm not sure what would happen in a multi-column environment.
  11. The results of transcription after editing just 5 pages and retraining a model based on them are astounding. I can't emphasize enough how much easier this makes digitizing early modern printed texts and how radically this might change scholarship that draws on them. Things that once seemed impossible, or possible only for institutions with significant specialized personnel and IT resources, are now possible on my desktop in my spare time.

Anyway, thanks very much to all those who provided suggestions and corrections, and especially to the developers. eScriptorium is finally letting me do some things I've been hoping to do for almost 20 years.


Wednesday, August 16, 2023

eScriptorium: Let's try breaking some rules

I am not a scriptorium. I'm just one guy with a professional interest in late medieval and early modern prophetic texts who needs to turn digital facsimiles into electronic texts as quickly and painlessly as possible so I can read, skim, annotate, compare and search them.

eScriptorium uses kraken as its recognition engine, and in fact it's entirely possible to install kraken as a standalone program (using the Windows Subsystem for Linux) and run it from the command line. Kraken in turn makes it possible to train or refine a recognition model using the command-line program ketos.

According to the kraken documentation, "Transcription has to be diplomatic, i.e. contain the exact character sequence in the line image, including original orthography." What I need, though, is readable text. Especially if I'm searching a longer text, I need to be able to find all occurrences of a word without worrying about abbreviations or spelling conventions.

This sets up the following workflow:

  • Pre-edit images as necessary (much reduced with eScriptorium, which handled skewed images well)
  • Import images into the OCR environment
  • Automatically recognize baselines/regions
  • Manually correct baselines/regions
  • Automatically recognize a bunch of pages
  • Manually correct the text as diplomatic transcription (Larissa Will added some useful pointers in a comment on my last post)
  • Export XML files and JPG images
  • Train a work-specific model with ketos using these files
  • Recognize the rest of the text in eScriptorium using this model
  • Correct the diplomatic transcription by resolving abbreviations and lightly normalizing the text

It's a lot of work. OCR4all and its models are set up to handle just about any abbreviation you throw at them, but the rest of the world isn't. It requires the use of Unicode extensions that Notepad++ and Word don't handle well in every case (or even many cases). And it requires an enormous amount of time to resolve the abbreviations and normalize the text, even with careful use of search and replace. I did it with one longer work, and it's exhausting.

But what happens if we break the rules? I don't want to deal with diplomatic transcriptions. You're not the boss of me and you can't make me. What if we don't give kraken a diplomatic transcription? What if we feed it transcribed pages of resolved abbreviations or even normalized text? Will it harm accuracy, and will the results be usable? Can we get kraken to take over the work of normalizing the text for us?

Yes we can.

I started with UB Mannheim's recognition model based on 16th-century German Gothic prints (luther_best.mlmodel), which gave results very similar to but maybe a very slight amount better than its model based on German and Latin prints of various periods (german_print_best.mlmodel). The OCR target is again a 1498 edition of Wolfgang Aytinger's commentary on pseudo-Methodius. I experimented a bit along the way, but I had 24 corrected pages (exported in PAGE XML format) by the time I conducted the final comparison.

So in my eScriptorium directory, I currently have a subdirectory, "train," containing 24 .jpg files and .xml files named 0058.jpg/0058.xml - 0080.jpg/0080.xml. Between them, there are around 650 corrected lines of text. In my WSL-based Debian distribution, I navigate to the eScriptorium directory and issue the following command:

ketos -v train -i luther_best.mlmodel -f page train/*.xml

To explain each part:

-v    Verbose output, so I can see more of what's going on

train    The command telling ketos what to do

-i    Instead of trying to train a new model from scratch, I want ketos to retrain an existing model, luther_best.mlmodel, found in the directory where I issue the command

-f    The XML files are in PAGE format. 

train/*.xml    The path to the files to use for training 

Retraining a model with just 5 corrected pages yielded much improved recognition results. On my underpowered computer, each training iteration took around 5 minutes and the whole process took a few hours. With more data, training takes longer - with 24 pages, each iteration took 20+ minutes and I had to let the process run overnight.

Was it worth it?

Here's the first 10 lines of the page I used for testing from the HAB. 


 And here are the first 10 lines using base Luther model: 


I mean, kind of? I could use it, but it would take a lot of work to fix. The base German_Print model is about the same:

A little better, a little worse. Still a lot of work.

After training on work-specific pages, accuracy can be expected to be much better. But how will kraken do with abbreviations after training on an undiplomatic transcription? As it turns out, brilliantly.

In this image, correct resolutions of abbreviations are marked in green and incorrect ones or other mistakes in red. eScriptorium got it right 47/57 times (82%) after training on just the abbreviations that occur in 24 pages. And the best part is that I only have to deal with plain old ASCII-compliant characters that will copy and paste anywhere without complaint.

I can almost read this as is. I can work with this.

But what about normalizing i/j and u/v and similar? I re-corrected the 24 pages by normalizing the text, re-exported the XML, and re-trained. Here are the results:

In this image, green and red are used as above to mark correct/incorrect expansions, while blue/purple are added to mark correct and incorrect normalization.

And the results are fantastic. Training on normalized text with resolved abbreviations didn't cause any loss of recognition accuracy. Kraken normalized the text correctly 10/11 times, including getting the internal "v" correct in conversus and numeravit

Even some of the mistaken abbreviation resolutions are better than nothing. I just need to delete an "n" from "quenm," and the "m" is already there. The less I have to type, the faster the last step will go. This is going to save me a lot of work.

Finally, someone might ask if I couldn't just train a model from scratch using pages from Aytinger and have a completely tailor-made recognition model. Wouldn't that be even more accurate?

No. At this point, the results are laughable - quite literally. Trained on 650 lines of corrected text, the resulting recognition model has a 4% accuracy rate and yielded the following transcription:

tee-
teei
teet
teei
tee-
tee-
tee-
teei
ees
teet

I don't know how much text is required to train a model from scratch, but "around 650 lines" is not the answer.

Wednesday, August 2, 2023

eScriptorium is bad. And brilliant.

OCR4all isn't the only option for turning early printed books into electronic texts. According to Stefan Weil (via Klaus Graf), eScriptorium was able to produce usable text of pseudo-Vincent Ferrer's De fine mundi in German translation with moderate effort based on segmentation and text recognition models from the UB Mannheim. With 15 corrected pages, it was then possible to train a work-specific OCR model for the rest. (His results are here.)

First the good news.  The documentation states that eScriptorium runs under either Linux or MacOS, but it installs and runs without major incident as a Docker app under Windows. The recognition models from the UB Mannheim are very good for early German printed books. Using a best-case scan of Joseph Grünpeck's 1508 Speculum in German from the Österreichische Nationalbibliothek, it produced impressively accurate results. Results were still very good for a less-than-perfect scenario (Wolfgang Aytinger's Latin commentary on pseudo-Methodius from the HAB). The underlying Kraken OCR engine has finally made it possible for me to train and use my own OCR models. I can't imagine using anything but eScriptorium after this. It's that good.

But eScriptorium is also bad.

The installation documentation isn't great, and the instructions for use are confusing. A quickstart guide would be helpful. I've run other OCR programs before, but it still took a long time for eScriptorium to start making sense.

Importing images into a new project went smoothly, and eScriptorium seems to handle PDFs just fine. So there's less need to pre-process images, and conversion to grayscale is unnecessary.

Segmentation, using either the default or the UB Mannheim segmentation model, worked quite well. But one of eScriptorium's major flaws is that the segmentation editing tools are very limited compared to OCR4all, and they behave erratically. There's no easy way to select an area and delete all text lines at once, which is a real problem with woodcut illustration, which are often recognize as a dozen or more short text lines at odd angles. Instead of selecting all lines or the whole region, I have to select each line and delete it one by one. Attempting to edit region outlines or lines quickly leads to frustration. The order of text lines can be displayed, but there's no way to change it if it's incorrect (which I've seen more than once so far). There's no way to "grab" the page image and move it around, so it can be difficult to move it back to the center of the screen at reasonable size. The equivalent tools in OCR4all are much more comprehensive and easier to use.

When segmenting or recognizing multiple pages, processes would stop without warning or explanation. The only reason I knew they had stopped was because CPU usage in Docker dropped back into low single digits. The failed processes weren't sequential, but instead affected pages seemingly at random, which meant I had to go back through the images one by one to see where segmentation or recognition had failed.

Some tasks proved impossible to cancel at all, even after stopping Docker or restarting Windows. My first attempts at training a model in eScriptorium were unsuccessful, but even now one of them is still listed as ongoing, and clicking the "Cancel training" button generates the less-than-helpful error "SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data." (For the curious: the raw data is "status," and nothing else.).

The transcription window is nicely responsive and easy to use. I especially like how the text resizes so the line width matches the image above it, making it easy to edit or transcribe. I prefer it to the OCR4all transcription screen. The eScriptorium window only shows one line at a time, however, and often it's necessary to see the next or preceding lines to choose the best transcription. It also seems to lack OCR4all's ability to show low-probability characters.

With good recognition models, recognition results are fantastic. A pattern I've noticed with perhaps half the pages I've processed so far, however, is that the last line on the page will have abominable recognition quality, as if eScriptorium is somehow reverting to a rudimentary model for the final line of the page.

You can export recognized, transcribed and/or corrected pages in ALTO or PAGE XML formats or plain text, with or without the images - maybe. Once I had downloaded one XML format without images, eScriptorium kept generating the same ZIP file no matter what other options I chose until I restarted the whole stack in Docker. Plain text is exported only as a single block of text rather than as separate text files. If you need to generate ground truth text files, you'll have to copy and paste by hand, without even a page separator to guide you.

Effective OCR relies on high-quality models, but the central Zenodo repository is nearly empty and doesn't offer anything suitable for the early modern printed books I work with. I knew that better models existed somewhere and even some of the file names, but even with that information it took me two days to find them. (I eventually found them here.)

One of the nice features of eScriptorium is the underlying Kraken engine. You can install it under Window's Linux subsystem (I've got it running under Debian) and work with it directly from the command line, no web interface needed. The bad news, again, is the confusing and sometimes contradictory documentation, with the same command-line switches meaning different things depending on the particular command given to Kraken.

My overall impression is: nomen est omen. OCR4all really is trying to bring OCR for early print to a broad audience of individual users, while eScriptorium is targeted at the digital scriptorium, with specialized hardware and trained personnel working on major digitalization efforts.

Still, eScriptorium gives me the sense, in a way that OCR4all didn't, that the promised land is in sight: I'll be able to take a digital facsimile, get a good initial transcription, then train a work-specific model that will produce very good to excellent results for the complete work. A working digital text might mean the investment of a few hours and not dozens.

Friday, July 14, 2023

Modern editions of medieval and early modern prophecies v.01

In my research, I keep asking myself related questions: "Are there any better options than a digital facsimile?" "Is this text edited anywhere?" "Why does everyone cite a 1690 edition of this key work?"

Some prophetic works have been edited. I know because I keep stumbling over the editions. In some cases we have the best the nineteenth century could do, and in others we have exemplary modern editions. I listed several editions in my Oxford bibliography, but I couldn't list even all the editions I knew of then, and the focus wasn't on editions in any case. To make the process of discovery less haphazard, I'm creating this list and updating it as I come across new works.

To the extent possible, I am excluding astrology and focusing on works in the form that circulated widely rather than on "authentic" or original versions. I will include short prophecies, but generally exclude partial editions of longer works. Any comments will be brief.

Next time: Add the various short prophecies edited in the nineteenth century and later, and the Oracula cyrili.

Birgitta of Sweden

Montag, Ulrich. Das Werk der heiligen Birgitta von Schweden in oberdeutscher Überlieferung: Texte und Untersuchungen. Münchener Texte und Untersuchungen zur deutschen Literatur des Mittelalters 18. Munich: C. H. Beck, 1968.

  • Latin and German editions of the Onus mundi compiled by Johannes Tortsch.

Hildegard of Bingen

Gebeno, José Carlos Santos Paz, and Hildegard. La obra de Gebenón de Eberbach. La tradizione profetica 2. Tavarnuzze (Firenze): SISMEL : Edizioni del Galluzzo, 2004.

  • While excellent editions of Hildegard's work exist, this is the only complete edition of the widely copied Pentacron.

 Johannes de Rupescissa

Vademecum

Johannes de Rupescissa. Vade mecum in tribulatione. Edited by Elena Tealdi, Robert E Lerner, and Gian Luca Potestà. Milan: Vita e Pensiero, 2015.

Kaup, Matthias. John of Rupescissa’s Vade Mecum in Tribulacione (1356): A Late Medieval Apocalypse Manual for the Forthcoming Fifteen Years of Horror and Hardship. Farnham: Ashgate, 2013.

Other works

Rupescissa, Johannes de. Liber secretorum eventuum: Edition critique, traduction et introduction historique. Edited by Robert E. Lerner and Christine Morerod-Fattebert. Fribourg: Editions Universitaires Fribourg Suisse, 1994.

Paracelsus

Paracelsus. Theophrast von Hohenheim gen. Paracelsus: Sämtliche Werke. Edited by Karl Sudhoff. 14 vols. Munich and Berlin: R. Oldenbourg, 1929.

  • Where among these 14 volumes are the prophetic works located? I'll have to take another look at this later. Available by open access. Vol. 1 is here.
Pseudo-Methodius
  • The problem with pseudo-Methodius editions is that most are focused on the early stages of the text in Syriac, Greek and Latin from the seventh to ninth centuries A.D., while I'm interested in what the text was doing in Europe nearly a thousand years later.

Sackur, Ernst. Sibyllinische Texte und Forschungen: Pseudomethodius, Adso und die Tiburtinische Sibylle. Halle (Saale), 1898. 1-96.

Pseudo-Methodius. Apocalypse of Pseudo-Methodius: An Alexandrian World Chronicle. Translated by Benjamin Garstad. Dumbarton Oaks Medieval Library, DOML 14. Cambridge, Mass: Harvard University Press, 2012.

Grifoni, Cinzia, and Clemens Ganter. “The Third Latin Recension of the Revelationes of Pseudo-Methodius – Introduction and Edition.” In Cultures of Eschatology, edited by Veronika Wieser, Vincent Eltschinger, and Johann Heiss, 1:194–253. Cultural History of Apocalyptic Thought. Berlin ; Boston: De Gruyter Oldenbourg, 2020.

 Sibylline prophecies

  • This will probably get split into separate categories at some point.

Neske, Ingeborg. Die spätmittelalterliche deutsche Sibyllenweissagung: Untersuchung und Edition. Göppingen: Kümmerle, 1985.

  • Perhaps the first vernacular printed work.

Sackur, Ernst. Sibyllinische Texte und Forschungen: Pseudomethodius, Adso und die Tiburtinische Sibylle. Halle (Saale), 1898. 114-187.

  • Tiburtine sibyl

 Ve mundo

 Kaup, Matthias, and Robert E. Lerner. “Gentile of Foligno Interprets the Prophecy ‘Woe to the World,’ with an Edition and English Translation.” Traditio 56 (2001): 149–211.

Tuesday, May 30, 2023

OCR4all is better than Google

OCR4all is better than Google OCR for early modern printed books.

Klaus Graf pointed me to a suggestion from Lajos Adamik on using Google Docs for OCR. It's a great idea - Google is fast and fairly accurate. For some projects I'd definitely use it.

But not for most early printed books. I had earlier tried out different OCR4all models/model combinations on a page from Abbas Joachim magnus propheta (1516) so I had images (in various stages of preparation) and uncorrected output text to compare.

(NB: I got best results using the "fraktur-hist" models. Contrary to OCR4all's documentation, I got best results from using all five "fraktur-hist" models together compared to using any one of them alone.)


 Here is the uncorrected OCR4all text. This is a fairly difficult text for OCR: Latin, many abbreviations, some barely perceptible abbreviation markings (some symbols may not display correctly in your browser, but I can see all of them here).

¶Jncipit liber de magnis tribulationibus in proximo futuris. Compilatus a docto ⁊ deuoto preſſ ptero ⁊ heremita Theoloſphoro de Cuſentia ꝓuincie Calabrie. Collectus vero ex vaticinijs nouorum prophetarum. ſ. beati Curilli: abbatis Joachim: Danda i: ⁊ Mer lini: ac veterum ſibillarum. Deinde abbreuiatus pervenerabilem fratrem Nuſtitianuʒ: vna cũ tractatu magiſtri Joãnis pa: iſini ordmis pᷣdicatoꝝ: de antichriſto ⁊ ſine mũdi: t fr̃is Tbertini de ſeptẽ ſtatib?eccleſie.

Here is the manually corrected text. Clearly a lot of post-recognition cleanup work is necessary, but OCR4all does a decent job of picking up abbreviations. Here's the corrected version - to get a corrected text of the complete work, perhaps 100-200 hours of work or more would be required.

¶ Incipit liber de magnis tribulationibus in proximo futuris. Compilatus a docto et devoto presbytero et heremita Theolosphoro de Cusentia provincie Calabrie. Collectus vero ex vaticiniis novorum prophetarum. s. beati Cirilli: abbatis Joachim: Dandali: et Merlini: ac veterum sibillarum. Deinde abbreviatus per venerabilem fratrem Rustitianum: una cum tractatu magistri Joannis parisini ordinis predicatorum: de antichristo et fine mundi: et fratris Ubertini de septem statibus ecclesie.

Here are a few examples from Google. The first one uses a deskewed base image. The amount of work required to correct the text will clearly be higher.

CIncipit liber de magnis tribulationibus in proximo futuris. Lompi latus a docto z ceuoro prefbytero z heremita Theolofpbozo de Lufentia puincie Lalabzie. Lollectus vero ex vaticinijs nouozum pzo/ phetarum.f. beati Lirilli: abbatis Joachim: Danda i: Delini:ac veterum fibillarum. Deinde abbreuiatus pervenerabilem fratrem Ruftitianus:vna cu tractatu magiftri Joanis parifini ordinis pdicator de antichrifto z fine müdi:z fris Ubertini de fepté ftatib ecclefie.

Using one of OCR4all's preprocessed images shows what can go wrong with Google Docs:

Incipit liber de magnis omne malum fuper omnes habitatores terre. tribulationibus in proximo futuris. Lompt Itaq; fub tali impatore/tres falt pape creabus tur:vnus grecus:alius italus:z alius germa/ latus a docto z ceuoro presbytero z heremita Theololphoto de Lufentia puincie Lalabae.us/oium peffimus: erunt finguli adinuicez Lollectus vero ex varicinije nouocum pro phetarum,f. beati Lirilli: abbatis Joachim: Danda i: dei lini:ac veterum fibillarum. Deinde abbreuiatus pervenerabilem fratrem Ruffitianuz:vna cũ tractatu magistri Joānis parifini ozdinis pdicator:de antichrifto z fine mūdì:z frio Ubertini de fepté flatib ecclefie.

Pretty bad. While a cleaner image should produce better results, Google's OCR failed to recognize the page's two-column layout and instead smashed the two columns together randomly (underlined/grey text) - sometimes one line at a time, sometimes two lines, then not at all - producing unuseable results. How often would this happen when scanning a 100-page book? 


I realize this is all based on one page of one book, but I think this points to several ways that OCR4all is a better fit for all but the most straightforward projects.

  • More accurate recognition of abbreviations and other symbols typical found in early modern books
  • User control over page layout - you can exclude images, marginalia, etc., without editing the image
  • User control over models - you can check where lines/columns have been divided before recognition, and make changes if necessary
  • Better interface for correcting recognized text - Google doesn't reveal low-probability characters, and you're stuck with whatever it offers
  • Access to better recognition models/user control of models - although OCR4all will need to keep improving its models over time

Thursday, April 27, 2023

Digital text: Methodius, Revelationes [German], 1497

Who needs a digital text when digital facsimiles are available? I do. When I'm hunting citations or comparing versions, I need to be able to annotate, rearrange, and copy and paste. In addition, it's much faster to read an early modern text in a modern font, especially if I'm scanning a text I've read to find something I remember seeing previously. It's what makes the effort required to use OCR4all worth it.

Here's one result, for example: a complete digital text of Pseudo-Methodius' Revelationes in German, published in 1497 (ISTC im00526000, GW M23065; fascimile from the Bayerische Staatsbibliothek).

I've prepared the text for my own purposes, so:

  1. The numbers count images starting from the title page, not anything useful like leaves.
  2. I've made minimal effort to resolve abbreviations or normalize spelling.
  3. I'm using modern equivalents for s and z.
  4. I read through the text once to straighten out the formatting and catch errors, but there are undoubtedly still errors in the text.

The German text was more useful to me than the Latin text, and it uses fewer abbreviations. That does mean that this text doesn't include Wolfgang Aytinger's commentary on Methodius, so I might have to deal with it separately. I'm currently digitalizing a Latin work for the first time, and I don't know how well OCR4all will handle abbreviations yet.

I have several more texts like this that I'll post eventually, and I'm working on more as time permits. One goal is to end up with a workable electronic text of Lichtenberger's Prognosticatio.

Hopefully someone will find these working notes useful. If you need to cite something, cite to Albrecht Kunne's edition or a modern critical edition of Methodius.

Thursday, April 6, 2023

OCR4all is good

 For a current project, it would be useful to have an electronic text of pseudo-Vincent Ferrer's De fine mundi (or Anton Sorg's 1486 edition of a German translation, to be precise).

Let me back up a bit. For every project that I've worked on since the beginning of grad school, I've wished for an electronic text of every primary and secondary source. They've rarely been available. When they've been essential, it usually required a lot of manual typing.

For modern works, ABBYY FineReader works extremely well, but it didn't produce usable results for 15/16th-century books, even with considerable fine-tuning. I looked at other options a few years ago, but they looked more like solutions for specialized digitalization centers.

Until recently. After seeing a colleague's results, I gave OCR4all another try, and it's finally giving me the electronic texts I want for a reasonable amount of effort.

The installation instructions are mostly clear. There's a learning curve and I'm far from understanding all the options, but it largely produces good enough output as is. Hardware requirements are quite modest - I'm running it on my work computer, an Intel Core i5-4590 from 2016.

There are a few things I've done to speed up the process.

  1. Most of the texts I want to digitize are available as scanned images. Before I turn OCR4all loose, I do some minimal preparation work using IrfanView, a free, lightweight, all-purpose image viewer and editor: I rotate any pages that are noticeably off horizontal (OCR4all will do its own fine-tuning), and I crop out extraneous margins. I set up an AuoHotKey script so I can save the file with one key combination once I have the cropping rectangle selected.
  2. Once I start up OCR4all, I can mostly let it do its preprocessing, noise removal, page segmentation and line segmentation on its own. But it's a good idea to check a few pages to make sure noise removal hasn't gotten out of control, and then to check the line segmentation of each page to make sure OCR4all hasn't skipped over anything and has chosen a rational reading order.
  3. For text recognition, I select five historical Fraktur models. I wish it was clearer why I'm selecting five, and all Fraktur models rather than a mix or something else, but I'm getting good results.
  4. OCR4all has a good visual editor for comparing recognized text to the page image. To speed things up, I'll zoom in the text to around 160%, or whatever makes the lines about the same length as the page image. Then I check off "Show Prediction" and "Show Confidence" and raise the Word Confidence to .95. This highlights the words I need to double-check in orange. I'll make any edits to those words and assume the rest are good enough for now.
  5. After checking each page, I export a text file, then do some searching and replacing in Notepad++. Its pattern-matching abilities based on regular expressions are much better than what's available in Word. I replace tall ſ and ʒ with s and z, for example - the goal here isn't an edition, but an easily readable text. I also remove line breaks.

 And that is how I can go from a digital facsimile to a usable electronic text in less than an hour for a pamphlet, and a few hours for a longer work like De fine mundi.


How it started

How it's going

Jtem/ vil wilder thier/ id est (Fürsten) werden widerwertiglich mit einander reden. Jm. 1524. jar. Vnd ein grosser adler/ id est (Keiser) desselben haupt/ vnd werden gehen auff die weitte der er den / id est (Türcken vnd das heilig land) vnd welcher den namen des sterns / id est (Bapst)hat/ derselb wirt geladen/ aber er wirt nit kumen / doch so wirt er senden den/ der den namen des visch hat/ id est (delfin/ das ist der Künig von Franckreich) vnd wirt an sich nemen der adler ein grosse samlung/ vñ der Leopardus/ id est (der Römisch Künig) auß Campo albo / vnd durch die Ritterschafft eingehen vñ machen ein haupt in der Marchia/ id est (in der Mar chia vnd eins ist in Welschen landen) vnd wirt darnach gehen wider das erdtreich/ das da seyn grund hat von verreterey/ id est (wider Welsch land vnd der Römer) vnd welcher den namen des vischs hat / der wirt an sich nemen den weg Leopardus / vñ kumen wider jn vber die lender Campaniam zwischen der grund Parmam schlösser/ id est (stet in Welschen landen) vnd wirt sich halten widerwertig / vnd da wirt dann vil blüts vergiessung durch manschlacht bey dem fluß / welcher genennet ist der arbeyt / id est (wascher fiuß) darnach wirdt derselb fluß des blůts / der von der manschlacht einkumpt/ vnd der da den gantzẽ tag billt / id est (Künig von Vngern) der wirt sterben von eisen/ vnd wirt vber jn herschen der Leopardus / darnach wirt er sterben natürlichs tods/ vñ wirt darnach frid durch dz gantz erdtrich werdẽ/ Das werd war/ AMEN.

Friday, March 17, 2023

A Pseudo-Methodius sighting

The fifteenth-century manuscript Wissenschaftliche Stadtbibliothek Mainz, Hs I 109, from the Carthusians of Mainz, consists of four parts. A sixteenth-century addendum at the end of part 3 (fol. 109v) includes a list of around 20 books. The first two items from the list belong to the third part of the manuscript, but the rest do not.

The final line of the list begins: Item Reuelacioes methodij

The library's manuscript catalog doesn't mention Methodius, but does describe the rest of the works as "primarily theological, grammatical and medical writings, including Sebastian Brant's Narrenschiff, Jakob Wimpfeling's Elegantiarum medulla and Isidoneus Germanicus, Johannes Marius Philelphus' Novum epistolarium." (I also see Johannes de Sacrobosco's De sphaera mundi in the book list.)

It's always interesting to get another data point on who owned prophecies, and in what context.