Research Fragments

Friday, June 6, 2025

Local vision language models on consumer hardware are useful for early modern printed books

The last time I looked at OCR for early printing, the state of the art was eScriptorium and the kraken ecosystem, which make use of relatively small models for recognizing text and page layout. Since then, AI has taken hold of public consciousness, and numerous companies and academic research projects have released open-weight models that can be run on suitable hardware, including some very large and capable models with tens or even hundreds of billions of parameters. Will any of this immense effort be useful for early printing? Almost certainly.

For transcribing early print, there are primarily two kinds of knowledge we’re interested in, visual information (what is a page? what is a word? what is an irrelevant paw print from a cat walking across the page?) and language information (what patterns are common in early modern German, and what sequences are impossible?). There are some recent AI models – vision language models (VLMs) or multi-modal models – that attempt precisely this combination of knowledge about language (and the rest of the world) and about how things look. These models, developed by some of the largest tech and AI companies, are often used to describe the content of an image. But OCR is merely one particular way to describe an image. Are any of these VLMs useful for OCR of early printed books?

Yes. You can freely download and use relatively large models and achieve fair to very good results, even on modest consumer hardware (in my case, an Nvidia RTX 3060 GPU), after zero to minimal image preparation, resulting in an electronic text suitable for further editing (which I’m defining as a character error rate [CER] of 5-10%). This would be helpful to reduce or eliminate the lengthy and tedious image setup and page layout recognition process required by eScriptorium and similar programs. Details and technical notes below.

I’m running Windows, so for practical purposes, the available inference engines are Ollama and llama.cpp (Ollama was originally based on llama.cpp, and many other projects use one or the other as their backend). Of the two, Ollama is simpler and easier to use, but offers fewer options for customization and can be somewhat slower (keeping in mind that speed is primarily determined by hardware limitations and model size). There are additional options for Mac and Linux users.

The models that proved useful were all on the heavier side (billions of parameters noted in parentheses): Mistral Small 3.1 (24B), Gemma 3 (27B), and Qwen 2.5-VL (32B). Most other VLMs and smaller versions of these three proved unsuitable, either because they have too little language knowledge or are focused on modern documents. The one exception was Pixtral (12B), whose results approached those of Gemma 3:27b (and was much faster due to its smaller size, allowing it to run entirely in VRAM on my hardware). There are of course much larger models, but they require much more computing power for inference that goes well beyond what most consumers have access to (but would be within the realm of a major digitalization project's budget).

For testing, I used 11 samples drawn from my typical research images, including books printed in the 15th, 16th and 17th centuries; texts in Latin, German, and mixed Latin/German; and pages with woodcut illustrations. I straightened and cropped most images, but included one pair of original/modified images to compare the results. I also prepared my own transcriptions (not normalized, but with expanded abbreviations) for computing character error rates. The usual prompt was a variation on: “This image is from a book printed in 1532. The text is in German. Transcribe the text. Expand any abbreviations. Output only the transcription.”

The results: Mistral-Small 3.1 was the best overall (average CER 9.3%), followed by Qwen 2.5-VL:32b (12.2%; inference only possible under Ollama) and Gemma 3:27b (13.7% under llama.cpp, but with degraded accuracy under Ollama).

All three models were very capable of finding the relevant text. There was no mishandling of illustrations or other distractions or skipping of text blocks. No definition of page layout or identification of text lines were necessary.

The models differed most on the Latin texts. Mistral appears to have had more Latin in its training data. The ability to expand abbreviations was universally poor, however, with no model doing well, especially with the numerous Latin abbreviations.

An open question is if there is some way to address ignorance of abbreviations through prompting or some other means. Providing extensive guidance in the prompt did not appear to consistently improve results, however. The best approach to normalizing the text (modern usage of u/v, i/j and umlaut vowels, for example) is yet to be determined, although both prompting and use of a grammar to constrain inference seem promising.

For dealing with poor image quality, size matters. The best results were obtained with Qwen 2.5-VL, the largest model, with 32 billion parameters. With a CER of nearly 20%, however, results from poor-quality images are not accurate enough to be highly useful.

All three models dealt with slightly skewed and uncropped images well (such as the images typically provided by major digitalization projects), with only small losses in accuracy. But larger images are slower to process, and the VLM's attention is not as tightly focused on the relevant areas of the image, so that some preparatory processing remains helpful for peak accuracy.

It’s necessary to become familiar with each model, including finding the best parameters for OCR. After extensive testing of parameters, an additional 1% improvement in accuracy (to 8.3%) was obtained from Mistral-Small 3.1. Sample command line:

llama-mtmd-cli.exe -m models\mistral-small3.1\Mistral-Small-3.1-24B-Instruct-2503-Q4_K_M.gguf --mmproj models\mistral-small3.1\mmproj-F32.gguf --image ..\Testing\image5.png -ngl 31 -lv 0 -p "This image is from a book printed in 1532. The text is in German. Transcribe the text. Output only the transcription." -fa --temp .61 --top-k 5 --min-p 0.2 --grammar-file grammars\16thc.gbnf

There can also be prompting pitfalls. Gemma 3, for example, often attempted to add additional analysis and had to be explicitly told not to. It's not clear that the effort to get to know a particular model is worth it when new models and versions are being released very frequently.

Compared to most files, these AI models are gigantic. In their native form, they range from 48 to 75 GB. My GPU only has 12 GB of VRAM, however. How is it possible to make use of models of this size?

The answer is quantization. Quantized models trade off some precision against size and inference speed. The typical quant (Q4_K_M, used in these tests) is a quarter of the size (14-19 GB) and runs acceptably when split between VRAM and system RAM, with minimal loss of accuracy. Quantization is very much an area of active research, with a recent explosion of quantization types and processes. Testing with larger quants (Q6_K) and more advanced techniques (Unsloth’s UD-Q5_K_XL) didn’t find any noticeable improvement in results, however. With Gemma 3, slightly better results were obtained with the “quantization aware” model Gemma 3:27b-it-qat. Whether any particular approach to quantization is more effective for OCR remains an open question.

In general, the field is changing extremely rapidly. Mistral-Small 3.1 was released only in March, and the initial quants for use in llama.cpp had poor OCR results until one provider fixed its quant in May. Inference with Mistral-Small 3.1 in Ollama (which uses its own quants) was accurate but very slow until the program was updated last week.

What this means is that it would be possible today to initially process images to generate OCR text for further editing or use in eScriptorium, with results that are more accurate than using kraken-based models. It’s possible to imagine a near- to medium-term future where an AI model could browse through directories of digital images, leaving behind a tidy text file with a transcription and translation.

Much of the knowledge in these models is not useful for our purposes, so that a smaller model, with less knowledge of cats and atomic structure and more knowledge of Latin abbreviations, might be both faster and significantly more accurate. Such a model would not need to be created from scratch, but could be a fine tune of an existing VLM, and might come close to achieving highly accurate unsupervised OCR of early printed books at the first try.

But training these larger and more complicated models is harder and more resource-intensive, so that the miracle of eScriptorium/kraken, with human corrections leading directly to retraining of a new model, is harder to replicate.

Finally, there is no economic case for this. Mistral.ai offers some of the most inexpensive API access to its models of any AI company. Despite over 20 years of collecting digital images, the cost of my used, last-last-generation GPU would more than cover the cost of API access to Mistral’s larger models or its specialized OCR model. Using a local model would have to be justified instead through network requirements, research focus, or privacy concerns.

Saturday, December 9, 2023

Honorable mention for Theuerdank

My translation of Theuerdank was awarded one of two honorable mentions in the 19th Aldo and Jeanne Scaglione Prize for a Translation of a Literary Work from the Modern Language Association.

My co-author Howard Louthan deserves a lot of the credit for this, as he first suggested the idea of translating Theuerdank for use in undergraduate teaching, and the prize citation specifically mentions the "supporting materials (maps, chronology, a key to the characters and narrative, discussion questions, and suggestions for further reading)" – which Howard provided. We were also fortunate to work with a great team of editors and production personnel at Routledge who worked with us to ensure the complex layout and ~ 120 illustrations were implemented correctly.

Looking back at my files, it looks like Howard and I first discussed the idea in March 2018. I started preparatory work in April and translation work in August 2018, with the initial translation finished by August 2019. Revising the translation and preparing the manuscript lasted into 2020, and the publication process was complete in 2022.

Monday, November 13, 2023

Some final thoughts on eScriptorium

After three months, I've been able to get back to eScriptorium. At this point, I think it's a reliable solution for creating the kind of digital texts I need in a timely and effective way. Being able to take advantage of the feedback loop from recognition to editing to training to recognition is a huge benefit. Some additional notes:

I upgraded my old machine with a refurbished Nvidia 3060-based GPU, and it makes quite a difference. For a separate project, recognition tasks were sped up 6-10 times. For OCR, it made training with Kraken significantly faster, although I didn't measure it. Page recognition is much quicker. For around $200, definitely worth it - it doesn't just speed up one step, it makes it possible to test and experiment in a reasonable amount of time. I also experienced fewer hanging processes in eScriptorium.
Docker crashed at one point and took everything with it. Not an eScriptorium problem, but making periodic backups or exporting XML is probably a good idea.
But re-installation was easy. I did have to remember not to reinstall eScriptorium from git in a Windows command prompt, but from git in a WSL shell. I don't know why that makes a difference to Docker, but it does.
eScriptorium does have a panning tool. You can right click and move an image around. However, there are limits in place so you can't move an image around when it fills the workspace. If you shrink the image, you can move it around the workspace at will. If you enlarge it, you can also move it around at will. But if the image is at 100% size, you can't move it at all. I still find this counterintuitive. Why can't I pan a 100%-sized image, but I can pan every other image size?
I wish I could hold CTRL or SHIFT to toggle between scrolling up/down the screen, and resizing the image. Doing one when you mean to do the other is irritating. To scroll up/down the screen, you have to move your mouse pointer to a margin or gutter region.
eScriptorium does let you select and delete lines and points in bulk, for example within a woodcut. Just hold down SHIFT, select, and hit DELETE. However, this only works if you don't have "Cut through lines" (scissors icon) selected. Why can't I select and delete lines in bulk when "Cut through lines" is selected? This is counterintuitive, especially since the first thing I would usually do after cutting through lines was deleting points and lines in bulk.
To turn editing features on/off, you click on their icon. They change from blue to green. (Or is it from green to blue?) The mask button additionally cycles from blue to green to gray. Is blue on and green off, or the other way around? And the "Cut through lines" feature changes from yellow to green. Why yellow and not blue? And I still don't remember which one is on or off, even after many hours using the editing screen. This is, I think, not optimal UI design.
I wish the editing icons stayed on screen instead of scrolling off. When I'm editing the lower section of an image, it's an annoyance to need to scroll up to click on something. The web interface makes it difficult to keep buttons in place, but it's annoying nevertheless. [Just hit the "C" key to switch between "Cut through lines" and usual operation. It's fast and easy and documented in the help screen and doesn't clutter up the screen with floating toolbars.]
There's a question mark icon that you can click for a help screen. Maybe my work flow isn't what the designers envisioned, but almost nothing on the help screen was relevant to how I worked. I searched for and read help files as issues came up, but I didn't watch any of the videos.
Line renumbering works! If a line gets skipped for some reason, it's not too hard to add it, enter the text manually, and have it renumber automatically. I'm not sure what would happen in a multi-column environment.
The results of transcription after editing just 5 pages and retraining a model based on them are astounding. I can't emphasize enough how much easier this makes digitizing early modern printed texts and how radically this might change scholarship that draws on them. Things that once seemed impossible, or possible only for institutions with significant specialized personnel and IT resources, are now possible on my desktop in my spare time.

Anyway, thanks very much to all those who provided suggestions and corrections, and especially to the developers. eScriptorium is finally letting me do some things I've been hoping to do for almost 20 years.

Wednesday, August 16, 2023

eScriptorium: Let's try breaking some rules

I am not a scriptorium. I'm just one guy with a professional interest in late medieval and early modern prophetic texts who needs to turn digital facsimiles into electronic texts as quickly and painlessly as possible so I can read, skim, annotate, compare and search them.

eScriptorium uses kraken as its recognition engine, and in fact it's entirely possible to install kraken as a standalone program (using the Windows Subsystem for Linux) and run it from the command line. Kraken in turn makes it possible to train or refine a recognition model using the command-line program ketos.

According to the kraken documentation, "Transcription has to be diplomatic, i.e. contain the exact character sequence in the line image, including original orthography." What I need, though, is readable text. Especially if I'm searching a longer text, I need to be able to find all occurrences of a word without worrying about abbreviations or spelling conventions.

This sets up the following workflow:

Pre-edit images as necessary (much reduced with eScriptorium, which handled skewed images well)
Import images into the OCR environment
Automatically recognize baselines/regions
Manually correct baselines/regions
Automatically recognize a bunch of pages
Manually correct the text as diplomatic transcription (Larissa Will added some useful pointers in a comment on my last post)
Export XML files and JPG images
Train a work-specific model with ketos using these files
Recognize the rest of the text in eScriptorium using this model
Correct the diplomatic transcription by resolving abbreviations and lightly normalizing the text

It's a lot of work. OCR4all and its models are set up to handle just about any abbreviation you throw at them, but the rest of the world isn't. It requires the use of Unicode extensions that Notepad++ and Word don't handle well in every case (or even many cases). And it requires an enormous amount of time to resolve the abbreviations and normalize the text, even with careful use of search and replace. I did it with one longer work, and it's exhausting.

But what happens if we break the rules? I don't want to deal with diplomatic transcriptions. You're not the boss of me and you can't make me. What if we don't give kraken a diplomatic transcription? What if we feed it transcribed pages of resolved abbreviations or even normalized text? Will it harm accuracy, and will the results be usable? Can we get kraken to take over the work of normalizing the text for us?

Yes we can.

I started with UB Mannheim's recognition model based on 16th-century German Gothic prints (luther_best.mlmodel), which gave results very similar to but maybe a very slight amount better than its model based on German and Latin prints of various periods (german_print_best.mlmodel). The OCR target is again a 1498 edition of Wolfgang Aytinger's commentary on pseudo-Methodius. I experimented a bit along the way, but I had 24 corrected pages (exported in PAGE XML format) by the time I conducted the final comparison.

So in my eScriptorium directory, I currently have a subdirectory, "train," containing 24 .jpg files and .xml files named 0058.jpg/0058.xml - 0080.jpg/0080.xml. Between them, there are around 650 corrected lines of text. In my WSL-based Debian distribution, I navigate to the eScriptorium directory and issue the following command:

ketos -v train -i luther_best.mlmodel -f page train/*.xml

To explain each part:

-v Verbose output, so I can see more of what's going on

train The command telling ketos what to do

-i Instead of trying to train a new model from scratch, I want ketos to retrain an existing model, luther_best.mlmodel, found in the directory where I issue the command

-f The XML files are in PAGE format.

train/*.xml The path to the files to use for training

Retraining a model with just 5 corrected pages yielded much improved recognition results. On my underpowered computer, each training iteration took around 5 minutes and the whole process took a few hours. With more data, training takes longer - with 24 pages, each iteration took 20+ minutes and I had to let the process run overnight.

Was it worth it?

Here's the first 10 lines of the page I used for testing from the HAB.

And here are the first 10 lines using base Luther model:

I mean, kind of? I could use it, but it would take a lot of work to fix. The base German_Print model is about the same:

A little better, a little worse. Still a lot of work.

After training on work-specific pages, accuracy can be expected to be much better. But how will kraken do with abbreviations after training on an undiplomatic transcription? As it turns out, brilliantly.

In this image, correct resolutions of abbreviations are marked in green and incorrect ones or other mistakes in red. eScriptorium got it right 47/57 times (82%) after training on just the abbreviations that occur in 24 pages. And the best part is that I only have to deal with plain old ASCII-compliant characters that will copy and paste anywhere without complaint.

I can almost read this as is. I can work with this.

But what about normalizing i/j and u/v and similar? I re-corrected the 24 pages by normalizing the text, re-exported the XML, and re-trained. Here are the results:

In this image, green and red are used as above to mark correct/incorrect expansions, while blue/purple are added to mark correct and incorrect normalization.

And the results are fantastic. Training on normalized text with resolved abbreviations didn't cause any loss of recognition accuracy. Kraken normalized the text correctly 10/11 times, including getting the internal "v" correct in conversus and numeravit

Even some of the mistaken abbreviation resolutions are better than nothing. I just need to delete an "n" from "quenm," and the "m" is already there. The less I have to type, the faster the last step will go. This is going to save me a lot of work.

Finally, someone might ask if I couldn't just train a model from scratch using pages from Aytinger and have a completely tailor-made recognition model. Wouldn't that be even more accurate?

No. At this point, the results are laughable - quite literally. Trained on 650 lines of corrected text, the resulting recognition model has a 4% accuracy rate and yielded the following transcription:

tee-
teei
teet
teei
tee-
tee-
tee-
teei
ees
teet

I don't know how much text is required to train a model from scratch, but "around 650 lines" is not the answer.

Wednesday, August 2, 2023

eScriptorium is bad. And brilliant.

OCR4all isn't the only option for turning early printed books into electronic texts. According to Stefan Weil (via Klaus Graf), eScriptorium was able to produce usable text of pseudo-Vincent Ferrer's De fine mundi in German translation with moderate effort based on segmentation and text recognition models from the UB Mannheim. With 15 corrected pages, it was then possible to train a work-specific OCR model for the rest. (His results are here.)

First the good news. The documentation states that eScriptorium runs under either Linux or MacOS, but it installs and runs without major incident as a Docker app under Windows. The recognition models from the UB Mannheim are very good for early German printed books. Using a best-case scan of Joseph Grünpeck's 1508 Speculum in German from the Österreichische Nationalbibliothek, it produced impressively accurate results. Results were still very good for a less-than-perfect scenario (Wolfgang Aytinger's Latin commentary on pseudo-Methodius from the HAB). The underlying Kraken OCR engine has finally made it possible for me to train and use my own OCR models. I can't imagine using anything but eScriptorium after this. It's that good.

But eScriptorium is also bad.

The installation documentation isn't great, and the instructions for use are confusing. A quickstart guide would be helpful. I've run other OCR programs before, but it still took a long time for eScriptorium to start making sense.

Importing images into a new project went smoothly, and eScriptorium seems to handle PDFs just fine. So there's less need to pre-process images, and conversion to grayscale is unnecessary.

Segmentation, using either the default or the UB Mannheim segmentation model, worked quite well. But one of eScriptorium's major flaws is that the segmentation editing tools are very limited compared to OCR4all, and they behave erratically. There's no easy way to select an area and delete all text lines at once, which is a real problem with woodcut illustration, which are often recognize as a dozen or more short text lines at odd angles. Instead of selecting all lines or the whole region, I have to select each line and delete it one by one. Attempting to edit region outlines or lines quickly leads to frustration. The order of text lines can be displayed, but there's no way to change it if it's incorrect (which I've seen more than once so far). There's no way to "grab" the page image and move it around, so it can be difficult to move it back to the center of the screen at reasonable size. The equivalent tools in OCR4all are much more comprehensive and easier to use.

When segmenting or recognizing multiple pages, processes would stop without warning or explanation. The only reason I knew they had stopped was because CPU usage in Docker dropped back into low single digits. The failed processes weren't sequential, but instead affected pages seemingly at random, which meant I had to go back through the images one by one to see where segmentation or recognition had failed.

Some tasks proved impossible to cancel at all, even after stopping Docker or restarting Windows. My first attempts at training a model in eScriptorium were unsuccessful, but even now one of them is still listed as ongoing, and clicking the "Cancel training" button generates the less-than-helpful error "SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data." (For the curious: the raw data is "status," and nothing else.).

The transcription window is nicely responsive and easy to use. I especially like how the text resizes so the line width matches the image above it, making it easy to edit or transcribe. I prefer it to the OCR4all transcription screen. The eScriptorium window only shows one line at a time, however, and often it's necessary to see the next or preceding lines to choose the best transcription. It also seems to lack OCR4all's ability to show low-probability characters.

With good recognition models, recognition results are fantastic. A pattern I've noticed with perhaps half the pages I've processed so far, however, is that the last line on the page will have abominable recognition quality, as if eScriptorium is somehow reverting to a rudimentary model for the final line of the page.

You can export recognized, transcribed and/or corrected pages in ALTO or PAGE XML formats or plain text, with or without the images - maybe. Once I had downloaded one XML format without images, eScriptorium kept generating the same ZIP file no matter what other options I chose until I restarted the whole stack in Docker. Plain text is exported only as a single block of text rather than as separate text files. If you need to generate ground truth text files, you'll have to copy and paste by hand, without even a page separator to guide you.

Effective OCR relies on high-quality models, but the central Zenodo repository is nearly empty and doesn't offer anything suitable for the early modern printed books I work with. I knew that better models existed somewhere and even some of the file names, but even with that information it took me two days to find them. (I eventually found them here.)

One of the nice features of eScriptorium is the underlying Kraken engine. You can install it under Window's Linux subsystem (I've got it running under Debian) and work with it directly from the command line, no web interface needed. The bad news, again, is the confusing and sometimes contradictory documentation, with the same command-line switches meaning different things depending on the particular command given to Kraken.

My overall impression is: nomen est omen. OCR4all really is trying to bring OCR for early print to a broad audience of individual users, while eScriptorium is targeted at the digital scriptorium, with specialized hardware and trained personnel working on major digitalization efforts.

Still, eScriptorium gives me the sense, in a way that OCR4all didn't, that the promised land is in sight: I'll be able to take a digital facsimile, get a good initial transcription, then train a work-specific model that will produce very good to excellent results for the complete work. A working digital text might mean the investment of a few hours and not dozens.

Friday, July 14, 2023

Modern editions of medieval and early modern prophecies v.01

In my research, I keep asking myself related questions: "Are there any better options than a digital facsimile?" "Is this text edited anywhere?" "Why does everyone cite a 1690 edition of this key work?"

Some prophetic works have been edited. I know because I keep stumbling over the editions. In some cases we have the best the nineteenth century could do, and in others we have exemplary modern editions. I listed several editions in my Oxford bibliography, but I couldn't list even all the editions I knew of then, and the focus wasn't on editions in any case. To make the process of discovery less haphazard, I'm creating this list and updating it as I come across new works.

To the extent possible, I am excluding astrology and focusing on works in the form that circulated widely rather than on "authentic" or original versions. I will include short prophecies, but generally exclude partial editions of longer works. Any comments will be brief.

Next time: Add the various short prophecies edited in the nineteenth century and later, and the Oracula cyrili.

Birgitta of Sweden

Montag, Ulrich. Das Werk der heiligen Birgitta von Schweden in oberdeutscher Überlieferung: Texte und Untersuchungen. Münchener Texte und Untersuchungen zur deutschen Literatur des Mittelalters 18. Munich: C. H. Beck, 1968.

Latin and German editions of the Onus mundi compiled by Johannes Tortsch.

Hildegard of Bingen

Gebeno, José Carlos Santos Paz, and Hildegard. La obra de Gebenón de Eberbach. La tradizione profetica 2. Tavarnuzze (Firenze): SISMEL : Edizioni del Galluzzo, 2004.

While excellent editions of Hildegard's work exist, this is the only complete edition of the widely copied Pentacron.

Johannes de Rupescissa

Vademecum

Johannes de Rupescissa. Vade mecum in tribulatione. Edited by Elena Tealdi, Robert E Lerner, and Gian Luca Potestà. Milan: Vita e Pensiero, 2015.

Kaup, Matthias. John of Rupescissa’s Vade Mecum in Tribulacione (1356): A Late Medieval Apocalypse Manual for the Forthcoming Fifteen Years of Horror and Hardship. Farnham: Ashgate, 2013.

Other works

Rupescissa, Johannes de. Liber secretorum eventuum: Edition critique, traduction et introduction historique. Edited by Robert E. Lerner and Christine Morerod-Fattebert. Fribourg: Editions Universitaires Fribourg Suisse, 1994.

Paracelsus

Paracelsus. Theophrast von Hohenheim gen. Paracelsus: Sämtliche Werke. Edited by Karl Sudhoff. 14 vols. Munich and Berlin: R. Oldenbourg, 1929.

Where among these 14 volumes are the prophetic works located? I'll have to take another look at this later. Available by open access. Vol. 1 is here.

Pseudo-Methodius

The problem with pseudo-Methodius editions is that most are focused on the early stages of the text in Syriac, Greek and Latin from the seventh to ninth centuries A.D., while I'm interested in what the text was doing in Europe nearly a thousand years later.

Sackur, Ernst. Sibyllinische Texte und Forschungen: Pseudomethodius, Adso und die Tiburtinische Sibylle. Halle (Saale), 1898. 1-96.

Pseudo-Methodius. Apocalypse of Pseudo-Methodius: An Alexandrian World Chronicle. Translated by Benjamin Garstad. Dumbarton Oaks Medieval Library, DOML 14. Cambridge, Mass: Harvard University Press, 2012.

Grifoni, Cinzia, and Clemens Ganter. “The Third Latin Recension of the Revelationes of Pseudo-Methodius – Introduction and Edition.” In Cultures of Eschatology, edited by Veronika Wieser, Vincent Eltschinger, and Johann Heiss, 1:194–253. Cultural History of Apocalyptic Thought. Berlin ; Boston: De Gruyter Oldenbourg, 2020.

Sibylline prophecies

This will probably get split into separate categories at some point.

Neske, Ingeborg. Die spätmittelalterliche deutsche Sibyllenweissagung: Untersuchung und Edition. Göppingen: Kümmerle, 1985.

Perhaps the first vernacular printed work.

Sackur, Ernst. Sibyllinische Texte und Forschungen: Pseudomethodius, Adso und die Tiburtinische Sibylle. Halle (Saale), 1898. 114-187.

Tiburtine sibyl

Ve mundo

Kaup, Matthias, and Robert E. Lerner. “Gentile of Foligno Interprets the Prophecy ‘Woe to the World,’ with an Edition and English Translation.” Traditio 56 (2001): 149–211.

Tuesday, May 30, 2023

OCR4all is better than Google

OCR4all is better than Google OCR for early modern printed books.

Klaus Graf pointed me to a suggestion from Lajos Adamik on using Google Docs for OCR. It's a great idea - Google is fast and fairly accurate. For some projects I'd definitely use it.

But not for most early printed books. I had earlier tried out different OCR4all models/model combinations on a page from Abbas Joachim magnus propheta (1516) so I had images (in various stages of preparation) and uncorrected output text to compare.

(NB: I got best results using the "fraktur-hist" models. Contrary to OCR4all's documentation, I got best results from using all five "fraktur-hist" models together compared to using any one of them alone.)

Here is the uncorrected OCR4all text. This is a fairly difficult text for OCR: Latin, many abbreviations, some barely perceptible abbreviation markings (some symbols may not display correctly in your browser, but I can see all of them here).

¶Jncipit liber de magnis tribulationibus in proximo futuris. Compilatus a docto ⁊ deuoto preſſ ptero ⁊ heremita Theoloſphoro de Cuſentia ꝓuincie Calabrie. Collectus vero ex vaticinijs nouorum prophetarum. ſ. beati Curilli: abbatis Joachim: Danda i: ⁊ Mer lini: ac veterum ſibillarum. Deinde abbreuiatus pervenerabilem fratrem Nuſtitianuʒ: vna cũ tractatu magiſtri Joãnis pa: iſini ordmis pᷣdicatoꝝ: de antichriſto ⁊ ſine mũdi: t fr̃is Tbertini de ſeptẽ ſtatib?eccleſie.

Here is the manually corrected text. Clearly a lot of post-recognition cleanup work is necessary, but OCR4all does a decent job of picking up abbreviations. Here's the corrected version - to get a corrected text of the complete work, perhaps 100-200 hours of work or more would be required.

¶ Incipit liber de magnis tribulationibus in proximo futuris. Compilatus a docto et devoto presbytero et heremita Theolosphoro de Cusentia provincie Calabrie. Collectus vero ex vaticiniis novorum prophetarum. s. beati Cirilli: abbatis Joachim: Dandali: et Merlini: ac veterum sibillarum. Deinde abbreviatus per venerabilem fratrem Rustitianum: una cum tractatu magistri Joannis parisini ordinis predicatorum: de antichristo et fine mundi: et fratris Ubertini de septem statibus ecclesie.

Here are a few examples from Google. The first one uses a deskewed base image. The amount of work required to correct the text will clearly be higher.

CIncipit liber de magnis tribulationibus in proximo futuris. Lompi latus a docto z ceuoro prefbytero z heremita Theolofpbozo de Lufentia puincie Lalabzie. Lollectus vero ex vaticinijs nouozum pzo/ phetarum.f. beati Lirilli: abbatis Joachim: Danda i: Delini:ac veterum fibillarum. Deinde abbreuiatus pervenerabilem fratrem Ruftitianus:vna cu tractatu magiftri Joanis parifini ordinis pdicator de antichrifto z fine müdi:z fris Ubertini de fepté ftatib ecclefie.

Using one of OCR4all's preprocessed images shows what can go wrong with Google Docs:

Incipit liber de magnis omne malum fuper omnes habitatores terre. tribulationibus in proximo futuris. Lompt Itaq; fub tali impatore/tres falt pape creabus tur:vnus grecus:alius italus:z alius germa/ latus a docto z ceuoro presbytero z heremita Theololphoto de Lufentia puincie Lalabae.us/oium peffimus: erunt finguli adinuicez Lollectus vero ex varicinije nouocum pro phetarum,f. beati Lirilli: abbatis Joachim: Danda i: dei lini:ac veterum fibillarum. Deinde abbreuiatus pervenerabilem fratrem Ruffitianuz:vna cũ tractatu magistri Joānis parifini ozdinis pdicator:de antichrifto z fine mūdì:z frio Ubertini de fepté flatib ecclefie.

Pretty bad. While a cleaner image should produce better results, Google's OCR failed to recognize the page's two-column layout and instead smashed the two columns together randomly (underlined/grey text) - sometimes one line at a time, sometimes two lines, then not at all - producing unuseable results. How often would this happen when scanning a 100-page book?

I realize this is all based on one page of one book, but I think this points to several ways that OCR4all is a better fit for all but the most straightforward projects.

More accurate recognition of abbreviations and other symbols typical found in early modern books
User control over page layout - you can exclude images, marginalia, etc., without editing the image
User control over models - you can check where lines/columns have been divided before recognition, and make changes if necessary
Better interface for correcting recognized text - Google doesn't reveal low-probability characters, and you're stuck with whatever it offers
Access to better recognition models/user control of models - although OCR4all will need to keep improving its models over time