The last time I looked at OCR for early printing, the state of the art was eScriptorium and the kraken ecosystem, which make use of relatively small models for recognizing text and page layout. Since then, AI has taken hold of public consciousness, and numerous companies and academic research projects have released open-weight models that can be run on suitable hardware, including some very large and capable models with tens or even hundreds of billions of parameters. Will any of this immense effort be useful for early printing? Almost certainly.
For transcribing early print, there are primarily two kinds of knowledge we’re interested in, visual information (what is a page? what is a word? what is an irrelevant paw print from a cat walking across the page?) and language information (what patterns are common in early modern German, and what sequences are impossible?). There are some recent AI models – vision language models (VLMs) or multi-modal models – that attempt precisely this combination of knowledge about language (and the rest of the world) and about how things look. These models, developed by some of the largest tech and AI companies, are often used to describe the content of an image. But OCR is merely one particular way to describe an image. Are any of these VLMs useful for OCR of early printed books?
Yes. You can freely download and use relatively large models and achieve fair to very good results, even on modest consumer hardware (in my case, an Nvidia RTX 3060 GPU), after zero to minimal image preparation, resulting in an electronic text suitable for further editing (which I’m defining as a character error rate [CER] of 5-10%). This would be helpful to reduce or eliminate the lengthy and tedious image setup and page layout recognition process required by eScriptorium and similar programs. Details and technical notes below.
I’m running Windows, so for practical purposes, the available inference engines are Ollama and llama.cpp (Ollama was originally based on llama.cpp, and many other projects use one or the other as their backend). Of the two, Ollama is simpler and easier to use, but offers fewer options for customization and can be somewhat slower (keeping in mind that speed is primarily determined by hardware limitations and model size). There are additional options for Mac and Linux users.
The models that proved useful were all on the heavier side (billions of parameters noted in parentheses): Mistral Small 3.1 (24B), Gemma 3 (27B), and Qwen 2.5-VL (32B). Most other VLMs and smaller versions of these three proved unsuitable, either because they have too little language knowledge or are focused on modern documents. The one exception was Pixtral (12B), whose results approached those of Gemma 3:27b (and was much faster due to its smaller size, allowing it to run entirely in VRAM on my hardware). There are of course much larger models, but they require much more computing power for inference that goes well beyond what most consumers have access to (but would be within the realm of a major digitalization project's budget).
For testing, I used 11 samples drawn from my typical research images, including books printed in the 15th, 16th and 17th centuries; texts in Latin, German, and mixed Latin/German; and pages with woodcut illustrations. I straightened and cropped most images, but included one pair of original/modified images to compare the results. I also prepared my own transcriptions (not normalized, but with expanded abbreviations) for computing character error rates. The usual prompt was a variation on: “This image is from a book printed in 1532. The text is in German. Transcribe the text. Expand any abbreviations. Output only the transcription.”
The results: Mistral-Small 3.1 was the best overall (average CER 9.3%), followed by Qwen 2.5-VL:32b (12.2%; inference only possible under Ollama) and Gemma 3:27b (13.7% under llama.cpp, but with degraded accuracy under Ollama).
All three models were very capable of finding the relevant text. There was no mishandling of illustrations or other distractions or skipping of text blocks. No definition of page layout or identification of text lines were necessary.
The models differed most on the Latin texts. Mistral appears to have had more Latin in its training data. The ability to expand abbreviations was universally poor, however, with no model doing well, especially with the numerous Latin abbreviations.
An open question is if there is some way to address ignorance of abbreviations through prompting or some other means. Providing extensive guidance in the prompt did not appear to consistently improve results, however. The best approach to normalizing the text (modern usage of u/v, i/j and umlaut vowels, for example) is yet to be determined, although both prompting and use of a grammar to constrain inference seem promising.
For dealing with poor image quality, size matters. The best results were obtained with Qwen 2.5-VL, the largest model, with 32 billion parameters. With a CER of nearly 20%, however, results from poor-quality images are not accurate enough to be highly useful.
All three models dealt with slightly skewed and uncropped images well (such as the images typically provided by major digitalization projects), with only small losses in accuracy. But larger images are slower to process, and the VLM's attention is not as tightly focused on the relevant areas of the image, so that some preparatory processing remains helpful for peak accuracy.
It’s necessary to become familiar with each model, including finding the best parameters for OCR. After extensive testing of parameters, an additional 1% improvement in accuracy (to 8.3%) was obtained from Mistral-Small 3.1. Sample command line:
llama-mtmd-cli.exe -m models\mistral-small3.1\Mistral-Small-3.1-24B-Instruct-2503-Q4_K_M.gguf --mmproj models\mistral-small3.1\mmproj-F32.gguf --image ..\Testing\image5.png -ngl 31 -lv 0 -p "This image is from a book printed in 1532. The text is in German. Transcribe the text. Output only the transcription." -fa --temp .61 --top-k 5 --min-p 0.2 --grammar-file grammars\16thc.gbnf
There can also be prompting pitfalls. Gemma 3, for example, often attempted to add additional analysis and had to be explicitly told not to. It's not clear that the effort to get to know a particular model is worth it when new models and versions are being released very frequently.
Compared to most files, these AI models are gigantic. In their native form, they range from 48 to 75 GB. My GPU only has 12 GB of VRAM, however. How is it possible to make use of models of this size?
The answer is quantization. Quantized models trade off some precision against size and inference speed. The typical quant (Q4_K_M, used in these tests) is a quarter of the size (14-19 GB) and runs acceptably when split between VRAM and system RAM, with minimal loss of accuracy. Quantization is very much an area of active research, with a recent explosion of quantization types and processes. Testing with larger quants (Q6_K) and more advanced techniques (Unsloth’s UD-Q5_K_XL) didn’t find any noticeable improvement in results, however. With Gemma 3, slightly better results were obtained with the “quantization aware” model Gemma 3:27b-it-qat. Whether any particular approach to quantization is more effective for OCR remains an open question.
In general, the field is changing extremely rapidly. Mistral-Small 3.1 was released only in March, and the initial quants for use in llama.cpp had poor OCR results until one provider fixed its quant in May. Inference with Mistral-Small 3.1 in Ollama (which uses its own quants) was accurate but very slow until the program was updated last week.
What this means is that it would be possible today to initially process images to generate OCR text for further editing or use in eScriptorium, with results that are more accurate than using kraken-based models. It’s possible to imagine a near- to medium-term future where an AI model could browse through directories of digital images, leaving behind a tidy text file with a transcription and translation.
Much of the knowledge in these models is not useful for our purposes, so that a smaller model, with less knowledge of cats and atomic structure and more knowledge of Latin abbreviations, might be both faster and significantly more accurate. Such a model would not need to be created from scratch, but could be a fine tune of an existing VLM, and might come close to achieving highly accurate unsupervised OCR of early printed books at the first try.
But training these larger and more complicated models is harder and more resource-intensive, so that the miracle of eScriptorium/kraken, with human corrections leading directly to retraining of a new model, is harder to replicate.
Finally, there is no economic case for this. Mistral.ai offers some of the most inexpensive API access to its models of any AI company. Despite over 20 years of collecting digital images, the cost of my used, last-last-generation GPU would more than cover the cost of API access to Mistral’s larger models or its specialized OCR model. Using a local model would have to be justified instead through network requirements, research focus, or privacy concerns.