Wednesday, August 16, 2023

eScriptorium: Let's try breaking some rules

I am not a scriptorium. I'm just one guy with a professional interest in late medieval and early modern prophetic texts who needs to turn digital facsimiles into electronic texts as quickly and painlessly as possible so I can read, skim, annotate, compare and search them.

eScriptorium uses kraken as its recognition engine, and in fact it's entirely possible to install kraken as a standalone program (using the Windows Subsystem for Linux) and run it from the command line. Kraken in turn makes it possible to train or refine a recognition model using the command-line program ketos.

According to the kraken documentation, "Transcription has to be diplomatic, i.e. contain the exact character sequence in the line image, including original orthography." What I need, though, is readable text. Especially if I'm searching a longer text, I need to be able to find all occurrences of a word without worrying about abbreviations or spelling conventions.

This sets up the following workflow:

  • Pre-edit images as necessary (much reduced with eScriptorium, which handled skewed images well)
  • Import images into the OCR environment
  • Automatically recognize baselines/regions
  • Manually correct baselines/regions
  • Automatically recognize a bunch of pages
  • Manually correct the text as diplomatic transcription (Larissa Will added some useful pointers in a comment on my last post)
  • Export XML files and JPG images
  • Train a work-specific model with ketos using these files
  • Recognize the rest of the text in eScriptorium using this model
  • Correct the diplomatic transcription by resolving abbreviations and lightly normalizing the text

It's a lot of work. OCR4all and its models are set up to handle just about any abbreviation you throw at them, but the rest of the world isn't. It requires the use of Unicode extensions that Notepad++ and Word don't handle well in every case (or even many cases). And it requires an enormous amount of time to resolve the abbreviations and normalize the text, even with careful use of search and replace. I did it with one longer work, and it's exhausting.

But what happens if we break the rules? I don't want to deal with diplomatic transcriptions. You're not the boss of me and you can't make me. What if we don't give kraken a diplomatic transcription? What if we feed it transcribed pages of resolved abbreviations or even normalized text? Will it harm accuracy, and will the results be usable? Can we get kraken to take over the work of normalizing the text for us?

Yes we can.

I started with UB Mannheim's recognition model based on 16th-century German Gothic prints (luther_best.mlmodel), which gave results very similar to but maybe a very slight amount better than its model based on German and Latin prints of various periods (german_print_best.mlmodel). The OCR target is again a 1498 edition of Wolfgang Aytinger's commentary on pseudo-Methodius. I experimented a bit along the way, but I had 24 corrected pages (exported in PAGE XML format) by the time I conducted the final comparison.

So in my eScriptorium directory, I currently have a subdirectory, "train," containing 24 .jpg files and .xml files named 0058.jpg/0058.xml - 0080.jpg/0080.xml. Between them, there are around 650 corrected lines of text. In my WSL-based Debian distribution, I navigate to the eScriptorium directory and issue the following command:

ketos -v train -i luther_best.mlmodel -f page train/*.xml

To explain each part:

-v    Verbose output, so I can see more of what's going on

train    The command telling ketos what to do

-i    Instead of trying to train a new model from scratch, I want ketos to retrain an existing model, luther_best.mlmodel, found in the directory where I issue the command

-f    The XML files are in PAGE format. 

train/*.xml    The path to the files to use for training 

Retraining a model with just 5 corrected pages yielded much improved recognition results. On my underpowered computer, each training iteration took around 5 minutes and the whole process took a few hours. With more data, training takes longer - with 24 pages, each iteration took 20+ minutes and I had to let the process run overnight.

Was it worth it?

Here's the first 10 lines of the page I used for testing from the HAB. 


 And here are the first 10 lines using base Luther model: 


I mean, kind of? I could use it, but it would take a lot of work to fix. The base German_Print model is about the same:

A little better, a little worse. Still a lot of work.

After training on work-specific pages, accuracy can be expected to be much better. But how will kraken do with abbreviations after training on an undiplomatic transcription? As it turns out, brilliantly.

In this image, correct resolutions of abbreviations are marked in green and incorrect ones or other mistakes in red. eScriptorium got it right 47/57 times (82%) after training on just the abbreviations that occur in 24 pages. And the best part is that I only have to deal with plain old ASCII-compliant characters that will copy and paste anywhere without complaint.

I can almost read this as is. I can work with this.

But what about normalizing i/j and u/v and similar? I re-corrected the 24 pages by normalizing the text, re-exported the XML, and re-trained. Here are the results:

In this image, green and red are used as above to mark correct/incorrect expansions, while blue/purple are added to mark correct and incorrect normalization.

And the results are fantastic. Training on normalized text with resolved abbreviations didn't cause any loss of recognition accuracy. Kraken normalized the text correctly 10/11 times, including getting the internal "v" correct in conversus and numeravit

Even some of the mistaken abbreviation resolutions are better than nothing. I just need to delete an "n" from "quenm," and the "m" is already there. The less I have to type, the faster the last step will go. This is going to save me a lot of work.

Finally, someone might ask if I couldn't just train a model from scratch using pages from Aytinger and have a completely tailor-made recognition model. Wouldn't that be even more accurate?

No. At this point, the results are laughable - quite literally. Trained on 650 lines of corrected text, the resulting recognition model has a 4% accuracy rate and yielded the following transcription:

tee-
teei
teet
teei
tee-
tee-
tee-
teei
ees
teet

I don't know how much text is required to train a model from scratch, but "around 650 lines" is not the answer.

Wednesday, August 2, 2023

eScriptorium is bad. And brilliant.

OCR4all isn't the only option for turning early printed books into electronic texts. According to Stefan Weil (via Klaus Graf), eScriptorium was able to produce usable text of pseudo-Vincent Ferrer's De fine mundi in German translation with moderate effort based on segmentation and text recognition models from the UB Mannheim. With 15 corrected pages, it was then possible to train a work-specific OCR model for the rest. (His results are here.)

First the good news.  The documentation states that eScriptorium runs under either Linux or MacOS, but it installs and runs without major incident as a Docker app under Windows. The recognition models from the UB Mannheim are very good for early German printed books. Using a best-case scan of Joseph Grünpeck's 1508 Speculum in German from the Österreichische Nationalbibliothek, it produced impressively accurate results. Results were still very good for a less-than-perfect scenario (Wolfgang Aytinger's Latin commentary on pseudo-Methodius from the HAB). The underlying Kraken OCR engine has finally made it possible for me to train and use my own OCR models. I can't imagine using anything but eScriptorium after this. It's that good.

But eScriptorium is also bad.

The installation documentation isn't great, and the instructions for use are confusing. A quickstart guide would be helpful. I've run other OCR programs before, but it still took a long time for eScriptorium to start making sense.

Importing images into a new project went smoothly, and eScriptorium seems to handle PDFs just fine. So there's less need to pre-process images, and conversion to grayscale is unnecessary.

Segmentation, using either the default or the UB Mannheim segmentation model, worked quite well. But one of eScriptorium's major flaws is that the segmentation editing tools are very limited compared to OCR4all, and they behave erratically. There's no easy way to select an area and delete all text lines at once, which is a real problem with woodcut illustration, which are often recognize as a dozen or more short text lines at odd angles. Instead of selecting all lines or the whole region, I have to select each line and delete it one by one. Attempting to edit region outlines or lines quickly leads to frustration. The order of text lines can be displayed, but there's no way to change it if it's incorrect (which I've seen more than once so far). There's no way to "grab" the page image and move it around, so it can be difficult to move it back to the center of the screen at reasonable size. The equivalent tools in OCR4all are much more comprehensive and easier to use.

When segmenting or recognizing multiple pages, processes would stop without warning or explanation. The only reason I knew they had stopped was because CPU usage in Docker dropped back into low single digits. The failed processes weren't sequential, but instead affected pages seemingly at random, which meant I had to go back through the images one by one to see where segmentation or recognition had failed.

Some tasks proved impossible to cancel at all, even after stopping Docker or restarting Windows. My first attempts at training a model in eScriptorium were unsuccessful, but even now one of them is still listed as ongoing, and clicking the "Cancel training" button generates the less-than-helpful error "SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data." (For the curious: the raw data is "status," and nothing else.).

The transcription window is nicely responsive and easy to use. I especially like how the text resizes so the line width matches the image above it, making it easy to edit or transcribe. I prefer it to the OCR4all transcription screen. The eScriptorium window only shows one line at a time, however, and often it's necessary to see the next or preceding lines to choose the best transcription. It also seems to lack OCR4all's ability to show low-probability characters.

With good recognition models, recognition results are fantastic. A pattern I've noticed with perhaps half the pages I've processed so far, however, is that the last line on the page will have abominable recognition quality, as if eScriptorium is somehow reverting to a rudimentary model for the final line of the page.

You can export recognized, transcribed and/or corrected pages in ALTO or PAGE XML formats or plain text, with or without the images - maybe. Once I had downloaded one XML format without images, eScriptorium kept generating the same ZIP file no matter what other options I chose until I restarted the whole stack in Docker. Plain text is exported only as a single block of text rather than as separate text files. If you need to generate ground truth text files, you'll have to copy and paste by hand, without even a page separator to guide you.

Effective OCR relies on high-quality models, but the central Zenodo repository is nearly empty and doesn't offer anything suitable for the early modern printed books I work with. I knew that better models existed somewhere and even some of the file names, but even with that information it took me two days to find them. (I eventually found them here.)

One of the nice features of eScriptorium is the underlying Kraken engine. You can install it under Window's Linux subsystem (I've got it running under Debian) and work with it directly from the command line, no web interface needed. The bad news, again, is the confusing and sometimes contradictory documentation, with the same command-line switches meaning different things depending on the particular command given to Kraken.

My overall impression is: nomen est omen. OCR4all really is trying to bring OCR for early print to a broad audience of individual users, while eScriptorium is targeted at the digital scriptorium, with specialized hardware and trained personnel working on major digitalization efforts.

Still, eScriptorium gives me the sense, in a way that OCR4all didn't, that the promised land is in sight: I'll be able to take a digital facsimile, get a good initial transcription, then train a work-specific model that will produce very good to excellent results for the complete work. A working digital text might mean the investment of a few hours and not dozens.