Wednesday, August 16, 2023

eScriptorium: Let's try breaking some rules

I am not a scriptorium. I'm just one guy with a professional interest in late medieval and early modern prophetic texts who needs to turn digital facsimiles into electronic texts as quickly and painlessly as possible so I can read, skim, annotate, compare and search them.

eScriptorium uses kraken as its recognition engine, and in fact it's entirely possible to install kraken as a standalone program (using the Windows Subsystem for Linux) and run it from the command line. Kraken in turn makes it possible to train or refine a recognition model using the command-line program ketos.

According to the kraken documentation, "Transcription has to be diplomatic, i.e. contain the exact character sequence in the line image, including original orthography." What I need, though, is readable text. Especially if I'm searching a longer text, I need to be able to find all occurrences of a word without worrying about abbreviations or spelling conventions.

This sets up the following workflow:

  • Pre-edit images as necessary (much reduced with eScriptorium, which handled skewed images well)
  • Import images into the OCR environment
  • Automatically recognize baselines/regions
  • Manually correct baselines/regions
  • Automatically recognize a bunch of pages
  • Manually correct the text as diplomatic transcription (Larissa Will added some useful pointers in a comment on my last post)
  • Export XML files and JPG images
  • Train a work-specific model with ketos using these files
  • Recognize the rest of the text in eScriptorium using this model
  • Correct the diplomatic transcription by resolving abbreviations and lightly normalizing the text

It's a lot of work. OCR4all and its models are set up to handle just about any abbreviation you throw at them, but the rest of the world isn't. It requires the use of Unicode extensions that Notepad++ and Word don't handle well in every case (or even many cases). And it requires an enormous amount of time to resolve the abbreviations and normalize the text, even with careful use of search and replace. I did it with one longer work, and it's exhausting.

But what happens if we break the rules? I don't want to deal with diplomatic transcriptions. You're not the boss of me and you can't make me. What if we don't give kraken a diplomatic transcription? What if we feed it transcribed pages of resolved abbreviations or even normalized text? Will it harm accuracy, and will the results be usable? Can we get kraken to take over the work of normalizing the text for us?

Yes we can.

I started with UB Mannheim's recognition model based on 16th-century German Gothic prints (luther_best.mlmodel), which gave results very similar to but maybe a very slight amount better than its model based on German and Latin prints of various periods (german_print_best.mlmodel). The OCR target is again a 1498 edition of Wolfgang Aytinger's commentary on pseudo-Methodius. I experimented a bit along the way, but I had 24 corrected pages (exported in PAGE XML format) by the time I conducted the final comparison.

So in my eScriptorium directory, I currently have a subdirectory, "train," containing 24 .jpg files and .xml files named 0058.jpg/0058.xml - 0080.jpg/0080.xml. Between them, there are around 650 corrected lines of text. In my WSL-based Debian distribution, I navigate to the eScriptorium directory and issue the following command:

ketos -v train -i luther_best.mlmodel -f page train/*.xml

To explain each part:

-v    Verbose output, so I can see more of what's going on

train    The command telling ketos what to do

-i    Instead of trying to train a new model from scratch, I want ketos to retrain an existing model, luther_best.mlmodel, found in the directory where I issue the command

-f    The XML files are in PAGE format. 

train/*.xml    The path to the files to use for training 

Retraining a model with just 5 corrected pages yielded much improved recognition results. On my underpowered computer, each training iteration took around 5 minutes and the whole process took a few hours. With more data, training takes longer - with 24 pages, each iteration took 20+ minutes and I had to let the process run overnight.

Was it worth it?

Here's the first 10 lines of the page I used for testing from the HAB. 


 And here are the first 10 lines using base Luther model: 


I mean, kind of? I could use it, but it would take a lot of work to fix. The base German_Print model is about the same:

A little better, a little worse. Still a lot of work.

After training on work-specific pages, accuracy can be expected to be much better. But how will kraken do with abbreviations after training on an undiplomatic transcription? As it turns out, brilliantly.

In this image, correct resolutions of abbreviations are marked in green and incorrect ones or other mistakes in red. eScriptorium got it right 47/57 times (82%) after training on just the abbreviations that occur in 24 pages. And the best part is that I only have to deal with plain old ASCII-compliant characters that will copy and paste anywhere without complaint.

I can almost read this as is. I can work with this.

But what about normalizing i/j and u/v and similar? I re-corrected the 24 pages by normalizing the text, re-exported the XML, and re-trained. Here are the results:

In this image, green and red are used as above to mark correct/incorrect expansions, while blue/purple are added to mark correct and incorrect normalization.

And the results are fantastic. Training on normalized text with resolved abbreviations didn't cause any loss of recognition accuracy. Kraken normalized the text correctly 10/11 times, including getting the internal "v" correct in conversus and numeravit

Even some of the mistaken abbreviation resolutions are better than nothing. I just need to delete an "n" from "quenm," and the "m" is already there. The less I have to type, the faster the last step will go. This is going to save me a lot of work.

Finally, someone might ask if I couldn't just train a model from scratch using pages from Aytinger and have a completely tailor-made recognition model. Wouldn't that be even more accurate?

No. At this point, the results are laughable - quite literally. Trained on 650 lines of corrected text, the resulting recognition model has a 4% accuracy rate and yielded the following transcription:

tee-
teei
teet
teei
tee-
tee-
tee-
teei
ees
teet

I don't know how much text is required to train a model from scratch, but "around 650 lines" is not the answer.

1 comment:

  1. congrats. according to my experience where a good generalized model exists it is almost always preferable to finetune it (as you did) instead of starting from scratch, especially if you are adding resolution of common abbreviations and i/j u/v normalization.

    ReplyDelete