Wednesday, August 2, 2023

eScriptorium is bad. And brilliant.

OCR4all isn't the only option for turning early printed books into electronic texts. According to Stefan Weil (via Klaus Graf), eScriptorium was able to produce usable text of pseudo-Vincent Ferrer's De fine mundi in German translation with moderate effort based on segmentation and text recognition models from the UB Mannheim. With 15 corrected pages, it was then possible to train a work-specific OCR model for the rest. (His results are here.)

First the good news.  The documentation states that eScriptorium runs under either Linux or MacOS, but it installs and runs without major incident as a Docker app under Windows. The recognition models from the UB Mannheim are very good for early German printed books. Using a best-case scan of Joseph Grünpeck's 1508 Speculum in German from the Österreichische Nationalbibliothek, it produced impressively accurate results. Results were still very good for a less-than-perfect scenario (Wolfgang Aytinger's Latin commentary on pseudo-Methodius from the HAB). The underlying Kraken OCR engine has finally made it possible for me to train and use my own OCR models. I can't imagine using anything but eScriptorium after this. It's that good.

But eScriptorium is also bad.

The installation documentation isn't great, and the instructions for use are confusing. A quickstart guide would be helpful. I've run other OCR programs before, but it still took a long time for eScriptorium to start making sense.

Importing images into a new project went smoothly, and eScriptorium seems to handle PDFs just fine. So there's less need to pre-process images, and conversion to grayscale is unnecessary.

Segmentation, using either the default or the UB Mannheim segmentation model, worked quite well. But one of eScriptorium's major flaws is that the segmentation editing tools are very limited compared to OCR4all, and they behave erratically. There's no easy way to select an area and delete all text lines at once, which is a real problem with woodcut illustration, which are often recognize as a dozen or more short text lines at odd angles. Instead of selecting all lines or the whole region, I have to select each line and delete it one by one. Attempting to edit region outlines or lines quickly leads to frustration. The order of text lines can be displayed, but there's no way to change it if it's incorrect (which I've seen more than once so far). There's no way to "grab" the page image and move it around, so it can be difficult to move it back to the center of the screen at reasonable size. The equivalent tools in OCR4all are much more comprehensive and easier to use.

When segmenting or recognizing multiple pages, processes would stop without warning or explanation. The only reason I knew they had stopped was because CPU usage in Docker dropped back into low single digits. The failed processes weren't sequential, but instead affected pages seemingly at random, which meant I had to go back through the images one by one to see where segmentation or recognition had failed.

Some tasks proved impossible to cancel at all, even after stopping Docker or restarting Windows. My first attempts at training a model in eScriptorium were unsuccessful, but even now one of them is still listed as ongoing, and clicking the "Cancel training" button generates the less-than-helpful error "SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data." (For the curious: the raw data is "status," and nothing else.).

The transcription window is nicely responsive and easy to use. I especially like how the text resizes so the line width matches the image above it, making it easy to edit or transcribe. I prefer it to the OCR4all transcription screen. The eScriptorium window only shows one line at a time, however, and often it's necessary to see the next or preceding lines to choose the best transcription. It also seems to lack OCR4all's ability to show low-probability characters.

With good recognition models, recognition results are fantastic. A pattern I've noticed with perhaps half the pages I've processed so far, however, is that the last line on the page will have abominable recognition quality, as if eScriptorium is somehow reverting to a rudimentary model for the final line of the page.

You can export recognized, transcribed and/or corrected pages in ALTO or PAGE XML formats or plain text, with or without the images - maybe. Once I had downloaded one XML format without images, eScriptorium kept generating the same ZIP file no matter what other options I chose until I restarted the whole stack in Docker. Plain text is exported only as a single block of text rather than as separate text files. If you need to generate ground truth text files, you'll have to copy and paste by hand, without even a page separator to guide you.

Effective OCR relies on high-quality models, but the central Zenodo repository is nearly empty and doesn't offer anything suitable for the early modern printed books I work with. I knew that better models existed somewhere and even some of the file names, but even with that information it took me two days to find them. (I eventually found them here.)

One of the nice features of eScriptorium is the underlying Kraken engine. You can install it under Window's Linux subsystem (I've got it running under Debian) and work with it directly from the command line, no web interface needed. The bad news, again, is the confusing and sometimes contradictory documentation, with the same command-line switches meaning different things depending on the particular command given to Kraken.

My overall impression is: nomen est omen. OCR4all really is trying to bring OCR for early print to a broad audience of individual users, while eScriptorium is targeted at the digital scriptorium, with specialized hardware and trained personnel working on major digitalization efforts.

Still, eScriptorium gives me the sense, in a way that OCR4all didn't, that the promised land is in sight: I'll be able to take a digital facsimile, get a good initial transcription, then train a work-specific model that will produce very good to excellent results for the complete work. A working digital text might mean the investment of a few hours and not dozens.

3 comments:

  1. Dear Mr. Green,
    Your article is really interesting and I agree with you on some points, especially regarding the unintuitive and sometimes clumsy user interface. In the OCR-BW project, we have written a relatively easy installation guide (but without Docker) in addition to a user guide. However, both are in German but can certainly be easily translated with DeepL. I would be interested to know how you find them.

    I have to disagree with you on a couple of points, as it is entirely possible to delete multiple lines at once. You have to hold SHIFT and select all lines with your mouse. You can then delete them all at once or extend them, etc. It is also possible to change the order of the lines. To do this, you must open the screen with the numbered lines and click on the double arrow, then you can move the lines there as you wish. I can also recommend you the manual of the Parisian colleagues, there you will also find a point exactly for this: https://escriptorium-tutorial.readthedocs.io/en/latest/transcribe/#sorting-lines. You also grab the possibility to take the image and move it around, for this you have to use the right mouse button.

    You are definitely right about the findability of the models, we still have to move them to Zenodo.

    Feel free to have a general discussion about eScriptorium. You can find my contact details and the documentation on the project homepage: https://ocr-bw.bib.uni-mannheim.de/.

    Best regards from Mannheim
    Larissa Will

    ReplyDelete
  2. Thanks for the tips - I'll try those things and see if I can get more efficient with fixing the page layout.

    ReplyDelete
  3. Hi Jonathan
    Tx a lot for trying eScriptorium out and writing this review.
    First, we are actually quite happy about any feature that does work in an installation on an OS for which it was not built.
    Second, thanks for your praise about some of the results.
    Third, a question: did you follow any of
    a) the introductory videos (e.g. the openITI ones, starting here: https://www.youtube.com/watch?v=N0hSNC3YvD4 or my own ones here https://vimeo.com/501443629)
    b) the written tutorial indicated by Larissa above or
    c) press the help button '?' above the segmentation panel?
    If not, were they not indicated well enough? (this is a real question).
    In addition to what Larissa mentioned above about multiline and lineorder modifications, you can grab the page image and move it around (i.e. 'panning') by click right mouse button and drag on the segmentation panel.
    Fourth, we would be most grateful for any bug reports here: https://gitlab.com/scripta/escriptorium/-/issues indicating your OS and browser.

    A hint for your problems with the last line: We try to limit the number of buttons but the API gives you a magic wand for a lot of issues especially if like seems to be your case you know how to program. Probably the polygon around the bottom line is too big, because it doesn't have a line below that would help to limit its size. You can address this issue by creating smaller polygons around the first or last lines according to the average line distance before transcription.
    Kindly from Paris
    Daniel

    ReplyDelete