Friday, October 31, 2014

Word, Zotero, Excel, Perl and back: online facsimiles and the digital research process in the humanities (with source code)

At a recent conference on book history and digital humanities in Wolfenbüttel, I was struck by the very different places the presenters were in with respect to adopting digital tools. Some presenters did little more than write their papers using Word, some made use of bleeding-edge visualization tools, and others were everywhere in between. One is not necessarily better than the others, as long as the appropriate tools are being used for the task - some projects just don't require complex visualization strategies. While watching the presentations, it occurred to me that selecting digital tools for my own research is like peeling back the layers of an onion.
For the simplest research projects, Word is enough for taking notes and then turning those notes into a written text. When I write book reviews, for example, I rarely need anything more than Word.

For more complex projects, I use Zotero to manage notes and bibliography, and for creating footnotes. For my forthcoming article on Lienhard Jost, I took several pages of notes in Word. Then I created a subcollection in Zotero that held a few dozen bibliographical sources, some with child notes. When I started writing, I used Zotero to create all the footnotes.

If I have a significant amount of tabular data, however, of if I need to do any calculation, then Excel becomes an essential tool for analysis and information management. If I'm working with more than five or ten editions, I'll create a spreadsheet to keep track of them. For a joint article in progress on "Dietrich von Zengg," I have notes from Word, a Zotero subcollection to manage bibliography and additional notes, and an Excel spreadsheet of 20+ relevant editions (including among other things the date and place of printing, VD16 number, and whether I have a facsimile or not). For establishing the textual history, I start with Word, but I need Excel to keep track of all the variants. I use Zotero again for footnotes as I write the article in Word.

Once I have more than several dozen primary sources, however, neither Excel nor Zotero are sufficient, especially if I need to look at subsets of the primary sources. For Printing and Prophecy, I made a systematic search of GW/ISTC/VD16 for relevant printed editions before 1550 (extended to 1620 for an article I published in AGB), then entered all the information into an Access database. It took months of daily data-entry, and I still update the database whenever I come across something that sounds relevant. For my work now, it's a huge time-saver. If I want to know what prophetic texts were printed in Nuremberg between 1530 and 1540, it takes me about ten seconds to find out, and I can also create much more complex queries. Or if I'm planning to visit Wolfenbüttel and want to find out which printed editions in the Herzog August Bibliothek I've never seen in person or in facsimile, a database search will send me in the right direction. I'll also import tabular data into Access and export tables from Access into Excel based on particular queries, such as the table of all editions attributed to Wilhelm Friess. The Strange and Terrible Visions was an Excel project, but Printing and Prophecy was driven by Access.

For projects that require very specific or complex analysis, or that involve online interaction, Access is not the right tool for me. To create the Access database for "The Shape of Incunable Survival and Statistical Estimation of Lost Editions," I spent several hours writing Perl scripts to query the ISTC and analyze the results. Every so often I'll need to write a new Perl script to extend one of the Access databases that I work with. (One doesn't have to use Perl, of course. Oliver Duntze's presentation involved some very interesting work on typographic history using Python. R and other statistical packages probably belong here, too.)

* * *

Problem: As I'm located a few thousand miles from most of my primary sources, digital facsimiles have become especially important to my work. I check several digitization projects daily for new facsimiles and update my database accordingly. Systematic searching for facsimiles led me to the discovery Lienhard Jost's lost visions, so I've already seen real professional benefit from keeping track of new digital editions. But what if I miss a day or skip over an important title? I might be missing out on something important.

Solution: Create an Access query to check my database and find all the titles in VD16 for which no facsimile is known. Time: 10 seconds. Result: 1,565 editions.
Export the results to Excel, and save them as a text file. Time: 10 more seconds.
Search VD16 for all 1,565 editions by hand and check for new facsimiles? Time: 13 mind-numbing hours?


Instead, write a short Perl script (see below) to read the text file, query the VD16 database, and spit out the VD16 numbers when it finds a link to a digital edition. Time: 1 hour (30 minutes programming, 30 minutes run time). Result: 280 new digital facsimiles that I had overlooked.

To inspect all of those facsimiles and see if they're hiding anything exciting will take a few weeks, but most of the time I spend on it will involve core scholarly competencies of reading and evaluating primary sources, and it will make my knowledge of the sources more comprehensive and complete. In cases like this, digital tools let me get to the heart of my scholarship more efficiently and spend less time on repetitive tasks.

Does every humanist need to know a programming language, as some conference participants suggested? I don't know. We need to constantly acquire or become conversant in new skills, both within and outside our discipline, but sometimes it makes more sense to rely on the help of experts. I don't think it's implausible, however, that before long it will be as common for scholars in the humanities to use programming languages as it is for us to use Excel today.

* * *

After all, if you can learn Latin, Perl isn't difficult.

### This script reads through a list of VD16 numbers (assumed to be named 'vd16.txt' and found in the same directory where the
### script is run, with one entry on each line in the form 'VD16 S 843'). The script loads the corresponding VD16 entry and checks
### to see if a digital facsimile is available. If it is, it prints the VD16 number.

### This script relies on the following Perl modules: LWP

use strict;
use LWP::Simple;    ### For grabbing web pages

my $VD16number;
my $url;
my $baseurl = '';
my $VD16page;
my $formatVD16number;

open FILE , "<VD16.txt";

while (<FILE>) {
    ### Create the URL to check
    ($VD16number) = ($_);
    ### Remove the newlines
    chomp ($VD16number);
    ### Replace spaces with + signs for the durable URL (seems not to be strictly necessary, will accept spaces OK)
    $formatVD16number = $VD16number;
    $formatVD16number =~ s/[ ]/+/g;
    ### Set up the URL we want to retrieve
    $url = $baseurl . $formatVD16number;
### Now load that page
    $VD16page = get $url;
    ### Now look for a facsimile link
    if ($VD16page =~ / {
    ### If we find one, print the VD16 number
        print "$VD16number\n";


  1. Practical, sensible.
    A few alternatives come to mind.
    Evernote versus Word: you still have bold and italic, but no sidetracking by all those features of Word. And better management of all those notes and snippets.
    Filemaker versus Access: I know some people doing powerful things with Filemaker, they have their transcriptions, facsimiles, conjectures and commentaries is in it. Very beautiful.
    Python versus Perl. I have programmed in Perl for 20 years, loved it. Yet a year ago I switched to Python. Several reasons: python is becoming a lingua franca in the data processing world, across disciplines, and its ecosystem of packages reflects that. Interactive Python (IPython Notebook) is ideal for exploratory data analysis. Python is drawing even theologians into its orb. Programming has never been easier, and I agree with the last sentence, that programming may very well become a basic skill for a large proportion of humanists.

  2. Totally agree with what you said. I had a similar experience, and now, with a bit of playing around, I have arrived at a stage where I use MS Access for bibliographies of books of the handpress era (all my sources, basically) and Zotero for secondary readings. Access is great in the sense that it is dynamic and allows you to change multiple entries at the same time and gives you endless opportunities to filter you data and write queries. Zotero for me works best as it has extensive possibilities of collecting notes on your reading and save them. Furthermore, it gives you a wide array of output format when it comes to creating a bibliography, covering almost all major style guides.