Friday, October 31, 2014

Word, Zotero, Excel, Perl and back: online facsimiles and the digital research process in the humanities (with source code)

At a recent conference on book history and digital humanities in Wolfenbüttel, I was struck by the very different places the presenters were in with respect to adopting digital tools. Some presenters did little more than write their papers using Word, some made use of bleeding-edge visualization tools, and others were everywhere in between. One is not necessarily better than the others, as long as the appropriate tools are being used for the task - some projects just don't require complex visualization strategies. While watching the presentations, it occurred to me that selecting digital tools for my own research is like peeling back the layers of an onion.
For the simplest research projects, Word is enough for taking notes and then turning those notes into a written text. When I write book reviews, for example, I rarely need anything more than Word.

For more complex projects, I use Zotero to manage notes and bibliography, and for creating footnotes. For my forthcoming article on Lienhard Jost, I took several pages of notes in Word. Then I created a subcollection in Zotero that held a few dozen bibliographical sources, some with child notes. When I started writing, I used Zotero to create all the footnotes.

If I have a significant amount of tabular data, however, of if I need to do any calculation, then Excel becomes an essential tool for analysis and information management. If I'm working with more than five or ten editions, I'll create a spreadsheet to keep track of them. For a joint article in progress on "Dietrich von Zengg," I have notes from Word, a Zotero subcollection to manage bibliography and additional notes, and an Excel spreadsheet of 20+ relevant editions (including among other things the date and place of printing, VD16 number, and whether I have a facsimile or not). For establishing the textual history, I start with Word, but I need Excel to keep track of all the variants. I use Zotero again for footnotes as I write the article in Word.

Once I have more than several dozen primary sources, however, neither Excel nor Zotero are sufficient, especially if I need to look at subsets of the primary sources. For Printing and Prophecy, I made a systematic search of GW/ISTC/VD16 for relevant printed editions before 1550 (extended to 1620 for an article I published in AGB), then entered all the information into an Access database. It took months of daily data-entry, and I still update the database whenever I come across something that sounds relevant. For my work now, it's a huge time-saver. If I want to know what prophetic texts were printed in Nuremberg between 1530 and 1540, it takes me about ten seconds to find out, and I can also create much more complex queries. Or if I'm planning to visit Wolfenbüttel and want to find out which printed editions in the Herzog August Bibliothek I've never seen in person or in facsimile, a database search will send me in the right direction. I'll also import tabular data into Access and export tables from Access into Excel based on particular queries, such as the table of all editions attributed to Wilhelm Friess. The Strange and Terrible Visions was an Excel project, but Printing and Prophecy was driven by Access.

For projects that require very specific or complex analysis, or that involve online interaction, Access is not the right tool for me. To create the Access database for "The Shape of Incunable Survival and Statistical Estimation of Lost Editions," I spent several hours writing Perl scripts to query the ISTC and analyze the results. Every so often I'll need to write a new Perl script to extend one of the Access databases that I work with. (One doesn't have to use Perl, of course. Oliver Duntze's presentation involved some very interesting work on typographic history using Python. R and other statistical packages probably belong here, too.)

* * *


Problem: As I'm located a few thousand miles from most of my primary sources, digital facsimiles have become especially important to my work. I check several digitization projects daily for new facsimiles and update my database accordingly. Systematic searching for facsimiles led me to the discovery Lienhard Jost's lost visions, so I've already seen real professional benefit from keeping track of new digital editions. But what if I miss a day or skip over an important title? I might be missing out on something important.

Solution: Create an Access query to check my database and find all the titles in VD16 for which no facsimile is known. Time: 10 seconds. Result: 1,565 editions.
Export the results to Excel, and save them as a text file. Time: 10 more seconds.
Search VD16 for all 1,565 editions by hand and check for new facsimiles? Time: 13 mind-numbing hours?

No.

Instead, write a short Perl script (see below) to read the text file, query the VD16 database, and spit out the VD16 numbers when it finds a link to a digital edition. Time: 1 hour (30 minutes programming, 30 minutes run time). Result: 280 new digital facsimiles that I had overlooked.

To inspect all of those facsimiles and see if they're hiding anything exciting will take a few weeks, but most of the time I spend on it will involve core scholarly competencies of reading and evaluating primary sources, and it will make my knowledge of the sources more comprehensive and complete. In cases like this, digital tools let me get to the heart of my scholarship more efficiently and spend less time on repetitive tasks.

Does every humanist need to know a programming language, as some conference participants suggested? I don't know. We need to constantly acquire or become conversant in new skills, both within and outside our discipline, but sometimes it makes more sense to rely on the help of experts. I don't think it's implausible, however, that before long it will be as common for scholars in the humanities to use programming languages as it is for us to use Excel today.

* * *

After all, if you can learn Latin, Perl isn't difficult.

### This script reads through a list of VD16 numbers (assumed to be named 'vd16.txt' and found in the same directory where the
### script is run, with one entry on each line in the form 'VD16 S 843'). The script loads the corresponding VD16 entry and checks
### to see if a digital facsimile is available. If it is, it prints the VD16 number.

### This script relies on the following Perl modules: LWP


use strict;
use LWP::Simple;    ### For grabbing web pages

my $VD16number;
my $url;
my $baseurl = 'http://gateway-bayern.de/';
my $VD16page;
my $formatVD16number;

open FILE , "<VD16.txt";

while (<FILE>) {
    ### Create the URL to check
    ($VD16number) = ($_);
    ### Remove the newlines
    chomp ($VD16number);
    ### Replace spaces with + signs for the durable URL (seems not to be strictly necessary, will accept spaces OK)
    $formatVD16number = $VD16number;
    $formatVD16number =~ s/[ ]/+/g;
    ### Set up the URL we want to retrieve
    $url = $baseurl . $formatVD16number;
   
### Now load that page
    $VD16page = get $url;
    ### Now look for a facsimile link
    if ($VD16page =~ /nbn-resolving.de/) {
    ### If we find one, print the VD16 number
        print "$VD16number\n";
    }
}

Friday, October 24, 2014

Let's learn R: Confidence intervals, and the "So what?" question for history and literary studies

To catch up with the last two posts (one, two): In 2007, Goran Proot and Leo Egghe published an article in Papers of the Bibliographical Society of America (102: 149-74) in which they suggested a method for estimating the number of missing editions of printed works based on surviving copies. In 2008, Quentin Burrell (in Journal of Informetrics 2:101-5) showed that their approach was a special case of the unseen species problem, which has been widely studied in statistics, and suggested a more robust approach based on the statistical literature. The problem is that applying Proot and Egghe's method is within the abilities of most book historians, while applying Burrell's is not. The last few posts have been dedicated to implementing Burrell's approach in R and bringing it within reach of the non-statistician.

The specific estimate that Burrell's method offers is not substantially different from Proot and Egghe's. Burrell's approach does offer two important advantages, however. The first is a confidence interval: within what range are we 95% confident that the actual number of lost or total editions lies? Burrell offers the following formula, where n is the number of editions (804 in Proot and Egghe's data):

This formula in turn relies on the formula for the Fisher information of a truncated Poisson distribution, which Burrell derives for us:

This again looks fearsome, but all we have to do is plug the right numbers into R, with e = Euler's number and λ = .2461 (as we found last time).

Let's define i according to Burrell's equation (5):
> i=(1-(1+.2461)*exp(1)^-.2461)/(.2461*(1-exp(1)^-.2461)^2)
> i
[1] 2.198025


So according to Burrell's equation (4), the confidence interval is .2461 plus or minus the result of this:
> 1.96/(sqrt(804*i))
[1] 0.04662423


Or in other words,
> .2461-1.96/(sqrt(804*i))
[1] 0.1994758
> .2461+1.96/(sqrt(804*i))
[1] 0.2927242


So we expect that there is a 95% chance that the actual fraction of missing editions lies somewhere between .7462 and .8192, which we find in the same way as mentioned in the last post, which also lets us estimate a range for the number of total editions:

> exp(-.1994758)
[1] 0.81916
> exp(-.2927242)
[1] 0.7462279


> 804/(1-exp(-.1994758))
[1] 4445.92
> 804/(1-exp(-.2927242))
[1] 3168.197


The second advantage of using Burrell's method is of fundamental importance: It forces us to think about when we can apply it, and when we can't. We can observe both numerically and graphically that Proot and Egghe's data are a very good fit for a truncated Poisson distribution, and therefore plug numbers into Burrell's equations with a clean conscience. (NB: I have stated in print that a truncated Poisson distribution is a very poor fit for modelling incunable survival, and I think it will usually be a poor model for most situations involving book survival. What to do about that is a problem that still needs to be addressed.)

In addition, Burrell offers a statistical test of how well the truncated Poisson distribution fits the observed data. Burrell's Table 1 compares the observed and the expected number of editions with the given number of copies, to which he applies a chi-square test for goodness of fit. Note, however, that a rule of thumb for chi-square tests is that no category should have five or less observations, and so Burrell combines the number of 3-, 4-, and 5-copy editions into one category.

Can R do this? Of course it can. R was made to do this.

> observed <- c(714,82,8)
> expected <- c(709.11,87.27,7.62)

> chisq.test (observed, p=expected, rescale.p=TRUE)

        Chi-squared test for given probabilities

data:  observed
X-squared = 0.3709, df = 2, p-value = 0.8307


So we derive nearly the same chi-squared value as Burrell does, .371 compared to his .370. My quibble with Burrell is that I can't see how there can be only 1 degree of freedom (df), as Burrell says, rather than 2, or .89 for a p-value rather than .831. The chi-squared value can be anything from zero (for a perfect fit) on up (for increasingly poor fits), while the degrees of freedom is defined as the number of categories (which here are the 1-copy editions, 2-copy editions, or 3/4/5-copy editions) minus one. The p-value ranges from very small (for a terrible fit) up to 1 (for a perfect fit). There is a great deal written about how to apply and interpret the results of a chi-square or other statistical tests, but the values above support the visual impression, as Burrell notes, that Proot and Egghe's data is a very close fit to a truncated Poisson distribution.

* * *

So why should a book historian or scholar of older literature care about any of this? Stated very briefly, the answer is:

For studying older literature, we often implicitly assume that what we can find in libraries today reflects what could have been found in the fifteenth or sixteenth centuries, and that manuscript catalogs and bibliographies of early printing are a reasonably accurate map of the past. But we need to have a clearer idea of what we don't know. We need to understand what kinds of books are most likely to disappear without a trace.

For book history, the question of incunable survival has been debated for over a century, with the consensus opinion holding until recently that a small fraction of editions are entirely unknown. It now seems clear that the consensus view is not correct.

For over seventy years, the field of statistics has been working on ways to provide estimates of unobserved cases, the "unseen species problem." The information that the statistical methods require - the number of editions and copies - is information that can be found, with some effort, in bibliographic databases. The ISTC, GW, VD16, and others are coming closer to providing a complete census and a usable copy count for each edition.

Attempts to estimate lost editions from within the field of book history have taken place independently of the statistical literature. This has only recently begun to change. Quentin Burrell's 2008 article made an important contribution to moving the discussion forward, and he challenged those studying lost books to make use of the method he outlined.

Statistical arguments are difficult for scholars in the humanities to follow, however, and the statistical methods Burrell suggested are difficult for scholars in the humanities to implement. The algebraic formula offered by Proot and Egghe is much more accessible - but limited. We have a statistical problem on our hands, and making progress requires engaging with the statistical arguments.

We can do it. We have to understand the concepts, but humanists are good at grappling with abstractions. We can use R to handle the calculations. These three posts on R and Burrell provide all we need in order to turn copy counts into an estimation of lost editions along with confidence intervals and a test of goodness of fit. Every part of the more robust approach suggested by Burrell is now implemented in R. We could write a script to automate the whole process. This doesn't solve all our problems - there's the problem of most kinds of book survival not being good fits for a Poisson distribution - but it at least gets us caught up to 2008.

Friday, October 17, 2014

Let's learn R: Estimating what we can't see

How amazing is R? It's so amazing that you can throw some data at it and ask R to fit it to a distribution, and R will do it. So let's take Proot and Egghe's data and throw it at a Poisson distribution (like Burrell suggests). For that, we'll need to use the fitdistr() function from the MASS library:

> require(MASS)

Also, we need to set up a vector filled with Proot and Egghe's data by filling it with 714 ones, 82 twos, and so on:
> x<-rep(1,714)
> x<-c(x,rep(2,82))
> x<-c(x,rep(3,84))
> x<-c(x,rep(4,3))
> x<-c(x,rep(5,1))


Let's make sure it looks right:
> table(x)
x
  1   2   3   4   5
714  82  84   3   1


Now we'll tell it to fit the data to a poisson distribution:

> fitdistr(x,"poisson")
     lambda
  1.29751131
 (0.03831153)


Wow! That was easy. So what do 804 data points (the number of editions in Proot and Egghe's sample) look like in a Poisson distribution with a lamda value of 1.2975? It looks like this, which is nothing like our data:

> hist(rpois(804,1.2975))


What went wrong?

Picking up where we left off: What Proot and Egghe are observing is a truncated Poisson distribution: There are 0-copy editions, and we don't know how many--but they also affect the distribution's defining parameter. Burrell's equations (1) and (2) give the formulas for a truncated Poisson distribution and its log-likelihood function, which is needed for determining the formula for a maximum likelihood estimation (or in other words: estimating the missing editions), which Burrell also provides in (3):

This looks impenetrable, but it isn't so bad: Lamda (λ) is the number we're looking for, e is Euler's number (2.718...), and x-bar is just the average number of copies per edition, 907/804 = 1.1281. So all we have to do is solve for λ...which, as Burrell points out, is not possible by using algebra. It can only be estimated numerically. How are we going to do that?

With R, of course. Here's Burrell's equation (3) in R.


> g <- function (lamda) (lamda/(1-exp(1)^-lamda))-907/804

Now that R knows what the equation is, let's tell R to solve it for us. We have to give R a range to look for the root in. We'll try a range between "very small" and 5, since only values above 0 make sense (we could try a range from -1 to 5 for the same result).

> uniroot(g,c(.0001,5))
$root
[1] 0.2461318

$f.root
[1] -2.516872e-07

$iter
[1] 6

$init.it
[1] NA

$estim.prec
[1] 6.103516e-05


And there's λ = .2461, just like Burrell said it would be. Now that we know λ, we can create Burrell's equation (1) and generate some expected values:

> h <- function (r) (((exp(1)^.2461)-1)^-1)*((.2461^r)/factorial(r))

Between the range of  1 and 5, let's see what we would expect to find, given 804 editions:

> h(seq(1:5))*804
[1] 709.12157887  87.25741028   7.15801622   0.44039695   0.02167634

Graphically, it's clear that there's a close match between what we expect to find and what Proot and Egghe's data - note how closely the red and blue dots are:

> range<-seq(1:5)
> y<-h(range)*804
> plot(x,col="red",pch=19);points(range,y,col="blue",pch=19)

This is an important step, as it tells us that we're not completely mistaken in treating Proot and Egghe's data as a truncated Poisson distribution. Burrell also provides a statistical test of goodness of fit, but we'll come back to that later. Now that we know λ, we can calculate the fraction of 0-copy editions with the formula that Burrell provides on p. 103:

Or in R:

> exp(-.2461)
[1] 0.781844

And there we have it. If there are 804 known editions, then there are 804/(1-.781844) = 3685 missing editions (approximately - we'll look at confidence intervals later).

> 804/(1-exp(-.2461))
[1] 3685.437


So that gives us 3685 + 804 = 4489 total editions. What do we expect a truncated Poisson distribution with a lamda of .2461 and 4489 data points to look like?

> x<-rpois(4489,.2461)
> x<-x[x>0]
> table(x)

x
  1   2   3   4
850 109   8   1  


> hist(x,breaks=c(0,1,2,3,4))

Now that's more like it:

Next time: Confidence intervals, a quibble with Burrell, and final thoughts.

Friday, October 10, 2014

Egenolff, Grünpeck, and list prophecies

The most significant contribution to the formation of a canon of fifteenth- and sixteenth-century German prophecies was probably that of the Frankfurt printer Christian Egenolff the Elder. After a brush with the law in Strasbourg in the early 1530s, Egenolff moved to Frankfurt and became one of the leading publishers of popular vernacular literature. Between 1531 and 1537, Egenolff published a collection of sibylline prophecies, Zwölff Sibyllen Weissagungen, along with some additional material (VD16 Z 941-945). A decade later, Egenolff expanded the collection significantly by adding the work of Johannes Lichtenberger, Johann Carion, and Paracelsus (VD16 P 5065-5066, P 5068). Reprints of both the smaller and the larger collections appear through the end of the sixteenth century and into the seventeenth and even eighteenth centuries.

While working on Printing and Prophecy, I wanted to consult the 1537 edition (VD16 Z 945), but by itself it didn't seem enough to justify a trip to Wolfenbüttel when obtaining a facsimile would be cheaper - but then, after I was back in the U.S., I learned the the condition of the volume was too fragile for the Herzog August Bibliothek to digitize it. Now that I've had a chance to look at it, it turns out to contain a few surprises.

The collection is the first of Egenolff's to include a work by Josef Grünpeck, but it isn't the same work that shows up in the expanded 1548-50 editions (that work, Von der Reformation der Christenheyt und der Kirchen, is known only from the latest Egenolff collections). For the 1537 sibylline collection, Egenolff instead included Grünpeck's Prognosticum of 1532 (VD16 G 3634-3640, ZV 7115, ZV 23147), making the 1537 Egenolff collection the latest edition of that work. Since the Prognosticum didn't foresee much left of the world's future after 1540, Egenolff in 1548 reached farther back to find a more timely work by Grünpeck.

The other unusual feature of the 1537 edition is that it claims to include a work by Filippo Cattaneo, identifed as being "vom Thurn auß Italia." Another prognostication attributed to Cattaneo appeared in 1535 (VD16 C 1725) - but that is not the work included by Egenolff in 1537. Instead, the work that appears in the 1537 collection is a set of terse prophecies for 1537-1550. The basic structure is a governing planet and year, the abundance of oil, wine, and grain, and then an additional brief prognostication. For example:
Mon 1537.
Ols wenig
Weins abgang
Treydt wenig
Zwitracht der Fürsten.
The interesting thing is what happens when you look only at the year and last line:
Mon 1537. Zwitracht der Fürsten
Mars 1538. Vil pestilent in allen landen
Mercurius 1539. Krieg undern Christen
Jupiter 1540. Sterben der Fürsten
Saturnus 1541. Frid in allen landen
Sonn 1542. Außbruch der verschlossen juden
Mon 1543. Juden wider die Christen
Mars 1544. Juden von Christen überwunden
Jupiter 1545. Falsche Propheten
Venus 1546. Vil laster und übels
Saturnus 1547. Kranckheyt der Pestilentz
Sonn 1548. Ein heylger man
1549. Wütten der ungleubigen
1550. Durch disen heyligen man werden alle unglaubigen zum glauben bekert / Als dann würdt werden ein Schaffstal / ein Hirt / und ein Herr / der die gantze welt under seine Herrschafft bringen und regieren würdt. Unnd als dann würdt das Gülden alter herfür kommen.
This looks to me very much like a predecessor of the list prophecy for 1570-80. Compare "Sterben der Fürsten" with "Pastor non erit," or "Ein heylger man" with "Surget maximus vir," or the culmination in "ein Schaffstal / ein Hirt / und ein Herr" with "unum ovile et unus pastor." All of these are common prophetic tropes, of course, but their combination in a list rather than a narrative suggests there may be some direct influence.

The set of people who are interested in Joseph Grünpeck, Christian Egenolff, and the list prophecy for 1570-80 may be somewhat restricted. But for those people, VD16 Z 945 is an extremely interesting edition.