Friday, October 17, 2014

Let's learn R: Estimating what we can't see

How amazing is R? It's so amazing that you can throw some data at it and ask R to fit it to a distribution, and R will do it. So let's take Proot and Egghe's data and throw it at a Poisson distribution (like Burrell suggests). For that, we'll need to use the fitdistr() function from the MASS library:

> require(MASS)

Also, we need to set up a vector filled with Proot and Egghe's data by filling it with 714 ones, 82 twos, and so on:
> x<-rep(1,714)
> x<-c(x,rep(2,82))
> x<-c(x,rep(3,84))
> x<-c(x,rep(4,3))
> x<-c(x,rep(5,1))

Let's make sure it looks right:
> table(x)
  1   2   3   4   5
714  82  84   3   1

Now we'll tell it to fit the data to a poisson distribution:

> fitdistr(x,"poisson")

Wow! That was easy. So what do 804 data points (the number of editions in Proot and Egghe's sample) look like in a Poisson distribution with a lamda value of 1.2975? It looks like this, which is nothing like our data:

> hist(rpois(804,1.2975))

What went wrong?

Picking up where we left off: What Proot and Egghe are observing is a truncated Poisson distribution: There are 0-copy editions, and we don't know how many--but they also affect the distribution's defining parameter. Burrell's equations (1) and (2) give the formulas for a truncated Poisson distribution and its log-likelihood function, which is needed for determining the formula for a maximum likelihood estimation (or in other words: estimating the missing editions), which Burrell also provides in (3):

This looks impenetrable, but it isn't so bad: Lamda (λ) is the number we're looking for, e is Euler's number (2.718...), and x-bar is just the average number of copies per edition, 907/804 = 1.1281. So all we have to do is solve for λ...which, as Burrell points out, is not possible by using algebra. It can only be estimated numerically. How are we going to do that?

With R, of course. Here's Burrell's equation (3) in R.

> g <- function (lamda) (lamda/(1-exp(1)^-lamda))-907/804

Now that R knows what the equation is, let's tell R to solve it for us. We have to give R a range to look for the root in. We'll try a range between "very small" and 5, since only values above 0 make sense (we could try a range from -1 to 5 for the same result).

> uniroot(g,c(.0001,5))
[1] 0.2461318

[1] -2.516872e-07

[1] 6

[1] NA

[1] 6.103516e-05

And there's λ = .2461, just like Burrell said it would be. Now that we know λ, we can create Burrell's equation (1) and generate some expected values:

> h <- function (r) (((exp(1)^.2461)-1)^-1)*((.2461^r)/factorial(r))

Between the range of  1 and 5, let's see what we would expect to find, given 804 editions:

> h(seq(1:5))*804
[1] 709.12157887  87.25741028   7.15801622   0.44039695   0.02167634

Graphically, it's clear that there's a close match between what we expect to find and what Proot and Egghe's data - note how closely the red and blue dots are:

> range<-seq(1:5)
> y<-h(range)*804
> plot(x,col="red",pch=19);points(range,y,col="blue",pch=19)

This is an important step, as it tells us that we're not completely mistaken in treating Proot and Egghe's data as a truncated Poisson distribution. Burrell also provides a statistical test of goodness of fit, but we'll come back to that later. Now that we know λ, we can calculate the fraction of 0-copy editions with the formula that Burrell provides on p. 103:

Or in R:

> exp(-.2461)
[1] 0.781844

And there we have it. If there are 804 known editions, then there are 804/(1-.781844) = 3685 missing editions (approximately - we'll look at confidence intervals later).

> 804/(1-exp(-.2461))
[1] 3685.437

So that gives us 3685 + 804 = 4489 total editions. What do we expect a truncated Poisson distribution with a lamda of .2461 and 4489 data points to look like?

> x<-rpois(4489,.2461)
> x<-x[x>0]
> table(x)

  1   2   3   4
850 109   8   1  

> hist(x,breaks=c(0,1,2,3,4))

Now that's more like it:

Next time: Confidence intervals, a quibble with Burrell, and final thoughts.

Friday, October 10, 2014

Egenolff, Grünpeck, and list prophecies

The most significant contribution to the formation of a canon of fifteenth- and sixteenth-century German prophecies was probably that of the Frankfurt printer Christian Egenolff the Elder. After a brush with the law in Strasbourg in the early 1530s, Egenolff moved to Frankfurt and became one of the leading publishers of popular vernacular literature. Between 1531 and 1537, Egenolff published a collection of sibylline prophecies, Zwölff Sibyllen Weissagungen, along with some additional material (VD16 Z 941-945). A decade later, Egenolff expanded the collection significantly by adding the work of Johannes Lichtenberger, Johann Carion, and Paracelsus (VD16 P 5065-5066, P 5068). Reprints of both the smaller and the larger collections appear through the end of the sixteenth century and into the seventeenth and even eighteenth centuries.

While working on Printing and Prophecy, I wanted to consult the 1537 edition (VD16 Z 945), but by itself it didn't seem enough to justify a trip to Wolfenbüttel when obtaining a facsimile would be cheaper - but then, after I was back in the U.S., I learned the the condition of the volume was too fragile for the Herzog August Bibliothek to digitize it. Now that I've had a chance to look at it, it turns out to contain a few surprises.

The collection is the first of Egenolff's to include a work by Josef Grünpeck, but it isn't the same work that shows up in the expanded 1548-50 editions (that work, Von der Reformation der Christenheyt und der Kirchen, is known only from the latest Egenolff collections). For the 1537 sibylline collection, Egenolff instead included Grünpeck's Prognosticum of 1532 (VD16 G 3634-3640, ZV 7115, ZV 23147), making the 1537 Egenolff collection the latest edition of that work. Since the Prognosticum didn't foresee much left of the world's future after 1540, Egenolff in 1548 reached farther back to find a more timely work by Grünpeck.

The other unusual feature of the 1537 edition is that it claims to include a work by Filippo Cattaneo, identifed as being "vom Thurn auß Italia." Another prognostication attributed to Cattaneo appeared in 1535 (VD16 C 1725) - but that is not the work included by Egenolff in 1537. Instead, the work that appears in the 1537 collection is a set of terse prophecies for 1537-1550. The basic structure is a governing planet and year, the abundance of oil, wine, and grain, and then an additional brief prognostication. For example:
Mon 1537.
Ols wenig
Weins abgang
Treydt wenig
Zwitracht der Fürsten.
The interesting thing is what happens when you look only at the year and last line:
Mon 1537. Zwitracht der Fürsten
Mars 1538. Vil pestilent in allen landen
Mercurius 1539. Krieg undern Christen
Jupiter 1540. Sterben der Fürsten
Saturnus 1541. Frid in allen landen
Sonn 1542. Außbruch der verschlossen juden
Mon 1543. Juden wider die Christen
Mars 1544. Juden von Christen überwunden
Jupiter 1545. Falsche Propheten
Venus 1546. Vil laster und übels
Saturnus 1547. Kranckheyt der Pestilentz
Sonn 1548. Ein heylger man
1549. Wütten der ungleubigen
1550. Durch disen heyligen man werden alle unglaubigen zum glauben bekert / Als dann würdt werden ein Schaffstal / ein Hirt / und ein Herr / der die gantze welt under seine Herrschafft bringen und regieren würdt. Unnd als dann würdt das Gülden alter herfür kommen.
This looks to me very much like a predecessor of the list prophecy for 1570-80. Compare "Sterben der Fürsten" with "Pastor non erit," or "Ein heylger man" with "Surget maximus vir," or the culmination in "ein Schaffstal / ein Hirt / und ein Herr" with "unum ovile et unus pastor." All of these are common prophetic tropes, of course, but their combination in a list rather than a narrative suggests there may be some direct influence.

The set of people who are interested in Joseph Grünpeck, Christian Egenolff, and the list prophecy for 1570-80 may be somewhat restricted. But for those people, VD16 Z 945 is an extremely interesting edition.

Friday, September 26, 2014

Let's learn R: Burrell on the binomial distribution

With this post, we'll start using R to retrace the steps of Quentin L. Burrell's article, "Some Comments on 'The Estimation of Lost Multi-Copy Documents: A New Type of Informetrics Theory' by Egghe and Proot," and translate it from the language of statistics into the language of the humanities.

Burrell's first important point (102) is:
A binomial distribution Bin(n, θ) with n large and θ small can be approximated by a Poisson distribution with mean λ = nθ. 

What does this mean, and why does it matter?

Burrell is responding to Proot and Egghe's formula for estimating lost editions, and they use a formula for the binomial distribution as their basis for their work. Burrell is restating a basic fact from statistics: For a data set where the number of attempts at success is large and the chance of success is small, the data will tend to fit not only a binomial distribution but also a Poisson distribution, which has some useful features for the kind of analysis he will be doing later. Where a binomial distribution has two parameters, the number of trials and the chance of success (n and θ, respectively), a Poisson distribution has only one parameter, its mean (or λ).

Is n large? In this case, yes, since n is the number of copies of each Jesuit play program that were printed. Although n isn't the same for every play program, it's a safe assumption that it's at least 150 or more, so the difference in print runs ends up not making much difference (as Proot and Egghe show in their article).

Is θ small? In this case, yes, since θ represents the likelihood of a copy to survive. With at most 5 copies of each edition surviving, θ is probably no higher than somewhere in the neighborhood of .02, and much lower for editions for fewer or no surviving copies.

So in these circumstances, can we approximate a binomial distribution with a Poisson distribution? Let's use R to illustrate how it works.

One thing that R makes quite easy is generating random data that follows a particular distribution. Let's start with 3500 observations (or, we might say, editions originally printed), with 250 trials in each (or copies of each edition that might survive) and a very low probability that any particular copy will survive, or .002. In R, we can simulate this as

> x <- rbinom (3500, 250, .002)

In other words, fill up the vector x with 3500 random points that follow a binomial distribution. What does it look like?

> hist (x) 

Now let's compare random points that follow Poisson distribution. We want 3500 data points again, and we also need to provide a value for λ. Since we want to see if λ really is equal to , we'll use 250*.002 or .5 for λ.

> y <- rpois (3500,.5)

Let's see what the result looks like:

> hist (y)

If you look very closely, you can see that the histograms are very similar but not quite identical - which you would expect for 3500 points chosen at random.

Our data sets only have whole-number values, so let's make the histograms reflect that:

> hist (x, breaks=c(0,1,2,3,4,5))

> hist (y, breaks=c(0,1,2,3,4,5))

Let's compare the two sets of values to see how close they are. Whenever we create a histogram, R automatically creates some information about that histogram. Let's access it:

> xinfo <- hist (x, breaks=c(0,1,2,3,4,5))
> yinfo <- hist (y, breaks=c(0,1,2,3,4,5))

Typing xinfo results in a lot of useful information, but we specifically want the counts. We could type xinfo$counts, but we want to view our data in a nice table. So let's try this:

> table <- data.frame (xinfo$counts, yinfo$counts, row.names=c(0,1,2,3,4))

(We gave the table the row names of 0-4, because otherwise the row name for zero observations would be "1," and the row name for one observation would be "2," and so on, which would be confusing.) Typing table results in this:

  xinfo.counts yinfo.counts
0         3204         3208
1          252          245
2           37           41
3            6            5
4            1            1

So there you have it. For the material Proot and Egghe are dealing with, a Poisson distribution will work as well as a binomial distribution. Next time: truncated distributions, or what happens when you don't know how many editions you have to begin with.

Friday, September 12, 2014

Let's learn R: histograms for the humanities

To avoid all misunderstanding, let me make two things completely clear at the beginning: I don't know statistics and I don't know R. I did a math minor as an undergraduate, and I like to experiment with software that looks like it might be useful. That's all. Comments and corrections are welcome.

But it's been clear to me for some time that several research problems in the humanities have statistical implications, or could be productively addressed through statistical methods, and the gradually rising profile of the digital humanities will likely make statistical methods increasingly useful.

Estimating the number of lost incunable editions is one example of a research problem that required statistical expertise. For that, Paul Needham and I partnered with Frank McIntyre, an econometrician now at Rutgers. But I can't bug Frank every time I have a statistics question. He actually has to work in his own field now and again. Even when we're working on a project together, I need to understand his part enough to ask intelligent questions about it, but I haven't been able to retrace his steps, or the steps of any other statistician.

This is where R comes in. R is a free and open source statistical software package with a thriving developer community. I've barely scratched the surface of it, but I can already see that R makes some things very easy that are difficult without it.

Like histograms. Histograms are simply graphs of how many items fit into some numerical category. If you have a long list of books and the years they were printed, how many were printed in the 1460s, 1470s, or 1480s? Histograms represent data in a way that is easy to grasp. If you've ever tried it, however, you know that histograms are a huge pain to make in Excel (and real statisticians complain about the result in any case).

To illustrate how easy it is in R, let's turn again to Eric White's compilation of known print runs of fifteenth-century books. Here's how to make the following histogram in R:

> hist (printrun)

It really is that fast and simple.

Of course, there are a few additional steps that go into it. First, we need some data. We could put our data in a simple text file (named data.txt for this example) and open the text file. On Windows, R looks for files by default in the Documents library directory. If we only include the print runs in the text file, then R will number each row and assign a column header automatically.

To read the file into R, we need to assign it to a vector, which we'll call x:

> x <- read.table ("data.txt")

If we want to use our own column header, then we need to specify it:

> x <- read.table ("data.txt", col.names="printrun")

But my data is already a column in an Excel spreadsheet with the column header printrun, so I can just copy the column and paste it into R with the following command (on Windows) that tells R to read from the clipboard and make the first thing it finds a column header rather than data:

> x <- read.table ("clipboard", header=TRUE)

To see what's now in x, you just need to type

> x

and the whole thing will be printed out on screen. We can now access the print runs by referring to that column as x$printrun, but it might simplify the process by telling R to keep the table and its variables - its column headers, in other words - in mind:

> attach (x)

At this point, we can start playing with histograms by issuing the command hist (printrun) as above. To see all our options, we can get the help page by typing

> ?hist

The first histogram is nice, but I would really like to have solid blue bars outlined in black:

> hist (printrun, col="blue", border="black")

The hist() command uses an algorithm to determine sensible bin widths, or how wide each column is. But 500 seems too wide to me, so let's specify something else.  If I want bin widths of 250, I could specify the number of breaks as twenty:

> hist (printrun, col="blue",border="black", breaks=20)

Last time, I used bin widths of 135 and found a bimodal distribution. To do that again, I'll need to specify a range of 0 to 5130 (which is 38 * 135):

> hist (printrun, col="blue",border="black", breaks=seq (0, 5130, by=135))

To generate all the images for this post, I told R to output PNG files (it also can do TIFFs, JPGs, BMPs, PDFs, and other file types, including print-quality high resolution images, and there are also label and formatting options that I haven't mentioned here).

> png ()
> hist (printrun)
> hist (printrun, col="blue", border="black")
> hist (printrun, col="blue",border="black", breaks=20)
> hist (printrun, col="blue",border="black", breaks=seq (0, 5130, by=135))
> ()

It's as simple as that. Putting high-powered tools for statistical analysis into the hands of humanities faculty. What could possibly go wrong?

* * *

I have a few more posts about using R, but it may be a few weeks before the next one, as I'll be traveling for two conferences in the next several weeks. Next time, I'll start using R to retrace the steps of Quentin L. Burrell's article, "Some Comments on 'The Estimation of Lost Multi-Copy Documents: A New Type of Informetrics Theory' by Egghe and Proot," Journal of Informetrics 2 (January 2008): 101–5. It's an important article, and R can help us translate it from the language of statistics into the language of the humanities.

Friday, September 5, 2014

Abstract: "Bibliographic databases and early printing: Questions we can’t answer and questions we can’t ask"

For the upcoming meeting of the Wolfenbütteler Arbeitskreises für Bibliotheks-, Buch- und Mediengeschichte on the topic of book history and digital humanities, the paper I'm giving will briefly summarize the paper I gave at the "Lost Books" conference in June, then look specifically at the bibliographical databases of early printing that we have today. These databases are invaluable, but their current design makes some kinds of research easy and other kinds quite difficult. Along with charts and graphs, my presentation will look at some specific examples of what can and can't be done at the moment, and offer some suggestions of what might be done in the future.

Abstract: Bibliographic databases and early printing: Questions we can’t answer and questions we can’t ask.
In 1932, Ernst Consentius first proposed addressing the question of how many incunable editions have been lost by graphing the number of editions preserved in a few copies or just one and projecting an estimate based on that graph. The problem Consentius posed is in fact only a variation of a problem that can be found in many academic fields, known in English since 1943 as the “unseen species problem,” although it has not been recognized as such until very recently. Using the well-established statistical methods and tools for approaching the unseen species problem, I and Frank McIntyre have recently updated the estimate that we first published in 2010. Depending on the assumptions used, our new estimate is that between 33% (of all editions) and 49% (of codex editions, plus an indeterminate but large number of broadsides) have been lost.

The problem of estimating lost editions exemplifies how data-drive approaches can support book history, but it also illustrates how databases of early printing impose limits on research in the way they structure their records and in the user interfaces by which they make data available. Of the current database projects relevant for early printing in Germany (GW, ISTC, VD16, VD17, and USTC), each has particular advantages in the kinds of data it offers, but also particular disadvantages in how it permits that data to be searched and accessed. Clarity and consistency of formatting would help in some cases. All of the databases could profit by adding information that only one or none of the databases currently provide, such as leaf counts. User interfaces should reveal more of each database’s fields, rather than making them only implicitly visible through search results. Monolithic imprint lines, particularly those that make use of arcane or archaic terminology, must be replaced by explicit differentiation of printers and publishers.

Of the current databases, the technological advantages of VD16 are often overlooked.  Its links from editions to a separate listing of shelf marks makes it possible to count copies more simply and accurately than any other database, and its links from authors’ names to an external authority file of biographical data provide the basis for characterizing the development of printing in the sixteenth century.  Most importantly, VD16 provides open access to its MARC-formatted records, allowing an unequaled ease and accuracy when analyzing records of sixteenth-century printing. Many VD16 records lack information about such basic information as their language, however.

The missing language fields in VD16 provide an example of the challenges faced in attempting to compare bibliographic data across borders or centuries. One approach to this problem, taken by the USTC as a database of databases, is to offer relatively sparse data to users. I suggest as an alternative to this a different approach: Databases should open their data to contributions from and analysis by scholars all over the world by making their records freely available. Doing so will allow scholars of book history to pursue data-driven approaches to questions in our field.

Friday, August 29, 2014

The Strange and Terrible Visions of Wilhelm Friess, Chapter Seven: The Last Emperor and the Beginning of Prophecy

The last chapter of The Strange and Terrible Visions does three things. First, it wraps up the argument about how the two prophecies of Wilhelm Friess are connected. Rather than the same name being used arbitrarily for two different texts published twenty years apart, "Friess II" is connected to "Friess I" by a chain of textual influence and historical context where the links are separated by just a few years, rather than a few decades: Basel editions beginning in 1577 of a text grounded in the situation of Strasbourg's Reformed community in 1574 based on historical resonance with Reformed civic unrest in Antwerp in 1567 and in reaction to a specifically Lutheran branch of "Friess I" which was circulating in the Netherlands in 1566.

The second item of interest involves taking a look at late appearances of the prophecies of Wilhelm Friess. "Friess II" is interesting because the latest edition, published in 1639 as the Thirty Years' War was giving new relevance to a prophecy about foreign armies laying waste to Germany, reflects an earlier stage of the text than any other edition. "Friess I" reappears even later, in several editions of 1686-91, when Louis XIV's annexation of Strasbourg made him a prime candidate for the tyrannical Antichrist of the west.

Finally, I suggest four characteristic ways that prophetic texts develop:
  1. Selective reception of an earlier prophecy
  2. Expansion into a complete prophecy
  3. A historical context that requires veiling the message
  4. Adaptation to a new context
These steps can come in any order, and one or more of them may be lacking in the history of a particular prophecy, but as you trace the historical development of prophetic texts, you will encounter all of them many times.

Friday, August 15, 2014

Adding an author to a fragmentary incunable prognostication

Identifying the author of a fragmentary annual prognostication is sometimes difficult, requiring a significant amount of research. Other times, it only takes a few minutes.

This week, the Heidelberg university library released a facsimile of ISTC ip01005937/GW M35611, a fairly extensive fragment that is lacking an incipit, so no author had been identified. Leaf 1v did provide a complete list of chapters, however, and the structure looked familiar. After checking my records, I found that the text of the fragment was identical to that of another incucanble edition: Bernardinus de Luntis, Judicium for 1492 (Rome: Stephan Plannck, [around 1492]; ISTC il00392200, GW M19510). The identity of the two texts can be verified by comparing the facsimile provided by the BSB, so Bernardinus de Luntis should be added as the author to ISTC ip01005937/GW M35611. I sent a note on to Berlin to that effect.

Bernardinus de Luntis is otherwise known to ISTC/GW only through one additional practica, for 1493: ISTC il00392300/GW M19511, again printed by Stephan Plannck. It's not unusual for astrologers to have a career in print that only lasted a few years, but the distribution of surviving copies is a bit odd in this case: apart from one copy in the Vatican, the other three are all in German-speaking Europe, in Basel, Heidelberg, and Munich. Apart from a brief mention by Simon de Phares in his Recueil des plus célèbres astrologues, not much appears to be known about Bernardinus de Luntis.

Update: Klaus Graf adds a few more references to Bernardinus de Luntis here.