Friday, September 12, 2014

Let's learn R: histograms for the humanities

To avoid all misunderstanding, let me make two things completely clear at the beginning: I don't know statistics and I don't know R. I did a math minor as an undergraduate, and I like to experiment with software that looks like it might be useful. That's all. Comments and corrections are welcome.

But it's been clear to me for some time that several research problems in the humanities have statistical implications, or could be productively addressed through statistical methods, and the gradually rising profile of the digital humanities will likely make statistical methods increasingly useful.

Estimating the number of lost incunable editions is one example of a research problem that required statistical expertise. For that, Paul Needham and I partnered with Frank McIntyre, an econometrician now at Rutgers. But I can't bug Frank every time I have a statistics question. He actually has to work in his own field now and again. Even when we're working on a project together, I need to understand his part enough to ask intelligent questions about it, but I haven't been able to retrace his steps, or the steps of any other statistician.

This is where R comes in. R is a free and open source statistical software package with a thriving developer community. I've barely scratched the surface of it, but I can already see that R makes some things very easy that are difficult without it.

Like histograms. Histograms are simply graphs of how many items fit into some numerical category. If you have a long list of books and the years they were printed, how many were printed in the 1460s, 1470s, or 1480s? Histograms represent data in a way that is easy to grasp. If you've ever tried it, however, you know that histograms are a huge pain to make in Excel (and real statisticians complain about the result in any case).

To illustrate how easy it is in R, let's turn again to Eric White's compilation of known print runs of fifteenth-century books. Here's how to make the following histogram in R:

> hist (printrun)


It really is that fast and simple.

Of course, there are a few additional steps that go into it. First, we need some data. We could put our data in a simple text file (named data.txt for this example) and open the text file. On Windows, R looks for files by default in the Documents library directory. If we only include the print runs in the text file, then R will number each row and assign a column header automatically.

To read the file into R, we need to assign it to a vector, which we'll call x:

> x <- read.table ("data.txt")

If we want to use our own column header, then we need to specify it:

> x <- read.table ("data.txt", col.names="printrun")

But my data is already a column in an Excel spreadsheet with the column header printrun, so I can just copy the column and paste it into R with the following command (on Windows) that tells R to read from the clipboard and make the first thing it finds a column header rather than data:

> x <- read.table ("clipboard", header=TRUE)

To see what's now in x, you just need to type

> x

and the whole thing will be printed out on screen. We can now access the print runs by referring to that column as x$printrun, but it might simplify the process by telling R to keep the table and its variables - its column headers, in other words - in mind:

> attach (x)

At this point, we can start playing with histograms by issuing the command hist (printrun) as above. To see all our options, we can get the help page by typing

> ?hist

The first histogram is nice, but I would really like to have solid blue bars outlined in black:

> hist (printrun, col="blue", border="black")

The hist() command uses an algorithm to determine sensible bin widths, or how wide each column is. But 500 seems too wide to me, so let's specify something else.  If I want bin widths of 250, I could specify the number of breaks as twenty:

> hist (printrun, col="blue",border="black", breaks=20)


Last time, I used bin widths of 135 and found a bimodal distribution. To do that again, I'll need to specify a range of 0 to 5130 (which is 38 * 135):

> hist (printrun, col="blue",border="black", breaks=seq (0, 5130, by=135))

To generate all the images for this post, I told R to output PNG files (it also can do TIFFs, JPGs, BMPs, PDFs, and other file types, including print-quality high resolution images, and there are also label and formatting options that I haven't mentioned here).

> png ()
> hist (printrun)
> hist (printrun, col="blue", border="black")
> hist (printrun, col="blue",border="black", breaks=20)
> hist (printrun, col="blue",border="black", breaks=seq (0, 5130, by=135))
> dev.off ()

It's as simple as that. Putting high-powered tools for statistical analysis into the hands of humanities faculty. What could possibly go wrong?

* * *

I have a few more posts about using R, but it may be a few weeks before the next one, as I'll be traveling for two conferences in the next several weeks. Next time, I'll start using R to retrace the steps of Quentin L. Burrell's article, "Some Comments on 'The Estimation of Lost Multi-Copy Documents: A New Type of Informetrics Theory' by Egghe and Proot," Journal of Informetrics 2 (January 2008): 101–5. It's an important article, and R can help us translate it from the language of statistics into the language of the humanities.

Friday, September 5, 2014

Abstract: "Bibliographic databases and early printing: Questions we can’t answer and questions we can’t ask"

For the upcoming meeting of the Wolfenbütteler Arbeitskreises für Bibliotheks-, Buch- und Mediengeschichte on the topic of book history and digital humanities, the paper I'm giving will briefly summarize the paper I gave at the "Lost Books" conference in June, then look specifically at the bibliographical databases of early printing that we have today. These databases are invaluable, but their current design makes some kinds of research easy and other kinds quite difficult. Along with charts and graphs, my presentation will look at some specific examples of what can and can't be done at the moment, and offer some suggestions of what might be done in the future.

Abstract: Bibliographic databases and early printing: Questions we can’t answer and questions we can’t ask.
In 1932, Ernst Consentius first proposed addressing the question of how many incunable editions have been lost by graphing the number of editions preserved in a few copies or just one and projecting an estimate based on that graph. The problem Consentius posed is in fact only a variation of a problem that can be found in many academic fields, known in English since 1943 as the “unseen species problem,” although it has not been recognized as such until very recently. Using the well-established statistical methods and tools for approaching the unseen species problem, I and Frank McIntyre have recently updated the estimate that we first published in 2010. Depending on the assumptions used, our new estimate is that between 33% (of all editions) and 49% (of codex editions, plus an indeterminate but large number of broadsides) have been lost.

The problem of estimating lost editions exemplifies how data-drive approaches can support book history, but it also illustrates how databases of early printing impose limits on research in the way they structure their records and in the user interfaces by which they make data available. Of the current database projects relevant for early printing in Germany (GW, ISTC, VD16, VD17, and USTC), each has particular advantages in the kinds of data it offers, but also particular disadvantages in how it permits that data to be searched and accessed. Clarity and consistency of formatting would help in some cases. All of the databases could profit by adding information that only one or none of the databases currently provide, such as leaf counts. User interfaces should reveal more of each database’s fields, rather than making them only implicitly visible through search results. Monolithic imprint lines, particularly those that make use of arcane or archaic terminology, must be replaced by explicit differentiation of printers and publishers.

Of the current databases, the technological advantages of VD16 are often overlooked.  Its links from editions to a separate listing of shelf marks makes it possible to count copies more simply and accurately than any other database, and its links from authors’ names to an external authority file of biographical data provide the basis for characterizing the development of printing in the sixteenth century.  Most importantly, VD16 provides open access to its MARC-formatted records, allowing an unequaled ease and accuracy when analyzing records of sixteenth-century printing. Many VD16 records lack information about such basic information as their language, however.

The missing language fields in VD16 provide an example of the challenges faced in attempting to compare bibliographic data across borders or centuries. One approach to this problem, taken by the USTC as a database of databases, is to offer relatively sparse data to users. I suggest as an alternative to this a different approach: Databases should open their data to contributions from and analysis by scholars all over the world by making their records freely available. Doing so will allow scholars of book history to pursue data-driven approaches to questions in our field.

Friday, August 29, 2014

The Strange and Terrible Visions of Wilhelm Friess, Chapter Seven: The Last Emperor and the Beginning of Prophecy

The last chapter of The Strange and Terrible Visions does three things. First, it wraps up the argument about how the two prophecies of Wilhelm Friess are connected. Rather than the same name being used arbitrarily for two different texts published twenty years apart, "Friess II" is connected to "Friess I" by a chain of textual influence and historical context where the links are separated by just a few years, rather than a few decades: Basel editions beginning in 1577 of a text grounded in the situation of Strasbourg's Reformed community in 1574 based on historical resonance with Reformed civic unrest in Antwerp in 1567 and in reaction to a specifically Lutheran branch of "Friess I" which was circulating in the Netherlands in 1566.

The second item of interest involves taking a look at late appearances of the prophecies of Wilhelm Friess. "Friess II" is interesting because the latest edition, published in 1639 as the Thirty Years' War was giving new relevance to a prophecy about foreign armies laying waste to Germany, reflects an earlier stage of the text than any other edition. "Friess I" reappears even later, in several editions of 1686-91, when Louis XIV's annexation of Strasbourg made him a prime candidate for the tyrannical Antichrist of the west.

Finally, I suggest four characteristic ways that prophetic texts develop:
  1. Selective reception of an earlier prophecy
  2. Expansion into a complete prophecy
  3. A historical context that requires veiling the message
  4. Adaptation to a new context
These steps can come in any order, and one or more of them may be lacking in the history of a particular prophecy, but as you trace the historical development of prophetic texts, you will encounter all of them many times.

Friday, August 15, 2014

Adding an author to a fragmentary incunable prognostication

Identifying the author of a fragmentary annual prognostication is sometimes difficult, requiring a significant amount of research. Other times, it only takes a few minutes.

This week, the Heidelberg university library released a facsimile of ISTC ip01005937/GW M35611, a fairly extensive fragment that is lacking an incipit, so no author had been identified. Leaf 1v did provide a complete list of chapters, however, and the structure looked familiar. After checking my records, I found that the text of the fragment was identical to that of another incucanble edition: Bernardinus de Luntis, Judicium for 1492 (Rome: Stephan Plannck, [around 1492]; ISTC il00392200, GW M19510). The identity of the two texts can be verified by comparing the facsimile provided by the BSB, so Bernardinus de Luntis should be added as the author to ISTC ip01005937/GW M35611. I sent a note on to Berlin to that effect.

Bernardinus de Luntis is otherwise known to ISTC/GW only through one additional practica, for 1493: ISTC il00392300/GW M19511, again printed by Stephan Plannck. It's not unusual for astrologers to have a career in print that only lasted a few years, but the distribution of surviving copies is a bit odd in this case: apart from one copy in the Vatican, the other three are all in German-speaking Europe, in Basel, Heidelberg, and Munich. Apart from a brief mention by Simon de Phares in his Recueil des plus célèbres astrologues, not much appears to be known about Bernardinus de Luntis.

Update: Klaus Graf adds a few more references to Bernardinus de Luntis here.

Friday, August 8, 2014

Abstract: "And also upon the servants and upon the handmaids: Illiteracy and the struggle for authority in the lost visions of Lienhard Jost"

This is the abstract I submitted for the paper I'll be giving at the annual conference of the German Studies Association in September in the session "Prophecy and Identity in Medieval and Early Modern Germany." It's based on a paper that was recently accepted by Sixteenth Century Studies, although it may not be in print until next fall or later. This is the first time in several years that I'm dealing not just with a prophetic text but with an actual visionary as a source. I'm excited about this project, as I think it might point in some interesting new directions, not specifically at the Strasbourg prophets (it sounds like Christina Moss, a doctoral student at Waterloo, will do that in her dissertation, which I'm looking forward to seeing). Instead, I think the case of Lienhard Jost might just point towards a general theory of prophecy, from vision to text to excerpt to literary allusion.

In any case, here's the abstract.

Abstract: "And also upon the servants and upon the handmaids: Illiteracy and the struggle for authority in the lost visions of Lienhard Jost"
Lienhard Jost is recognized today as one of the leading members of the Strasbourg Prophets associated with Melchior Hoffman in the 1530s. Because Jost’s works have been lost for centuries, however, basic facts of his biography have been uncertain, and he has been treated as a secondary figure behind Hoffman and behind his visionary wife Ursula Jost. With the rediscovery of Lienhard Jost’s visions in a copy preserved in Vienna, we now have access to a unique account in his own words of the experiences of an early modern German folk prophet. Jost’s visions, published by Melchior Hoffman in 1532, establish that he was an illiterate woodcutter. Popular accounts of learned discourse, including the Reformation and predictions of a second deluge in 1524, inspired Jost’s prophetic experiences. Jost’s account of his visions documents his progressive transformation with respect to literacy. At the beginning of his preaching, Jost regarded the written word as a silencing of and a loss of control over his own speech. Jost’s prophecies included not only oral preaching but also active performance of his message. In Jost’s account, these prophetic episodes transformed the city’s insane asylum, where he was confined, into a school. Although still illiterate when released, Jost saw himself as now authorized to comment on scripture and engage in other textual activities typically reserved for the learned, and he became increasingly confident in having his visions committed to writing and ultimately to print.

Friday, August 1, 2014

The Strange and Terrible Visions of Wilhelm Friess, Chapter Six: Wilhelm Friess in Strasbourg

This is the chapter that's going to get me into trouble. There are some things that prudent scholars of the sixteenth century do not engage in, and that includes attributing anonymous pamphlets to famous writers based on circumstantial evidence. What was I thinking?

The line of thought, from the basic evidence to the crime of wanton attribution, goes like this:

The early textual history of "Friess II" identifies 24 April 1574 as the first alleged date of the vision. Another passage found only in the earliest versions gives special significance to Strasbourg for the survivors of German's future devastation. Another passage attacking Lutheran clergy, the prophecy's attitude towards sacramental theology, and its situating of the source of salvation in the south - in Switzerland - point to an ideological home among the embattled Reformed community of Strasbourg in late spring of 1574. I argue that the demonic child-eating general from the northwest in "Friess II" was meant as a reference to Henry of Valois, king of Poland and until recently a leader of the French anti-Huguenot army and, upon the death of his brother, the presumed next king of France (where he reigned as Henry III).

The dating of "Friess II" is supported by the title page illustration of an early edition, whose eclipsed sun and moon, and conjunction and opposition involving Mars, Saturn, and Mercury, again points to 1574. The zodiacal iconography of Mars, Saturn, and Mercury is also featured prominently in the text of the prophecy. In the spring of 1574, eclipses and conjunctions and anthropomorphic planets took a notable form in Strasbourg with the completion of the cathedral's astronomical clock. The first solar and lunar eclipse on the clock's tables of eclipses are the same ones featured on the title woodcut (which is also the title illustration for The Strange and Terrible Visions of Wilhelm Friess).

As it turns out, the laudatory verse that accompanies a broadside illustration of the new clock was written by Johann Fischart, one of the leading German writers and the most accomplished satirist of the later sixteenth century. Fischart was a native of Strasbourg and a Reformed sympathizer who had earlier written some intemperate things about sacramental theology, supported the Huguenots and the Reformed of the Netherlands, had passed through Flanders at the time "Friess I" was circulating there, was known to publish works anonymously, and had knowledge of astrological and prophetic pamphlets.

That's a pretty amazing coincidence, I told myself. Maybe somebody could argue that Fischart was the author of "Friess II." Maybe I could even say what the evidence would be and what the argument would look like, if someone wanted to make that argument.

In the end, I decided the best approach was not to sketch out what a hypothetical argument might look like, but to simply make the claim that Fischart was the author, and defend that claim as forthrightly as possible. I try not to be irresponsible about it: As with other uncertain points, I give the evidence for both sides (including the not inconsequential note that one expert on Fischart finds the idea entirely ridiculous), and I avoid making Fischart's authorship of "Friess II" essential to any of my other arguments, but this time I opted to take the underdog bet.

Wednesday, July 23, 2014

All sheets are not created equal

At the recent Lost Books conference in St Andrews, a topic that came up during discussion was "survival of the fattest": Books with more leaves tend to survive in greater numbers of copies that thinner books. The USTC apparently has plans to include the number of sheets used in the production of each edition.

The number of sheets is useful, but not quite the key information that one would hope it would be. As Frank McIntyre and I were preparing our paper, we originally considered sheet counts as a way to to enable comparison between formats. If you fold a sheet in half for a folio, or in four for a quarto, or in eight for an octavo, should be of no concern: a sheet is a sheet is a sheet.

Alas, it is not so. When it comes to book survival, how that sheet gets folded matters, as format is still the single most important variable in book survival.

For example, consider books of 8-16 sheets, including folios of 17-32 leaves, quartos of 33-64 leaves, and octavos of 65-128 leaves. Those are thin folios, handy quartos, and thick little octavos, but all composed of the same number of sheets. (Ideally, we would also factor in paper sizes, but that will have to wait for another generation of bibliographic databases.)

If we graph the percentage of incunable editions that survive in a given number of copies, this is what we find:


Despite all having 8-16 sheets, 12% of these folios survive in a single copy, while 20% of the quartos and nearly 30% of the octavos are known by just one copy. Book format informed choices about production, use, and survival more than leaf count did, and in a way that can't be reduced to a simple matter of bulk.