Friday, September 26, 2014

Let's learn R: Burrell on the binomial distribution

With this post, we'll start using R to retrace the steps of Quentin L. Burrell's article, "Some Comments on 'The Estimation of Lost Multi-Copy Documents: A New Type of Informetrics Theory' by Egghe and Proot," and translate it from the language of statistics into the language of the humanities.

Burrell's first important point (102) is:
A binomial distribution Bin(n, θ) with n large and θ small can be approximated by a Poisson distribution with mean λ = nθ. 

What does this mean, and why does it matter?

Burrell is responding to Proot and Egghe's formula for estimating lost editions, and they use a formula for the binomial distribution as their basis for their work. Burrell is restating a basic fact from statistics: For a data set where the number of attempts at success is large and the chance of success is small, the data will tend to fit not only a binomial distribution but also a Poisson distribution, which has some useful features for the kind of analysis he will be doing later. Where a binomial distribution has two parameters, the number of trials and the chance of success (n and θ, respectively), a Poisson distribution has only one parameter, its mean (or λ).

Is n large? In this case, yes, since n is the number of copies of each Jesuit play program that were printed. Although n isn't the same for every play program, it's a safe assumption that it's at least 150 or more, so the difference in print runs ends up not making much difference (as Proot and Egghe show in their article).

Is θ small? In this case, yes, since θ represents the likelihood of a copy to survive. With at most 5 copies of each edition surviving, θ is probably no higher than somewhere in the neighborhood of .02, and much lower for editions for fewer or no surviving copies.

So in these circumstances, can we approximate a binomial distribution with a Poisson distribution? Let's use R to illustrate how it works.

One thing that R makes quite easy is generating random data that follows a particular distribution. Let's start with 3500 observations (or, we might say, editions originally printed), with 250 trials in each (or copies of each edition that might survive) and a very low probability that any particular copy will survive, or .002. In R, we can simulate this as

> x <- rbinom (3500, 250, .002)

In other words, fill up the vector x with 3500 random points that follow a binomial distribution. What does it look like?

> hist (x) 

Now let's compare random points that follow Poisson distribution. We want 3500 data points again, and we also need to provide a value for λ. Since we want to see if λ really is equal to , we'll use 250*.002 or .5 for λ.

> y <- rpois (3500,.5)

Let's see what the result looks like:

> hist (y)

If you look very closely, you can see that the histograms are very similar but not quite identical - which you would expect for 3500 points chosen at random.

Our data sets only have whole-number values, so let's make the histograms reflect that:

> hist (x, breaks=c(0,1,2,3,4,5))

> hist (y, breaks=c(0,1,2,3,4,5))

Let's compare the two sets of values to see how close they are. Whenever we create a histogram, R automatically creates some information about that histogram. Let's access it:

> xinfo <- hist (x, breaks=c(0,1,2,3,4,5))
> yinfo <- hist (y, breaks=c(0,1,2,3,4,5))

Typing xinfo results in a lot of useful information, but we specifically want the counts. We could type xinfo$counts, but we want to view our data in a nice table. So let's try this:

> table <- data.frame (xinfo$counts, yinfo$counts, row.names=c(0,1,2,3,4))

(We gave the table the row names of 0-4, because otherwise the row name for zero observations would be "1," and the row name for one observation would be "2," and so on, which would be confusing.) Typing table results in this:

  xinfo.counts yinfo.counts
0         3204         3208
1          252          245
2           37           41
3            6            5
4            1            1

So there you have it. For the material Proot and Egghe are dealing with, a Poisson distribution will work as well as a binomial distribution. Next time: truncated distributions, or what happens when you don't know how many editions you have to begin with.

No comments:

Post a Comment