Friday, October 17, 2014

Let's learn R: Estimating what we can't see

How amazing is R? It's so amazing that you can throw some data at it and ask R to fit it to a distribution, and R will do it. So let's take Proot and Egghe's data and throw it at a Poisson distribution (like Burrell suggests). For that, we'll need to use the fitdistr() function from the MASS library:

> require(MASS)

Also, we need to set up a vector filled with Proot and Egghe's data by filling it with 714 ones, 82 twos, and so on:
> x<-rep(1,714)
> x<-c(x,rep(2,82))
> x<-c(x,rep(3,84))
> x<-c(x,rep(4,3))
> x<-c(x,rep(5,1))

Let's make sure it looks right:
> table(x)
  1   2   3   4   5
714  82  84   3   1

Now we'll tell it to fit the data to a poisson distribution:

> fitdistr(x,"poisson")

Wow! That was easy. So what do 804 data points (the number of editions in Proot and Egghe's sample) look like in a Poisson distribution with a lamda value of 1.2975? It looks like this, which is nothing like our data:

> hist(rpois(804,1.2975))

What went wrong?

Picking up where we left off: What Proot and Egghe are observing is a truncated Poisson distribution: There are 0-copy editions, and we don't know how many--but they also affect the distribution's defining parameter. Burrell's equations (1) and (2) give the formulas for a truncated Poisson distribution and its log-likelihood function, which is needed for determining the formula for a maximum likelihood estimation (or in other words: estimating the missing editions), which Burrell also provides in (3):

This looks impenetrable, but it isn't so bad: Lamda (λ) is the number we're looking for, e is Euler's number (2.718...), and x-bar is just the average number of copies per edition, 907/804 = 1.1281. So all we have to do is solve for λ...which, as Burrell points out, is not possible by using algebra. It can only be estimated numerically. How are we going to do that?

With R, of course. Here's Burrell's equation (3) in R.

> g <- function (lamda) (lamda/(1-exp(1)^-lamda))-907/804

Now that R knows what the equation is, let's tell R to solve it for us. We have to give R a range to look for the root in. We'll try a range between "very small" and 5, since only values above 0 make sense (we could try a range from -1 to 5 for the same result).

> uniroot(g,c(.0001,5))
[1] 0.2461318

[1] -2.516872e-07

[1] 6

[1] NA

[1] 6.103516e-05

And there's λ = .2461, just like Burrell said it would be. Now that we know λ, we can create Burrell's equation (1) and generate some expected values:

> h <- function (r) (((exp(1)^.2461)-1)^-1)*((.2461^r)/factorial(r))

Between the range of  1 and 5, let's see what we would expect to find, given 804 editions:

> h(seq(1:5))*804
[1] 709.12157887  87.25741028   7.15801622   0.44039695   0.02167634

Graphically, it's clear that there's a close match between what we expect to find and what Proot and Egghe's data - note how closely the red and blue dots are:

> range<-seq(1:5)
> y<-h(range)*804
> plot(x,col="red",pch=19);points(range,y,col="blue",pch=19)

This is an important step, as it tells us that we're not completely mistaken in treating Proot and Egghe's data as a truncated Poisson distribution. Burrell also provides a statistical test of goodness of fit, but we'll come back to that later. Now that we know λ, we can calculate the fraction of 0-copy editions with the formula that Burrell provides on p. 103:

Or in R:

> exp(-.2461)
[1] 0.781844

And there we have it. If there are 804 known editions, then there are 804/(1-.781844) = 3685 missing editions (approximately - we'll look at confidence intervals later).

> 804/(1-exp(-.2461))
[1] 3685.437

So that gives us 3685 + 804 = 4489 total editions. What do we expect a truncated Poisson distribution with a lamda of .2461 and 4489 data points to look like?

> x<-rpois(4489,.2461)
> x<-x[x>0]
> table(x)

  1   2   3   4
850 109   8   1  

> hist(x,breaks=c(0,1,2,3,4))

Now that's more like it:

Next time: Confidence intervals, a quibble with Burrell, and final thoughts.

No comments:

Post a Comment