Correct algorithm for Shannon entropy with R

Question

Shannon entropy is defined by: $H(X) = -\sum_{i} {P(x_i) \log_b P(x_i)}$, where b could be $e$, 2 or 10 (bit, nat, dit, respectively).

My interpretation of the formula is: $H(X)$ is equal to the negative sum of: probability of $x_i$ multiplied by $log_b$(probability of $x_i$).

So far, my implementation of Shannon entropy in R is (here is an example):

mystring <- c(1,2,3,1,3,5,4,2,1,3,2,4,2,2,3,4,4)
myfreqs <- table(mystring)/length(mystring)
# vectorize
myvec <- as.data.frame(myfreqs)[,2]
# H in bit
-sum(myvec * log2(myvec))
[1] 2.183667

So for the string used in my example, $H(X)=2.183667$.

Now take a look to the entropy package. The function entropy.empirical computes the Shannon entropy:

mystring <- c(1,2,3,1,3,5,4,2,1,3,2,4,2,2,3,4,4)
entropy.empirical(mystring, unit="log2")
[1] 3.944667

If we look at the code, is seems that the formula used is:

freqs <- mystring / sum(mystring)
H <- -sum(freqs * log(freqs) / log(2)
[1] 3.944667

My simple question: who is wrong? Why R use that code? Is that the Shannon entropy, or another entropy's calculation?

score 4 · Accepted Answer · answered Mar 14 '14 at 19:07

Depends on the problem at hand, both of you are correct. If your source symbols are "$1$," "$2$," "$3$," "$4$," and "$5$," and from this source you got the sequence $1,2,3,1,3,5,4,2,1,3,2,4,2,2,3,4,4$, then to calculate the empirical entropy of the source you count the number of times that the symbols $1$ through $5$ appear, which are $3,5,4,4,1$ respectively. Then, the empirical probabilities of the symbols will be $\frac{3}{17},\frac{5}{17},\frac{4}{17},\frac{4}{17},\frac{1}{17}$ respectively, and from this you can calculate the empirical Shannon entropy to really be roughly $2.18$, which is the result you find in your first code.

Now, the function in the second code has a different "assumption." Its input argument is already the number of times the number of distinct symbols had appeared in the sequence. If our problem was as stated in the previous paragraph (which I believe it is due to the naming "mystring"), then you should have passed the array $[3,5,4,4,1]$ as your argument (i.e. entropy.empirical([3 5 4 4 1], unit="log2")), and it would give you $2.18$ as well.

What you are calculating right now in your second code is however entirely different. It is as if you have a source that generates $17$ different symbols. You generated sum(mystring) = 46 samples from this source. The first source symbol appeared mystring(1) = 1 times, the second symbol appeared mystring(2) = 2 times and so on.

I wrongly interpreted the help and the example of the entropy.empirical function. Thanks for the explanation. — Tommaso, Mar 14 '14 at 19:15
What about the calculation of Shannon (or Tsallis) entropy for a continuous variable? Here is my new question: http://math.stackexchange.com/questions/713316/tsallis-entropy-for-continuous-variable-in-r — Tommaso, Mar 16 '14 at 08:46

Correct algorithm for Shannon entropy with R

1 Answers1