2

Suppose I have an ordered list of bytes (the hexdump of some object file), and wish to calculate the information entropy of this file. My understanding is I can calculate this as $$ \sum_{n=0}^{n=255} -p_n \log_{256}(p_n) $$

where $p_n = \frac{(\text{number of n-valued bytes})}{(\text{total number of bytes})}$.

My understanding is that the information entropy should be the theoretical lower bound for the compression ratio for a file. But when calculating the entropy of the standard C library, I get an entropy of ~0.8, when it's possible to compress the standard C library to 40% of the original size using gzip.

What am I misunderstanding here? Perhaps my calculation of $p_n$ is too simplistic, as the value of every byte in a byte stream is not independent of the preceding bytes, in the same way that characters in English text are not independent. Is there a better way to calculate the informational entropy of a file?

Davide Giraudo
  • 172,925
Eric
  • 21
  • That is how much you can compress individual bytes. If you want the compression ratio for the whole file, you need to look at the probability distribution of files themselves. – Christopher King Jun 29 '16 at 21:50

1 Answers1

1

Probably way too late for you, but as a future reference.

The $p_n$ you calculate is the fraction and not the probability of a complete random event. Probably there is a lot of correlation in your data.

Suppose you'd have a file of 1MB pure 0, followed by 1MB of pure 1 (or 255 to be more accurate for bytes). Your fraction of bits set to true is 50%, and entropy according to your calculation would be 1 (and thus no compression), but this file could be described/compressed in a few hundred of bytes.

Pieter21
  • 3,475
  • But how is entropy of a finite series is computed in practice? – tst Oct 09 '19 at 00:51
  • @tst depends if you know the behavior of the symbol source a priori. If you don't know, zipping is actually a good estimator. The compression rate should converge to the entropy. Otherwise, you could use the formula in the original post. – Pieter21 Oct 09 '19 at 09:46