Compression: Integer Encodings
Well, let's look at the standard encoding. One bit is needed for 1 (1). Two bits are needed for 2 (10) and 3 (11), furthermore three bits are needed for 4 (100), ..., 7 (111). Hmm. log(1)=0, log(2)=1, log(4)=3, log(8)=4. So the number of bits needed is 1+floor(log n). Or is it?
Suppose the bits 110 are read from a file. Is this the number 3 or is it a 1 followed by a 2? We can't know unless we a priori know how many bits were used to store the number.
Leaving the standard encoding aside for a moment, consider instead the unary encoding.
The natural number is represented as n-1 1-bits followed by a single 0-bit.
The unary encoding does not have the same problem as before - it is a prefix-free code. A code for a number is never a prefix for the code of a different number.
The unary encoding is useful in situations where the majority of numbers are very small. For large numbers it is horribly ineffecient.
The problem with the standard encoding was, that we didn't know where the code ended. Consider now for the sake of argument encoding a number n by the unary code for the length of the standard encoding (1+floor(log n) followed by the standard encoding. The total length of that code is 2*(1+floor(log n). For large numbers this much better than the unary code.
A small observation saves a bit. Consider the encoding of 4: 110 100. The encoding starts with the unary encoding of 3, the length of the standard encoding. Since all numbers of length 3 have the bit pattern 1xx it is unncessary to actually store that bit. Thus we can just store 4 as 110 00. This bit saving encoding is called the gamma encoding.
To sum up, the gamma encoding of a natural number n consists of the unary code for 1+floor(log n) followed by the floor(log n) bits representing n-2^floor(log n)) in binary. The number of bits used by the gamma code is 2*floor(log n))+1.
For numbers smaller than 5 the unary encoding uses fewer bits than the gamma code, for 5 they use the same number, and for numbers larger than 5 the gamma codes uses fewer bits than the unary code.
A variation is of the gamma code is the delta code. The reasoning goes as follows: For numbers with a length larger than 5 in the standard encoding, it would be better store the length as a gamma code than a unary code. That is, the delta code for a natural number n is consists of the gamma code for the length of the standard encoding of n, followed by the standard encoding.
For numbers below 32 the gamma code is shorter than the delta code. For numbers between 32 and 53 the codes have the same length. The delta code for 64 and larger numbers are shorter than the gamma code.
A small excerpt from the implementation of these encodings - mail me if you are interested in a copy. [The Blogger software inserts spurious newlines in the atom feed. See Everything Scheme for the original.]
(planet "42.ss" ("soegaard" "srfi.plt")))
;;; UNARY CODE
; The unary code for an integer n>=1 is n-1 one bits followed by a zero bit.
; The code for 3 is 110.
(write-unary n (current-output-bit-port))]
(unless (and (integer? n) (positive? n))
(error #f "a positive integer was expected, got: " n))
(if (> n 1)
(write-bits (sub1 n)
(sub1 (arithmetic-shift 2 (sub1 (sub1 n))))
(write-bits 1 0 out-bit-port)]))
(do ([n 1 (+ n 1)])
[(= (read-bits 1 in-bit-port) 0)
Labels: search engine