Huffman Coding

Huffman coding compresses data by assigning short codes to symbols which occur often and long codes to symbols which occur sparingly. It is a lossless method and therefore the results are not particularly impressive, so it is used for text compression, or in conjunction with another method: For example, it has been proposed for compressing JPeg coefficients.

Creating Huffman codes involves finding the two least often occuring symbols and adding a zero or one (arbitrarily) to the respective codes. The symbols are then merged. The process continues iteratively by finding the two most rare symbols or clusters again. The Shannon-Fano algorithm, incidentally, is similar, but the symbols are clustered in two large groups of about the same size first and these are successively split- therefore it is less efficient. If you still want to experiment with it, though, you can substitute the subroutine Fano for Huffman, in the subsequent program.

There are several optimal sets of Huffman codes corresponding to a given symbol distribution, but efficient packing of the Huffman table is possible for code sets which feature consecutive codes for every code length: The decoder is also made faster, because it can record the minimum and maximum acceptable code for every code length. Then it only needs to check that the incoming code is within range. If it is not, the program proceeds with the next code length.

Practical registers are of finite length, so the same must be true of the Huffman codes in any implementation. A maximum code length of 16 bits is allowed by JPeg, accepting that, sparingly, the resulting encoding will be slightly suboptimal. This first version will decline to compress data if the resulting code length is more than 24 bits, but, of course, such a symbol distribution is most unlikely. (Long integers on the development system are 32 bits long, but the sign bit is awkward to handle. Therefore the buffer is always between 24 and 31 bits full).

Here is the source code (which may be compiled again) and the executable file for this project. Mainly, it is intended for text files, which it will typically halve, or bitmap images- the savings for executable files will be less.

Huffman coding has been proven by the man who proposed it to be optimal when an integral number of bits is assigned to every symbol. By assigning a fractional number of bits, arithmetic coding improves upon Huffman coding and gets even closer to the Shannon limit. LWZ type algorithms, like Lempel-Ziv, improve on arithmetic coding by considering combinations of symbols. However, both are patented in several countries.

The 8051 frequency counter section has had some minor modifications appended, to prevent the design from becoming marginal. Above 960KHz, the overflow warning needs additional circuitry.

Present Random Quote:

The more fanatically we are sawing off the branch we are sitting on, the more we 'll yell when we crash.

For my own code: Valid XHTML 1.0!