Evan's Page of Programming


Huffman Encoding

Huffman encoding is a form of lossless compression that is optimal when each symbol has a probability that is an integral power of 1/2. It is a simple algorithm, but achieves impressive results. The JPEG image format uses Huffman encoding as a final step.

Huffman encoding is a variable-length encoding, which means that each individual symbol will have a different length bit sequence representing it. The first step to compressing using Huffman encoding is generating a frequency table. For each symbol in the input text, count how many times it occurs, and sort in ascending order.

Example:
   Input Text: aabaaccaadddaaeee
   b - 1
   c - 2
   d - 3
   e - 3
   a - 8

Now, consider each symbol as a tree node and each frequency as the node's weight. Replace the two lowest-weight nodes with a node having both of them as children, and having the sum of their weights as its weight. Repeat this process until there is only one node in the list.

Example:
   Initial nodes:
   b(1), c(2), d(3), e(3), a(8)
   
   Merge b & c into node x:
   x(3), d(3), e(3), a(8)
   
   Merge x & d into node y:
   y(6), e(3), a(8)
   
   Merge y & e into node z:
   z(9), a(8)
   
   Merge z & a into node w:
   w(17)

Final Tree
            w
           / \
          z   a
         / \
        y   e
       / \
      x   d
     / \
    b   c

To compress the input text, for each symbol in the input, start at the root of the tree and output a 0 for each left turn and a 1 for each right turn until we reach that symbol. To decompress, simply start at the root, and take the directions indicated by the bits until you reach a leaf. Then, output the symbol at that leaf and go back to the root.

Example:
   a = Right turn(1)
   b = Left turn(0), Left turn(0), Left turn(0), Left turn(0)
   c = Left turn(0), Left turn(0), Left turn(0), Right turn(1)
   d = Left turn(0), Left turn(0), Right turn(1)
   e = Left turn(0), Right turn(1)
   aabaaccaadddaaeee =
   11000011000100011100100100111010101 (35 bits)

With 5 different symbols, it would have required 3 bits per symbol or 51 bits by a trivial encoding. Even on this very small example, we saved a good percentage of the message size. The tree must also be stored with the compressed file, but with larger files, the tree size will become negligible.

The last note here is that Huffman encoding is not the perfect compression technique. Although it optimal in certain cases, these cases rarely come up in real life. Arithmetic encoding more closely approaches optimality, but it has patent issues and is much more complicated (slow) than Huffman.

Example code

All pages, images, and other files are Copyright (c) Evan Wright 2003, unless otherwise specified.