Huffman encoding is a form of lossless compression that is optimal when
each symbol has a probability that is an integral power of 1/2. It is a simple
algorithm, but achieves impressive results. The JPEG image format uses Huffman
encoding as a final step.
Huffman encoding is a variable-length encoding, which means that each individual
symbol will have a different length bit sequence representing it. The first step
to compressing using Huffman encoding is generating a frequency table. For each
symbol in the input text, count how many times it occurs, and sort in ascending order.
Example:
Input Text: aabaaccaadddaaeee
b - 1
c - 2
d - 3
e - 3
a - 8
Now, consider each symbol as a tree node and each frequency as the node's weight.
Replace the two lowest-weight nodes with a node having both of them as children, and
having the sum of their weights as its weight. Repeat this process until there is
only one node in the list.
Example:
Initial nodes:
b(1), c(2), d(3), e(3), a(8)
Merge b & c into node x:
x(3), d(3), e(3), a(8)
Merge x & d into node y:
y(6), e(3), a(8)
Merge y & e into node z:
z(9), a(8)
Merge z & a into node w:
w(17)
Final Tree
w
/ \
z a
/ \
y e
/ \
x d
/ \
b c
To compress the input text, for each symbol in the input, start at the root of the tree
and output a 0 for each left turn and a 1 for each right turn until we reach that symbol.
To decompress, simply start at the root, and take the directions indicated by the bits
until you reach a leaf. Then, output the symbol at that leaf and go back to the root.
Example:
a = Right turn(1)
b = Left turn(0), Left turn(0), Left turn(0), Left turn(0)
c = Left turn(0), Left turn(0), Left turn(0), Right turn(1)
d = Left turn(0), Left turn(0), Right turn(1)
e = Left turn(0), Right turn(1)
aabaaccaadddaaeee =
11000011000100011100100100111010101 (35 bits)
With 5 different symbols, it would have required 3 bits per symbol or 51 bits by a trivial
encoding. Even on this very small example, we saved a good percentage of the
message size. The tree must also be stored with the compressed file, but with larger
files, the tree size will become negligible.
The last note here is that Huffman encoding is not the perfect compression technique.
Although it optimal in certain cases, these cases rarely come up in real life. Arithmetic
encoding more closely approaches optimality, but it has patent issues and is much
more complicated (slow) than Huffman.
Example code
|