Final Assignment: Huffman Codes

In this final programming assignment, you will write program that builds a Huffman code, given a frequency distribution for the characters to be encoded.

Background: Huffman Codes

An alphabet is a set of symbols. A code is a procedure for mapping the symbols of one alphabet into strings of characters from a second alphabet. A binary code maps symbols from one alphabet into strings of symbols from the binary alphabet (i.e., strings consisting of 1's and 0's).

Many binary codes map characters to fixed-length strings of 1's and 0's. ASCII (American Standard for Information Interchange) is a binary code that maps each symbol to a sting of seven bits (1's and 0's). For example, the symbol '(' is mapped to the string 0101000 and the symbol 'l' is mapped to the string 1101100.

A Huffman code is a variable-length code that uses frequency counts to minimize the expected code length for encoding messages. As an example, suppose that we need to encode messages constructed from the symbols, A, B, C, and D. Further, suppose that 75% of the symbols in our messages are A, 10% are the symbol B, 10% are the symbol C, and the remaining 5% are the symbol D.

Using a fixed-length code, each symbol would be encoded using two bits. So, for example, the message BAAABCAACD would be encoded using 20 bits.

However, if we map the symbol A to the string 1, the symbol B to the string 01, the symbol C to the string 001, and the symbol D to the string 000, the message BAAABCAACD is encoded in 18 bits (011110100111001000).

Saving two bits may not seem too significant; however, if we encode a message of 100 symbols, the savings becomes more apparent. From our frequencies, we can expect that 75 of the symbols will be A's requiring one bit (1), 10 will be B's requiring two bits (01), 10 will be C's requiring three bits (001) and 5 will be D's requiring three bits (000). As such, the total length will be 75 + 20 + 30 + 15 = 140 bits which is quite a bit smaller than the 200 bits that would be required for a fixed-length code.

Building Huffman Codes

Given an alphabet and the expected frequency for each symbol, we use a Huffman tree to build a Huffman code for the alphabet. A Huffman tree is a binary tree in which the leaves hold symbols from the alphabet. The string that a symbol is mapped to is determined by tracing a path from the root of the tree to the symbol. Initially, the string used to encode the symbol is empty. Every time you go to the left you append a 1 to the string and every time you go to the right you append a 0 to the string. When you reach the symbol, you have the string used to encode the symbol.

To construct a Huffman tree for a set of symbols, you start by constructing a trivial tree for each symbol in the alphabet. Along with each tree, you need to keep track of its frequency. The frequency of the initial trees is the frequency of the symbol contained in the tree. On each step of the construction, you select two trees with the minimum frequencies and merge them to form a new tree. The trees are merged by making one of them the left child and the other the right child. The frequency of the new tree is the sum of the frequencies of the two trees used to construct the new tree.

Implementation

Your assignment is to build a Huffman code, given an alphabet and expected frequency for each symbol in the alphabet. In writing your code, you should implement a priority queue class (each element is a Huffman tree and the priority is the frequency of the tree), a Huffman tree class, and a string class (from your previous assignment).

Input and Output

The input will consist of a frequency (a double), followed by a single space, followed by the symbol (a printable character, use cin.get() to read the symbol). You may assume that the input is correct, but should not make any assumptions about the number of symbols in the input alphabet.

The output should consist of a line for each symbol in the input alphabet. Each line should have the symbol, followed by a single space, followed by the string that the symbol is mapped to.


Barney Maccabe
Last modified: Mon Apr 28 21:21:36 MDT 1997