Huffman | Shannon | Fano |

All of the below coding techniques require some knowledge of Information Theory and Probability/Statistics. Further, these all assume that you are using the following variables:

X is a random variable which has as its possible
values u_{1} through u_{n} where n is the number of possible
values in the source language (the alphabet). For instance, in English,
we have 26 letters, therefore u_{1} would correspond to 'a', and
u_{26} would correspond to 'z'. Stated in more mathematical
terms, we have:

X = { u_{1}, u_{2}, u_{3},
u_{4}, ... , u_{n}}, where n is the number of symbols in
the source language. (for instance, in english, 1 <= n <= 26,
X = { 'a', ..., 'z' } ).

P is a probability distribution of X.

P = { p_{1}, p_{2}, p_{3},
p_{4}, ..., p_{n} }, where n is the number of symbols in
the source language, and each value within the distribution is less than
or equal to 1.

S is a set of symbols in the coding alphabet.
Note that r does not have to be the number of symbols in the source alphabet.
It could be far less, and in the case of a computer, in binary code, it
is much less.

S = { s_{1}, s_{2}, ..., s_{r}
}, where r is the number of symbols in the code alphabet.

Every variable length code has a certain measure of how well it encodes a language. This is called the efficiency, and is simply a mathematical formula, which can be determined by the entropy (average information) of a code and the expected value of the length of the code. For a fixed length code, this is very straight forward, and is completely useless, though for a variable length code this is actually useful for comparing different codes.

A code's efficiency is determined as:

Hu

-------

<L> log r

Where Hu is the average information (Shannon's Theory of Information) of the original words, <L> is the expected value of L (a set of the lengths of each code for the alphabet), r is the number of symbols in the code alphabet.

recall: Hu is defined as: - Sum[i] ( Px(i)*log(Px(i)) )

<L> is defined
as: Sum[i] ( Pl(i)*l(i) )

For huffman coding one creates a binary tree of the
source symbols, using the probabilities in P(x). This is first assuming
that the coding alphabet is binary, as it is within the computer, a more
general case will be shown after. So, what happens, is:

The probabilities of a symbol are in P and therefore
we take the two least probable symbols, and combine them into a binary
tree (right now with only two leafs and a root). The root is imagined
to be a symbol with a probability that is the sum of the symbols in its
tree. Then the algorithm recurses on the new data.

Following the construction of the tree, you must
traverse the tree in order to find the new code for a given symbol.
Whenever one must take a left branch, one places a zero in the new code,
and whenever one must take a right branch one places a one in the new code.

In this manner the code is constructed, and typically
it allows common symbols to be represented in fewer symbols than rare symbols.

Example:

X = { 'a', 'b', 'c', 'd', 'e', 'f' } (n = 6)

P = { 0.13, 0.1, 0.03, 0.15, 0.18, 0.41 }

S = { 0, 1 } (r = 2)

'a' 'b'
'c' 'd' 'e'
'f'

0.13 0.1 0.03 0.15 0.18 0.41

Here, we can combine 'b' and 'c', because they have probabilities that are less than all the other letters...

'a' u
'd' 'e' 'f'

0.13 0.13 0.15 0.18 0.41

/ \

'b' 'c'

And we can combine 'a' with the combined 'b' and 'c' (u), because they
both have probabilities less than all the others...

u
'd' 'e' 'f'

0.26 0.15 0.18 0.41

/ \

'a' / \

'b' 'c'

Here though, because the combination of 'a', 'b' and 'c' has grown larger than all but 'f', we cannot combine it with anything else yet, so we pick on 'd' and 'e', because they have probabilities less than u or 'f'.

u
v 'f'

0.26
0.33 0.41

/ \
/ \

'a' / \
'd' 'e'

'b' 'c'

Here both u and v are lower in probability than 'f', so they can be combined...

u
'f'

0.59 0.41

/
\

/ \
/ \

'a' / \
'd' 'e'

'b' 'c'

And lastly, there is only 'f' and u left, so they can be combined.

u

1.0

/ \

/
\ 'f'

/
/ \

/ \ 'd' 'e'

'a' / \

'b' 'c'

We must go to the left four times to get to 'a', so the symbol for 'a'
is 0000.

...three times to the left, once to the right and finally once to the
left to get to 'b', so the symbol for 'b' is 00010.

...three times to the left and twice to the right for 'c', so the symbol
for 'c' is 00011.

...once to the left, once to the right and once to the left for 'd',
so the symbol for 'd' is 010.

...once to the left and twice to the right for 'e', so the symbol for
'e' is 011.

...and only once to the right for 'f', so the symbol for 'f' is 1.

Now, look at the tree. 'f' with a probability of nearly half,
0.41, is once half of the entire tree. To that end, we look and see
that at 0.33, 'd' and 'e' are also half of the remaining (a little over)
half. As we continue to look down the tree, we see that things that
are more common are easier to get to, and therefore require less action
to be taken to compute their numbers, though rare, or uncommon things can
be difficult to find, as they are only sought after very infrequently.

Answer = { 'a' = 0000, 'b' = 00010, 'c' = 00011, 'd' = 010, 'e' = 011,
'f' = 1 }

Efficiency of this code (log is assumed to be base 2, because the coding
alphabet consists of two symbols):

Hu = - 0.13*log(0.13) + 0.1*log(0.1) + 0.03*log(0.03)
+ 0.15*log(0.15) + 0.18*log(0.18) + 0.41*log(0.41)

= - (- 0.3826 + - 0.3322
+ -0.1518 + -0.1972 + -0.4453 + -0.5274)

= 2.0365
(on average, there are two bits of information in a single message)

recall:

Answer = { 'a' = 0000, 'b' = 00010, 'c' = 00011, 'd' = 010, 'e' = 011,
'f' = 1 }

P = {
0.13 , 0.1
, 0.03 ,
0.15 , 0.18
, 0.41 }

Now...by counting the number of symbols in each code...

L = {
4 ,
5 ,
5 ,
3 , 2
, 1 }

<L> = 0.13*4 + 0.1*5 + 0.03*5 + 0.15*3 + 0.18*2 + 0.41*1

= 2.39

So, putting these together, we get:

2.0365/(2.39 * log 2), though since log, base 2,
of 2 is simply 1...

2.0365/2.39 = 0.8521

So, this code is 85.21% efficient.

Under Construction...

This is a much simpler code than the Huffman code, and is not usually
used, because it is not as efficient, generally, as the Huffman code, however,
this is generally combined with the Shannon Method (to produce Shannon
- Fano codes). The main difference, such that I have found, is that one
sorts the Shannon probabilities, though the Fano codes are not sorted.

So, to code using this we make two subgroups, with
almost equiprobable distributions, assign one group as a one and the other
group a zero, and then subdivide each group, appending ones and zeros to
each subgroups code, and continue subdividing until only one element is
in each group.

As an example of this:

X = { 'a', 'b', 'c', 'd', 'e', 'f' } (n = 6)

P = { 0.13, 0.1, 0.03, 0.15, 0.18, 0.41 }

S = { 0, 1 } (r = 2)

'a' - 0.13

'b' - 0.1

'c' - 0.03

'd' - 0.15

'e' - 0.18

'f' - 0.41

Now, divide these into two almost equiprobable groups

'a' - 0.13

'c' - 0.03
==\ 0.49

'd' - 0.15
==/

'e' - 0.18

'b' - 0.1
==> 0.51

'f' - 0.41

So, the second group is immeidately obvious, we simply assign that group a one and then assign 'b' a zero, and 'f' a one, so we get { 'b' = 10, 'f' = 11 }

The other group is a bit more complex, and the grouping is less equal
than the previous grouping, but it can be done.

'a' - 0.13 \ 0.18

'd' - 0.15 /

'c' - 0.03 \ 0.21

'e' - 0.18 /

The first group then becomes obvious, as the group above this group is assigned 0 and this group is assigned 0, and in this group, 'a' is assigned 0 and 'd' is assigned 1, so we get { 'a' = 000, 'd' = 001 }

The other group is also, now, obvious, as the group above this group is assigned 0 and this group is assigned 1, and in this group, 'c' is assigned 0 and 'e' is assigned 1. So, we get { 'c' = 010, 'e' = 011 }

Altogether, this produces the answer:

{ 'a' = 000, 'b' = 10, 'c' = 010, 'd' = 001, 'e' = 011, 'f'
= 11 }

So, how efficient is this code?

Efficiency of this code (log is assumed to be base 2, because the coding
alphabet consists of two symbols):

Hu = - 0.13*log(0.13) + 0.1*log(0.1) + 0.03*log(0.03)
+ 0.15*log(0.15) + 0.18*log(0.18) + 0.41*log(0.41)

= - (- 0.3826 + - 0.3322
+ -0.1518 + -0.1972 + -0.4453 + -0.5274)

= 2.0365
(on average, there are two bits of information in a single message)

recall:

Answer = { 'a' = 000, 'b' = 10, 'c' = 010, 'd' = 001, 'e' = 011,
'f' = 11 }

P = {
0.13 , 0.1
, 0.03 , 0.15
, 0.18 , 0.41
}

Now...by counting the number of symbols in each code...

L = {
3 , 2
, 3 ,
3 , 3
, 2 }

<L> = 0.13*3 + 0.1*2 + 0.03*3 + 0.15*3 + 0.18*3 + 0.41*2

= 2.49

So, putting these together, we get:

2.0365/(2.49 * log 2), though since log, base 2,
of 2 is simply 1...

2.0365/2.49 = 0.8179

So, this code is 81.79% efficient (less than the Huffman code
generated for this language above...)