Coding Theory

All of the below coding techniques require some knowledge of Information Theory and Probability/Statistics.  Further, these all assume that you are using the following variables:

X is a random variable which has as its possible values u1 through un where n is the number of possible values in the source language (the alphabet).  For instance, in English, we have 26 letters, therefore u1 would correspond to 'a', and u26 would correspond to 'z'.  Stated in more mathematical terms, we have:

X = { u1, u2, u3, u4, ... , un}, where n is the number of symbols in the source language.  (for instance, in english, 1 <= n <= 26, X = { 'a', ..., 'z' } ).

P is a probability distribution of X.
P = { p1, p2, p3, p4, ..., pn }, where n is the number of symbols in the source language, and each value within the distribution is less than or equal to 1.

S is a set of symbols in the coding alphabet.  Note that r does not have to be the number of symbols in the source alphabet.  It could be far less, and in the case of a computer, in binary code, it is much less.
S = { s1, s2, ..., sr }, where r is the number of symbols in the code alphabet.

Every variable length code has a certain measure of how well it encodes a language.  This is called the efficiency, and is simply a mathematical formula, which can be determined by the entropy (average information) of a code and the expected value of the length of the code.  For a fixed length code, this is very straight forward, and is completely useless, though for a variable length code this is actually useful for comparing different codes.

A code's efficiency is determined as:

Hu
-------
<L> log r

Where Hu is the average information (Shannon's Theory of Information) of the original words, <L> is the expected value of L (a set of the lengths of each code for the alphabet), r is the number of symbols in the code alphabet.

recall: Hu is defined as:   - Sum[i] ( Px(i)*log(Px(i)) )
<L> is defined as:  Sum[i] ( Pl(i)*l(i) )

Huffman Coding

For huffman coding one creates a binary tree of the source symbols, using the probabilities in P(x).  This is first assuming that the coding alphabet is binary, as it is within the computer, a more general case will be shown after.  So, what happens, is:
The probabilities of a symbol are in P and therefore we take the two least probable symbols, and combine them into a binary tree (right now with only two leafs and a root).  The root is imagined to be a symbol with a probability that is the sum of the symbols in its tree.  Then the algorithm recurses on the new data.
Following the construction of the tree, you must traverse the tree in order to find the new code for a given symbol.  Whenever one must take a left branch, one places a zero in the new code, and whenever one must take a right branch one places a one in the new code.
In this manner the code is constructed, and typically it allows common symbols to be represented in fewer symbols than rare symbols.

Example:

X = { 'a', 'b', 'c', 'd', 'e', 'f' } (n = 6)
P = { 0.13, 0.1, 0.03, 0.15, 0.18, 0.41 }
S = { 0, 1 } (r = 2)

'a'    'b'    'c'    'd'    'e'     'f'
0.13  0.1 0.03 0.15 0.18 0.41

Here, we can combine 'b' and 'c', because they have probabilities that are less than all the other letters...

'a'      u      'd'    'e'       'f'
0.13  0.13 0.15 0.18  0.41
/  \
'b'  'c'

And we can combine 'a' with the combined 'b' and 'c' (u), because they both have probabilities less than all the others...

u        'd'    'e'    'f'
0.26  0.15 0.18 0.41
/   \
'a'   /  \
'b'  'c'

Here though, because the combination of 'a', 'b' and 'c' has grown larger than all but 'f', we cannot combine it with anything else yet, so we pick on 'd' and 'e', because they have probabilities less than u or 'f'.

u            v        'f'
0.26      0.33   0.41
/   \         /  \
'a'   /  \    'd' 'e'
'b'  'c'

Here both u and v are lower in probability than 'f', so they can be combined...

u              'f'
0.59         0.41
/           \
/   \         /  \
'a'   /  \    'd' 'e'
'b'  'c'

And lastly, there is only 'f' and u left, so they can be combined.
u
1.0
/            \
/      \         'f'
/        /  \
/  \     'd' 'e'
'a'  /  \
'b' 'c'

We must go to the left four times to get to 'a', so the symbol for 'a' is 0000.
...three times to the left, once to the right and finally once to the left to get to 'b', so the symbol for 'b' is 00010.
...three times to the left and twice to the right for 'c', so the symbol for 'c' is 00011.
...once to the left, once to the right and once to the left for 'd', so the symbol for 'd' is 010.
...once to the left and twice to the right for 'e', so the symbol for 'e' is 011.
...and only once to the right for 'f', so the symbol for 'f' is 1.
Now, look at the tree.  'f' with a probability of nearly half, 0.41, is once half of the entire tree.  To that end, we look and see that at 0.33, 'd' and 'e' are also half of the remaining (a little over) half.  As we continue to look down the tree, we see that things that are more common are easier to get to, and therefore require less action to be taken to compute their numbers, though rare, or uncommon things can be difficult to find, as they are only sought after very infrequently.

Answer = { 'a' = 0000, 'b' = 00010, 'c' = 00011, 'd' = 010, 'e' = 011, 'f' = 1 }

Efficiency of this code (log is assumed to be base 2, because the coding alphabet consists of two symbols):
Hu = - 0.13*log(0.13) + 0.1*log(0.1) + 0.03*log(0.03) + 0.15*log(0.15) + 0.18*log(0.18) + 0.41*log(0.41)
= - (- 0.3826 + - 0.3322 + -0.1518 + -0.1972 + -0.4453 + -0.5274)
= 2.0365        (on average, there are two bits of information in a single message)

recall:
Answer = { 'a' = 0000, 'b' = 00010, 'c' = 00011, 'd' = 010, 'e' = 011, 'f' = 1 }
P       = {  0.13       , 0.1             ,  0.03         ,   0.15     , 0.18       ,  0.41    }
Now...by counting the number of symbols in each code...
L      = {           4    ,            5     ,            5    ,          3   ,         2   ,         1  }
<L> = 0.13*4 + 0.1*5 + 0.03*5 + 0.15*3 + 0.18*2 + 0.41*1
= 2.39

So, putting these together, we get:
2.0365/(2.39 * log 2), though since log, base 2, of 2 is simply 1...
2.0365/2.39 = 0.8521
So, this code is 85.21% efficient.

Shannon Coding

Under Construction...

Fano Coding

This is a much simpler code than the Huffman code, and is not usually used, because it is not as efficient, generally, as the Huffman code, however, this is generally combined with the Shannon Method (to produce Shannon - Fano codes). The main difference, such that I have found, is that one sorts the Shannon probabilities, though the Fano codes are not sorted.
So, to code using this we make two subgroups, with almost equiprobable distributions, assign one group as a one and the other group a zero, and then subdivide each group, appending ones and zeros to each subgroups code, and continue subdividing until only one element is in each group.
As an example of this:

X = { 'a', 'b', 'c', 'd', 'e', 'f' } (n = 6)
P = { 0.13, 0.1, 0.03, 0.15, 0.18, 0.41 }
S = { 0, 1 } (r = 2)

'a'  - 0.13
'b'  - 0.1
'c'  - 0.03
'd'  - 0.15
'e'  - 0.18
'f'   - 0.41

Now, divide these into two almost equiprobable groups

'a'  - 0.13
'c'  - 0.03             ==\     0.49
'd'  - 0.15             ==/
'e'  - 0.18

'b'  - 0.1                  ==>  0.51
'f'   - 0.41

So, the second group is immeidately obvious, we simply assign that group a one and then assign 'b' a zero, and 'f' a one, so we get { 'b' = 10, 'f' = 11 }

The other group is a bit more complex, and the grouping is less equal than the previous grouping, but it can be done.
'a'  - 0.13   \    0.18
'd'  - 0.15   /
'c'  - 0.03   \    0.21
'e'  - 0.18   /

The first group then becomes obvious, as the group above this group is assigned 0 and this group is assigned 0, and in this group, 'a' is assigned 0 and 'd' is assigned 1, so we get { 'a' = 000, 'd' = 001 }

The other group is also, now, obvious, as the group above this group is assigned 0 and this group is assigned 1, and in this group, 'c' is assigned 0 and 'e' is assigned 1.  So, we get { 'c' = 010, 'e' = 011 }

{ 'a' = 000, 'b' = 10, 'c' = 010, 'd' = 001, 'e' = 011, 'f' = 11 }

So, how efficient is this code?

Efficiency of this code (log is assumed to be base 2, because the coding alphabet consists of two symbols):
Hu = - 0.13*log(0.13) + 0.1*log(0.1) + 0.03*log(0.03) + 0.15*log(0.15) + 0.18*log(0.18) + 0.41*log(0.41)
= - (- 0.3826 + - 0.3322 + -0.1518 + -0.1972 + -0.4453 + -0.5274)
= 2.0365        (on average, there are two bits of information in a single message)

recall:
Answer = { 'a' = 000, 'b' = 10, 'c' = 010, 'd' = 001, 'e' = 011, 'f' = 11 }
P       = {  0.13      , 0.1       ,  0.03     ,   0.15     , 0.18       ,  0.41    }
Now...by counting the number of symbols in each code...
L      = {           3   ,          2  ,          3 ,          3   ,         3   ,         2  }
<L> = 0.13*3 + 0.1*2 + 0.03*3 + 0.15*3 + 0.18*3 + 0.41*2
= 2.49

So, putting these together, we get:
2.0365/(2.49 * log 2), though since log, base 2, of 2 is simply 1...
2.0365/2.49 = 0.8179
So, this code is 81.79% efficient (less than the Huffman code generated for this language above...)