Entropy
The information theoretic notion of Entropy is a generalization of the
physical notion. There are many ways to describe Entropy. It is a
measure of the randomness of a random variable. It is also a measure
of the amount of information a random variable or stochastic process
contains. It is also a lower bound on the amount a message can be
compressed. And finally it is the average number of yes/no questions
that need to be asked about an random entity to determine its value.
Entropy and related Information Theoretic metrics are used extensively
in Artificial Intelligence applications that do stochastic modeling
such as speech recognition, pattern recognition, medical diagnotics
and financial modeling. They are useful not only because they measure
randomness but also because they can be used to determine questions
with the highest information content and show the best way to decrease
Entropy through successive querrys.
Another example of the use of Entropy is in compression. We know the
Entropy of English is around 1 bit per letter which means that when we
communicate, the redundancy factor is around 26. This suggests that
we should be able to compress English to a high degree which in fact
we can do. If we ever reach compression rates of 26, we should stop,
pat ourselves on the back, and move on to other problems because we can
do no better.
Let's look briefly at the formula for Entropy through an example. The
problem we consider is how many times a deck of cards must be shuffled
to ensure complete randomness. First we present the equation for
Entropy: it is the sum over all values of a rv of the probability of
that value times the log of that prob(i.e. p(x)logp(x)). This
equation can be derived from first principles of the properties of
information.
Now since all permuations have equal probability in a random deck of
cards, the entropy of that deck is log52! = 225.6 bits. When we
shuffle a deck of cards, that shuffle has entropy equivalent to log(52
choose 26) = 48.8 bits (we assume the deck is divided in half and a
"rifle" shuffle is used). This means we should use a "rifle" shuffle
225.6/48.8 = 4.6 or 5 times on average to assure complete randomness.
This computation is relatively simple because the probabilities of all
events are assumed to be equal.
Kolmogorov Complexity
Kolmogorov Complexity is a measure of the intrinsic descriptive
complexity of an object invented suprisingly enough by the
mathematician Kolmogorov. The Kolmogorov complexity of an object is
defined to be the length of the shortest binary binary computer
program that descibes that object. It has been proven that this
definition of complexity is essentially computer independent so for
convenience, we assume these programs are run on a Turing Machine.
Entropy and Kolmogorov complexity are related because the shortest
binary description of a random variable is about equal to the Entropy
of that random variable.
Let's look at some examples of Kolmogorov complexity. Kolmogorov
complexity(abbreviated from here on as KC) may be independent of
length. For example, the KC of pi is quite small as is the KC of a
fractal. The KC of the Mona Lisa is its size divided by the amount it
can be compressed by. These objects have small KC since they are
highly ordered. In fact, most numbers have KC which is essentially
equal to their length. The probability that a random object can be
compressed in this way by more that k bits is 2^-k. One particularly
interesting number with high KC is known as Omega.
Omega is equal to the probability that a randomly chosen Turing
Machine will halt. Omega is known as the philosopher's stone because
if we knew Omega, in some sense, we would have all the mathematical
knowledge that the human race can ever hope to possess.
Unfortunately, since the halting problem is incomputable, so is Omega
and hence its KC is infinite. This seems reasonable since it contains
an infinite amount of information.
Let's see how we can use Omega to find the truth or falsity of any
theorem. First consider an arbitrary theorem of size n. We run all
possible Turing machine in parallel(this may seem difficult but it is
possible) until the sum of the probability masses contributed by the
ones that halt exceeds the n bit approximation of Omega. At this
point, we know all TMs of length less than n that have not halted will
never halt. There exists some TM of length less than n which searches
for a proof of our desired theorem. If this TM has halted, the
theorem is true and provable. If the TM searching for a
counter-example has halted, the theorem is false. If neither halts,
the theorem is unprovable either way.
That great computer in the sky
Let's end this section with a few practical, broad reaching
implications of Information Theory. First we need to consider the
quantity known as the universal probability of an object written as
Pu(x). The universal probability is defined to be the probability
that a randomly drawn "program" outputs that object. This probability
is again provably independent of the type of computer so we limit our
discussion to Turing Machines. Pu(x) and KC(x) have a very elegant
relationship in that Pu(x) is approximately equal to 2^-KC(x).
Consider a monkey typing randomly at a typewriter versus a monkey at a
computer. It can be shown that the monkey at a computer is more
likely to produce more ordered or "interesting" results. Consider the
probability that either monkey will produce all the works of
Shakespeare. Assuming the text is 1 million bits long, the monkey at
a typewriter has probability about 2^-1,000,000 of producing the text.
The monkey at the computer has probability 2^-KC(Shakespeare) or
2^-250,000 of producing the text assuming the Shakespeare is
conpressable by four times. Neither of these probabilities is large
but one is still many, many orders of magnitude more likely than the
other. In this sense, we can say that a computer is an amplifier of
order, information or "sense".
Assume the universe is an infinitely fast, massively parallel
computer. This is not an unusual assumption(certainly people have
assumed the universe was stranger things) since the idea of an
algorithm is basically a *generalization* of a law of Physics. In
some sense each object in the universe is a processing unit. While
each object obeys fairly basic laws(or algorithms) in relation to
other objects, the net effect of the interactions among objects can
lead to fairly complicated results.
We know there are certain laws of Physics which govern the universe,
let's just couch those laws as algorithms. For example the algorithm
of gravity states that to find the gravitational pull on an object,
find the gravitational pull exerted by all other objects in the
universe (we know the "processing unit" that is that object is fast
enough to do this so that it seems instantaneous) and process these
individual forces to get the net force. (A more advanced physical
idea-and one with less Kolmogorov complexity- is to compute the
warping effect each object has on the space time continuum). The
processing units have access to random numbers so that they can handle
quantum mechanics.
It is a fact that any massively parallel computer can be simulated on
a regular computer. While parallelism may increase the speed of a
computer, it does not increase the computational power. Hence any
massively parallel computer is Turing complete and thus we can
postulate that the universe is Turing complete. So from this point
on, for simplicities sake, we assume that the universe is a Turing
machine.
Why are certain algorithms of physics chosen and not others? For
example why are energy and mass related by E=MC^2 and not E=MC^3. Are
these algorithms chosen randomly? What sort of computer is this
anyway? Who, if anyone, is programming it?
First of all, we provide a justification for Occams razor by showing
that no matter what kind of computer it is and how it is programmed,
all other things being equal, simpler theories are more likely to
explain phenomena then more complex ones. First assume we are
considering two theories A and B. A is a very simple gravitational
theory that has Kolmogorov Complexity 1,000 whereas B assumes that
gravitational force depends on all kinds of strange interactions and
so has Kolmogorov Complexity 10,000. This means Pu(A) is about
2^-1,000 and Pu(B) is about 2^-10,000. Since the universal
probability is the same up to a multiplicative constant for any type
of computer, we should assume that theory A is more likely.
Does this computer have a master program to create and maintain human
life and are all the algorithms of Physics chosen to fulfil this goal?
Are there further goals such as rewarding the good and punishing the
wicked(or vice versa)? The fact is we can never know this for sure
just by studying the physical world. In other words, even as we learn
more and more laws (or algorithms) of nature, the likelihood ratio
that the universe is run by a computer programmed by a monkey to that
it is run along some other type of plan will never approach zero or
infinity(unlike most other simple hypothesis testing problems). In
this sense, we will never be able to reject the possibility that the
universe is the output of a monkey typing at a computer. On a
brighter note, we will never be able to completely accept this
hypothesis either.
See Elements of Information Theory by Thomas and Cover for a more
rigorous coverage of these topics.