Theoretical Computer Science and You

Entropy

The information theoretic notion of Entropy is a generalization of the physical notion. There are many ways to describe Entropy. It is a measure of the randomness of a random variable. It is also a measure of the amount of information a random variable or stochastic process contains. It is also a lower bound on the amount a message can be compressed. And finally it is the average number of yes/no questions that need to be asked about an random entity to determine its value.

Entropy and related Information Theoretic metrics are used extensively in Artificial Intelligence applications that do stochastic modeling such as speech recognition, pattern recognition, medical diagnotics and financial modeling. They are useful not only because they measure randomness but also because they can be used to determine questions with the highest information content and show the best way to decrease Entropy through successive querrys.

Another example of the use of Entropy is in compression. We know the Entropy of English is around 1 bit per letter which means that when we communicate, the redundancy factor is around 26. This suggests that we should be able to compress English to a high degree which in fact we can do. If we ever reach compression rates of 26, we should stop, pat ourselves on the back, and move on to other problems because we can do no better.

Let's look briefly at the formula for Entropy through an example. The problem we consider is how many times a deck of cards must be shuffled to ensure complete randomness. First we present the equation for Entropy: it is the sum over all values of a rv of the probability of that value times the log of that prob(i.e. p(x)logp(x)). This equation can be derived from first principles of the properties of information.

Now since all permuations have equal probability in a random deck of cards, the entropy of that deck is log52! = 225.6 bits. When we shuffle a deck of cards, that shuffle has entropy equivalent to log(52 choose 26) = 48.8 bits (we assume the deck is divided in half and a "rifle" shuffle is used). This means we should use a "rifle" shuffle 225.6/48.8 = 4.6 or 5 times on average to assure complete randomness. This computation is relatively simple because the probabilities of all events are assumed to be equal.

Kolmogorov Complexity

Kolmogorov Complexity is a measure of the intrinsic descriptive complexity of an object invented suprisingly enough by the mathematician Kolmogorov. The Kolmogorov complexity of an object is defined to be the length of the shortest binary binary computer program that descibes that object. It has been proven that this definition of complexity is essentially computer independent so for convenience, we assume these programs are run on a Turing Machine. Entropy and Kolmogorov complexity are related because the shortest binary description of a random variable is about equal to the Entropy of that random variable.

Let's look at some examples of Kolmogorov complexity. Kolmogorov complexity(abbreviated from here on as KC) may be independent of length. For example, the KC of pi is quite small as is the KC of a fractal. The KC of the Mona Lisa is its size divided by the amount it can be compressed by. These objects have small KC since they are highly ordered. In fact, most numbers have KC which is essentially equal to their length. The probability that a random object can be compressed in this way by more that k bits is 2^-k. One particularly interesting number with high KC is known as Omega.

Omega is equal to the probability that a randomly chosen Turing Machine will halt. Omega is known as the philosopher's stone because if we knew Omega, in some sense, we would have all the mathematical knowledge that the human race can ever hope to possess. Unfortunately, since the halting problem is incomputable, so is Omega and hence its KC is infinite. This seems reasonable since it contains an infinite amount of information.

Let's see how we can use Omega to find the truth or falsity of any theorem. First consider an arbitrary theorem of size n. We run all possible Turing machine in parallel(this may seem difficult but it is possible) until the sum of the probability masses contributed by the ones that halt exceeds the n bit approximation of Omega. At this point, we know all TMs of length less than n that have not halted will never halt. There exists some TM of length less than n which searches for a proof of our desired theorem. If this TM has halted, the theorem is true and provable. If the TM searching for a counter-example has halted, the theorem is false. If neither halts, the theorem is unprovable either way.

That great computer in the sky

Let's end this section with a few practical, broad reaching implications of Information Theory. First we need to consider the quantity known as the universal probability of an object written as Pu(x). The universal probability is defined to be the probability that a randomly drawn "program" outputs that object. This probability is again provably independent of the type of computer so we limit our discussion to Turing Machines. Pu(x) and KC(x) have a very elegant relationship in that Pu(x) is approximately equal to 2^-KC(x).

Consider a monkey typing randomly at a typewriter versus a monkey at a computer. It can be shown that the monkey at a computer is more likely to produce more ordered or "interesting" results. Consider the probability that either monkey will produce all the works of Shakespeare. Assuming the text is 1 million bits long, the monkey at a typewriter has probability about 2^-1,000,000 of producing the text. The monkey at the computer has probability 2^-KC(Shakespeare) or 2^-250,000 of producing the text assuming the Shakespeare is conpressable by four times. Neither of these probabilities is large but one is still many, many orders of magnitude more likely than the other. In this sense, we can say that a computer is an amplifier of order, information or "sense".

Assume the universe is an infinitely fast, massively parallel computer. This is not an unusual assumption(certainly people have assumed the universe was stranger things) since the idea of an algorithm is basically a *generalization* of a law of Physics. In some sense each object in the universe is a processing unit. While each object obeys fairly basic laws(or algorithms) in relation to other objects, the net effect of the interactions among objects can lead to fairly complicated results.

We know there are certain laws of Physics which govern the universe, let's just couch those laws as algorithms. For example the algorithm of gravity states that to find the gravitational pull on an object, find the gravitational pull exerted by all other objects in the universe (we know the "processing unit" that is that object is fast enough to do this so that it seems instantaneous) and process these individual forces to get the net force. (A more advanced physical idea-and one with less Kolmogorov complexity- is to compute the warping effect each object has on the space time continuum). The processing units have access to random numbers so that they can handle quantum mechanics.

It is a fact that any massively parallel computer can be simulated on a regular computer. While parallelism may increase the speed of a computer, it does not increase the computational power. Hence any massively parallel computer is Turing complete and thus we can postulate that the universe is Turing complete. So from this point on, for simplicities sake, we assume that the universe is a Turing machine.

Why are certain algorithms of physics chosen and not others? For example why are energy and mass related by E=MC^2 and not E=MC^3. Are these algorithms chosen randomly? What sort of computer is this anyway? Who, if anyone, is programming it?

First of all, we provide a justification for Occams razor by showing that no matter what kind of computer it is and how it is programmed, all other things being equal, simpler theories are more likely to explain phenomena then more complex ones. First assume we are considering two theories A and B. A is a very simple gravitational theory that has Kolmogorov Complexity 1,000 whereas B assumes that gravitational force depends on all kinds of strange interactions and so has Kolmogorov Complexity 10,000. This means Pu(A) is about 2^-1,000 and Pu(B) is about 2^-10,000. Since the universal probability is the same up to a multiplicative constant for any type of computer, we should assume that theory A is more likely.

Does this computer have a master program to create and maintain human life and are all the algorithms of Physics chosen to fulfil this goal? Are there further goals such as rewarding the good and punishing the wicked(or vice versa)? The fact is we can never know this for sure just by studying the physical world. In other words, even as we learn more and more laws (or algorithms) of nature, the likelihood ratio that the universe is run by a computer programmed by a monkey to that it is run along some other type of plan will never approach zero or infinity(unlike most other simple hypothesis testing problems). In this sense, we will never be able to reject the possibility that the universe is the output of a monkey typing at a computer. On a brighter note, we will never be able to completely accept this hypothesis either.

See Elements of Information Theory by Thomas and Cover for a more rigorous coverage of these topics.