Winnow Algorithm
----------------
Now turn to an alg that is something like a combination of the WM
alg and the Perceptron alg.

Look at in this context: Say we're trying to learn an OR function.  We
saw an algorithm: "list all features and cross off bad ones on
negative examples" that can make n mistakes.  But, what if most
features are irrelevant.  What if the target is an OR of r relevant
features where r is a lot smaller than n.  Can we get a better bound
in that case?

Winnow will give us a bound of O(r log n) mistakes.

Algorithm:(a simple version)

1. Initialize the weights w_1, ..., w_n of the variables to 1.

2. Given an example x = (x_1, ..., x_n), output + if

	w_1x_1 + w_2x_2 + ... + w_nx_n >= n,

   else output -.

3. If the algorithm makes a mistake:

  (a) If the algorithm predicts negative on a positive example, then
  for each x_i equal to 1, double the value of w_i.

  (b) If the algorithm predicts positive on a negative example, then
  for each x_i equal to 1, cut the value of w_i in half.

4. repeat (goto 2)


THEOREM: The Winnow Algorithm learns the class of disjunctions in the
Mistake Bound model, making at most 2 + 3r(1 + lg n) mistakes when the
target concept is an OR of r variables.


PROOF: Let us first bound the number of mistakes that will be
made on positive examples.  Any mistake made on a positive example
must double at least one of the weights in the target function (the
{\em relevant} weights), and a mistake made on a negative example will
{\em not} halve any of these weights, by definition of a disjunction.
Furthermore, each of these weights can be doubled at most $1 + \lg n$
times, since only weights that are less than $n$ can ever be doubled.
Therefore, Winnow makes at most $r(1 + \lg n)$ mistakes on positive
examples.

Now we bound the number of mistakes made on negative examples.  The
total weight summed over all the variables is initially $n$.  Each
mistake made on a positive example increases the total weight by at
most $n$ (since before doubling, we must have had $w_1x_1 + \ldots
w_nx_n < n$).  On the other hand, each mistake made on a negative
example decreases the total weight by at least $n/2$ (since before
halving, we must have had $w_1x_1 + \ldots + w_nx_n \geq n$).  The
total weight never drops below zero.  Therefore, the number of
mistakes made on negative examples is at most twice the number of
mistakes made on positive examples, plus 2.  That is, $2 + 2r(1 + \lg
n)$.  Adding this to the bound on the number of mistakes on positive
examples yields the theorem.  \qed