Winnow Algorithm ---------------- Now turn to an alg that is something like a combination of the WM alg and the Perceptron alg. Look at in this context: Say we're trying to learn an OR function. We saw an algorithm: "list all features and cross off bad ones on negative examples" that can make n mistakes. But, what if most features are irrelevant. What if the target is an OR of r relevant features where r is a lot smaller than n. Can we get a better bound in that case? Winnow will give us a bound of O(r log n) mistakes. Algorithm:(a simple version) 1. Initialize the weights w_1, ..., w_n of the variables to 1. 2. Given an example x = (x_1, ..., x_n), output + if w_1x_1 + w_2x_2 + ... + w_nx_n >= n, else output -. 3. If the algorithm makes a mistake: (a) If the algorithm predicts negative on a positive example, then for each x_i equal to 1, double the value of w_i. (b) If the algorithm predicts positive on a negative example, then for each x_i equal to 1, cut the value of w_i in half. 4. repeat (goto 2) THEOREM: The Winnow Algorithm learns the class of disjunctions in the Mistake Bound model, making at most 2 + 3r(1 + lg n) mistakes when the target concept is an OR of r variables. PROOF: Let us first bound the number of mistakes that will be made on positive examples. Any mistake made on a positive example must double at least one of the weights in the target function (the {\em relevant} weights), and a mistake made on a negative example will {\em not} halve any of these weights, by definition of a disjunction. Furthermore, each of these weights can be doubled at most $1 + \lg n$ times, since only weights that are less than $n$ can ever be doubled. Therefore, Winnow makes at most $r(1 + \lg n)$ mistakes on positive examples. Now we bound the number of mistakes made on negative examples. The total weight summed over all the variables is initially $n$. Each mistake made on a positive example increases the total weight by at most $n$ (since before doubling, we must have had $w_1x_1 + \ldots w_nx_n < n$). On the other hand, each mistake made on a negative example decreases the total weight by at least $n/2$ (since before halving, we must have had $w_1x_1 + \ldots + w_nx_n \geq n$). The total weight never drops below zero. Therefore, the number of mistakes made on negative examples is at most twice the number of mistakes made on positive examples, plus 2. That is, $2 + 2r(1 + \lg n)$. Adding this to the bound on the number of mistakes on positive examples yields the theorem. \qed