The spam classification method you will be using is based on a
Bayesian statistical model known as the ``naïve Bayes'' model.
It's based on estimating the probabilities that a given UNLABELED
message is either SPAM or NORMAL. More specifically, for an UNLABELED
message,
, you must evaluate the quantities
and
, where
denotes the class of NORMAL email messages and
is the class of
SPAM email messages. If you find that
The trick is finding
. Your program
(specifically, BSFTrain) will estimate them by looking at a
great many NORMAL and SPAM emails, called TRAINING DATA. The problem
is that, even given a bunch of example emails, it's not immediately
obvious what this conditional probability might be. Through the magic
of Bayes' rule, however, we can turn this around:
The quantities
and
are
called posterior probability estimates--posterior because
they're the probabilities you assign after you see the data
(i.e., after you get to look at
). The quantity
is called the prior probability of class
,
or simply the prior. This is the probability you would assign to a
particular message being SPAM or NORMAL before you look at the
contents of the message. The quantity
is the
``raw'' data likelihood. Essentially, it's the probability of a
particular message occuring, across both SPAM and NONSPAM. The
quantity
is called the generative
model for
given class
.
It seems like we've taken a step backward. We now have three quantities
to calculate rather than one. Fortunately, in this case, these three are
simpler than the original one. First off, if all we want to do is
classify the data, via Equation 1, then we can
discard the raw data likelihood,
(to see this, plug
Equation 2 into 1). Second, the term
is easy--it's just the relative probability of a message
being SPAM or NORMAL, i.e., the frequency of SPAM or NORMAL emails
you've seen:
![]() |
||
![]() |
||
![]() |
||
![]() |
That leaves the generative model. Note that if I hand you a message
and say ``that's spam'', you have a sample of
. Your job is to assemble a bunch of such
samples to create a comprehensive model of the data for each class.
Essentially,
tells you what the chance is that
you see a particular configuration of letters and words within the
universe of all messages of class
.
Here we make a massive approximation. First, let's break up the
message
into a set of lower level elements, called
features:
. In
the case of email, a feature might be a single character, a word, an
HTML token, a MIME attachment, the length of a line, time of day the
mail was sent, etc. For the moment, we won't worry about what a
feature is (it's the tokenizer's job to determine that--see
Section 3.2 for details); all we'll care is that you have
some way to break it down into more fundamental pieces. Now
we'll write:
| (3) | ||
| (4) | ||
![]() |
(5) |
This is called the naïve Bayes approximation. It's
naïve because it is a drastic approximation (for example,
it discards any information about the order among words), but it turns
out to work surprisingly well in practice in a number of cases. You
can think about more sophisticated ways to approximate
if you like (I welcome your thoughts on the
matter), but for this project it's sufficient to stick with naïve
Bayes.
Ok, so now we've blown out a single term that we didn't know how to
calculate into a long product of terms. Is our life any better? Yes!
Because each of those individual terms,
, is simply
an observed frequency within the TRAINING data for the token
--you can get it simply by counting:
![]() |
For example, suppose that your tokens are individual words. When
you're analyzing a new SPAM message during TRAINING, you find that the
token is the word ``tyromancy''. Your probability estimate
for ``tyromancy'' is just:
![]() |
So when you're TRAINING, every time you see a particular token, you increment the count of that token (and the count of all tokens) for that class. When you're doing CLASSIFICATION, you don't change the counts when you see a token. Instead, you just look up the appropriate counts and call that the probability of the token that you're looking at. So to calculate Equation 5, you simply iterate across the message, taking each token, and multiplying its class-conditional probability into your total probability estimate for the corresponding class. In pseudo-code,
And you're done. There are, of course, an immense number of technical
issues in turning this into a real program, but that's the gist of
it. One practical issue, however, is underflow--if the product in
Equation 5 has very many terms, your probability estimates
(
and
) will quickly become 0 and it will be impossible
to tell the difference between the two classes. To overcome this,
instead of working directly with the probabilities of tokens,
we'll work with the log likelihood of the tokens. I.e., we'll
replace Equation 5 with
Equation 5
. (Question 1: how does this
change the algorithm in Figure 2? Question 2: does
this leave the final classification unchanged? Why or why not?)
A second critical issue is what to do if you see a token in an
UNLABLED message that you've never seen before. If all you're doing
is using
,
then you have that
if you've never seen token
before in your TRAINING data. This is bad. (Question: why is
this bad? Hint: consider what happens to Equation 5 if one
or more terms are 0.) So instead, we'll use an approximation to
that avoids this danger:
![]() |
Now you have enough mathematical background and tricks to implement the SPAM filter. The rest is Java...
Terran Lane 2004-01-26