Appendix A: Bayesian Spam Analysis

The spam classification method you will be using is based on a Bayesian statistical model known as the ``naïve Bayes'' model. It's based on estimating the probabilities that a given UNLABELED message is either SPAM or NORMAL. More specifically, for an UNLABELED message, $ \mathbf{X}$, you must evaluate the quantities $ \Pr[C_{N}\vert\mathbf{X}]$ and $ \Pr[C_{S}\vert\mathbf{X}]$, where $ C_{N}$ denotes the class of NORMAL email messages and $ C_{S}$ is the class of SPAM email messages. If you find that

$\displaystyle \Pr[C_{N}\vert\mathbf{X}]$ $\displaystyle >\Pr[C_{S}\vert\mathbf{X}]$ (1)

, you can label the message $ \mathbf{X}$ NORMAL, otherwise you label it SPAM.

The trick is finding $ \Pr[C_{i}\vert\mathbf{X}]$. Your program (specifically, BSFTrain) will estimate them by looking at a great many NORMAL and SPAM emails, called TRAINING DATA. The problem is that, even given a bunch of example emails, it's not immediately obvious what this conditional probability might be. Through the magic of Bayes' rule, however, we can turn this around:

$\displaystyle \Pr[C_{i}\vert\mathbf{X}]$ $\displaystyle = \frac{\Pr[\mathbf{X}\vert C_{i}]\Pr[C_{i}]}{\Pr[\mathbf{X}]}$ (2)

The quantities $ \Pr[C_{N}\vert\mathbf{X}]$ and $ \Pr[C_{S}\vert\mathbf{X}]$ are called posterior probability estimates--posterior because they're the probabilities you assign after you see the data (i.e., after you get to look at $ \mathbf{X}$). The quantity $ \Pr[C_{i}]$ is called the prior probability of class $ i$, or simply the prior. This is the probability you would assign to a particular message being SPAM or NORMAL before you look at the contents of the message. The quantity $ \Pr[\mathbf{X}]$ is the ``raw'' data likelihood. Essentially, it's the probability of a particular message occuring, across both SPAM and NONSPAM. The quantity $ \Pr[\mathbf{X}\vert C_{i}]$ is called the generative model for $ \mathbf{X}$ given class $ i$.

It seems like we've taken a step backward. We now have three quantities to calculate rather than one. Fortunately, in this case, these three are simpler than the original one. First off, if all we want to do is classify the data, via Equation 1, then we can discard the raw data likelihood, $ \Pr[\mathbf{X}]$ (to see this, plug Equation 2 into 1). Second, the term $ \Pr[C_{i}]$ is easy--it's just the relative probability of a message being SPAM or NORMAL, i.e., the frequency of SPAM or NORMAL emails you've seen:

$\displaystyle \Pr[C_{N}]$ $\displaystyle = \frac{\text{\char93  NORMAL emails}}{\text{total \char93  emails}}$    
  $\displaystyle =\frac{\text{\char93  NORMAL emails}}{\text{\char93  SPAM} + \text{\char93  NORMAL}}$    
$\displaystyle \Pr[C_{S}]$ $\displaystyle = \frac{\text{\char93  SPAM emails}}{\text{total \char93  emails}}$    
  $\displaystyle =\frac{\text{\char93  SPAM emails}}{\text{\char93  SPAM} + \text{\char93  NORMAL}}$    

That leaves the generative model. Note that if I hand you a message and say ``that's spam'', you have a sample of $ \Pr[\mathbf{X}\vert C_{S}]$. Your job is to assemble a bunch of such samples to create a comprehensive model of the data for each class. Essentially, $ \Pr[\mathbf{X}\vert C_{i}]$ tells you what the chance is that you see a particular configuration of letters and words within the universe of all messages of class $ C_{i}$.

Here we make a massive approximation. First, let's break up the message $ \mathbf{X}$ into a set of lower level elements, called features: $ \mathbf{X}=\LR{x_{1},x_{2},\dots, x_{k}}$. In the case of email, a feature might be a single character, a word, an HTML token, a MIME attachment, the length of a line, time of day the mail was sent, etc. For the moment, we won't worry about what a feature is (it's the tokenizer's job to determine that--see Section 3.2 for details); all we'll care is that you have some way to break it down into more fundamental pieces. Now we'll write:

$\displaystyle \Pr[\mathbf{X}\vert C_{i}]$ $\displaystyle = \Pr[x_{1},x_{2},\dots ,x_{k}\vert C_{i}]$ (3)
  $\displaystyle \approx \Pr[x_{1}\vert C_{i}]\Pr[x_{2}\vert C_{i}] \cdots \Pr[x_{k}\vert C_{i}]$ (4)
  $\displaystyle = \prod_{j=1}^{k}\Pr[x_{j}\vert C_{i}]$ (5)

This is called the naïve Bayes approximation. It's naïve because it is a drastic approximation (for example, it discards any information about the order among words), but it turns out to work surprisingly well in practice in a number of cases. You can think about more sophisticated ways to approximate $ \Pr[\mathbf{X}\vert C_{i}]$ if you like (I welcome your thoughts on the matter), but for this project it's sufficient to stick with naïve Bayes.

Ok, so now we've blown out a single term that we didn't know how to calculate into a long product of terms. Is our life any better? Yes! Because each of those individual terms, $ \Pr[x_{j}\vert C_{i}]$, is simply an observed frequency within the TRAINING data for the token $ x_{j}$--you can get it simply by counting:

$\displaystyle \Pr[x_{j}\vert C_{i}]$ $\displaystyle = \frac{\text{\char93  of tokens of type } x_{j} \text{ seen in class } C_{i}}{\text{total \char93  of tokens seen in class }C_{i}}$    

For example, suppose that your tokens are individual words. When you're analyzing a new SPAM message during TRAINING, you find that the $ j$ token is the word ``tyromancy''. Your probability estimate for ``tyromancy'' is just:

$\displaystyle \Pr[$``tyromancy''$\displaystyle \vert C_{S}]$ $\displaystyle = \frac{\text{\char93  \lq\lq tyromancy'' instances in all SPAM}}{\text{total \char93  of tokens in all SPAM}}$    

So when you're TRAINING, every time you see a particular token, you increment the count of that token (and the count of all tokens) for that class. When you're doing CLASSIFICATION, you don't change the counts when you see a token. Instead, you just look up the appropriate counts and call that the probability of the token that you're looking at. So to calculate Equation 5, you simply iterate across the message, taking each token, and multiplying its class-conditional probability into your total probability estimate for the corresponding class. In pseudo-code,

Figure 1: TRAINING pseudo-code
\fbox{\begin{minipage}{0.8\textwidth}
Given: an email message, $\mathbf{X}$, and...
... total number of email messages for class $C_{i}$
\end{enumerate}\end{minipage}}

Figure 2: CLASSIFICATION pseudo-code
\fbox{\begin{minipage}{0.8\textwidth}
Given: an UNLABELED email message, $\mathb...
...p_{S}$\ then return NORMAL
\item else return SPAM
\end{enumerate}\end{minipage}}

And you're done. There are, of course, an immense number of technical issues in turning this into a real program, but that's the gist of it. One practical issue, however, is underflow--if the product in Equation 5 has very many terms, your probability estimates ($ p_{N}$ and $ p_{S}$) will quickly become 0 and it will be impossible to tell the difference between the two classes. To overcome this, instead of working directly with the probabilities of tokens, we'll work with the log likelihood of the tokens. I.e., we'll replace Equation 5 with $ \log($Equation 5$ )$. (Question 1: how does this change the algorithm in Figure 2? Question 2: does this leave the final classification unchanged? Why or why not?)

A second critical issue is what to do if you see a token in an UNLABLED message that you've never seen before. If all you're doing is using $ \Pr[x_{j}\vert C_{i}]=\frac{\char93 \ x_{j}}{\text{total tokens}}$, then you have that $ \Pr[x_{j}\vert C_{i}]=0$ if you've never seen token $ x_{j}$ before in your TRAINING data. This is bad. (Question: why is this bad? Hint: consider what happens to Equation 5 if one or more terms are 0.) So instead, we'll use an approximation to $ \Pr[x_{j}\vert C_{i}]$ that avoids this danger:

$\displaystyle \Pr[x_{j}\vert C_{i}]$ $\displaystyle \approx \frac{(\char93 \ x_{j}\ \text{tokens in class }C_{i})+1}{(\text{total \char93  of tokens in class } C_{i})+1 }$    

This is called a Laplace correction or, equivalently, a Laplace smoothing. (It also happens to be a special case of a Dirichlet prior, but we won't go into that here.)

Now you have enough mathematical background and tricks to implement the SPAM filter. The rest is Java...

Terran Lane 2004-01-26