Tokenizers

The job of a tokenizer is to split the ANALYZABLE sections of an email message into small chunks, called tokens, which are the basic units of analysis. Many different types of tokens are possible--individual words, words plus punctuation characters, short strings of characters, HTML entities, MIME attachments, etc. A program can obtain different streams of tokens and, thus, different statistics from the same email message by changing the definition of a token (i.e., by changing the tokenizer module that turns an email message into tokens).

The SpamBGon software suite MUST provide at least three tokenizers:

NGramTokenizer
This tokenizer splits the ANALYZABLE section of an email message into tokens of $ n$ contiguous characters, where $ n$ is a parameter to the tokenizer. This tokenizer MUST be able to omit WHITESPACE and PUNCTUATION characters. It MAY also provide user-selectable functionality to include WHITESPACE and/or PUNCTUATION characters.
WhiteSpaceTokenizer
This tokenizer splits the ANALYZABLE section of an email message at WHITESPACE characters. Essentially, the goal of this tokenizer is to split out individual ``words''. This tokenizer MUST discard WHITESPACE and PUNCTUATION characters, though it MAY also provide a user-selectable option to preserve PUNCTUATION characters.
One other tokenizer
Choice of this tokenizer is a design decision, but it MUST be documented, described, motivated (i.e., a rationale given for why it might be a useful tokenizer), and empirically tested. Possibilities for this tokenizer include (but are not limited to) a recognizer for dates and times (from the Date header), a tokenizer that recognizes HTML tags and their contents, a tokenizer that recognizes MIME messages as single entities, a tokenizer based on n-grams of words (rather than characters), etc.

These tokenizers MUST be interchangable; the USER MUST be able to select among the tokenizers at the command line during TRAINING and CLASSIFICATION. When a tokenizer requires additional parameters (e.g., the parameter $ n$ for the NGramTokenizer), the program MUST provide a command-line interface to set such parameters.

Terran Lane 2004-01-26