Tokenizers
The job of a tokenizer is to split the ANALYZABLE sections of an email
message into small chunks, called tokens, which are the basic units of
analysis. Many different types of tokens are possible--individual
words, words plus punctuation characters, short strings of characters,
HTML entities, MIME attachments, etc. A program can obtain different
streams of tokens and, thus, different statistics from the same email
message by changing the definition of a token (i.e., by changing the
tokenizer module that turns an email message into tokens).
The SpamBGon software suite MUST provide at least three
tokenizers:
- NGramTokenizer
- This tokenizer splits the ANALYZABLE
section of an email message into tokens of
contiguous
characters, where
is a parameter to the tokenizer. This
tokenizer MUST be able to omit WHITESPACE and PUNCTUATION
characters. It MAY also provide user-selectable functionality to
include WHITESPACE and/or PUNCTUATION characters.
- WhiteSpaceTokenizer
- This tokenizer splits the
ANALYZABLE section of an email message at WHITESPACE characters.
Essentially, the goal of this tokenizer is to split out individual
``words''. This tokenizer MUST discard WHITESPACE and PUNCTUATION
characters, though it MAY also provide a user-selectable option to
preserve PUNCTUATION characters.
- One other tokenizer
- Choice
of this tokenizer is a design decision, but it MUST be documented,
described, motivated (i.e., a rationale given for why it might be a
useful tokenizer), and empirically tested. Possibilities for this
tokenizer include (but are not limited to) a recognizer for dates
and times (from the Date header), a tokenizer that
recognizes HTML tags and their contents, a tokenizer that recognizes
MIME messages as single entities, a tokenizer based on n-grams of
words (rather than characters), etc.
These tokenizers MUST be interchangable; the USER MUST be able to
select among the tokenizers at the command line during TRAINING and
CLASSIFICATION. When a tokenizer requires additional parameters
(e.g., the parameter
for the NGramTokenizer), the program
MUST provide a command-line interface to set such parameters.
Terran Lane
2004-01-26