Definitions

ANALYZABLE
The sections of an email message subject to tokenization and statistical analysis. The ANALYZABLE section consists of the BODY, plus the FIELD-BODY of the HEADERS ``From'', ``To'', and ``Subject''.
BODY
The main body of an email message; the part of an email message outside the HEADERs. Refer to RFC822 for details.
CLASSIFICATION
The process of using statistics accumulated during TRAINING to label an UNLABELED message as SPAM or NORMAL.
EVIL
See SPAM.
FEATURE
Some measurable characteristic of an email message, such as the presence or absence of a specific word, the length of a line, etc.
FIELD-BODY
The contents of a HEADER, following the FIELD-NAME. Refer to RFC822 for details.
FIELD-NAME
The sequence of characters that identifies a field (i.e., specific HEADER) within the HEADERS section of an email message.
HEADER
An email header or field, specifing information such as routing, date and time, subject, etc. Refer to RFC822 for details.
MAILBOX
A folder format for storing multiple email messages that is widely used under Unix (e.g., by mail, mailx, pine, mutt, etc.). A MAILBOX consists of zero or more email messages concatenated together, separated by (single) blank lines. Each new email message is recognized by the presence of the token ``From '' at the beginning of a line. No other structure or control information is imposed on the file.
MAY
A requirement that the product can choose to implement if desired. Can also indicate a choice among acceptable alternatives (e.g., ``The program MAY do x, y, or z.'' indicates that the choice of behavior x, y, or z is up to the designer.)
MUST
A requirement that the product must implement for full credit.
MUST NOT
A behavior or assumption that must not be violated. Violating a MUST NOT restriction will result in a penalty on the assignment.
NORMAL
Email that the USER wishes to receive.
PRIOR
Or prior frequency estimate. The expected frequency of SPAM or NORMAL emails after TRAINING but before looking at a specific email message. Represents the proportion of SPAM and NORMAL email messages in the training data. Equivalent to the terms $ \Pr[C_{S}]$ and $ \Pr[C_{N}]$.
POSTERIOR
Or posterior frequency estimate. The conditional probability estimate of a specific message being SPAM or NORMAL after analyzing the contents of the message. Equivalent to the terms $ \Pr[C_{S}\vert\mathbf{X}]$ and $ \Pr[C_{N}\vert\mathbf{X}]$.
PUNCTUATION
Punctuation characters. For the purposes of this project, the punctuation characters are considered to be any characters other than WHITESPACE, letters, or digits.
RECOVERABLE ERROR
An error condition that the software can ignore, correct, or otherwise recover from. The program MUST produce a warning message and then cleanly continue with no corruption or loss of valid data.
RFC822
Document that specifies the syntax of standard internet email messages. Refer to this document for all specifications related to the format of email messages. Available at http://www.ietf.org/rfc/rfc0822.txt.
SPAM
Email that the USER does not wish to receive.
TRAINING
Mode or stage in which the software compiles statistics from data that is known to be SPAM or NORMAL.
UNLABELED
An email message whose content (SPAM or NORMAL) is unknown.
UNRECOVERABLE ERROR
An error condition from which recovery is impossible. The program MUST produce an error message describing the condition and then cleanly halt.
USER
A single computer user or email recipient (potentially a mailing list).
WHITESPACE
Non-printable characters including (but not limited to) space, horizontal and vertical tabs, newlines, and carriage returns. C.f., the Java JDK API call Character.isWhitespace().

Terran Lane 2004-01-26