Definitions
- ANALYZABLE
- The sections of an email message subject to
tokenization and statistical analysis. The ANALYZABLE section
consists of the BODY, plus the FIELD-BODY of the HEADERS ``From'',
``To'', and ``Subject''.
- BODY
- The main body of an email message; the part of an email
message outside the HEADERs. Refer to RFC822 for details.
- CLASSIFICATION
- The process of using statistics accumulated
during TRAINING to label an UNLABELED message as SPAM or NORMAL.
- EVIL
- See SPAM.
- FEATURE
- Some measurable characteristic of an email message,
such as the presence or absence of a specific word, the length of a
line, etc.
- FIELD-BODY
- The contents of a HEADER, following the
FIELD-NAME. Refer to RFC822 for details.
- FIELD-NAME
- The sequence of characters that identifies a field
(i.e., specific HEADER) within the HEADERS section of an email
message.
- HEADER
- An email header or field, specifing information such as
routing, date and time, subject, etc. Refer to RFC822 for details.
- MAILBOX
- A folder format for storing multiple email messages
that is widely used under Unix (e.g., by mail,
mailx, pine, mutt, etc.). A MAILBOX consists
of zero or more email messages concatenated together, separated by
(single) blank lines. Each new email message is recognized by the
presence of the token ``From '' at the beginning of a line. No other
structure or control information is imposed on the file.
- MAY
- A requirement that the product can choose to implement if
desired. Can also indicate a choice among acceptable alternatives
(e.g., ``The program MAY do x, y, or z.'' indicates that the choice of
behavior x, y, or z is up to the designer.)
- MUST
- A requirement that the product must implement for full
credit.
- MUST NOT
- A behavior or assumption that must not be violated.
Violating a MUST NOT restriction will result in a penalty on the
assignment.
- NORMAL
- Email that the USER wishes to receive.
- PRIOR
- Or prior frequency estimate. The expected frequency of
SPAM or NORMAL emails after TRAINING but before looking at a
specific email message. Represents the proportion of SPAM and NORMAL
email messages in the training data. Equivalent to the terms
and
.
- POSTERIOR
- Or posterior frequency estimate. The conditional
probability estimate of a specific message being SPAM or NORMAL
after analyzing the contents of the message. Equivalent to the
terms
and
.
- PUNCTUATION
- Punctuation characters. For the purposes of this
project, the punctuation characters are considered to be any
characters other than WHITESPACE, letters, or digits.
- RECOVERABLE ERROR
- An error condition that the software can
ignore, correct, or otherwise recover from. The program MUST produce
a warning message and then cleanly continue with no corruption or loss
of valid data.
- RFC822
- Document that specifies the syntax of standard
internet email messages. Refer to this document for all
specifications related to the format of email messages. Available at
http://www.ietf.org/rfc/rfc0822.txt.
- SPAM
- Email that the USER does not wish to receive.
- TRAINING
- Mode or stage in which the software compiles
statistics from data that is known to be SPAM or NORMAL.
- UNLABELED
- An email message whose content (SPAM or NORMAL) is
unknown.
- UNRECOVERABLE ERROR
- An error condition from which recovery is
impossible. The program MUST produce an error message describing the
condition and then cleanly halt.
- USER
- A single computer user or email recipient (potentially a
mailing list).
- WHITESPACE
- Non-printable characters including (but not limited
to) space, horizontal and vertical tabs, newlines, and carriage
returns. C.f., the Java JDK API call Character.isWhitespace().
Terran Lane
2004-01-26