This client is responsible for analyzing known examples of NORMAL and SPAM emails, building the naïve Bayes statistical models for each, and saving a durable copy of those models. This program also provides an interface to display a summary of the contents of the two naïve Bayes models.
This program MUST accept email inputs in the format specified by RFC822 as training input. It MUST also accept MAILBOX format files as training input, and recognize the individual messages that occur within the file as separate email entities. It MUST correctly handle zero-length input files.
This program MAY accept its training data files from standard input, but MAY provide direct file access to them (via an appropriately chosen command-line option). It MUST support some form of durable storage for the statistical models. The developer MAY use the Java serialization mechanism to implement this durable storage. BSFTrain MAY save its statistics in a single, combined file or in two separate files. This program MUST be capable of loading previously produced statistics files, updating them with new data, and saving the new updated files. This program MUST NOT lose data from previous TRAINING sessions during this operation--it MUST only add new data to existing.
BSFTrain MUST maintain two statistical models--one for SPAM and one for NORMAL email. When this program is initially run (i.e., if no statistics files exist at first), it MUST construct new, default statistics tables. The designer MAY choose to require a separate, ``initialization'' invocation to create and initialize the statistics files before TRAINING, or MAY choose to have that operation happen transparently.
This program MUST be capable of producing a human-readable dump of the current (possibly default) statistical models. The designer MAY choose any reasonable format for the dump, but the dump MUST, at the minimum, provide information about the relative frequency of all known tokens under both SPAM and NORMAL classes, the class PRIORs, and the total number of tokens read.
The BSFTrain client MUST support the following command-line options:
This program MUST also support the options listed below under Common Options.
BSFTrain MAY implement additional command-line options of the designer's choice, but such options MUST NOT conflict with the options given above or the options listed below under Common Options.
Terran Lane 2004-01-26