About BSFTrain ============== The Bayesian spam filter training tool is responsible for analyzing known examples of normal (non-spam) and spam emails, building statistical models for each, and saving a durable copy of the models to disk. A second utility, BSFTest (see below), will then use these models to classify unknown email samples as either normal or spam. BSFTrain can run in two modes: training and dump. In training mode, BSFTrain breaks up the email into tokens via specialized tokenizers, which are specified at runtime. BSFTrain accepts email input in the format described by RFC822 as training input. It also accepts Unix mailbox (mbox) format files as training input, recoginizing the individual messages that occur within the file as separate email entities. BSFTrain accepts its training files either from standard input or from direct access. The resultant statistical models are then stored on disk as a single file. In dump mode, BSFTrain output summary statistics for both email types. If a log file is specified, more detailed stats are output to the file. Running BSFTrain ---------------- Usage: java -classpath .:./java-getopt-1.0.9.jar BSFTrain [options] -m file -k name The full suite of commands and options are: -s Treat input data as spam. -n Treat input data as normal. -t Runs BSFTrain in TRAINING mode. Compiles input data into tokens and statistics and updates the statistical models. -d Runs BSFTrain in DUMP mode, providing detailed statistics if a log file is specified or summary statistics only if no log file has been specified. -f file The mailbox to read in email from. If not specified, email is read in from the standard output (optional). -g val The NGram value, if the tokenizer is NGram-type. -k name The tokenizer. -l file The name of the log file to output dump results to (optional). -m file The name of the statistics model the tokens tables are output to. BSFTrain will add the .stat extension. NOTE: SpamBGon requires the third-party library Getopt. Before running, the Getopt library (java-getopt-1.0.9.jar) must be located in the same directory as BSFTest and BSFTrain. Should a different location be desired for Getopt, it will be necessary to modify -classpath accordingly. Examples -------- NOTE: For the following examples, the -classpath option will be ommitted for clarity. * Example #1: java BSFTrain -n -t -k WhiteSpaceTokenizer -m wsStats < somenonspam Output: Standby... Processing normal email (standard input) using WhiteSpaceTokenizer... Processed: 56 email(s). Total now: 56 email(s). Token cnt: 39709 Saving model... Finished. Explanation: Trains the system using the WhiteSpaceTokenizer with known good email. The model file is stored in wsStats.stat, and the training email is coming from standard input. * Example #2: java BSFTrain -n -t -k WhiteSpaceTokenizer -m wsStats -f somepam.txt Output: Standby... Loading model... Processing spam email (somespam.txt) using WhiteSpaceTokenizer... Processed: 93 email(s). Total now: 93 email(s). Token cnt: 30410 Saving model... Finished. Explanation: Trains the system using the WhiteSpaceTokenizer with known spam email. The model file is stored in wsStats.stat, and the training email is coming from the file somespam.txt. * Example #3: java BSFTrain -d -k WhiteSpaceTokenizer -m wsStats Output: Standby... Loading model... NORMAL email token dump (summary only)... Tokenizer used : edu.unm.cs351.p1.WhiteSpaceTokenizer Email processed : 56 Total token count : 39709 Unique token count: 3930 SPAM email token dump (summary only)... Tokenizer used : edu.unm.cs351.p1.WhiteSpaceTokenizer Email processed : 93 Total token count : 30410 Unique token count: 10116 Finished. Explanation: Displays summary statistics for the normal and spam emails processed in wsStats.stat. * Example #4: java BSFTrain -d -k WhiteSpaceTokenizer -m wsStats -l dump.txt Output: Standby... Loading model... Dumping NORMAL stats, standby... Dumping SPAM stats, standby... Finished. Explanation: Logs to dump.txt summary and detailed statistics for the normal and spam emails processed in wsStats.stat. A portion of dump.txt (normal email) shows: NORMAL email token dump: Tokenizer used : WhiteSpaceTokenizer Email processed : 56 Total token count : 39709 Unique token count: 3930 Count Prob (cnt/tot) Key ----- -------------- --------- 7 0.0001762825 ARMANDO 1 0.0000251832 lead20 1 0.0000251832 highly 3 0.0000755496 plural 5 0.0001259160 posting 3 0.0000755496 footprints 8 0.0002014657 early 3 0.0000755496 ministry 50 0.0012591604 1 44 0.0011080611 2 Each unique key in the normal token table is listed along with its count and probability. A summary section preceeds the detail listing. Running BSFTest =============== The Bayesian spam filter testing tool is responsible for analyzing an unknown email, calculating the naive Bayes approximation on the tokens, and then classifying the unknown email as either normal or spam. The result is displayed to the standard output in X-Spam-Status header format. BSFTest accepts a single unknown email in the format described by RFC822 as input. BSFTest accepts its training files via standard input, or optionally as direct input. If a log file is specified at runtime, a detailed log is created showing the full analysis used to determine normal/spam classification (see below). Running BSFTest ---------------- Usage: java -classpath .:./java-getopt-1.0.9.jar BSFTest [options] -m file -k name The full suite of commands and options are: -f file The file containing a single, unknown email (optional). -k name The tokenizer. -l file The name of the log file to output dump results to (optional). -m file The name of the statistics model the tokens tables are stored in. BSFTest will add the .stat extension. NOTE: SpamBGon requires the third-party library Getopt. Before running, the Getopt library (java-getopt-1.0.9.jar) must be located in the same directory as BSFTest and BSFTrain. Should a different location be desired for Getopt, it will be necessary to modify -classpath accordingly. Examples -------- NOTE: For the following examples, the -classpath option will be ommitted for clarity. * Example #1: java BSFTest -m wsStats -k WhiteSpaceTokenizer < sample Output: X-Spam-Status: SPAM, N: -773.23, S: -621.73, Diff: 151.51 Explanation: Classifies a standard input email using wsStats.stat model and the WhiteSpaceTokenizer. The output shows the classification, along with the Bayes approximation for normal and spam, and the difference between them. * Example #2: java BSFTest -m wsStats -k WhiteSpaceTokenizer -f sample.txt -l out.txt Output: X-Spam-Status: SPAM, N: -773.23, S: -621.73, Diff: 151.51 Explanation: Classifies an email located in sample.txt using wsStats.stat model and the WhiteSpaceTokenizer, and produces detailed analysis to out.txt. A portion of out.txt (normal email) shows: Tokenizer: WhiteSpaceTokenizer Norm Table: Email=56, Tokens=39709, Unique=3930, Prior=0.375839 Spam Table: Email=93, Tokens=30410, Unique=10116, Prior=0.624161 X-Spam-Status: SPAM, N: -773.23, S: -621.73, Diff: 151.51 Bayes Norm Bayes Spam Diff Count Token ---------- ---------- ---- ----- ----- -21.178717 -11.454880 s 9.7 2 html -7.015299 -8.164568 n 1.1 2 to -10.589358 -7.549971 s 3.0 1 Street -10.589358 -8.713122 s 1.9 1 width3D80 -10.589358 -9.223947 s 1.4 1 brcentertable -21.178717 -10.834570 s 10.3 2 font The top portion displays a summary of the normal and spam token tables, including priors. In the detail section, for each token, is the Bayes approximation for the token including the difference between them. The 's' and 'n' designation tell whether the token leans toward spam or normal, respectively