Performance documentation - Misuse of "comprise" -- c.f. "The Elements of Style" or a good dictionary. (Common error.) - "through", not "thru" (informal for a document like this) - Learning curves -- excellent. - Usually, we don't report accuracy rates averaged over all training sizes, but only at asymptote (the other number doesn't necessarily really reflect the accuracy we would see "in practice" after fully training the system). - Token counts vs # emails -- nice. - Nice analysis of Krusty vs. Whitespace. Hard to tell if Krusty is really doing "better" than Whitespace, though. At asymptote of Fig 1, there's still some variability in accs. The two might be the same, or might be diff. Remember, as well, that you're only testing on 40 emails -> you only have a resolution of 2.5% in measuring accuracy -- the results for Krusty and WhiteSpace look pretty close according to that. - NGram -- did you consider using n!=5 (e.g., 6, 7, 8, ...)? Any thoughts on how that might effect performance? - What about different types of errors? (E.g., spam->normal vs. normal->spam) - "good enough" -> "well enough" - Good observations on "future work", esp. on tokenizers and training. More details on "better algorithm"? - Good observation on serialization -- might be possible to serialize these models faster, or might be better off just replacing it with a direct I/O system. User/API docs - README.TXT - Nice description of compilation. Good to specify dependency on getopt. - You can probably get away with something like "getopt must be on your CLASSPATH. Either add it to the CLASSPATH, or use the -classpath option to javac and java. For further details, refer to..." and then ignore it thereafter. - USERDOC.TXT - Tool overviews. Good. - "BSFTrain accepts its training files either from standard input or from direct access." -- "direct access" ==? From files specified on the command line? - Should say that in training mode, it generates and saves the models to disk for use by BSFTest. (And in dump mode, that it doesn't save any.) - Should specify which tokenizers are available, what each does, how to identify them on the command line, and possibly which one is recommended (might refer to performance docs for additional info on these) - If you provide dynamic instantiation of third party tokenizers (cool, BTW), you should describe that in the user docs, telling the user how to _add_ a 3rd party tokenizer. You should also provide a ref to API docs that describes how an independent developer might make a new tokenizer. - Nice examples of usage. Interpreted output for reader -- good. Explained dump -- excellent. - Shouldn't output of BSFTest include the whole email (plus X-Spam-Status header), not just the header?