Anomaly Detection for Computer Security

A number of critical problems in computer security can be viewed as distinguishing some "normal" circumstance from "anomalous" or "abnormal" circumstances. For example, we can think of computer viruses as being (syntactic and begavioral) abnormal modifications to normal programs. Similarly, network intrusion detection is also an attempt to discern unusual or abnormal patterns in network traffic. Superficially, this is a standard binary concept learning problem from supervised learning. In practice, however, it's usually infeasible to treat the problem directly this way. Typically, we don't have a thorough sample of examples of abnormal/hostile data, either because the data itself is hard to come by (many sources don't preserve or won't release records of their own vulnerabilities) or because novel attacks are constantly being introduced. Furthermore, defenses based on any fixed assumption of the distribution of attacks would be vulnerable to attacks designed specifically to subvert that assumption. (Virus authors, for example, appear to test their new strains against current commercial antivirus programs in order to develop undetectable strains.) Thus, it is often advantageous to conceive of the anomaly detection problem as the task of developing a strong model of normal behaviors and detecting abnormalities as deviations from that model. This offers the dual benefits of adaptivity to individual systems/users/sites and of (in principle) being less vulnerable to novel attacks.

A bit more formally, the anomaly detection problem can be framed as a distribution estimation problem for a single class of data (normal behavior) coupled with a threshold selection procedure to define the negative pattern (anomaly) space. The challenge lies in developing sufficiently descriptive models of normal behavior that still allow descrimination of abnormalities.

Now Available

Software
The software used for the user-behavioral anomaly detection work, as documented in my thesis is now available under GPL. As a warning, this is RESEARCH GRADE software. Which means that it's flakey, obscure, poorly documented, and generally not intended for human consumption. They're also old and depend on a deprecated parsing package -- PCCTS (the Purdue Compiler Construction Tool Set). I've tried to include all the necessary post-processed files, so you should just be able to compile. But if you find that you need to regenerate the parser parts from the original grammar, you may be able to get them to work by using the very nifty (and much more sophisticated) successor to PCCTS, ANTLR. Anyway, a number of people have expressed interest, so I'm making it available. Please let me know if you use either of these packages and find them useful in any way. Also, if you make any interesting modifications/improvements/bug fixes, I'd be delighted to hear about them.
Data
The Purdue Unix user data is available. Note that this data has been santized to remove identifying information and is available under conditions of ANONYMITY and FOR NON-COMMERCIAL USE only. More details are provided in the README file in the archive. If you're interested in the Calgary data that I discussed, I suggest that you contact Saul Greenberg at the University of Calgary directly.