Computer Immune Systems
Main Page PeoplePapersSponsorsData Sets and Software

Sequence-based Intrusion Detection

We have collected several data sets of system calls executed by active processes. These include different kinds of programs (e.g., programs that run as daemons and those that do not), programs that vary widely in their size and complexity, and different kinds of intrusions (buffer overflows, symbolic link attacks, and Trojan programs). We include only those programs that run with privilege, because misuse of these programs has the greatest potential for harm to the system.

Some of the normal data are "synthetic" and some are "live". Synthetic traces are collected in production environments by running a prepared script; the program options are chosen solely for the purpose of exercising the program, and not to meet any real user's requests. Live normal data are traces of programs collected during normal usage of a production computer system.

In some cases, we have data for the same program from multiple locations and/or multiple versions of the program. Each of these is a distinct data set; normal traces from one set can be quite different from those of another. Intrusions collected at one location or with a certain version of the program should not be compared to normal data from a different set.

Each trace is the list of system calls issued by a single process from the beginning of its execution to the end. Trace lengths vary widely because of differences in program complexity and because some traces are daemon processes and others are not.

Links to descriptions of each program's data sets are below. Each trace file (*.int) lists pairs of numbers, one pair per line. The first number in a pair is the PID of the executing process, and the second is a number representing the system call. Note that there may be multiple processes within a single file, and they may be interleaved.

The mapping between system call numbers and actual system call names is given in a separate file. Since a variety of tracing packages and operating systems were used, the same mapping file is not used for all programs. The individual program pages indicate which mapping is appropriate. Each mapping file is just a list of system call names, where the line number for each name indicates the number used for that system call. Line numbers begin at zero.

We used a method called "sequence time-delay embedding" or stide, to model the data. In the training phase, stide builds a database of all unique, contiguous system call sequences of a predetermined fixed length occuring in the traces. During testing, stide compares sequences in the new traces to those in the database, and reports an anomaly measure indicating how much the new traces differ from the normal training data. The links below will let you download the stide code and a postscript version of the user's manual.

Computer Science Department
Farris Engineering Center
University of New Mexico
Albuquerque, NM 87131
Phone: (505) 277-3112 Fax: (505) 277-6927
Email: forrest@cs.unm.edu