Research Projects

Immune System Modeling

Immune systems are adaptive systems in which learning takes place by evolutionary mechanisms similar to biological evolution. Their major function is to provide a defense mechanism for the body that can identify dangerous foreign material and eliminate it. Understanding the immune system is important, both because of its role in complex diseases such as AIDS and because of potential applications to computational problems. The mechanisms of the immune system are remarkably complex and poorly understood, even by immunologists.

Our immune system models are based on a universe in which antigens (foreign material) and lymphocytes (the cells that do the recognition of foreign material) are represented by strings of discrete symbols. The strings represent receptors on B cells and T cells and epitopes on antigens---the regions of the cells in which binding takes place. The complex chemistry of molecular binding is modeled in our system by string matching.

We have used this model to study several different aspects of the immune system, including its ability to detect common patterns (schemas) in noisy environments ( Forrest et al., 1993a), its ability to discover and maintain coverage of diverse pattern classes (Smith et al., 1993), and its ability to learn effectively, even when not all antibodies are expressed and not all antigens are presented (Hightower et al., 1995, Hightower et al., 1996). Current research directions include modeling cross-reactive memory (work by Derek Smith) and location-specific behavior (immune cells that behave differently depending on their physical location in the body). This work is in collaboration with Alan Perelson.

Computer Immune System

A long-term goal of my research is to build an artificial immune system for a computer. This immune system would have much more sophisticated notions of identity and protection than those afforded by current operating systems, and it would provide a general-purpose protection system to augment current computer security systems.

As an initial step in this direction, we developed a computer virus detection system based on immunological principles. Our change-detection algorithm defines a set in terms of its complement. This allows us to create a detection system with the following properties: it is probabilistic and tunable (the probability of detection can be traded off against CPU time and space), it can be distributed (providing high system-wide reliability at low individual cost), and it can detect novel viruses that have not previously been identified. We have written several papers on this work (Forrest et al., 1994a, D'haeseleer et al., 1996a, D'haeseleer, 1996).

We are currently studying a method for anomaly detection in which ``normal'' is defined by short-range correlations in a process' system calls. In preliminary experiments, we found this definition to be stable during normal behavior for several standard UNIX programs. Further, it is able to detect several common intrusions involving sendmail and other processes.

Ecological modeling with Echo

In collaboration with John Holland, Terry Jones, Peter Hraber, Andrew Kosoresow, and several ecologists, we are studying how genetic algorithms can be used in ecological modeling. Echo extends standard genetic algorithms in several interesting ways: (i) there is no explicit fitness function, (ii) individuals have local state, and (iii) the genetic representation is based on a higher-cardinality alphabet than binary strings. In Echo, fitness evaluation takes place implicitly. That is, individuals in the population (called {\em agents}) are allowed to make copies of themselves anytime they acquire enough ``resources'' to replicate their genome. Different resources are modeled by different letters of the alphabet (say, A, B, C, D), and genomes are constructed out of those same letters. However these resources can exist independently of the agent's genome, either free in the environment or stored internally by the agent. Agents acquire resources by interacting with other agents through trading relationships and combat. Echo thus relaxes the constraint that an explicit fitness function must return a numerical evaluation of each agent. This ``endogenous'' fitness function is much closer to the way fitness is assessed in natural settings. In addition to trade and combat, a third form of interaction between agents is ``mating.'' Mating provides opportunities for agents to exchange genetic material through crossover, thus creating hybrids. Mating, together with mutation, provides the mechanism for new types of agents to evolve.

In preliminary simulations, the Echo system has demonstrated surprisingly complex behaviors, including something resembling a biological arms race (in which two competing species develop progressively more complex offensive and defensive strategies), ecological dependencies among different species, and sensitivity (in terms of the number of different phenotypes) to differing levels of renewable resources. We are currently studying the extent to which macro-level behaviors of Echo mimic those of natural ecological systems. In particular, we are quantifying the species abundance patterns observed in Echo and comparing them with those found in natural ecologies.

More information about the Echo project can be found here.

Foundations of Genetic Algorithms

This project is in collaboration with Melanie Mitchell. In this work, we have investigated a number of anomalous results (some reported in the literature and some resulting from our own investigations) in which the genetic algorithm fails to perform as theory predicts. In each case, we have found interesting explanations for these failures, which have led us to think more deeply about how the algorithm really works (Forrest and Mitchell, 1991; Mitchell et al., 1992; Forrest and Mitchell 1993a; Forrest and Mitchell 1993b). We have recently described and analyzed an idealized version of the genetic algorithm (the IGA) which contains the features of genetic algorithms that we believe are most important to its success---independent sampling, combining good partial solutions through crossover, and preserving these combinations through selection. Our mathematical analysis provides a lower bound on the time to discover an optimal solution for one class of problems. We performed a similar analysis for hillclimbing, which is a step towards predicting the conditions under which the genetic algorithm will outperform hillclimbing and vice versa (Mitchell et al., 1993a). An important area of future research is to study what parameter settings and versions of the genetic algorithm allow it to conform most closely to the IGA.

DNA Fragment Assembly

This project is in collaboration with Rebecca Parsons. We are applying the genetic algorithm to a computational biology problem---the assembly of DNA sequence fragments from a parent clone whose sequence is unknown into a consensus sequence corresponding to the parent sequence. This is called the ``fragment assembly problem'' and was chosen as one of two challenge problems for the 1994-95 DIMACS competition. It is a permutation problem, similar to the Traveling Salesman Problem (TSP), and similarly, NP-complete, but with some important differences (circular tours, noise, and special relationships between entities). In this work we have studied two different representations and many different genetic operators. We have compared the two representations and their associated operators on problems ranging from 2 kilobases to 34 kilobases. After experimenting with different combinations, we were able to solve a 10 kilobase sequence, consisting of 177 fragments, with no manual intervention. We recently extended our work to a 752 fragment set and achieved very close agreement with the published consensus sequence.