The AGENT will be trained in a series of TRIALs - contiguous sequences of experience beginning at a START STATE and continuing until the AGENT encounters a GOAL STATE. During a single TRIAL, the AGENT performs the following loop:
while (! AGENT at a GOAL STATE) {
State2d s=get agent's current state
Action a=select action for state s
SARSTuple step=execute action a in the WORLD SIMULATOR
agent learns from experience step
add step to current trajectory
}
Once the TRIAL is complete (at the termination of the while loop), the current TRAJECTORY is ended, the agent is relocated to a new START STATE, and the process is repeated. The new START STATE is chosen uniformly at random from the set of all possible START STATEs for this environment (possibly only a single STATE or possibly the entire MAP). The whole experiment is run until some max number of TRIALs has been completed.
As the AGENT executes, it keeps a record of every TUPLE of experience it receives. This record is known as a TRAJECTORY. The designer may choose any data structure to implement the TRAJECTORY. The concrete implementation of a TRAJECTORY data structure MUST be a durable state object - it MUST be capable of being stored to and retrieved from DISK. The designer MAY use the Java serialization mechanism to achieve this.
Terran Lane 2005-09-27