Agent Performance and the Value of a Trajectory

In order to determine whether an AGENT is learning - improving its performance over time - we must have some measure of performance. In general, this should be some aggregate of REWARDs received. For this project, however, it will be sufficient to employ trial length as a measure.5 Over many TRIALs, the AGENT should be able to travel from the START STATE to the GOAL STATE in a smaller and smaller number of steps, on average.6

Figure: Examples of learning curves. The horizontal axis gives the amount of experience, measured in TRIALs, while the vertical axis gives the performance of the AGENT, measured in TRIAL length (down is good). (a) Idealized learning curves (does not account for real fluctuations due to stochasticity, etc.). (b) Real learning curves from an RL experiment. These curves are more ``jagged'' than the idealized curves because of noise in the AGENT's dynamics and because of randomized START and GOAL STATES. On average, though, the ``relational'' AGENT (lower curve) is converging to a stable solution, while the ``atomic'' AGENT (upper curve) is diverging. Note that both plots have labeled axes and legends describing each line.
[width=0.4]pics/learn_curve_ideal [width=0.4]pics/rect_grid
(a) (b)

We will define the value of a TRAJECTORY to be the length (number of SARS tuples) of that TRAJECTORY. The performance of the AGENT is the average value it can attain over multiple TRAJECTORIES. The AGENT's overall goal is to minimize this value. A plot of ``number of trials'' versus ``performance'' is called a LEARNING CURVE. Examples of some (idealized) learning curves are given in Figure [*] (a). In this figure, Algorithm A represents an AGENT that learns relatively quickly and converges to a fixed performance level. Algorithm B represents an AGENT that learns more slowly than A does, but converges to a better overall solution (asymptotes to a shorter TRIAL length). Algorithm C is an AGENT that doesn't learn anything at all - it starts with a moderately strong performance, but never improves. Algorithm D is an AGENT that actually gets worse, or diverges, with experience.

Figure [*] (b) displays two learning curves for a real RL experiment. Note that this figure is much more ``jagged'' than (a) is, reflecting the noise introduced by the AGENT's stochastic dynamics and randomly selected START or GOAL STATEs. This figure portrays two AGENTs: a ``relational'' AGENT that converges to its best solution in roughly 800 TRIALs (lower, solid blue curve), and an ``atomic'' AGENT that diverges (upper, dashed red curve). Note that both plots are labeled on both axes and that efforts have been made to visually differentiate the outcomes of different AGENTs.

Terran Lane 2005-10-18