In order to determine whether an AGENT is learning - improving its performance over time - we must have some measure of performance. In general, this should be some aggregate of REWARDs received. For this project, however, it will be sufficient to employ trial length as a measure.5 Over many TRIALs, the AGENT should be able to travel from the START STATE to the GOAL STATE in a smaller and smaller number of steps, on average.6
|
We will define the value of a TRAJECTORY to be the
length (number of SARS tuples) of that TRAJECTORY. The
performance of the AGENT is the average value it can attain over
multiple TRAJECTORIES. The AGENT's overall goal is to minimize
this value. A plot of ``number of trials'' versus ``performance'' is
called a LEARNING CURVE. Examples of some (idealized) learning
curves are given in Figure
(a). In
this figure, Algorithm A represents an AGENT that learns relatively
quickly and converges to a fixed performance level. Algorithm B
represents an AGENT that learns more slowly than A does, but converges
to a better overall solution (asymptotes to a shorter TRIAL length).
Algorithm C is an AGENT that doesn't learn anything at all - it
starts with a moderately strong performance, but never improves.
Algorithm D is an AGENT that actually gets worse, or diverges,
with experience.
Figure
(b) displays two learning
curves for a real RL experiment. Note that this figure is much more
``jagged'' than (a) is, reflecting the noise introduced by
the AGENT's stochastic dynamics and randomly selected START or GOAL
STATEs. This figure portrays two AGENTs: a ``relational'' AGENT that
converges to its best solution in roughly 800 TRIALs (lower, solid
blue curve), and an ``atomic'' AGENT that diverges (upper, dashed red
curve). Note that both plots are labeled on both axes and that
efforts have been made to visually differentiate the outcomes of
different AGENTs.
Terran Lane 2005-10-18