Learning and Planning in Large Stochastic Environments

One research area that particularly interests me is learning about acting. That is, learning to perform sequential tasks in a complex environment with sensory feedback. For example, we would much rather program a mobile robot to learn its way around a building (and even learn the dynamics of its own controllers) than to program it directly to do the task. Direct programming is tedious and fraught with errors, as it's incredibly difficult to forsee all of the circumstances that the robot will encounter. And when you intend to drop your mobile robot rover onto, say, Mars and let it go, then autonomous control and learning become much more important.

This problem spawned the field of reinforcement learning which, in turn, found itself adopting (some might say reinventing) the Markov decision process (MDP) formalism, and its nastier big sibling, the partially observable MDP or POMDP. In their original, Operations Research, formulations these models were used primarily for control of discrete (time and/or space) stochastic processes. The reinforcement learning community has added an interest in model identification and acquisition of optimal control policies through experience rather than through direct solution of models.

Like many problems in AI and ML, the MDP and POMDP formulations of the reinforcement learning problem are well formed, plausible, mathematically compelling, very general and descriptive, and utterly intractable. We can solve small (rougly 100,000 state) MDP models directly and do reinforcement learning in spaces of similar size, but our methods depend largely on dynamic programming or matrix inversion over the entire, explicitly represented, state space. When we want to move to propositional/factored state space representations (i.e., each state is represented as a product of propositional variables, or fluents), the state space explodes exponentially and our best known methods quickly break down. Moving to relational/first order representations makes things even worse, and POMDPs are intractable even for small state spaces.

I'm interested in scaling techniques for planning and learning in the MDP formalism. My recent work in this area, while I was a postdoctoral researcher studying with Leslie Kaelbling at the MIT AI Lab concerned finding and exploiting "nearly deterministic" subcomponents within navigational MDPs. By finding sub-goals that the agent can reach with very high probability (1-epsilon), we can reduce an exponentially large MDP path planning problem to a polynomial series of tractably small MDPs and a deterministic meta-planning problem. The meta-planning problem can then be solved with special-purpose planning software that runs much more quickly in the large, propositional state spaces. I'm currently working on extending this work to scheduling applications, POMDPs and continuous state space MDPs, and model/sub-goal learning.

Resources