Reading List: MDPs and POMDPs for very large state spaces
Reading List
Hierarchical Structure and Other Decompositions for Addressing Very Large State Spaces in MDPs and POMDPs
Although it is clear that Markov decision processes (MDPs) and the corresponding partially-observable models (POMDPs) are powerful and general methods for modeling learning and planning problems in stochastic domains, our current understanding limits practical application of these techniques to relatively small state spaces (e.g., those which are explicitly enumerable). A great deal of recent and ongoing effort has focused on developing hierarchical decompositions and approximations which allow us to represent "macros" of actions or "meta-states", typically at the cost of a sacrifice of optimality or Markovianness. The autonomous mobile robotics group is currently surveying such approaches. If you have suggestions for additions to this list, please contact Terran.
Note that this list has now expanded a bit beyond strictly hierarchy and decomposition work, to include background and related works as well such as Puterman and Kemery & Snell.
- [Baum and Nicholson, 1998a]
- Baum, J. and Nicholson, A. E. (1998a). Dynamic non-uniform abstractions for approximate planning in large structured stochastic domains. Technical Report 1998/18, School of Computer Science and Software Engineering, Monash University, Melbourne.
- [Baum and Nicholson, 1998b]
- Baum, J. and Nicholson, A. E. (1998b). Dynamic non-uniform abstractions for approximate planning in large structured stochastic domains. In Proceedings of the 5th Pacific Rim International Conference on Artificial Intelligence, pages 587--598, Singapore.
- [Bernstein, 2002]
- Bernstein, D. S. (2002). Hierarchy and finite-state controllers for POMDPs. Technical Report UM-CS-2002-026, Department of Computer Science, University of Massachusetts, Amherst, MA.
- [Bertsekas, 1987]
- Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, NJ.
- [Bertsekas, 1995]
- Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control, volume 1. Athena Scientific, Belmont, MA.
- [Bertsekas and Tsitsiklis, 1996]
- Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Optimization and neural computation series. Athena Scientific, Belmont, MA.
- [Boutilier et al., 1997]
- Boutilier, C., Brafman, R. I., and Geib, C. (1997). Prioritized goal decomposition of markov decision processes: Toward a synthesis of classical and decision theoretic planning. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), Nagoya, Japan. Morgan Kaufmann.
- [Boutilier et al., 1999]
- Boutilier, C., Dean, T., and Hanks, S. (1999). Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1--94.
- [Boutilier and Dearden, 1994]
- Boutilier, C. and Dearden, R. (1994). Using abstractions for decision-theoretic planning with time constraints. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 1016--1022. AAAI Press.
- [Boutilier et al., 1995]
- Boutilier, C., Dearden, R., and Goldszmidt, M. (1995). Exploiting structure in policy construction. In Proceedings of the Fourteenth International Conference on Artificial Intelligence.
- [Boutilier et al., 2000]
- Boutilier, C., Dearden, R., and Goldszmidt, M. (2000). Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1--2):49--107.
- [Dayan and Hinton, 1993]
- Dayan, P. and Hinton, G. E. (1993). Feudal reinforcement learning. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems 5. Morgan Kaufmann: San Mateo, CA.
- [Dean and Givan, 1997]
- Dean, T. and Givan, R. (1997). Model minimization in Markov decision processes. In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), pages 106--111, Providence, RI. AAAI Press/MIT Press.
- [Dean et al., 1995]
- Dean, T., Kaelbling, L. P., Kirman, J., and Nicholson, A. (1995). Planning under time constraints in stochastic domains. Artificial Intelligence, 76.
- [Dietterich, 2000]
- Dietterich, T. G. (2000). An overview of MAXQ hierarchical reinforcement learning. In Choueiry, B. Y. and Walsh, T., editors, Proceedings of the Symposium on Abstraction, Reformulation and Approximation SARA 2000, Lecture Notes in Artificial Intelligence. Springer Verlag, New York.
- [Dietterich and Flann, 1995]
- Dietterich, T. G. and Flann, N. S. (1995). Explanation-based learning and reinforcement learning: A unified view. In Proceedings of the Twelfth International Conference on Machine Learning, pages 176--184. Morgan Kaufmann.
- [Drummond, 1998]
- Drummond, C. (1998). Composing functions to speed up reinforcement learning in a changing world. In Nedellec, C. and Rouveirol, C., editors, Machine learning. Proceedings of the 10th European conference., pages 370--381, Chemnitz, Germany. Springer-Verlag.
- [Guestrin and Ormoneit, 2001]
- Guestrin, C. and Ormoneit, D. (2001). Robust combination of local controllers. In Breese, J. and Koller, D., editors, Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI-01), pages 178--185, Seattle, WA. Morgan Kaufmann.
- [Hauskrecht et al., 1998]
- Hauskrecht, M., Meuleau, N., Boutilier, C., Kaelbling, L. P., and Dean, T. (1998). Hierarchical solution of Markov decision processes using macro-actions. In Cooper, G. F. and Moral, S., editors, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI-98). Morgan Kaufmann.
- [Kaelbling, 1993]
- Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pages 167--173.
- [Kaelbling et al., 1998]
- Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101.
- [Kaelbling et al., 1996]
- Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237--277.
- [Kearns and Koller, 1999]
- Kearns, M. and Koller, D. (1999). Efficient reinforcement learning in factored MDPs. In Dean, T., editor, Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), pages 740--747, Stockholm, Sweden. Morgan Kaufmann.
- [Kearns et al., 1999a]
- Kearns, M., Mansour, Y., and Ng, A. Y. (1999a). Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems.
- [Kearns et al., 1999b]
- Kearns, M., Mansour, Y., and Ng, A. Y. (1999b). A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In Dean, T., editor, Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), volume 2, pages 1324--1331, Stockholm, Sweden. Morgan Kaufmann.
- [Kearns et al., 2001]
- Kearns, M., Mansour, Y., and Ng, A. Y. (2001). A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning. To appear.
- [Kearns and Singh, 2001]
- Kearns, M. and Singh, S. (2001). Near-optimal reinforcement learning in polynomial time. Machine Learning. To appear.
- [Kemeny and Snell, 1976]
- Kemeny, J. G. and Snell, J. L. (1976). Finite Markov Chains. Undergraduate Texts in Mathematics. Springer-Verlag, New York.
- [Koller and Parr, 2000]
- Koller, D. and Parr, R. (2000). Policy iteration for factored MDPs. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI 2000). Morgan Kaufmann.
- [Koller and Pfeffer, 1997]
- Koller, D. and Pfeffer, A. (1997). Object-oriented Bayesian networks. In Proceedings of the Thirteenth Annual Conference on Uncertainty in Artificial Intelligence, pages 302—313, Providence, Rhode Island.
- [Lane and Kaelbling, 2001]
- Lane, T. and Kaelbling, L. P. (2001). Toward hierarchical decomposition for planning in uncertain environments. In Proceedings of the IJCAI-01 Workshop on Planning Under Uncertainty and Incomplete Information, pages 1--7, Seattle, WA. International Joint Conferences on Artificial Intelligence, Inc (IJCAII).
- [Lane and Kaelbling, 2002]
- Lane, T. and Kaelbling, L. P. (2002). Nearly deterministic abstractions of markov decision processes. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI-02), Edmonton, Canada. AAAI Press. To appear.
- [Liberatore, 2002]
- Liberatore, P. (2002). The size of MDP factored policies. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI-02), Edmonton, Canada. AAAI Press. To appear.
- [Littman, 1997]
- Littman, M. (1997). Probabilisitic propositional planning: Representations and complexity. In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), pages 748--754, Providence, RI. AAAI Press/MIT Press.
- [Littman et al., 1995]
- Littman, M. L., Dean, T. L., and Kaelbling, L. P. (1995). On the complexity of solving markov decision problems. In Proceedings of the Eleventh International Conference on Uncertainty in Artificial Intelligence (UAI-95).
- [Lusena et al., 2001]
- Lusena, C., Mundhenk, M., and Goldsmith, J. (2001). Nonapproximability results for partially observable Markov decision processes. Journal of Artificial Intelligence Research, 14:83--103.
- [McCallum, 1997]
- McCallum, A. K. (1997). Efficient exploration in reinforcement learning with hidden state. In AAAI Fall Symposium on "Model-directed Autonomous Systems". AAAI Press.
- [McCallum, 1994]
- McCallum, R. A. (1994). Reduced training time for reinforcement learning with hidden state. In The Proceedings of the Eleventh International Machine Learning Workshop (Robot Learning).
- [McGovern, 2002]
- McGovern, A. (2002). Autonomous Discovery of Temporal Abstractions from Interaction with an Environment. PhD thesis, University of Massachusetts Amherst.
- [McGovern et al., 1997]
- McGovern, A., Sutton, R. S., and Fagg, A. H. (1997). Roles of macro-actions in accelerating reinforcement learning. In Proceedings of the 1997 Grace Hopper Celebration of Women in Computing, pages 13--18.
- [Meuleau et al., 1998]
- Meuleau, N., Hauskrecht, M., Kim, K.-E., Peshkin, L., Kaelbling, L. P., Dean, T., and Boutilier, C. (1998). Solving very large weakly coupled Markov decision processes. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. AAAI Press.
- [Moore et al., 1999]
- Moore, A. W., Baird, L. C., and Kaelbling, L. (1999). Multi-value-functions: Efficient automatic action hierarchies for multiple goal MDPs. In Dean, T., editor, Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden. Morgan Kaufmann.
- [Mundhenk et al., 2000]
- Mundhenk, M., Goldsmith, J., Lusena, C., and Allender, E. (2000). Complexity of finite-horizon markov decision process problems. Journal of the ACM, 47(4):681--720.
- [Parr, 1998a]
- Parr, R. (1998a). Flexible decomposition algorithms for weakly coupled Markov decision problems. In Cooper, G. F. and Moral, S., editors, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI-98). Morgan Kaufmann.
- [Parr and Russell, 1998]
- Parr, R. and Russell, S. (1998). Reinforcement learning with hierarchies of machines. In Jordan, M. I., Kearns, M. J., and Solla, S. A., editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press.
- [Parr, 1998b]
- Parr, R. E. (1998b). Hierarchical Control and Learning for Markov Decision Processes. PhD thesis, University of California at Berkeley.
- [Perkins and Barto, 2001]
- Perkins, T. J. and Barto, A. G. (2001). Lyapunov-constrained action sets for reinforcement learning. In Brodley, C. E. and Danyluk, A. P., editors, Proceedings of the Eighteenth International Conference on Machine Learning, pages 409--416, Williams College, MA. Morgan Kaufmann.
- [Precup, 2000]
- Precup, D. (2000). Temporal Abstraction in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Department of Computer Science.
- [Puterman, 1994]
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley \& Sons, New York.
- [Ravindran and Barto, 2001]
- Ravindran, B. and Barto, A. G. (2001). Symmetries and model minimization of markov decision processes. Technical Report Computer Science Technical Report 01-43, University of Massachusetts, Amherst, MA.
- [Ravindran and Barto, 2002]
- Ravindran, B. and Barto, A. G. (2002). Model minimization in hierarchical reinforcement learning. In Holte, R., editor, Proceedings of the 2002 Symposium on Abstraction, Reformulation, and Approximation (SARA-2002). To appear.
- [Singh and Cohn, 1998]
- Singh, S. and Cohn, D. (1998). How to dynamically merge Markov decision processes. In Jordan, M., Kearns, M., and Solla, S., editors, Advances in Neural and Information Processing Systems 10, pages 1057--1063. MIT Press, Cambridge, MA.
- [Springer, 1979]
- Springer, M. D. (1979). The Algebra of Random Variables. Wiley Series in Probability and Mathematical Statistics. John Wiley \& Sons, New York.
- [Sutton et al., 1999]
- Sutton, R. S., Precup, D., and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181--211.
- [Tash and Russell, 1994]
- Tash, J. and Russell, S. (1994). Control strategies for a stochastic planner. In Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, WA. AAAI Press.
- [Tash, 1996]
- Tash, J. K. (1996). Decision Theory Made Tractable: The Value of Deliberation, with Applications to Markov Decision Process Planning. PhD thesis, University of California, Berkeley.
- [Zhang and Dietterich, 1995]
- Zhang, W. and Dietterich, T. G. (1995). A reinforcement learning approach to job-shop scheduling. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-95), Montreal, Canada. Morgan Kaufmann.
- [Zubek and Dietterich, 2001]
- Zubek, V. B. and Dietterich, T. (2001). Two heuristics for solving POMDPs having a delayed need to observe. In Proceedings of the IJCAI-01 Workshop on Planning Under Uncertainty and Incomplete Information, pages 51--60, Seattle, WA. International Joint Conferences on Artificial Intelligence, Inc (IJCAII).
People
Here are some of the people whose work is listed above. This is a far from complete list of those working in this subject; these are just the ones I've found recently. If you know a name (and URL) that belongs here, please contact me
- Valentina Bayer
- Craig Boutilier
- Justin Boyan
- David Cohn
- Peter Dayan
- Tom Dean
- Richard Dearden
- Thomas G. Dietterich
- Geoffrey Hinton
- Bob Givan
- Judy Goldsmith
- Carlos Guestrin
- Leslie Pack Kaelbling
- Michael Kearns
- Daphne Koller
- Terran Lane
- Michael Littman
- Yishay Mansour
- Andrew McCallum
- Amy McGovern
- Nicolas Meuleau
- Andrew W. Moore
- Ron Parr
- Ted Perkins
- Avi Pfeffer
- Doina Precup
- Martin Puterman
- Stuart Russell
- Satinder Singh
- Rich Sutton
- Jonathan Tash
- Georgios Theocharous
