Actor-Critic--Type Learning Algorithms for Markov Decision Processes 论文

1999SIAM Journal on Control and Optimization引用 238
Reinforcement Learning in RoboticsAdaptive Dynamic Programming ControlAdvanced Control Systems Optimization

摘要

Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.