

Author: Kearns M.
Publisher: Springer Publishing Company
ISSN: 0885-6125
Source: Machine Learning, Vol.49, Iss.2-3, 2002-11, pp. : 209-232
Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.
Abstract
We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.
Related content










Optimal and Near-Optimal Algorithms for
By Santos E.E.
Journal of Parallel and Distributed Computing, Vol. 57, Iss. 2, 1999-05 ,pp. :