Variance-constrained actor-critic algorithms for discounted and average reward MDPs

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

详细信息查看全文

作者：L. A. Prashanth ; Mohammad Ghavamzadeh
关键词：Markov decision process (MDP) ; Reinforcement learning (RL) ; Risk sensitive RL ; Actor ; critic algorithms ; Multi ; time ; scale stochastic approximation ; Simultaneous perturbation stochastic approximation (SPSA) ; Smoothed functional (SF)
刊名：Machine Learning
出版年：2016
出版时间：December 2016
年：2016
卷：105
期：3
页码：367-417
全文大小：1,301 KB
刊物类别：Computer Science
刊物主题：Artificial Intelligence and Robotics
Automation and Robotics
Computing Methodologies
Simulation and Modeling
Language Translation and Linguistics
出版者：Springer Netherlands
ISSN：1573-0565
卷排序：105

文摘

In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms that operate on three timescales—a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale, and a dual ascent for Lagrange multipliers on the slowest timescale. In the discounted setting, we point out the difficulty in estimating the gradient of the variance of the return and incorporate simultaneous perturbation approaches to alleviate this. The average setting, on the other hand, allows for an actor update using compatible features to estimate the gradient of the variance. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700