Abstract
Policy gradient methods are successful for a wide range of reinforcement learning tasks. Traditionally, such methods utilize the score function as stochastic gradient estimator. We investigate the effect of replacing the score function with a measure-valued derivative within an on-policy actor-critic algorithm. The hypothesis is that measure-valued derivatives reduce the need for score function variance reduction techniques that are common in policy gradient algorithms. We adapt the actor-critic to measure-valued derivatives and develop a novel algorithm. This method keeps the computational complexity of the measure-valued derivative within bounds by using a parameterized state-value function approximation. We show empirically that measure-valued derivatives have comparable performance to score functions on the environments Pendulum and MountainCar. The empirical results of this study suggest that measure-valued derivatives can serve as low-variance alternative to score functions in on-policy actor-critic and indeed reduce the need for variance reduction techniques.
Original language | English |
---|---|
Title of host publication | 2022 Winter Simulation Conference (WSC) |
Subtitle of host publication | [Proceedings] |
Editors | B. Feng, G. Pedrielli, Y. Peng, S. Shashaani, E. Song, C.G. Corlu, L.H. Lee, E.P. Chew, T. Roeder, P. Lendermann |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 2736-2747 |
Number of pages | 12 |
ISBN (Electronic) | 9798350309713 |
ISBN (Print) | 9781665476621 |
DOIs | |
Publication status | Published - 2022 |
Event | 2022 Winter Simulation Conference, WSC 2022 - Guilin, China Duration: 11 Dec 2022 → 14 Dec 2022 |
Publication series
Name | Proceedings - Winter Simulation Conference |
---|---|
Volume | 2022-December |
ISSN (Print) | 0891-7736 |
Conference
Conference | 2022 Winter Simulation Conference, WSC 2022 |
---|---|
Country/Territory | China |
City | Guilin |
Period | 11/12/22 → 14/12/22 |
Bibliographical note
Publisher Copyright:© 2022 IEEE.