Multi-objective reinforcement learning for the expected utility of the return

Diederik M. Roijers, Denis Steckelmacher, Ann Nowé

Research output: Contribution to ConferencePaperAcademic

Abstract

Real-world decision problems often have multiple, possibly conflicting, objectives. In multi-objective reinforcement learning, the effects of actions in terms of these objectives must be learned by interacting with an environment. Typically, multi-objective reinforcement learning algorithms optimise the utility of the expected value of the returns. This implies the underlying assumption that it is indeed the expected value of the returns (i.e., an average returns over many runs) that is important to the user. However, this is not always the case. For example in a medical treatment setting only the return of a single run matters to the patient. This return is expressed in terms of multiple objectives such as maximising the probability of a full recovery and minimising the severity of side-effects. The utility of such a vector-valued return is often a non-linear combination of the return in each objective. In such cases, we should thus optimise the expected value of the utility of the returns, rather than the utility of the expected value of the returns. In this paper, we propose a novel method to do so, based on policy gradient, and show empirically that our method is key to learning good policies with respect to the expected value of the utility of the returns.

Original languageEnglish
Publication statusPublished - 14 Jul 2020
Event2018 Adaptive Learning Agents, ALA 2018 - Co-located Workshop at the Federated AI Meeting, FAIM 2018 - Stockholm, Sweden
Duration: 14 Jul 201815 Jul 2018

Conference

Conference2018 Adaptive Learning Agents, ALA 2018 - Co-located Workshop at the Federated AI Meeting, FAIM 2018
CountrySweden
CityStockholm
Period14/07/1815/07/18

Keywords

  • Expected Scalarised Return
  • Multi-Objective Reinforcement Learning
  • Policy Gradient

Fingerprint Dive into the research topics of 'Multi-objective reinforcement learning for the expected utility of the return'. Together they form a unique fingerprint.

Cite this