On the value of learning for Bernoulli bandits with unknown parameters

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

In this paper we investigate the multi-armed bandit problem, where each arm generates an infinite sequence of Bernoulli distributed rewards. The parameters of these Bernoulli distributions are unknown and initially assumed to be Beta-distributed. Every time a bandit is selected its Beta-distribution is updated to new information in a Bayesian way. The objective is to maximize the long term discounted rewards. We study the relationship between the necessity of acquiring additional information and the reward. This is done by considering two extreme situations which occur when a bandit has been played N times; the situation where the decision maker stops learning and the situation where the decision maker acquires full information about that bandit. We show that the difference in reward between this lower and upper bound goes to zero as N grows large.
Original languageEnglish
Pages (from-to)2135-2140
JournalIEEE Transactions on Automatic Control
Volume45
DOIs
Publication statusPublished - 2000

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 16 - Peace, Justice and Strong Institutions
    SDG 16 Peace, Justice and Strong Institutions

Fingerprint

Dive into the research topics of 'On the value of learning for Bernoulli bandits with unknown parameters'. Together they form a unique fingerprint.

Cite this