Preview

Title in english

Advanced search

Relation between reward variance estimation and losses for UCB strategy for Gaussian two-armed bandit

https://doi.org/10.34680/2076-8052.2021.2(123).17-20

Abstract

Gaussian two-armed bandit problem is considered. Awards are assumed to have unknown expected values and unknown variances. Gaussian two-armed bandits may prove useful in a batch processing scenario, when there are two methods available. It is demonstrated that expected regret value is a continuous function of reward variance when using UCB1 strategy. Monte-Carlo simulations are used to show the nature of the relation between variance estimation and losses. It is shown that using an incorrect estimate is equivalent to using non-optimal parameters of the strategy, but the regret grows only slightly when the estimation error is fairly large, which allows to estimate the variance during the initial steps of the control.

About the Author

S. V. Garbar
Новгородский государственный университет имени Ярослава Мудрого
Russian Federation


References

1. Lattimore T., Szepesvari C. Bandit Algorithms. Cambridge University Press, 2020.

2. Kolnogorov A.V. Gaussovskiy dvurukiy bandit i optimizatsiya gruppovoy obrabotki dannykh [Gaussian TwoArmed Bandit and Optimization of Batch Data Processing]. Problems of Information Transmission, 2018, vol.54, no. 1, pp.27–44.

3. Kolnogorov A.V. Gaussovskiy dvurukiy bandit: predel'noe opisanie [Gaussian Two-Armed Bandit: Limiting Description]. Problems of Information Transmission, 2020, vol.56, iss.3, pp.86–111

4. Bather J.A. The Minimax Risk for the Two-Armed Bandit Problem. Mathematical Learning Models — Theory and Algorithms. Lecture Notes in Statistics. 1983, vol.20, pp.1-11. DOI: https://doi.org/10.1007/978-1-4612-5612-0_1

5. Lai T.L. Adaptive Treatment Allocation and the Multi-Armed Bandit Problem. The Annals of Statist., 1987, vol.25, pp.1091-1114.

6. Auer P. Using Confidence Bounds for ExploitationExploration Trade-offs. Journal of Machine Learning Research, 2002, vol.3, pp.397-422.

7. Reverdy P.B., Srivastava V., Leonard N.E. Modeling Human Decision Making in Generalized Gaussian Multiarmed Bandits. Proceedings of the IEEE, 2014, v.102, no.4, pp.544-571. DOI: https://doi.org/10.1109/JPROC.2014.2307024

8. Reverdy P. Gaussian multi-armed bandit problems with multiple objectives. Proceedings of the American Control Conference (ACC). Boston, USA, 2016, pp.5263-5269. DOI: 10.1109/ACC.2016.7526494

9. Kolnogorov A.V., Shiyan D.N. Parallel Version of the Mirror Descent Algorithm for the Two-Armed Bandit Problem. Proceedings of the 2016 Third International Conference on Mathematics and Computers in Sciences and in Industry (MCSI 2016), Chania, Crete. 27-29 August 2016, pp.241-245. DOI: https://doi.org/10.1109/MCSI.2016.052

10. Garbar S.V. Invariant description for batch version of UCB strategy for multi-armed bandit. Journal of Physics: Conference Series, 2020, vol.1658. Article number: 012015. DOI: https://doi.org/10.1088/1742-6596/1658/1/012015

11. Garbar S., Kolnogorov A. Invariant description for regret of UCB strategy for Gaussian multi-arm bandit. Proceedings of the 6th Stochastic Modeling Techniques and Data Analysis International Conference with Demographics Workshop (SMTDA). Barcelona, Spain, 2-5 June 2020, pp.251-260.

12. Vogel W. An Asymptotic Minimax Theorem for the TwoArmed Bandit Problem. Ann. Math. Statist., 1960, vol.31, pp.444-451.


Review

For citations:


Garbar S.V. Relation between reward variance estimation and losses for UCB strategy for Gaussian two-armed bandit. Title in english. 2021;(2(123)):17-20. (In Russ.) https://doi.org/10.34680/2076-8052.2021.2(123).17-20

Views: 40


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2076-8052 (Print)