Relation between reward variance estimation and losses for UCB strategy for Gaussian two-armed bandit

S. V. Garbar

doi:10.34680/2076-8052.2021.2(123).17-20

Relation between reward variance estimation and losses for UCB strategy for Gaussian two-armed bandit

S. V. Garbar

https://doi.org/10.34680/2076-8052.2021.2(123).17-20

Full Text:

PDF (Rus)

Generate QR code

Abstract

Gaussian two-armed bandit problem is considered. Awards are assumed to have unknown expected values and unknown variances. Gaussian two-armed bandits may prove useful in a batch processing scenario, when there are two methods available. It is demonstrated that expected regret value is a continuous function of reward variance when using UCB1 strategy. Monte-Carlo simulations are used to show the nature of the relation between variance estimation and losses. It is shown that using an incorrect estimate is equivalent to using non-optimal parameters of the strategy, but the regret grows only slightly when the estimation error is fairly large, which allows to estimate the variance during the initial steps of the control.

Keywords

control in a random environment, two-armed bandit problem, UCB

About the Author

S. V. Garbar

Новгородский государственный университет имени Ярослава Мудрого
Russian Federation

References

1. Lattimore T., Szepesvari C. Bandit Algorithms. Cambridge University Press, 2020.

2. Kolnogorov A.V. Gaussovskiy dvurukiy bandit i optimizatsiya gruppovoy obrabotki dannykh [Gaussian TwoArmed Bandit and Optimization of Batch Data Processing]. Problems of Information Transmission, 2018, vol.54, no. 1, pp.27–44.

3. Kolnogorov A.V. Gaussovskiy dvurukiy bandit: predel'noe opisanie [Gaussian Two-Armed Bandit: Limiting Description]. Problems of Information Transmission, 2020, vol.56, iss.3, pp.86–111

4. Bather J.A. The Minimax Risk for the Two-Armed Bandit Problem. Mathematical Learning Models — Theory and Algorithms. Lecture Notes in Statistics. 1983, vol.20, pp.1-11. DOI: https://doi.org/10.1007/978-1-4612-5612-0_1

5. Lai T.L. Adaptive Treatment Allocation and the Multi-Armed Bandit Problem. The Annals of Statist., 1987, vol.25, pp.1091-1114.

6. Auer P. Using Confidence Bounds for ExploitationExploration Trade-offs. Journal of Machine Learning Research, 2002, vol.3, pp.397-422.

7. Reverdy P.B., Srivastava V., Leonard N.E. Modeling Human Decision Making in Generalized Gaussian Multiarmed Bandits. Proceedings of the IEEE, 2014, v.102, no.4, pp.544-571. DOI: https://doi.org/10.1109/JPROC.2014.2307024

8. Reverdy P. Gaussian multi-armed bandit problems with multiple objectives. Proceedings of the American Control Conference (ACC). Boston, USA, 2016, pp.5263-5269. DOI: 10.1109/ACC.2016.7526494

9. Kolnogorov A.V., Shiyan D.N. Parallel Version of the Mirror Descent Algorithm for the Two-Armed Bandit Problem. Proceedings of the 2016 Third International Conference on Mathematics and Computers in Sciences and in Industry (MCSI 2016), Chania, Crete. 27-29 August 2016, pp.241-245. DOI: https://doi.org/10.1109/MCSI.2016.052

10. Garbar S.V. Invariant description for batch version of UCB strategy for multi-armed bandit. Journal of Physics: Conference Series, 2020, vol.1658. Article number: 012015. DOI: https://doi.org/10.1088/1742-6596/1658/1/012015

11. Garbar S., Kolnogorov A. Invariant description for regret of UCB strategy for Gaussian multi-arm bandit. Proceedings of the 6th Stochastic Modeling Techniques and Data Analysis International Conference with Demographics Workshop (SMTDA). Barcelona, Spain, 2-5 June 2020, pp.251-260.

12. Vogel W. An Asymptotic Minimax Theorem for the TwoArmed Bandit Problem. Ann. Math. Statist., 1960, vol.31, pp.444-451.

Review

For citations:

Garbar S.V. Relation between reward variance estimation and losses for UCB strategy for Gaussian two-armed bandit. Title in english. 2021;(2(123)):17-20. (In Russ.) https://doi.org/10.34680/2076-8052.2021.2(123).17-20

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2076-8052 (Print)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Title in english

Relation between reward variance estimation and losses for UCB strategy for Gaussian two-armed bandit

Full Text:

Abstract

Keywords

About the Author

References

Review

For citations:

Cookies policy