Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1151

10.6339/24-JDS1151

Statistical Data Science

Efficient UCB-Based Assignment Algorithm Under Unknown Utility with Application in Mentor-Mentee Matching

Shi

Yuyang

syuyang123@gmail.com1∗† Mei

Yajun

2 1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA 2Department of Biostatistics, School of Global Public Health, New York University, New York, NY, USA

∗Corresponding author. Email: syuyang123@gmail.com.†

This paper is part of the first author’s PhD dissertation at Georgia Institute of Technology.

2024

2492024

00117

Supplementary Material

The supplementary materials online includes: Proofs of Lemmas and Theorems used in the paper, and Python code needed to reproduce the results.

18620242282024

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2024

Open access article under the CC BY license.

The assignment problem, crucial in various real-world applications, involves optimizing the allocation of agents to tasks for maximum utility. While it has been well-studied in the optimization literature when the underlying utilities between all agent-task pairs are known, research is sparse when the utilities are unknown and need to be learned from data on the fly. This paper addresses this gap, as motivated by mentor-mentee matching programs at many U.S. universities. We develop an efficient sequential assignment algorithm, with the aim of nearly maximizing the overall utility simultaneously over different time periods. Our proposed algorithm is to use stochastic bandit feedback to adaptively estimate the unknown utilities through linear regression models, integrating the Upper Confidence Bound (UCB) algorithm in the multi-armed bandit problem with the Hungarian algorithm in the assignment problem. We provide theoretical bounds of our algorithm for both the estimation error and the total regret. Additionally, numerical studies are also conducted to demonstrate the practical effectiveness of our algorithm.

Keywords bandits estimation optimal assignment upper confidence bound

This research was supported in part by NSF grant DMS-2015405 and by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR002378. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

Abbasi-Yadkori

, Pál

, Szepesvári

(2011). Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24: 2312–2320.

Anderson

(2008). The Theory and Practice of Online Learning. Athabasca University Press.

Anscombe

(1953). Sequential estimation. Journal of the Royal Statistical Society, Series B, Methodological, 15(1): 1–21.

Avis

(1983). A survey of heuristics for the weighted matching problem. Networks, 13(4): 475–493.

Biró

, Gyetvai

(2023). Online voluntary mentoring: Optimising the assignment of students and mentors. European Journal of Operational Research, 307(1): 392–405.

Cesa-Bianchi

, Lugosi

(2012). Combinatorial bandits. Journal of Computer and System Sciences, 78(5): 1404–1422.

Chen

, Wang

, Yuan

(2013). Combinatorial multi-armed bandit: General framework and applications. In: Dasgupta S and McAllester D (eds.), Proceedings of the 30th International Conference on Machine Learning, 151–159. PMLR.

Chu

, Li

, Reyzin

, Schapire

(2011). Contextual bandits with linear payoff functions. In: Gordon G, Dunson D, and Dudík M (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 208–214. JMLR Workshop and Conference Proceedings.

Deshmukh

, Dogan

, Scott

(2017). Multi-task learning for contextual bandits. Advances in Neural Information Processing Systems, 30: 4848–4856.

DiCiccio

, Efron

(1996). Bootstrap confidence intervals. Statistical Science, 11(3): 189–228.

Duan

, Pettie

(2014). Linear-time approximation for maximum weight matching. Journal of the ACM, 61(1): 1–23.

Edmonds

, Karp

(1972). Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM, 19(2): 248–264.

Efron

, Tibshirani

(1994). An Introduction to the Bootstrap. CRC Press.

Erraqabi

, Lazaric

, Valko

, Brunskill

, Liu

(2017). Trading off rewards and errors in multi-armed bandits. In: Singh A and Zhu J (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 709–717. PMLR.

Fang

, Zhu

(2022). Matching for peer support: Exploring algorithmic matching for online mental health communities. Proceedings of the ACM on Human–Computer Interaction, 6(CSCW2): 1–37.

Gai

, Krishnamachari

, Jain

(2012). Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5): 1466–1478.

Ghosh

, Mukhopadhyay

, Sen

(2011). Sequential Estimation. John Wiley & Sons.

Hazan

(2016). Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3–4): 157–325.

Kuhn

(1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2): 83–97.

Kurtzberg

(1962). On approximation methods for the assignment problem. Journal of the ACM, 9(4): 419–439.

Lai

, Robbins

(1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4–22.

Perrault

, Boursier

, Valko

, Perchet

(2020). Statistical efficiency of Thompson sampling for combinatorial semi-bandits. Advances in Neural Information Processing Systems, 33: 5429–5440.

Pizzato

, Rej

, Chung

, Koprinska

Kay

(2010). RECON: A reciprocal recommender for online dating. In: Proceedings of the Fourth ACM Conference on Recommender Systems, 207–214.

Shalev-Shwartz

(2012). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2): 107–194.

Shi

, Mei

(2022). Efficient sequential ucb-based Hungarian algorithm for assignment problems. In: 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 1–8. IEEE.

Simchi-Levi

, Wang

(2023). Multi-armed bandit experimental design: Online decision-making and adaptive inference. In: Ruiz F, Dy J, and van de Meent J-W (eds.), Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, 3086–3097. PMLR.

Tomizawa

(1971). On some techniques useful for solution of transportation network problems. Networks, 1(2): 173–194.

Vakili

, Zhao

, Zhou

(2014). Time-varying stochastic multi-armed bandit problems. In: 2014 48th Asilomar Conference on Signals, Systems and Computers, 2103–2107. IEEE.

Wen

, Kveton

, Ashkan

(2015). Efficient learning in large-scale combinatorial semi-bandits. In: Bach F and Blei D (eds.), Proceedings of the 32nd International Conference on Machine Learning, 1113–1122. PMLR.

Xia

, Liu

, Sun

, Chen

(2015). Reciprocal recommendation system for online dating. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 234–241. IEEE.

, Dong

, Li

, He

, Li

(2020). Contextual-bandit based personalized recommendation with time-varying user interests. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 6518–6525.

Yang

, Ramdas

, Jamieson

, Wainwright

(2017). A framework for multi-a(rmed)/b(andit) testing with online FDR control. Advances in Neural Information Processing Systems, 30: 5959–5968.