Efficient UCB-Based Assignment Algorithm Under Unknown Utility with Application in Mentor-Mentee Matching

Shi, Yuyang; Mei, Yajun

doi:10.6339/24-JDS1151

Journal of Data Science

Efficient UCB-Based Assignment Algorithm Under Unknown Utility with Application in Mentor-Mentee Matching

Yuyang Shi ^† Yajun Mei

https://doi.org/10.6339/24-JDS1151

Pub. online: 24 September 2024 Type: Statistical Data Science

Open Access

^† This paper is part of the first author’s PhD dissertation at Georgia Institute of Technology.

Received
18 June 2024

Accepted
22 August 2024

Published
24 September 2024

Abstract

The assignment problem, crucial in various real-world applications, involves optimizing the allocation of agents to tasks for maximum utility. While it has been well-studied in the optimization literature when the underlying utilities between all agent-task pairs are known, research is sparse when the utilities are unknown and need to be learned from data on the fly. This paper addresses this gap, as motivated by mentor-mentee matching programs at many U.S. universities. We develop an efficient sequential assignment algorithm, with the aim of nearly maximizing the overall utility simultaneously over different time periods. Our proposed algorithm is to use stochastic bandit feedback to adaptively estimate the unknown utilities through linear regression models, integrating the Upper Confidence Bound (UCB) algorithm in the multi-armed bandit problem with the Hungarian algorithm in the assignment problem. We provide theoretical bounds of our algorithm for both the estimation error and the total regret. Additionally, numerical studies are also conducted to demonstrate the practical effectiveness of our algorithm.

Supplementary material

Supplementary Material

The supplementary materials online includes: Proofs of Lemmas and Theorems used in the paper, and Python code needed to reproduce the results.

References

Abbasi-Yadkori Y, Pál D, Szepesvári C (2011). Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24: 2312–2320.

Anderson T (2008). The Theory and Practice of Online Learning. Athabasca University Press.

Anscombe FJ (1953). Sequential estimation. Journal of the Royal Statistical Society, Series B, Methodological, 15(1): 1–21.

Avis D (1983). A survey of heuristics for the weighted matching problem. Networks, 13(4): 475–493.

Biró P, Gyetvai M (2023). Online voluntary mentoring: Optimising the assignment of students and mentors. European Journal of Operational Research, 307(1): 392–405.

Cesa-Bianchi N, Lugosi G (2012). Combinatorial bandits. Journal of Computer and System Sciences, 78(5): 1404–1422.

Chen W, Wang Y, Yuan Y (2013). Combinatorial multi-armed bandit: General framework and applications. In: Dasgupta S and McAllester D (eds.), Proceedings of the 30th International Conference on Machine Learning, 151–159. PMLR.

Chu W, Li L, Reyzin L, Schapire R (2011). Contextual bandits with linear payoff functions. In: Gordon G, Dunson D, and Dudík M (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 208–214. JMLR Workshop and Conference Proceedings.

Deshmukh AA, Dogan U, Scott C (2017). Multi-task learning for contextual bandits. Advances in Neural Information Processing Systems, 30: 4848–4856.

DiCiccio TJ, Efron B (1996). Bootstrap confidence intervals. Statistical Science, 11(3): 189–228.

Duan R, Pettie S (2014). Linear-time approximation for maximum weight matching. Journal of the ACM, 61(1): 1–23.

Edmonds J, Karp RM (1972). Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM, 19(2): 248–264.

Efron B, Tibshirani RJ (1994). An Introduction to the Bootstrap. CRC Press.

Erraqabi A, Lazaric A, Valko M, Brunskill E, Liu YE (2017). Trading off rewards and errors in multi-armed bandits. In: Singh A and Zhu J (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 709–717. PMLR.

Fang A, Zhu H (2022). Matching for peer support: Exploring algorithmic matching for online mental health communities. Proceedings of the ACM on Human–Computer Interaction, 6(CSCW2): 1–37.

Gai Y, Krishnamachari B, Jain R (2012). Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5): 1466–1478.

Ghosh M, Mukhopadhyay N, Sen PK (2011). Sequential Estimation. John Wiley & Sons.

Hazan E (2016). Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3–4): 157–325.

Kuhn HW (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2): 83–97.

Kurtzberg JM (1962). On approximation methods for the assignment problem. Journal of the ACM, 9(4): 419–439.

Lai TL, Robbins H (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4–22.

Perrault P, Boursier E, Valko M, Perchet V (2020). Statistical efficiency of Thompson sampling for combinatorial semi-bandits. Advances in Neural Information Processing Systems, 33: 5429–5440.

Pizzato L, Rej T, Chung T, Koprinska I Kay J (2010). RECON: A reciprocal recommender for online dating. In: Proceedings of the Fourth ACM Conference on Recommender Systems, 207–214.

Shalev-Shwartz S (2012). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2): 107–194.

Shi Y, Mei Y (2022). Efficient sequential ucb-based Hungarian algorithm for assignment problems. In: 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 1–8. IEEE.

Simchi-Levi D, Wang C (2023). Multi-armed bandit experimental design: Online decision-making and adaptive inference. In: Ruiz F, Dy J, and van de Meent J-W (eds.), Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, 3086–3097. PMLR.

Tomizawa N (1971). On some techniques useful for solution of transportation network problems. Networks, 1(2): 173–194.

Vakili S, Zhao Q, Zhou Y (2014). Time-varying stochastic multi-armed bandit problems. In: 2014 48th Asilomar Conference on Signals, Systems and Computers, 2103–2107. IEEE.

Wen Z, Kveton B, Ashkan A (2015). Efficient learning in large-scale combinatorial semi-bandits. In: Bach F and Blei D (eds.), Proceedings of the 32nd International Conference on Machine Learning, 1113–1122. PMLR.

Xia P, Liu B, Sun Y, Chen C (2015). Reciprocal recommendation system for online dating. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 234–241. IEEE.

Xu X, Dong F, Li Y, He S, Li X (2020). Contextual-bandit based personalized recommendation with time-varying user interests. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 6518–6525.

Yang F, Ramdas A, Jamieson KG, Wainwright MJ (2017). A framework for multi-a(rmed)/b(andit) testing with online FDR control. Advances in Neural Information Processing Systems, 30: 5959–5968.

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

bandits estimation optimal assignment upper confidence bound

Funding

This research was supported in part by NSF grant DMS-2015405 and by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR002378. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Metrics

since February 2021

224

Article info
views

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file