Reinforcement Learning: A Statistical Perspective
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 86–105
Pub. online: 10 December 2025
Type: Data Science Reviews
Open Access
Received
1 January 2025
1 January 2025
Accepted
27 October 2025
27 October 2025
Published
10 December 2025
10 December 2025
Abstract
Reinforcement Learning (RL) is a powerful framework for sequential decision-making, enabling agents to optimize actions through interaction with their environment. While widely studied in computer science, statisticians have advanced RL by addressing challenges like uncertainty quantification, sample efficiency, and interpretability. These contributions are particularly impactful in healthcare, where RL complements Dynamic Treatment Regimes (DTRs), optimizing personalized medicine by tailoring treatments to individuals based on evolving characteristics. This paper serves as both a tutorial for statisticians new to RL and a review of its integration with statistical methodologies. It introduces foundational RL concepts, classical algorithms, and Q-learning variants, and highlights how statistical perspectives, especially causal inference, address challenges in DTRs. By bridging RL and statistical perspectives, the paper highlights opportunities to enhance decision-making in high-stakes domains like healthcare.
References
Agarwal A, Han S, Saha D, Syrgkanis V, Yoon H (2025). Synthetic blips: Generalizing synthetic controls for dynamic treatment effects. arXiv preprint: https://arxiv.org/abs/2210.11003v2.
Angrist JD, Imbens GW, Rubin DB (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434): 444–455. https://doi.org/10.1080/01621459.1996.10476902
Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6): 26–38. https://doi.org/10.1109/MSP.2017.2743240
Barto AG, Sutton RS, Anderson CW (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, SMC–13(5): 834–846. https://doi.org/10.1109/TSMC.1983.6313077
Bengio Y, Courville A, Vincent P (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8): 1798–1828. https://doi.org/10.1109/TPAMI.2013.50
Bennett A, Kallus N (2024). Proximal reinforcement learning: Efficient off-policy evaluation in partially observed Markov decision processes. Operations Research, 72(3): 1071–1086. https://doi.org/10.1287/opre.2021.0781
Chen S, Zhang B (2023). Estimating and improving dynamic treatment regimes with a time-varying instrumental variable. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 85(2): 427–453. https://doi.org/10.1093/jrsssb/qkad011
Cook RD (2007). Fisher lecture: Dimension reduction in regression. Statistical Science, 22(1): 1–26. https://doi.org/10.1214/088342306000000682
Ertefaie A, Strawderman RL (2018). Constructing dynamic treatment regimes over indefinite time horizons. Biometrika, 105(4): 963–977. https://doi.org/10.1093/biomet/asy043
Kober J, Bagnell JA, Peters J (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11): 1238–1274. https://doi.org/10.1177/0278364913495721
Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA (2018). The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11): 1716–1720. https://doi.org/10.1038/s41591-018-0213-5
Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA (2014). Dynamic treatment regimes: Technical challenges and applications. Electronic Journal of Statistics, 8(1): 1225–1272. https://doi.org/10.1214/14-EJS920
Li M, Shi C, Wu Z, Fryzlewicz P (2025). Testing stationarity and change point detection in reinforcement learning. The Annals of Statistics, 53(3): 1230–1256. https://doi.org/10.1214/25-AOS2501
Luckett DJ, Laber EB, Kahkoska AR, Maahs DM, Mayer-Davis E, Kosorok MR (2020). Estimating dynamic treatment regimes in mobile health using V-learning. Journal of the American Statistical Association, 115(530): 692–706. https://doi.org/10.1080/01621459.2018.1537919
Luo Z, Pan Y, Watkinson P, Zhu T (2024). Reinforcement learning in dynamic treatment regimes needs critical reexamination. arXiv preprint: https://arxiv.org/abs/2405.18556.
Lyu L, Cheng Y, Wahed AS (2023). Imputation-based Q-learning for optimizing dynamic treatment regimes with right-censored survival outcome. Biometrics, 79(4): 3676–3689. https://doi.org/10.1111/biom.13872
Miao W, Geng Z, Tchetgen Tchetgen EJ (2018). Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4): 987–993. https://doi.org/10.1093/biomet/asy038
Miotto R, Li L, Kidd BA, Dudley JT (2016). Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports, 6: 26094. https://doi.org/10.1038/srep26094
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533. https://doi.org/10.1038/nature14236
Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 65(2): 331–355. https://doi.org/10.1111/1467-9868.00389
Padakandla S, KJ P, Bhatnagar S (2020). Reinforcement learning algorithm for non-stationary environments. Applied Intelligence, 50(11): 3590–3606. https://doi.org/10.1007/s10489-020-01758-5
Robins J (1986). A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9–12): 1393–1512. https://doi.org/10.1016/0270-0255(86)90088-6
Robins J, Hernán M, Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5): 550–560. https://doi.org/10.1097/00001648-200009000-00011
Shahn Z, Dukes O, Shamsunder M, Richardson D, Tchetgen Tchetgen ET, Robins J (2025). Structural nested mean models under parallel trends assumptions. arXiv preprint: https://arxiv.org/abs/2204.10291v8.
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587): 484–489. https://doi.org/10.1038/nature16961
Spicker D, Wallace MP (2020). Measurement error and precision medicine: Error-prone tailoring covariates in dynamic treatment regimes. Statistics in Medicine, 39(26): 3732–3755. https://doi.org/10.1002/sim.8690
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Tsitsiklis J, Van Roy B (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5): 674–690. https://doi.org/10.1109/9.580874
Uehara M, Shi C, Kallus N (2022). A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355.
van Hasselt H, Doron Y, Strub F, Hessel M, Sonnerat N, Modayil J (2018). Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648.
Williams RJ (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8: 229–256. https://doi.org/10.1023/A:1022672621406
Zeng Y, Cai R, Sun F, Huang L, Hao Z (2025). A survey on causal reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 36(4): 5942–5962. https://doi.org/10.1109/TNNLS.2024.3403001
Zhang B, Tsiatis AA, Laber EB, Davidian M (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika, 100(3): 681–694. https://doi.org/10.1093/biomet/ast014
Zhang Y, Laber EB, Davidian M, Tsiatis AA (2018). Interpretable dynamic treatment regimes. Journal of the American Statistical Association, 113(524): 1541–1549. https://doi.org/10.1080/01621459.2017.1345743
Zhao YQ, Zeng D, Laber EB, Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510): 583–598. https://doi.org/10.1080/01621459.2014.937488
Zhou W, Zhu R, Qu A (2024). Estimating optimal infinite horizon dynamic treatment regimes via pT-learning. Journal of the American Statistical Association, 119(545): 625–638. https://doi.org/10.1080/01621459.2022.2138760
Zhu Z, Lin K, Jain AK, Zhou J (2023). Transfer learning in deep reinforcement learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11): 13344–13362. https://doi.org/10.1109/TPAMI.2023.3292075