Reinforcement Learning: A Statistical Perspective

Zhou, Ying

doi:10.6339/25-JDS1205

Journal of Data Science

Reinforcement Learning: A Statistical Perspective

Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 86–105

Ying Zhou

https://doi.org/10.6339/25-JDS1205

Pub. online: 10 December 2025 Type: Data Science Reviews

Open Access

Received
1 January 2025

Accepted
27 October 2025

Published
10 December 2025

Abstract

Reinforcement Learning (RL) is a powerful framework for sequential decision-making, enabling agents to optimize actions through interaction with their environment. While widely studied in computer science, statisticians have advanced RL by addressing challenges like uncertainty quantification, sample efficiency, and interpretability. These contributions are particularly impactful in healthcare, where RL complements Dynamic Treatment Regimes (DTRs), optimizing personalized medicine by tailoring treatments to individuals based on evolving characteristics. This paper serves as both a tutorial for statisticians new to RL and a review of its integration with statistical methodologies. It introduces foundational RL concepts, classical algorithms, and Q-learning variants, and highlights how statistical perspectives, especially causal inference, address challenges in DTRs. By bridging RL and statistical perspectives, the paper highlights opportunities to enhance decision-making in high-stakes domains like healthcare.

References

Agarwal A, Han S, Saha D, Syrgkanis V, Yoon H (2025). Synthetic blips: Generalizing synthetic controls for dynamic treatment effects. arXiv preprint: https://arxiv.org/abs/2210.11003v2.

Allen C, Parikh N, Gottesman O, Konidaris G (2021). Learning Markov state abstractions for deep reinforcement learning. In: Advances in Neural Information Processing Systems, volume 34, 8229–8241.

Angrist JD, Imbens GW, Rubin DB (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434): 444–455. https://doi.org/10.1080/01621459.1996.10476902

Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6): 26–38. https://doi.org/10.1109/MSP.2017.2743240

Barto AG, Sutton RS, Anderson CW (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, SMC–13(5): 834–846. https://doi.org/10.1109/TSMC.1983.6313077

Bellman RE (1957). Dynamic Programming. Princeton University Press.

Bengio Y, Courville A, Vincent P (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8): 1798–1828. https://doi.org/10.1109/TPAMI.2013.50

Bennett A, Kallus N (2024). Proximal reinforcement learning: Efficient off-policy evaluation in partially observed Markov decision processes. Operations Research, 72(3): 1071–1086. https://doi.org/10.1287/opre.2021.0781

Bertsekas DP (2017). Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA, 4th edition.

Breiman L, Friedman JH, Olshen RA, Stone CJ (1984). Classification and Regression Trees. Chapman and Hall/CRC, New York, 1st edition.

Cai H, Ren K, Zhang W, Malialis K, Wang J, Yu Y, et al. (2017). Real-time bidding by reinforcement learning in display advertising. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 661–670. ACM.

Chakraborty B, Moodie EE (2013). Statistical Methods for Dynamic Treatment Regimes. Springer, New York, NY.

Chen S, Zhang B (2023). Estimating and improving dynamic treatment regimes with a time-varying instrumental variable. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 85(2): 427–453. https://doi.org/10.1093/jrsssb/qkad011

Choi E, Bahadori MT, Kulas JA, Schuetz A, Stewart WF, Sun J (2016). RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, 3512–3520.

Cook RD (2007). Fisher lecture: Dimension reduction in regression. Statistical Science, 22(1): 1–26. https://doi.org/10.1214/088342306000000682

Covington P, Adams J, Sargin E (2016). Deep neural networks for YouTube recommendations. In: Proceedings of the 10th ACM Conference on Recommender Systems, 191–198. ACM.

Ernst D, Geurts P, Wehenkel L (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6: 503–556.

Ertefaie A, Strawderman RL (2018). Constructing dynamic treatment regimes over indefinite time horizons. Biometrika, 105(4): 963–977. https://doi.org/10.1093/biomet/asy043

Finn C, Abbeel P, Levine S (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, volume 70, 1126–1135. PMLR.

Greensmith E, Bartlett PL, Baxter J (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5: 1471–1530.

Gupta P, Puri N, Verma S, Kayastha D, Deshmukh S, Krishnamurthy B, et al. (2020). Explain your move: Understanding agent actions using specific and relevant feature attribution. In: International Conference on Learning Representations.

Hernán MA, Robins JM (2024). Causal Inference: What If. Chapman & Hall/CRC. CRC Press.

Johansson FD, Shalit U, Sontag D (2016). Learning representations for counterfactual inference. In: Proceedings of the 33rd International Conference on Machine Learning, volume 48, 3020–3029. PMLR.

Kendall A, Hawke J, Janz D, Mazur P, Reda D, Allen JM, et al. (2019). Learning to drive in a day. In: 2019 International Conference on Robotics and Automation, 8248–8254.

Kober J, Bagnell JA, Peters J (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11): 1238–1274. https://doi.org/10.1177/0278364913495721

Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA (2018). The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11): 1716–1720. https://doi.org/10.1038/s41591-018-0213-5

Konda VR, Tsitsiklis JN (2000). Actor–critic algorithms. In: Advances in Neural Information Processing Systems, volume 12, 1008–1014. MIT Press.

Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA (2014). Dynamic treatment regimes: Technical challenges and applications. Electronic Journal of Statistics, 8(1): 1225–1272. https://doi.org/10.1214/14-EJS920

Li L, Walsh TJ, Littman ML (2006). Towards a unified theory of state abstraction for MDPs. In: Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics, 531–539.

Li M, Shi C, Wu Z, Fryzlewicz P (2025). Testing stationarity and change point detection in reinforcement learning. The Annals of Statistics, 53(3): 1230–1256. https://doi.org/10.1214/25-AOS2501

Luckett DJ, Laber EB, Kahkoska AR, Maahs DM, Mayer-Davis E, Kosorok MR (2020). Estimating dynamic treatment regimes in mobile health using V-learning. Journal of the American Statistical Association, 115(530): 692–706. https://doi.org/10.1080/01621459.2018.1537919

Luo Z, Pan Y, Watkinson P, Zhu T (2024). Reinforcement learning in dynamic treatment regimes needs critical reexamination. arXiv preprint: https://arxiv.org/abs/2405.18556.

Lyu L, Cheng Y, Wahed AS (2023). Imputation-based Q-learning for optimizing dynamic treatment regimes with right-censored survival outcome. Biometrics, 79(4): 3676–3689. https://doi.org/10.1111/biom.13872

Madumal P, Miller T, Sonenberg L, Vetere F (2020). Explainable reinforcement learning through a causal lens. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2493–2500.

Miao W, Geng Z, Tchetgen Tchetgen EJ (2018). Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4): 987–993. https://doi.org/10.1093/biomet/asy038

Miotto R, Li L, Kidd BA, Dudley JT (2016). Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports, 6: 26094. https://doi.org/10.1038/srep26094

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533. https://doi.org/10.1038/nature14236

Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 65(2): 331–355. https://doi.org/10.1111/1467-9868.00389

Murphy SA (2005). A generalization error for Q-learning. Journal of Machine Learning Research, 6(37): 1073–1097.

Ng AY, Harada D, Russell S (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the International Conference on Machine Learning, volume 99, 278–287.

Padakandla S, KJ P, Bhatnagar S (2020). Reinforcement learning algorithm for non-stationary environments. Applied Intelligence, 50(11): 3590–3606. https://doi.org/10.1007/s10489-020-01758-5

Puterman ML (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York.

Robins J (1986). A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9–12): 1393–1512. https://doi.org/10.1016/0270-0255(86)90088-6

Robins J, Hernán M, Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5): 550–560. https://doi.org/10.1097/00001648-200009000-00011

Schulte PJ, Tsiatis AA, Laber EB, Davidian M (2015). Q-and A-learning methods for estimating optimal dynamic treatment regimes. Statistical Science, 29(4): 640.

Shahn Z, Dukes O, Shamsunder M, Richardson D, Tchetgen Tchetgen ET, Robins J (2025). Structural nested mean models under parallel trends assumptions. arXiv preprint: https://arxiv.org/abs/2204.10291v8.

Sherman E, Arbour D, Shpitser I (2020). General identification of dynamic treatment regimes under interference. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, volume 108, 3917–3927. PMLR.

Shi C, Wan R, Song R, Lu W, Leng L (2020). Does the Markov decision process fit the data: Testing for the Markov property in sequential decision making. In: International Conference on Machine Learning, 8807–8817. PMLR.

Silva A, Gombolay M, Killian T, Jimenez I, Son SH (2020). Optimization methods for interpretable differentiable decision trees applied to reinforcement learning. In: International Conference on Artificial Intelligence and Statistics, 1855–1865. PMLR.

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587): 484–489. https://doi.org/10.1038/nature16961

Spicker D, Wallace MP (2020). Measurement error and precision medicine: Error-prone tailoring covariates in dynamic treatment regimes. Statistics in Medicine, 39(26): 3732–3755. https://doi.org/10.1002/sim.8690

Sutton RS, Barto AG (2018). Reinforcement Learning: An Introduction. MIT press.

Sutton RS, McAllester DA, Singh SP, Mansour Y (1999). Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, volume 12, 1057–1063.

Tennenholtz G, Shalit U, Mannor S (2020). Off-policy evaluation in partially observable environments. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 10276–10283.

Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Tsitsiklis J, Van Roy B (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5): 674–690. https://doi.org/10.1109/9.580874

Uehara M, Shi C, Kallus N (2022). A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355.

van Hasselt H (2010). Double Q-learning. In: Advances in Neural Information Processing Systems, volume 23, 2613–2621.

van Hasselt H, Doron Y, Strub F, Hessel M, Sonnerat N, Modayil J (2018). Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648.

van Hasselt H, Guez A, Silver D (2016). Deep reinforcement learning with double Q-learning. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, volume 30, 2094–2100.

Wang Z, Schaul T, Hessel M, van Hasselt H, Lanctot M, Freitas N (2016). Dueling network architectures for deep reinforcement learning. In: International Conference on Machine Learning, volume 48, 1995–2003. PMLR.

Watkins CJ, Dayan P (1992). Q-learning. Machine Learning, 8: 279–292.

Williams RJ (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8: 229–256. https://doi.org/10.1023/A:1022672621406

Xu Y, Zhu J, Shi C, Luo S, Song R (2023). An instrumental variable approach to confounded off-policy evaluation. In: International Conference on Machine Learning, volume 202, 38848–38880. PMLR.

Yu C, Liu J, Nemati S, Yin G (2021). Reinforcement learning in healthcare: A survey. ACM Computing Surveys, 55(1): 1–36.

Zeng Y, Cai R, Sun F, Huang L, Hao Z (2025). A survey on causal reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 36(4): 5942–5962. https://doi.org/10.1109/TNNLS.2024.3403001

Zhang B, Tsiatis AA, Laber EB, Davidian M (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika, 100(3): 681–694. https://doi.org/10.1093/biomet/ast014

Zhang Y, Laber EB, Davidian M, Tsiatis AA (2018). Interpretable dynamic treatment regimes. Journal of the American Statistical Association, 113(524): 1541–1549. https://doi.org/10.1080/01621459.2017.1345743

Zhao YQ, Zeng D, Laber EB, Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510): 583–598. https://doi.org/10.1080/01621459.2014.937488

Zhou W, Zhu R, Qu A (2024). Estimating optimal infinite horizon dynamic treatment regimes via pT-learning. Journal of the American Statistical Association, 119(545): 625–638. https://doi.org/10.1080/01621459.2022.2138760

Zhu Z, Lin K, Jain AK, Zhou J (2023). Transfer learning in deep reinforcement learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11): 13344–13362. https://doi.org/10.1109/TPAMI.2023.3292075

2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

causal inference dynamic treatment regimes sequential decision-making

Metrics

since February 2021

1658

Article info
views

811

PDF
downloads

RSS

Authors

Abstract

References

Export citation

Copy and paste formatted citation

Download citation in file