Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning
  4. Reinforcement Learning: A Statistical Pe ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Reinforcement Learning: A Statistical Perspective
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 86–105
Ying Zhou  

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1205
Pub. online: 10 December 2025      Type: Data Science Reviews      Open accessOpen Access

Received
1 January 2025
Accepted
27 October 2025
Published
10 December 2025

Abstract

Reinforcement Learning (RL) is a powerful framework for sequential decision-making, enabling agents to optimize actions through interaction with their environment. While widely studied in computer science, statisticians have advanced RL by addressing challenges like uncertainty quantification, sample efficiency, and interpretability. These contributions are particularly impactful in healthcare, where RL complements Dynamic Treatment Regimes (DTRs), optimizing personalized medicine by tailoring treatments to individuals based on evolving characteristics. This paper serves as both a tutorial for statisticians new to RL and a review of its integration with statistical methodologies. It introduces foundational RL concepts, classical algorithms, and Q-learning variants, and highlights how statistical perspectives, especially causal inference, address challenges in DTRs. By bridging RL and statistical perspectives, the paper highlights opportunities to enhance decision-making in high-stakes domains like healthcare.

References

 
Agarwal A, Han S, Saha D, Syrgkanis V, Yoon H (2025). Synthetic blips: Generalizing synthetic controls for dynamic treatment effects. arXiv preprint: https://arxiv.org/abs/2210.11003v2.
 
Allen C, Parikh N, Gottesman O, Konidaris G (2021). Learning Markov state abstractions for deep reinforcement learning. In: Advances in Neural Information Processing Systems, volume 34, 8229–8241.
 
Angrist JD, Imbens GW, Rubin DB (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434): 444–455. https://doi.org/10.1080/01621459.1996.10476902
 
Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6): 26–38. https://doi.org/10.1109/MSP.2017.2743240
 
Barto AG, Sutton RS, Anderson CW (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, SMC–13(5): 834–846. https://doi.org/10.1109/TSMC.1983.6313077
 
Bellman RE (1957). Dynamic Programming. Princeton University Press.
 
Bengio Y, Courville A, Vincent P (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8): 1798–1828. https://doi.org/10.1109/TPAMI.2013.50
 
Bennett A, Kallus N (2024). Proximal reinforcement learning: Efficient off-policy evaluation in partially observed Markov decision processes. Operations Research, 72(3): 1071–1086. https://doi.org/10.1287/opre.2021.0781
 
Bertsekas DP (2017). Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA, 4th edition.
 
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984). Classification and Regression Trees. Chapman and Hall/CRC, New York, 1st edition.
 
Cai H, Ren K, Zhang W, Malialis K, Wang J, Yu Y, et al. (2017). Real-time bidding by reinforcement learning in display advertising. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 661–670. ACM.
 
Chakraborty B, Moodie EE (2013). Statistical Methods for Dynamic Treatment Regimes. Springer, New York, NY.
 
Chen S, Zhang B (2023). Estimating and improving dynamic treatment regimes with a time-varying instrumental variable. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 85(2): 427–453. https://doi.org/10.1093/jrsssb/qkad011
 
Choi E, Bahadori MT, Kulas JA, Schuetz A, Stewart WF, Sun J (2016). RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, 3512–3520.
 
Cook RD (2007). Fisher lecture: Dimension reduction in regression. Statistical Science, 22(1): 1–26. https://doi.org/10.1214/088342306000000682
 
Covington P, Adams J, Sargin E (2016). Deep neural networks for YouTube recommendations. In: Proceedings of the 10th ACM Conference on Recommender Systems, 191–198. ACM.
 
Ernst D, Geurts P, Wehenkel L (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6: 503–556.
 
Ertefaie A, Strawderman RL (2018). Constructing dynamic treatment regimes over indefinite time horizons. Biometrika, 105(4): 963–977. https://doi.org/10.1093/biomet/asy043
 
Finn C, Abbeel P, Levine S (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, volume 70, 1126–1135. PMLR.
 
Greensmith E, Bartlett PL, Baxter J (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5: 1471–1530.
 
Gupta P, Puri N, Verma S, Kayastha D, Deshmukh S, Krishnamurthy B, et al. (2020). Explain your move: Understanding agent actions using specific and relevant feature attribution. In: International Conference on Learning Representations.
 
Hernán MA, Robins JM (2024). Causal Inference: What If. Chapman & Hall/CRC. CRC Press.
 
Johansson FD, Shalit U, Sontag D (2016). Learning representations for counterfactual inference. In: Proceedings of the 33rd International Conference on Machine Learning, volume 48, 3020–3029. PMLR.
 
Kendall A, Hawke J, Janz D, Mazur P, Reda D, Allen JM, et al. (2019). Learning to drive in a day. In: 2019 International Conference on Robotics and Automation, 8248–8254.
 
Kober J, Bagnell JA, Peters J (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11): 1238–1274. https://doi.org/10.1177/0278364913495721
 
Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA (2018). The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11): 1716–1720. https://doi.org/10.1038/s41591-018-0213-5
 
Konda VR, Tsitsiklis JN (2000). Actor–critic algorithms. In: Advances in Neural Information Processing Systems, volume 12, 1008–1014. MIT Press.
 
Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA (2014). Dynamic treatment regimes: Technical challenges and applications. Electronic Journal of Statistics, 8(1): 1225–1272. https://doi.org/10.1214/14-EJS920
 
Li L, Walsh TJ, Littman ML (2006). Towards a unified theory of state abstraction for MDPs. In: Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics, 531–539.
 
Li M, Shi C, Wu Z, Fryzlewicz P (2025). Testing stationarity and change point detection in reinforcement learning. The Annals of Statistics, 53(3): 1230–1256. https://doi.org/10.1214/25-AOS2501
 
Luckett DJ, Laber EB, Kahkoska AR, Maahs DM, Mayer-Davis E, Kosorok MR (2020). Estimating dynamic treatment regimes in mobile health using V-learning. Journal of the American Statistical Association, 115(530): 692–706. https://doi.org/10.1080/01621459.2018.1537919
 
Luo Z, Pan Y, Watkinson P, Zhu T (2024). Reinforcement learning in dynamic treatment regimes needs critical reexamination. arXiv preprint: https://arxiv.org/abs/2405.18556.
 
Lyu L, Cheng Y, Wahed AS (2023). Imputation-based Q-learning for optimizing dynamic treatment regimes with right-censored survival outcome. Biometrics, 79(4): 3676–3689. https://doi.org/10.1111/biom.13872
 
Madumal P, Miller T, Sonenberg L, Vetere F (2020). Explainable reinforcement learning through a causal lens. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2493–2500.
 
Miao W, Geng Z, Tchetgen Tchetgen EJ (2018). Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4): 987–993. https://doi.org/10.1093/biomet/asy038
 
Miotto R, Li L, Kidd BA, Dudley JT (2016). Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports, 6: 26094. https://doi.org/10.1038/srep26094
 
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533. https://doi.org/10.1038/nature14236
 
Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 65(2): 331–355. https://doi.org/10.1111/1467-9868.00389
 
Murphy SA (2005). A generalization error for Q-learning. Journal of Machine Learning Research, 6(37): 1073–1097.
 
Ng AY, Harada D, Russell S (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the International Conference on Machine Learning, volume 99, 278–287.
 
Padakandla S, KJ P, Bhatnagar S (2020). Reinforcement learning algorithm for non-stationary environments. Applied Intelligence, 50(11): 3590–3606. https://doi.org/10.1007/s10489-020-01758-5
 
Puterman ML (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York.
 
Robins J (1986). A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9–12): 1393–1512. https://doi.org/10.1016/0270-0255(86)90088-6
 
Robins J, Hernán M, Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5): 550–560. https://doi.org/10.1097/00001648-200009000-00011
 
Schulte PJ, Tsiatis AA, Laber EB, Davidian M (2015). Q-and A-learning methods for estimating optimal dynamic treatment regimes. Statistical Science, 29(4): 640.
 
Shahn Z, Dukes O, Shamsunder M, Richardson D, Tchetgen Tchetgen ET, Robins J (2025). Structural nested mean models under parallel trends assumptions. arXiv preprint: https://arxiv.org/abs/2204.10291v8.
 
Sherman E, Arbour D, Shpitser I (2020). General identification of dynamic treatment regimes under interference. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, volume 108, 3917–3927. PMLR.
 
Shi C, Wan R, Song R, Lu W, Leng L (2020). Does the Markov decision process fit the data: Testing for the Markov property in sequential decision making. In: International Conference on Machine Learning, 8807–8817. PMLR.
 
Silva A, Gombolay M, Killian T, Jimenez I, Son SH (2020). Optimization methods for interpretable differentiable decision trees applied to reinforcement learning. In: International Conference on Artificial Intelligence and Statistics, 1855–1865. PMLR.
 
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587): 484–489. https://doi.org/10.1038/nature16961
 
Spicker D, Wallace MP (2020). Measurement error and precision medicine: Error-prone tailoring covariates in dynamic treatment regimes. Statistics in Medicine, 39(26): 3732–3755. https://doi.org/10.1002/sim.8690
 
Sutton RS, Barto AG (2018). Reinforcement Learning: An Introduction. MIT press.
 
Sutton RS, McAllester DA, Singh SP, Mansour Y (1999). Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, volume 12, 1057–1063.
 
Tennenholtz G, Shalit U, Mannor S (2020). Off-policy evaluation in partially observable environments. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 10276–10283.
 
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
 
Tsitsiklis J, Van Roy B (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5): 674–690. https://doi.org/10.1109/9.580874
 
Uehara M, Shi C, Kallus N (2022). A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355.
 
van Hasselt H (2010). Double Q-learning. In: Advances in Neural Information Processing Systems, volume 23, 2613–2621.
 
van Hasselt H, Doron Y, Strub F, Hessel M, Sonnerat N, Modayil J (2018). Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648.
 
van Hasselt H, Guez A, Silver D (2016). Deep reinforcement learning with double Q-learning. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, volume 30, 2094–2100.
 
Wang Z, Schaul T, Hessel M, van Hasselt H, Lanctot M, Freitas N (2016). Dueling network architectures for deep reinforcement learning. In: International Conference on Machine Learning, volume 48, 1995–2003. PMLR.
 
Watkins CJ, Dayan P (1992). Q-learning. Machine Learning, 8: 279–292.
 
Williams RJ (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8: 229–256. https://doi.org/10.1023/A:1022672621406
 
Xu Y, Zhu J, Shi C, Luo S, Song R (2023). An instrumental variable approach to confounded off-policy evaluation. In: International Conference on Machine Learning, volume 202, 38848–38880. PMLR.
 
Yu C, Liu J, Nemati S, Yin G (2021). Reinforcement learning in healthcare: A survey. ACM Computing Surveys, 55(1): 1–36.
 
Zeng Y, Cai R, Sun F, Huang L, Hao Z (2025). A survey on causal reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 36(4): 5942–5962. https://doi.org/10.1109/TNNLS.2024.3403001
 
Zhang B, Tsiatis AA, Laber EB, Davidian M (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika, 100(3): 681–694. https://doi.org/10.1093/biomet/ast014
 
Zhang Y, Laber EB, Davidian M, Tsiatis AA (2018). Interpretable dynamic treatment regimes. Journal of the American Statistical Association, 113(524): 1541–1549. https://doi.org/10.1080/01621459.2017.1345743
 
Zhao YQ, Zeng D, Laber EB, Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510): 583–598. https://doi.org/10.1080/01621459.2014.937488
 
Zhou W, Zhu R, Qu A (2024). Estimating optimal infinite horizon dynamic treatment regimes via pT-learning. Journal of the American Statistical Association, 119(545): 625–638. https://doi.org/10.1080/01621459.2022.2138760
 
Zhu Z, Lin K, Jain AK, Zhou J (2023). Transfer learning in deep reinforcement learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11): 13344–13362. https://doi.org/10.1109/TPAMI.2023.3292075

Related articles PDF XML
Related articles PDF XML

Copyright
2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
causal inference dynamic treatment regimes sequential decision-making

Metrics
since February 2021
425

Article info
views

187

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy