References

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1205

10.6339/25-JDS1205

Data Science Reviews

Reinforcement Learning: A Statistical Perspective

Zhou

Ying

yzhou@uconn.edu1∗ 1Department of Statistics, University of Connecticut, Storrs, CT 06269, U.S.A.

∗Email: yzhou@uconn.edu.

2026

10122025

2418610511202527102025

2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2026

Open access article under the CC BY license.

Reinforcement Learning (RL) is a powerful framework for sequential decision-making, enabling agents to optimize actions through interaction with their environment. While widely studied in computer science, statisticians have advanced RL by addressing challenges like uncertainty quantification, sample efficiency, and interpretability. These contributions are particularly impactful in healthcare, where RL complements Dynamic Treatment Regimes (DTRs), optimizing personalized medicine by tailoring treatments to individuals based on evolving characteristics. This paper serves as both a tutorial for statisticians new to RL and a review of its integration with statistical methodologies. It introduces foundational RL concepts, classical algorithms, and Q-learning variants, and highlights how statistical perspectives, especially causal inference, address challenges in DTRs. By bridging RL and statistical perspectives, the paper highlights opportunities to enhance decision-making in high-stakes domains like healthcare.

Keywords causal inference dynamic treatment regimes sequential decision-making

References

Agarwal

, Han

, Saha

, Syrgkanis

, Yoon

(2025). Synthetic blips: Generalizing synthetic controls for dynamic treatment effects. arXiv preprint: https://arxiv.org/abs/2210.11003v2.

Allen

, Parikh

, Gottesman

, Konidaris

(2021). Learning Markov state abstractions for deep reinforcement learning. In: Advances in Neural Information Processing Systems, volume 34, 8229–8241.

Angrist

, Imbens

, Rubin

(1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434): 444–455. https://doi.org/10.1080/01621459.1996.10476902

Arulkumaran

, Deisenroth

, Brundage

, Bharath

(2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6): 26–38. https://doi.org/10.1109/MSP.2017.2743240

Barto

, Sutton

, Anderson

(1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, SMC–13(5): 834–846. https://doi.org/10.1109/TSMC.1983.6313077

Bellman

(1957). Dynamic Programming. Princeton University Press.

Bengio

, Courville

, Vincent

(2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8): 1798–1828. https://doi.org/10.1109/TPAMI.2013.50

Bennett

, Kallus

(2024). Proximal reinforcement learning: Efficient off-policy evaluation in partially observed Markov decision processes. Operations Research, 72(3): 1071–1086. https://doi.org/10.1287/opre.2021.0781

Bertsekas

(2017). Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA, 4th edition.

Breiman

, Friedman

, Olshen

, Stone

(1984). Classification and Regression Trees. Chapman and Hall/CRC, New York, 1st edition.

Cai

, Ren

, Zhang

, Malialis

, Wang

, Yu

, et al. (2017). Real-time bidding by reinforcement learning in display advertising. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 661–670. ACM.

Chakraborty

, Moodie

(2013). Statistical Methods for Dynamic Treatment Regimes. Springer, New York, NY.

Chen

, Zhang

(2023). Estimating and improving dynamic treatment regimes with a time-varying instrumental variable. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 85(2): 427–453. https://doi.org/10.1093/jrsssb/qkad011

Choi

, Bahadori

, Kulas

, Schuetz

, Stewart

, Sun

(2016). RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, 3512–3520.

Cook

(2007). Fisher lecture: Dimension reduction in regression. Statistical Science, 22(1): 1–26. https://doi.org/10.1214/088342306000000682

Covington

, Adams

, Sargin

(2016). Deep neural networks for YouTube recommendations. In: Proceedings of the 10th ACM Conference on Recommender Systems, 191–198. ACM.

Ernst

, Geurts

, Wehenkel

(2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6: 503–556.

Ertefaie

, Strawderman

(2018). Constructing dynamic treatment regimes over indefinite time horizons. Biometrika, 105(4): 963–977. https://doi.org/10.1093/biomet/asy043

Finn

, Abbeel

, Levine

(2017). Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, volume 70, 1126–1135. PMLR.

Greensmith

, Bartlett

, Baxter

(2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5: 1471–1530.

Gupta

, Puri

, Verma

, Kayastha

, Deshmukh

, Krishnamurthy

, et al. (2020). Explain your move: Understanding agent actions using specific and relevant feature attribution. In: International Conference on Learning Representations.

Hernán

, Robins

(2024). Causal Inference: What If. Chapman & Hall/CRC. CRC Press.

Johansson

, Shalit

, Sontag

(2016). Learning representations for counterfactual inference. In: Proceedings of the 33rd International Conference on Machine Learning, volume 48, 3020–3029. PMLR.

Kendall

, Hawke

, Janz

, Mazur

, Reda

, Allen

, et al. (2019). Learning to drive in a day. In: 2019 International Conference on Robotics and Automation, 8248–8254.

Kober

, Bagnell

, Peters

(2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11): 1238–1274. https://doi.org/10.1177/0278364913495721

Komorowski

, Celi

, Badawi

, Gordon

, Faisal

(2018). The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11): 1716–1720. https://doi.org/10.1038/s41591-018-0213-5

Konda

, Tsitsiklis

(2000). Actor–critic algorithms. In: Advances in Neural Information Processing Systems, volume 12, 1008–1014. MIT Press.

Laber

, Lizotte

, Qian

, Pelham

, Murphy

(2014). Dynamic treatment regimes: Technical challenges and applications. Electronic Journal of Statistics, 8(1): 1225–1272. https://doi.org/10.1214/14-EJS920

, Walsh

, Littman

(2006). Towards a unified theory of state abstraction for MDPs. In: Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics, 531–539.

, Shi

, Wu

, Fryzlewicz

(2025). Testing stationarity and change point detection in reinforcement learning. The Annals of Statistics, 53(3): 1230–1256. https://doi.org/10.1214/25-AOS2501

Luckett

, Laber

, Kahkoska

, Maahs

, Mayer-Davis

, Kosorok

(2020). Estimating dynamic treatment regimes in mobile health using V-learning. Journal of the American Statistical Association, 115(530): 692–706. https://doi.org/10.1080/01621459.2018.1537919

Luo

, Pan

, Watkinson

, Zhu

(2024). Reinforcement learning in dynamic treatment regimes needs critical reexamination. arXiv preprint: https://arxiv.org/abs/2405.18556.

Lyu

, Cheng

, Wahed

(2023). Imputation-based Q-learning for optimizing dynamic treatment regimes with right-censored survival outcome. Biometrics, 79(4): 3676–3689. https://doi.org/10.1111/biom.13872

Madumal

, Miller

, Sonenberg

, Vetere

(2020). Explainable reinforcement learning through a causal lens. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2493–2500.

Miao

, Geng

, Tchetgen Tchetgen

(2018). Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4): 987–993. https://doi.org/10.1093/biomet/asy038

Miotto

, Li

, Kidd

, Dudley

(2016). Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports, 6: 26094. https://doi.org/10.1038/srep26094

Mnih

, Kavukcuoglu

, Silver

, Rusu

, Veness

, Bellemare

, et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533. https://doi.org/10.1038/nature14236

Murphy

(2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 65(2): 331–355. https://doi.org/10.1111/1467-9868.00389

Murphy

(2005). A generalization error for Q-learning. Journal of Machine Learning Research, 6(37): 1073–1097.

, Harada

, Russell

(1999). Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the International Conference on Machine Learning, volume 99, 278–287.

Padakandla

, KJ

, Bhatnagar

(2020). Reinforcement learning algorithm for non-stationary environments. Applied Intelligence, 50(11): 3590–3606. https://doi.org/10.1007/s10489-020-01758-5

Puterman

(1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York.

Robins

(1986). A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9–12): 1393–1512. https://doi.org/10.1016/0270-0255(86)90088-6

Robins

, Hernán

, Brumback

(2000). Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5): 550–560. https://doi.org/10.1097/00001648-200009000-00011

Schulte

, Tsiatis

, Laber

, Davidian

(2015). Q-and A-learning methods for estimating optimal dynamic treatment regimes. Statistical Science, 29(4): 640.

Shahn

, Dukes

, Shamsunder

, Richardson

, Tchetgen Tchetgen

, Robins

(2025). Structural nested mean models under parallel trends assumptions. arXiv preprint: https://arxiv.org/abs/2204.10291v8.

Sherman

, Arbour

, Shpitser

(2020). General identification of dynamic treatment regimes under interference. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, volume 108, 3917–3927. PMLR.

Shi

, Wan

, Song

, Lu

, Leng

(2020). Does the Markov decision process fit the data: Testing for the Markov property in sequential decision making. In: International Conference on Machine Learning, 8807–8817. PMLR.

Silva

, Gombolay

, Killian

, Jimenez

, Son

(2020). Optimization methods for interpretable differentiable decision trees applied to reinforcement learning. In: International Conference on Artificial Intelligence and Statistics, 1855–1865. PMLR.

Silver

, Huang

, Maddison

, Guez

, Sifre

, van den Driessche

, et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587): 484–489. https://doi.org/10.1038/nature16961

Spicker

, Wallace

(2020). Measurement error and precision medicine: Error-prone tailoring covariates in dynamic treatment regimes. Statistics in Medicine, 39(26): 3732–3755. https://doi.org/10.1002/sim.8690

Sutton

, Barto

(2018). Reinforcement Learning: An Introduction. MIT press.

Sutton

, McAllester

, Singh

, Mansour

(1999). Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, volume 12, 1057–1063.

Tennenholtz

, Shalit

, Mannor

(2020). Off-policy evaluation in partially observable environments. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 10276–10283.

Tibshirani

(1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Tsitsiklis

, Van Roy

(1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5): 674–690. https://doi.org/10.1109/9.580874

Uehara

, Shi

, Kallus

(2022). A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355.

van Hasselt

(2010). Double Q-learning. In: Advances in Neural Information Processing Systems, volume 23, 2613–2621.

van Hasselt

, Doron

, Strub

, Hessel

, Sonnerat

, Modayil

(2018). Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648.

van Hasselt

, Guez

, Silver

(2016). Deep reinforcement learning with double Q-learning. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, volume 30, 2094–2100.

Wang

, Schaul

, Hessel

, van Hasselt

, Lanctot

, Freitas

(2016). Dueling network architectures for deep reinforcement learning. In: International Conference on Machine Learning, volume 48, 1995–2003. PMLR.

Watkins

, Dayan

(1992). Q-learning. Machine Learning, 8: 279–292.

Williams

(1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8: 229–256. https://doi.org/10.1023/A:1022672621406

, Zhu

, Shi

, Luo

, Song

(2023). An instrumental variable approach to confounded off-policy evaluation. In: International Conference on Machine Learning, volume 202, 38848–38880. PMLR.

, Liu

, Nemati

, Yin

(2021). Reinforcement learning in healthcare: A survey. ACM Computing Surveys, 55(1): 1–36.

Zeng

, Cai

, Sun

, Huang

, Hao

(2025). A survey on causal reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 36(4): 5942–5962. https://doi.org/10.1109/TNNLS.2024.3403001

Zhang

, Tsiatis

, Laber

, Davidian

(2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika, 100(3): 681–694. https://doi.org/10.1093/biomet/ast014

Zhang

, Laber

, Davidian

, Tsiatis

(2018). Interpretable dynamic treatment regimes. Journal of the American Statistical Association, 113(524): 1541–1549. https://doi.org/10.1080/01621459.2017.1345743

Zhao

, Zeng

, Laber

, Kosorok

(2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510): 583–598. https://doi.org/10.1080/01621459.2014.937488

Zhou

, Zhu

, Qu

(2024). Estimating optimal infinite horizon dynamic treatment regimes via pT-learning. Journal of the American Statistical Association, 119(545): 625–638. https://doi.org/10.1080/01621459.2022.2138760

Zhu

, Lin

, Jain

, Zhou

(2023). Transfer learning in deep reinforcement learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11): 13344–13362. https://doi.org/10.1109/TPAMI.2023.3292075