<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1205</article-id>
<article-id pub-id-type="doi">10.6339/25-JDS1205</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Data Science Reviews</subject></subj-group></article-categories>
<title-group>
<article-title>Reinforcement Learning: A Statistical Perspective</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhou</surname><given-names>Ying</given-names></name><email xlink:href="mailto:yzhou@uconn.edu">yzhou@uconn.edu</email><xref ref-type="aff" rid="j_jds1205_aff_001">1</xref><xref ref-type="fn" rid="cor1">∗</xref>
</contrib>
<aff id="j_jds1205_aff_001"><label>1</label>Department of Statistics, <institution>University of Connecticut</institution>, Storrs, CT 06269, <country>U.S.A.</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Email: <ext-link ext-link-type="uri" xlink:href="mailto:yzhou@uconn.edu">yzhou@uconn.edu</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2026</year></pub-date><pub-date pub-type="epub"><day>10</day><month>12</month><year>2025</year></pub-date><volume>24</volume><issue>1</issue><fpage>86</fpage><lpage>105</lpage><history><date date-type="received"><day>1</day><month>1</month><year>2025</year></date><date date-type="accepted"><day>27</day><month>10</month><year>2025</year></date></history>
<permissions><copyright-statement>2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2026</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Reinforcement Learning (RL) is a powerful framework for sequential decision-making, enabling agents to optimize actions through interaction with their environment. While widely studied in computer science, statisticians have advanced RL by addressing challenges like uncertainty quantification, sample efficiency, and interpretability. These contributions are particularly impactful in healthcare, where RL complements Dynamic Treatment Regimes (DTRs), optimizing personalized medicine by tailoring treatments to individuals based on evolving characteristics. This paper serves as both a tutorial for statisticians new to RL and a review of its integration with statistical methodologies. It introduces foundational RL concepts, classical algorithms, and Q-learning variants, and highlights how statistical perspectives, especially causal inference, address challenges in DTRs. By bridging RL and statistical perspectives, the paper highlights opportunities to enhance decision-making in high-stakes domains like healthcare.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>causal inference</kwd>
<kwd>dynamic treatment regimes</kwd>
<kwd>sequential decision-making</kwd>
</kwd-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1205_reflist_001">
<title>References</title>
<ref id="j_jds1205_ref_001">
<mixed-citation publication-type="other"> <string-name><surname>Agarwal</surname> <given-names>A</given-names></string-name>, <string-name><surname>Han</surname> <given-names>S</given-names></string-name>, <string-name><surname>Saha</surname> <given-names>D</given-names></string-name>, <string-name><surname>Syrgkanis</surname> <given-names>V</given-names></string-name>, <string-name><surname>Yoon</surname> <given-names>H</given-names></string-name> (<year>2025</year>). Synthetic blips: Generalizing synthetic controls for dynamic treatment effects. arXiv preprint: <uri>https://arxiv.org/abs/2210.11003v2</uri>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_002">
<mixed-citation publication-type="chapter"> <string-name><surname>Allen</surname> <given-names>C</given-names></string-name>, <string-name><surname>Parikh</surname> <given-names>N</given-names></string-name>, <string-name><surname>Gottesman</surname> <given-names>O</given-names></string-name>, <string-name><surname>Konidaris</surname> <given-names>G</given-names></string-name> (<year>2021</year>). <chapter-title>Learning Markov state abstractions for deep reinforcement learning</chapter-title>. In: <source><italic>Advances in Neural Information Processing Systems</italic></source>, volume <volume>34</volume>, <fpage>8229</fpage>–<lpage>8241</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_003">
<mixed-citation publication-type="journal"> <string-name><surname>Angrist</surname> <given-names>JD</given-names></string-name>, <string-name><surname>Imbens</surname> <given-names>GW</given-names></string-name>, <string-name><surname>Rubin</surname> <given-names>DB</given-names></string-name> (<year>1996</year>). <article-title>Identification of causal effects using instrumental variables</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>91</volume>(<issue>434</issue>): <fpage>444</fpage>–<lpage>455</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.1996.10476902" xlink:type="simple">https://doi.org/10.1080/01621459.1996.10476902</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_004">
<mixed-citation publication-type="journal"> <string-name><surname>Arulkumaran</surname> <given-names>K</given-names></string-name>, <string-name><surname>Deisenroth</surname> <given-names>MP</given-names></string-name>, <string-name><surname>Brundage</surname> <given-names>M</given-names></string-name>, <string-name><surname>Bharath</surname> <given-names>AA</given-names></string-name> (<year>2017</year>). <article-title>Deep reinforcement learning: A brief survey</article-title>. <source><italic>IEEE Signal Processing Magazine</italic></source>, <volume>34</volume>(<issue>6</issue>): <fpage>26</fpage>–<lpage>38</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/MSP.2017.2743240" xlink:type="simple">https://doi.org/10.1109/MSP.2017.2743240</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_005">
<mixed-citation publication-type="journal"> <string-name><surname>Barto</surname> <given-names>AG</given-names></string-name>, <string-name><surname>Sutton</surname> <given-names>RS</given-names></string-name>, <string-name><surname>Anderson</surname> <given-names>CW</given-names></string-name> (<year>1983</year>). <article-title>Neuronlike adaptive elements that can solve difficult learning control problems</article-title>. <source><italic>IEEE Transactions on Systems, Man and Cybernetics</italic></source>, <volume>SMC–13</volume>(<issue>5</issue>): <fpage>834</fpage>–<lpage>846</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TSMC.1983.6313077" xlink:type="simple">https://doi.org/10.1109/TSMC.1983.6313077</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_006">
<mixed-citation publication-type="book"> <string-name><surname>Bellman</surname> <given-names>RE</given-names></string-name> (<year>1957</year>). <source><italic>Dynamic Programming</italic></source>. <publisher-name>Princeton University Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_007">
<mixed-citation publication-type="journal"> <string-name><surname>Bengio</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Courville</surname> <given-names>A</given-names></string-name>, <string-name><surname>Vincent</surname> <given-names>P</given-names></string-name> (<year>2013</year>). <article-title>Representation learning: A review and new perspectives</article-title>. <source><italic>IEEE Transactions on Pattern Analysis and Machine Intelligence</italic></source>, <volume>35</volume>(<issue>8</issue>): <fpage>1798</fpage>–<lpage>1828</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TPAMI.2013.50" xlink:type="simple">https://doi.org/10.1109/TPAMI.2013.50</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_008">
<mixed-citation publication-type="journal"> <string-name><surname>Bennett</surname> <given-names>A</given-names></string-name>, <string-name><surname>Kallus</surname> <given-names>N</given-names></string-name> (<year>2024</year>). <article-title>Proximal reinforcement learning: Efficient off-policy evaluation in partially observed Markov decision processes</article-title>. <source><italic>Operations Research</italic></source>, <volume>72</volume>(<issue>3</issue>): <fpage>1071</fpage>–<lpage>1086</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1287/opre.2021.0781" xlink:type="simple">https://doi.org/10.1287/opre.2021.0781</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_009">
<mixed-citation publication-type="book"> <string-name><surname>Bertsekas</surname> <given-names>DP</given-names></string-name> (<year>2017</year>). <source><italic>Dynamic Programming and Optimal Control</italic></source>. <publisher-name>Athena Scientific</publisher-name>, <publisher-loc>Belmont, MA</publisher-loc>, <edition>4</edition>th edition.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_010">
<mixed-citation publication-type="book"> <string-name><surname>Breiman</surname> <given-names>L</given-names></string-name>, <string-name><surname>Friedman</surname> <given-names>JH</given-names></string-name>, <string-name><surname>Olshen</surname> <given-names>RA</given-names></string-name>, <string-name><surname>Stone</surname> <given-names>CJ</given-names></string-name> (<year>1984</year>). <source><italic>Classification and Regression Trees</italic></source>. <publisher-name>Chapman and Hall/CRC</publisher-name>, <publisher-loc>New York</publisher-loc>, <edition>1</edition>st edition.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_011">
<mixed-citation publication-type="chapter"> <string-name><surname>Cai</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Malialis</surname> <given-names>K</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>Y</given-names></string-name>, <etal>et al.</etal> (<year>2017</year>). <chapter-title>Real-time bidding by reinforcement learning in display advertising</chapter-title>. In: <source><italic>Proceedings of the Tenth ACM International Conference on Web Search and Data Mining</italic></source>, <fpage>661</fpage>–<lpage>670</lpage>. <publisher-name>ACM</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_012">
<mixed-citation publication-type="book"> <string-name><surname>Chakraborty</surname> <given-names>B</given-names></string-name>, <string-name><surname>Moodie</surname> <given-names>EE</given-names></string-name> (<year>2013</year>). <source><italic>Statistical Methods for Dynamic Treatment Regimes</italic></source>. <publisher-name>Springer</publisher-name>, <publisher-loc>New York, NY</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_013">
<mixed-citation publication-type="journal"> <string-name><surname>Chen</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>B</given-names></string-name> (<year>2023</year>). <article-title>Estimating and improving dynamic treatment regimes with a time-varying instrumental variable</article-title>. <source><italic>Journal of the Royal Statistical Society, Series B, Statistical Methodology</italic></source>, <volume>85</volume>(<issue>2</issue>): <fpage>427</fpage>–<lpage>453</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1093/jrsssb/qkad011" xlink:type="simple">https://doi.org/10.1093/jrsssb/qkad011</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_014">
<mixed-citation publication-type="chapter"> <string-name><surname>Choi</surname> <given-names>E</given-names></string-name>, <string-name><surname>Bahadori</surname> <given-names>MT</given-names></string-name>, <string-name><surname>Kulas</surname> <given-names>JA</given-names></string-name>, <string-name><surname>Schuetz</surname> <given-names>A</given-names></string-name>, <string-name><surname>Stewart</surname> <given-names>WF</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name> (<year>2016</year>). <chapter-title>RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism</chapter-title>. In: <source><italic>Proceedings of the 30th International Conference on Neural Information Processing Systems</italic></source>, <fpage>3512</fpage>–<lpage>3520</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_015">
<mixed-citation publication-type="journal"> <string-name><surname>Cook</surname> <given-names>RD</given-names></string-name> (<year>2007</year>). <article-title>Fisher lecture: Dimension reduction in regression</article-title>. <source><italic>Statistical Science</italic></source>, <volume>22</volume>(<issue>1</issue>): <fpage>1</fpage>–<lpage>26</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/088342306000000682" xlink:type="simple">https://doi.org/10.1214/088342306000000682</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_016">
<mixed-citation publication-type="chapter"> <string-name><surname>Covington</surname> <given-names>P</given-names></string-name>, <string-name><surname>Adams</surname> <given-names>J</given-names></string-name>, <string-name><surname>Sargin</surname> <given-names>E</given-names></string-name> (<year>2016</year>). <chapter-title>Deep neural networks for YouTube recommendations</chapter-title>. In: <source><italic>Proceedings of the 10th ACM Conference on Recommender Systems</italic></source>, <fpage>191</fpage>–<lpage>198</lpage>. <publisher-name>ACM</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_017">
<mixed-citation publication-type="journal"> <string-name><surname>Ernst</surname> <given-names>D</given-names></string-name>, <string-name><surname>Geurts</surname> <given-names>P</given-names></string-name>, <string-name><surname>Wehenkel</surname> <given-names>L</given-names></string-name> (<year>2005</year>). <article-title>Tree-based batch mode reinforcement learning</article-title>. <source><italic>Journal of Machine Learning Research</italic></source>, <volume>6</volume>: <fpage>503</fpage>–<lpage>556</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_018">
<mixed-citation publication-type="journal"> <string-name><surname>Ertefaie</surname> <given-names>A</given-names></string-name>, <string-name><surname>Strawderman</surname> <given-names>RL</given-names></string-name> (<year>2018</year>). <article-title>Constructing dynamic treatment regimes over indefinite time horizons</article-title>. <source><italic>Biometrika</italic></source>, <volume>105</volume>(<issue>4</issue>): <fpage>963</fpage>–<lpage>977</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1093/biomet/asy043" xlink:type="simple">https://doi.org/10.1093/biomet/asy043</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_019">
<mixed-citation publication-type="chapter"> <string-name><surname>Finn</surname> <given-names>C</given-names></string-name>, <string-name><surname>Abbeel</surname> <given-names>P</given-names></string-name>, <string-name><surname>Levine</surname> <given-names>S</given-names></string-name> (<year>2017</year>). <chapter-title>Model-agnostic meta-learning for fast adaptation of deep networks</chapter-title>. In: <source><italic>Proceedings of the 34th International Conference on Machine Learning</italic></source>, volume <volume>70</volume>, <fpage>1126</fpage>–<lpage>1135</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_020">
<mixed-citation publication-type="journal"> <string-name><surname>Greensmith</surname> <given-names>E</given-names></string-name>, <string-name><surname>Bartlett</surname> <given-names>PL</given-names></string-name>, <string-name><surname>Baxter</surname> <given-names>J</given-names></string-name> (<year>2004</year>). <article-title>Variance reduction techniques for gradient estimates in reinforcement learning</article-title>. <source><italic>Journal of Machine Learning Research</italic></source>, <volume>5</volume>: <fpage>1471</fpage>–<lpage>1530</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_021">
<mixed-citation publication-type="chapter"> <string-name><surname>Gupta</surname> <given-names>P</given-names></string-name>, <string-name><surname>Puri</surname> <given-names>N</given-names></string-name>, <string-name><surname>Verma</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kayastha</surname> <given-names>D</given-names></string-name>, <string-name><surname>Deshmukh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Krishnamurthy</surname> <given-names>B</given-names></string-name>, <etal>et al.</etal> (<year>2020</year>). <chapter-title>Explain your move: Understanding agent actions using specific and relevant feature attribution</chapter-title>. In: <source><italic>International Conference on Learning Representations</italic></source>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_022">
<mixed-citation publication-type="book"> <string-name><surname>Hernán</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Robins</surname> <given-names>JM</given-names></string-name> (<year>2024</year>). <source><italic>Causal Inference: What If</italic></source>. <publisher-name>Chapman &amp; Hall/CRC. CRC Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_023">
<mixed-citation publication-type="chapter"> <string-name><surname>Johansson</surname> <given-names>FD</given-names></string-name>, <string-name><surname>Shalit</surname> <given-names>U</given-names></string-name>, <string-name><surname>Sontag</surname> <given-names>D</given-names></string-name> (<year>2016</year>). <chapter-title>Learning representations for counterfactual inference</chapter-title>. In: <source><italic>Proceedings of the 33rd International Conference on Machine Learning</italic></source>, volume <volume>48</volume>, <fpage>3020</fpage>–<lpage>3029</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_024">
<mixed-citation publication-type="chapter"> <string-name><surname>Kendall</surname> <given-names>A</given-names></string-name>, <string-name><surname>Hawke</surname> <given-names>J</given-names></string-name>, <string-name><surname>Janz</surname> <given-names>D</given-names></string-name>, <string-name><surname>Mazur</surname> <given-names>P</given-names></string-name>, <string-name><surname>Reda</surname> <given-names>D</given-names></string-name>, <string-name><surname>Allen</surname> <given-names>JM</given-names></string-name>, <etal>et al.</etal> (<year>2019</year>). <chapter-title>Learning to drive in a day</chapter-title>. In: <source><italic>2019 International Conference on Robotics and Automation</italic></source>, <fpage>8248</fpage>–<lpage>8254</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_025">
<mixed-citation publication-type="journal"> <string-name><surname>Kober</surname> <given-names>J</given-names></string-name>, <string-name><surname>Bagnell</surname> <given-names>JA</given-names></string-name>, <string-name><surname>Peters</surname> <given-names>J</given-names></string-name> (<year>2013</year>). <article-title>Reinforcement learning in robotics: A survey</article-title>. <source><italic>The International Journal of Robotics Research</italic></source>, <volume>32</volume>(<issue>11</issue>): <fpage>1238</fpage>–<lpage>1274</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1177/0278364913495721" xlink:type="simple">https://doi.org/10.1177/0278364913495721</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_026">
<mixed-citation publication-type="journal"> <string-name><surname>Komorowski</surname> <given-names>M</given-names></string-name>, <string-name><surname>Celi</surname> <given-names>LA</given-names></string-name>, <string-name><surname>Badawi</surname> <given-names>O</given-names></string-name>, <string-name><surname>Gordon</surname> <given-names>AC</given-names></string-name>, <string-name><surname>Faisal</surname> <given-names>AA</given-names></string-name> (<year>2018</year>). <article-title>The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care</article-title>. <source><italic>Nature Medicine</italic></source>, <volume>24</volume>(<issue>11</issue>): <fpage>1716</fpage>–<lpage>1720</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1038/s41591-018-0213-5" xlink:type="simple">https://doi.org/10.1038/s41591-018-0213-5</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_027">
<mixed-citation publication-type="chapter"> <string-name><surname>Konda</surname> <given-names>VR</given-names></string-name>, <string-name><surname>Tsitsiklis</surname> <given-names>JN</given-names></string-name> (<year>2000</year>). <chapter-title>Actor–critic algorithms</chapter-title>. In: <source><italic>Advances in Neural Information Processing Systems</italic></source>, volume <volume>12</volume>, <fpage>1008</fpage>–<lpage>1014</lpage>. <publisher-name>MIT Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_028">
<mixed-citation publication-type="journal"> <string-name><surname>Laber</surname> <given-names>EB</given-names></string-name>, <string-name><surname>Lizotte</surname> <given-names>DJ</given-names></string-name>, <string-name><surname>Qian</surname> <given-names>M</given-names></string-name>, <string-name><surname>Pelham</surname> <given-names>WE</given-names></string-name>, <string-name><surname>Murphy</surname> <given-names>SA</given-names></string-name> (<year>2014</year>). <article-title>Dynamic treatment regimes: Technical challenges and applications</article-title>. <source><italic>Electronic Journal of Statistics</italic></source>, <volume>8</volume>(<issue>1</issue>): <fpage>1225</fpage>–<lpage>1272</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/14-EJS920" xlink:type="simple">https://doi.org/10.1214/14-EJS920</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_029">
<mixed-citation publication-type="chapter"> <string-name><surname>Li</surname> <given-names>L</given-names></string-name>, <string-name><surname>Walsh</surname> <given-names>TJ</given-names></string-name>, <string-name><surname>Littman</surname> <given-names>ML</given-names></string-name> (<year>2006</year>). <chapter-title>Towards a unified theory of state abstraction for MDPs</chapter-title>. In: <source><italic>Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics</italic></source>, <fpage>531</fpage>–<lpage>539</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_030">
<mixed-citation publication-type="journal"> <string-name><surname>Li</surname> <given-names>M</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Fryzlewicz</surname> <given-names>P</given-names></string-name> (<year>2025</year>). <article-title>Testing stationarity and change point detection in reinforcement learning</article-title>. <source><italic>The Annals of Statistics</italic></source>, <volume>53</volume>(<issue>3</issue>): <fpage>1230</fpage>–<lpage>1256</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/25-AOS2501" xlink:type="simple">https://doi.org/10.1214/25-AOS2501</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_031">
<mixed-citation publication-type="journal"> <string-name><surname>Luckett</surname> <given-names>DJ</given-names></string-name>, <string-name><surname>Laber</surname> <given-names>EB</given-names></string-name>, <string-name><surname>Kahkoska</surname> <given-names>AR</given-names></string-name>, <string-name><surname>Maahs</surname> <given-names>DM</given-names></string-name>, <string-name><surname>Mayer-Davis</surname> <given-names>E</given-names></string-name>, <string-name><surname>Kosorok</surname> <given-names>MR</given-names></string-name> (<year>2020</year>). <article-title>Estimating dynamic treatment regimes in mobile health using V-learning</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>115</volume>(<issue>530</issue>): <fpage>692</fpage>–<lpage>706</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2018.1537919" xlink:type="simple">https://doi.org/10.1080/01621459.2018.1537919</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_032">
<mixed-citation publication-type="other"> <string-name><surname>Luo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Watkinson</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>T</given-names></string-name> (<year>2024</year>). Reinforcement learning in dynamic treatment regimes needs critical reexamination. arXiv preprint: <uri>https://arxiv.org/abs/2405.18556</uri>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_033">
<mixed-citation publication-type="journal"> <string-name><surname>Lyu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wahed</surname> <given-names>AS</given-names></string-name> (<year>2023</year>). <article-title>Imputation-based Q-learning for optimizing dynamic treatment regimes with right-censored survival outcome</article-title>. <source><italic>Biometrics</italic></source>, <volume>79</volume>(<issue>4</issue>): <fpage>3676</fpage>–<lpage>3689</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/biom.13872" xlink:type="simple">https://doi.org/10.1111/biom.13872</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_034">
<mixed-citation publication-type="chapter"> <string-name><surname>Madumal</surname> <given-names>P</given-names></string-name>, <string-name><surname>Miller</surname> <given-names>T</given-names></string-name>, <string-name><surname>Sonenberg</surname> <given-names>L</given-names></string-name>, <string-name><surname>Vetere</surname> <given-names>F</given-names></string-name> (<year>2020</year>). <chapter-title>Explainable reinforcement learning through a causal lens</chapter-title>. In: <source><italic>Proceedings of the AAAI Conference on Artificial Intelligence</italic></source>, volume <volume>34</volume>, <fpage>2493</fpage>–<lpage>2500</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_035">
<mixed-citation publication-type="journal"> <string-name><surname>Miao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Geng</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Tchetgen Tchetgen</surname> <given-names>EJ</given-names></string-name> (<year>2018</year>). <article-title>Identifying causal effects with proxy variables of an unmeasured confounder</article-title>. <source><italic>Biometrika</italic></source>, <volume>105</volume>(<issue>4</issue>): <fpage>987</fpage>–<lpage>993</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1093/biomet/asy038" xlink:type="simple">https://doi.org/10.1093/biomet/asy038</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_036">
<mixed-citation publication-type="journal"> <string-name><surname>Miotto</surname> <given-names>R</given-names></string-name>, <string-name><surname>Li</surname> <given-names>L</given-names></string-name>, <string-name><surname>Kidd</surname> <given-names>BA</given-names></string-name>, <string-name><surname>Dudley</surname> <given-names>JT</given-names></string-name> (<year>2016</year>). <article-title>Deep patient: An unsupervised representation to predict the future of patients from the electronic health records</article-title>. <source><italic>Scientific Reports</italic></source>, <volume>6</volume>: <fpage>26094</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1038/srep26094" xlink:type="simple">https://doi.org/10.1038/srep26094</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_037">
<mixed-citation publication-type="journal"> <string-name><surname>Mnih</surname> <given-names>V</given-names></string-name>, <string-name><surname>Kavukcuoglu</surname> <given-names>K</given-names></string-name>, <string-name><surname>Silver</surname> <given-names>D</given-names></string-name>, <string-name><surname>Rusu</surname> <given-names>AA</given-names></string-name>, <string-name><surname>Veness</surname> <given-names>J</given-names></string-name>, <string-name><surname>Bellemare</surname> <given-names>MG</given-names></string-name>, <etal>et al.</etal> (<year>2015</year>). <article-title>Human-level control through deep reinforcement learning</article-title>. <source><italic>Nature</italic></source>, <volume>518</volume>(<issue>7540</issue>): <fpage>529</fpage>–<lpage>533</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1038/nature14236" xlink:type="simple">https://doi.org/10.1038/nature14236</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_038">
<mixed-citation publication-type="journal"> <string-name><surname>Murphy</surname> <given-names>SA</given-names></string-name> (<year>2003</year>). <article-title>Optimal dynamic treatment regimes</article-title>. <source><italic>Journal of the Royal Statistical Society, Series B, Statistical Methodology</italic></source>, <volume>65</volume>(<issue>2</issue>): <fpage>331</fpage>–<lpage>355</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/1467-9868.00389" xlink:type="simple">https://doi.org/10.1111/1467-9868.00389</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_039">
<mixed-citation publication-type="journal"> <string-name><surname>Murphy</surname> <given-names>SA</given-names></string-name> (<year>2005</year>). <article-title>A generalization error for Q-learning</article-title>. <source><italic>Journal of Machine Learning Research</italic></source>, <volume>6</volume>(<issue>37</issue>): <fpage>1073</fpage>–<lpage>1097</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_040">
<mixed-citation publication-type="chapter"> <string-name><surname>Ng</surname> <given-names>AY</given-names></string-name>, <string-name><surname>Harada</surname> <given-names>D</given-names></string-name>, <string-name><surname>Russell</surname> <given-names>S</given-names></string-name> (<year>1999</year>). <chapter-title>Policy invariance under reward transformations: Theory and application to reward shaping</chapter-title>. In: <source><italic>Proceedings of the International Conference on Machine Learning</italic></source>, volume <volume>99</volume>, <fpage>278</fpage>–<lpage>287</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_041">
<mixed-citation publication-type="journal"> <string-name><surname>Padakandla</surname> <given-names>S</given-names></string-name>, <string-name><surname>KJ</surname> <given-names>P</given-names></string-name>, <string-name><surname>Bhatnagar</surname> <given-names>S</given-names></string-name> (<year>2020</year>). <article-title>Reinforcement learning algorithm for non-stationary environments</article-title>. <source><italic>Applied Intelligence</italic></source>, <volume>50</volume>(<issue>11</issue>): <fpage>3590</fpage>–<lpage>3606</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s10489-020-01758-5" xlink:type="simple">https://doi.org/10.1007/s10489-020-01758-5</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_042">
<mixed-citation publication-type="book"> <string-name><surname>Puterman</surname> <given-names>ML</given-names></string-name> (<year>1994</year>). <source><italic>Markov Decision Processes: Discrete Stochastic Dynamic Programming</italic></source>. <series><italic>Wiley Series in Probability and Mathematical Statistics</italic></series>. <publisher-name>John Wiley &amp; Sons</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_043">
<mixed-citation publication-type="journal"> <string-name><surname>Robins</surname> <given-names>J</given-names></string-name> (<year>1986</year>). <article-title>A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect</article-title>. <source><italic>Mathematical Modelling</italic></source>, <volume>7</volume>(<issue>9–12</issue>): <fpage>1393</fpage>–<lpage>1512</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/0270-0255(86)90088-6" xlink:type="simple">https://doi.org/10.1016/0270-0255(86)90088-6</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_044">
<mixed-citation publication-type="journal"> <string-name><surname>Robins</surname> <given-names>J</given-names></string-name>, <string-name><surname>Hernán</surname> <given-names>M</given-names></string-name>, <string-name><surname>Brumback</surname> <given-names>B</given-names></string-name> (<year>2000</year>). <article-title>Marginal structural models and causal inference in epidemiology</article-title>. <source><italic>Epidemiology</italic></source>, <volume>11</volume>(<issue>5</issue>): <fpage>550</fpage>–<lpage>560</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1097/00001648-200009000-00011" xlink:type="simple">https://doi.org/10.1097/00001648-200009000-00011</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_045">
<mixed-citation publication-type="journal"> <string-name><surname>Schulte</surname> <given-names>PJ</given-names></string-name>, <string-name><surname>Tsiatis</surname> <given-names>AA</given-names></string-name>, <string-name><surname>Laber</surname> <given-names>EB</given-names></string-name>, <string-name><surname>Davidian</surname> <given-names>M</given-names></string-name> (<year>2015</year>). <article-title>Q-and A-learning methods for estimating optimal dynamic treatment regimes</article-title>. <source><italic>Statistical Science</italic></source>, <volume>29</volume>(<issue>4</issue>): <fpage>640</fpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_046">
<mixed-citation publication-type="other"> <string-name><surname>Shahn</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Dukes</surname> <given-names>O</given-names></string-name>, <string-name><surname>Shamsunder</surname> <given-names>M</given-names></string-name>, <string-name><surname>Richardson</surname> <given-names>D</given-names></string-name>, <string-name><surname>Tchetgen Tchetgen</surname> <given-names>ET</given-names></string-name>, <string-name><surname>Robins</surname> <given-names>J</given-names></string-name> (<year>2025</year>). Structural nested mean models under parallel trends assumptions. arXiv preprint: <uri>https://arxiv.org/abs/2204.10291v8</uri>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_047">
<mixed-citation publication-type="chapter"> <string-name><surname>Sherman</surname> <given-names>E</given-names></string-name>, <string-name><surname>Arbour</surname> <given-names>D</given-names></string-name>, <string-name><surname>Shpitser</surname> <given-names>I</given-names></string-name> (<year>2020</year>). <chapter-title>General identification of dynamic treatment regimes under interference</chapter-title>. In: <source><italic>Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics</italic></source>, volume <volume>108</volume>, <fpage>3917</fpage>–<lpage>3927</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_048">
<mixed-citation publication-type="chapter"> <string-name><surname>Shi</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wan</surname> <given-names>R</given-names></string-name>, <string-name><surname>Song</surname> <given-names>R</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Leng</surname> <given-names>L</given-names></string-name> (<year>2020</year>). <chapter-title>Does the Markov decision process fit the data: Testing for the Markov property in sequential decision making</chapter-title>. In: <source><italic>International Conference on Machine Learning</italic></source>, <fpage>8807</fpage>–<lpage>8817</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_049">
<mixed-citation publication-type="chapter"> <string-name><surname>Silva</surname> <given-names>A</given-names></string-name>, <string-name><surname>Gombolay</surname> <given-names>M</given-names></string-name>, <string-name><surname>Killian</surname> <given-names>T</given-names></string-name>, <string-name><surname>Jimenez</surname> <given-names>I</given-names></string-name>, <string-name><surname>Son</surname> <given-names>SH</given-names></string-name> (<year>2020</year>). <chapter-title>Optimization methods for interpretable differentiable decision trees applied to reinforcement learning</chapter-title>. In: <source><italic>International Conference on Artificial Intelligence and Statistics</italic></source>, <fpage>1855</fpage>–<lpage>1865</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_050">
<mixed-citation publication-type="journal"> <string-name><surname>Silver</surname> <given-names>D</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>A</given-names></string-name>, <string-name><surname>Maddison</surname> <given-names>CJ</given-names></string-name>, <string-name><surname>Guez</surname> <given-names>A</given-names></string-name>, <string-name><surname>Sifre</surname> <given-names>L</given-names></string-name>, <string-name><surname>van den Driessche</surname> <given-names>G</given-names></string-name>, <etal>et al.</etal> (<year>2016</year>). <article-title>Mastering the game of go with deep neural networks and tree search</article-title>. <source><italic>Nature</italic></source>, <volume>529</volume>(<issue>7587</issue>): <fpage>484</fpage>–<lpage>489</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1038/nature16961" xlink:type="simple">https://doi.org/10.1038/nature16961</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_051">
<mixed-citation publication-type="journal"> <string-name><surname>Spicker</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wallace</surname> <given-names>MP</given-names></string-name> (<year>2020</year>). <article-title>Measurement error and precision medicine: Error-prone tailoring covariates in dynamic treatment regimes</article-title>. <source><italic>Statistics in Medicine</italic></source>, <volume>39</volume>(<issue>26</issue>): <fpage>3732</fpage>–<lpage>3755</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1002/sim.8690" xlink:type="simple">https://doi.org/10.1002/sim.8690</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_052">
<mixed-citation publication-type="book"> <string-name><surname>Sutton</surname> <given-names>RS</given-names></string-name>, <string-name><surname>Barto</surname> <given-names>AG</given-names></string-name> (<year>2018</year>). <source><italic>Reinforcement Learning: An Introduction</italic></source>. <publisher-name>MIT press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_053">
<mixed-citation publication-type="chapter"> <string-name><surname>Sutton</surname> <given-names>RS</given-names></string-name>, <string-name><surname>McAllester</surname> <given-names>DA</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>SP</given-names></string-name>, <string-name><surname>Mansour</surname> <given-names>Y</given-names></string-name> (<year>1999</year>). <chapter-title>Policy gradient methods for reinforcement learning with function approximation</chapter-title>. In: <source><italic>Advances in Neural Information Processing Systems</italic></source>, volume <volume>12</volume>, <fpage>1057</fpage>–<lpage>1063</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_054">
<mixed-citation publication-type="chapter"> <string-name><surname>Tennenholtz</surname> <given-names>G</given-names></string-name>, <string-name><surname>Shalit</surname> <given-names>U</given-names></string-name>, <string-name><surname>Mannor</surname> <given-names>S</given-names></string-name> (<year>2020</year>). <chapter-title>Off-policy evaluation in partially observable environments</chapter-title>. In: <source><italic>Proceedings of the AAAI Conference on Artificial Intelligence</italic></source>, volume <volume>34</volume>, <fpage>10276</fpage>–<lpage>10283</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_055">
<mixed-citation publication-type="journal"> <string-name><surname>Tibshirani</surname> <given-names>R</given-names></string-name> (<year>1996</year>). <article-title>Regression shrinkage and selection via the lasso</article-title>. <source><italic>Journal of the Royal Statistical Society, Series B, Statistical Methodology</italic></source>, <volume>58</volume>(<issue>1</issue>): <fpage>267</fpage>–<lpage>288</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/j.2517-6161.1996.tb02080.x" xlink:type="simple">https://doi.org/10.1111/j.2517-6161.1996.tb02080.x</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_056">
<mixed-citation publication-type="journal"> <string-name><surname>Tsitsiklis</surname> <given-names>J</given-names></string-name>, <string-name><surname>Van Roy</surname> <given-names>B</given-names></string-name> (<year>1997</year>). <article-title>An analysis of temporal-difference learning with function approximation</article-title>. <source><italic>IEEE Transactions on Automatic Control</italic></source>, <volume>42</volume>(<issue>5</issue>): <fpage>674</fpage>–<lpage>690</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/9.580874" xlink:type="simple">https://doi.org/10.1109/9.580874</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_057">
<mixed-citation publication-type="other"> <string-name><surname>Uehara</surname> <given-names>M</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>C</given-names></string-name>, <string-name><surname>Kallus</surname> <given-names>N</given-names></string-name> (<year>2022</year>). A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:<ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2212.06355">2212.06355</ext-link>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_058">
<mixed-citation publication-type="chapter"> <string-name><surname>van Hasselt</surname> <given-names>H</given-names></string-name> (<year>2010</year>). <chapter-title>Double Q-learning</chapter-title>. In: <source><italic>Advances in Neural Information Processing Systems</italic></source>, volume <volume>23</volume>, <fpage>2613</fpage>–<lpage>2621</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_059">
<mixed-citation publication-type="other"> <string-name><surname>van Hasselt</surname> <given-names>H</given-names></string-name>, <string-name><surname>Doron</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Strub</surname> <given-names>F</given-names></string-name>, <string-name><surname>Hessel</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sonnerat</surname> <given-names>N</given-names></string-name>, <string-name><surname>Modayil</surname> <given-names>J</given-names></string-name> (<year>2018</year>). Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:<ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1812.02648">1812.02648</ext-link>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_060">
<mixed-citation publication-type="chapter"> <string-name><surname>van Hasselt</surname> <given-names>H</given-names></string-name>, <string-name><surname>Guez</surname> <given-names>A</given-names></string-name>, <string-name><surname>Silver</surname> <given-names>D</given-names></string-name> (<year>2016</year>). <chapter-title>Deep reinforcement learning with double Q-learning</chapter-title>. In: <source><italic>Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence</italic></source>, volume <volume>30</volume>, <fpage>2094</fpage>–<lpage>2100</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_061">
<mixed-citation publication-type="chapter"> <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Schaul</surname> <given-names>T</given-names></string-name>, <string-name><surname>Hessel</surname> <given-names>M</given-names></string-name>, <string-name><surname>van Hasselt</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lanctot</surname> <given-names>M</given-names></string-name>, <string-name><surname>Freitas</surname> <given-names>N</given-names></string-name> (<year>2016</year>). <chapter-title>Dueling network architectures for deep reinforcement learning</chapter-title>. In: <source><italic>International Conference on Machine Learning</italic></source>, volume <volume>48</volume>, <fpage>1995</fpage>–<lpage>2003</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_062">
<mixed-citation publication-type="journal"> <string-name><surname>Watkins</surname> <given-names>CJ</given-names></string-name>, <string-name><surname>Dayan</surname> <given-names>P</given-names></string-name> (<year>1992</year>). <article-title>Q-learning</article-title>. <source><italic>Machine Learning</italic></source>, <volume>8</volume>: <fpage>279</fpage>–<lpage>292</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_063">
<mixed-citation publication-type="journal"> <string-name><surname>Williams</surname> <given-names>RJ</given-names></string-name> (<year>1992</year>). <article-title>Simple statistical gradient-following algorithms for connectionist reinforcement learning</article-title>. <source><italic>Machine Learning</italic></source>, <volume>8</volume>: <fpage>229</fpage>–<lpage>256</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1023/A:1022672621406" xlink:type="simple">https://doi.org/10.1023/A:1022672621406</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_064">
<mixed-citation publication-type="chapter"> <string-name><surname>Xu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>C</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>S</given-names></string-name>, <string-name><surname>Song</surname> <given-names>R</given-names></string-name> (<year>2023</year>). <chapter-title>An instrumental variable approach to confounded off-policy evaluation</chapter-title>. In: <source><italic>International Conference on Machine Learning</italic></source>, volume <volume>202</volume>, <fpage>38848</fpage>–<lpage>38880</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_065">
<mixed-citation publication-type="journal"> <string-name><surname>Yu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Nemati</surname> <given-names>S</given-names></string-name>, <string-name><surname>Yin</surname> <given-names>G</given-names></string-name> (<year>2021</year>). <article-title>Reinforcement learning in healthcare: A survey</article-title>. <source><italic>ACM Computing Surveys</italic></source>, <volume>55</volume>(<issue>1</issue>): <fpage>1</fpage>–<lpage>36</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1205_ref_066">
<mixed-citation publication-type="journal"> <string-name><surname>Zeng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>R</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>F</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Hao</surname> <given-names>Z</given-names></string-name> (<year>2025</year>). <article-title>A survey on causal reinforcement learning</article-title>. <source><italic>IEEE Transactions on Neural Networks and Learning Systems</italic></source>, <volume>36</volume>(<issue>4</issue>): <fpage>5942</fpage>–<lpage>5962</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TNNLS.2024.3403001" xlink:type="simple">https://doi.org/10.1109/TNNLS.2024.3403001</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_067">
<mixed-citation publication-type="journal"> <string-name><surname>Zhang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Tsiatis</surname> <given-names>AA</given-names></string-name>, <string-name><surname>Laber</surname> <given-names>EB</given-names></string-name>, <string-name><surname>Davidian</surname> <given-names>M</given-names></string-name> (<year>2013</year>). <article-title>Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions</article-title>. <source><italic>Biometrika</italic></source>, <volume>100</volume>(<issue>3</issue>): <fpage>681</fpage>–<lpage>694</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1093/biomet/ast014" xlink:type="simple">https://doi.org/10.1093/biomet/ast014</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_068">
<mixed-citation publication-type="journal"> <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Laber</surname> <given-names>EB</given-names></string-name>, <string-name><surname>Davidian</surname> <given-names>M</given-names></string-name>, <string-name><surname>Tsiatis</surname> <given-names>AA</given-names></string-name> (<year>2018</year>). <article-title>Interpretable dynamic treatment regimes</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>113</volume>(<issue>524</issue>): <fpage>1541</fpage>–<lpage>1549</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2017.1345743" xlink:type="simple">https://doi.org/10.1080/01621459.2017.1345743</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_069">
<mixed-citation publication-type="journal"> <string-name><surname>Zhao</surname> <given-names>YQ</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>D</given-names></string-name>, <string-name><surname>Laber</surname> <given-names>EB</given-names></string-name>, <string-name><surname>Kosorok</surname> <given-names>MR</given-names></string-name> (<year>2015</year>). <article-title>New statistical learning methods for estimating optimal dynamic treatment regimes</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>110</volume>(<issue>510</issue>): <fpage>583</fpage>–<lpage>598</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2014.937488" xlink:type="simple">https://doi.org/10.1080/01621459.2014.937488</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_070">
<mixed-citation publication-type="journal"> <string-name><surname>Zhou</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Qu</surname> <given-names>A</given-names></string-name> (<year>2024</year>). <article-title>Estimating optimal infinite horizon dynamic treatment regimes via pT-learning</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>119</volume>(<issue>545</issue>): <fpage>625</fpage>–<lpage>638</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2022.2138760" xlink:type="simple">https://doi.org/10.1080/01621459.2022.2138760</ext-link></mixed-citation>
</ref>
<ref id="j_jds1205_ref_071">
<mixed-citation publication-type="journal"> <string-name><surname>Zhu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>K</given-names></string-name>, <string-name><surname>Jain</surname> <given-names>AK</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>J</given-names></string-name> (<year>2023</year>). <article-title>Transfer learning in deep reinforcement learning: A survey</article-title>. <source><italic>IEEE Transactions on Pattern Analysis and Machine Intelligence</italic></source>, <volume>45</volume>(<issue>11</issue>): <fpage>13344</fpage>–<lpage>13362</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TPAMI.2023.3292075" xlink:type="simple">https://doi.org/10.1109/TPAMI.2023.3292075</ext-link></mixed-citation>
</ref>
</ref-list>
</back>
</article>
