Pub. online:26 Jan 2026Type:Philosophies Of Data ScienceOpen Access
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 4–25
Abstract
A central focus of data science is the transformation of empirical evidence into knowledge. By “knowledge,” we mean claims that are (i) supported by data through an explicit inferential procedure and (ii) accompanied by calibrated measures of uncertainty. As such, the scientific insights and attitudes of deep thinkers like Ronald A. Fisher, Karl R. Popper, and John W. Tukey are expected to inspire exciting new advances in machine learning and artificial intelligence in years to come. Along these lines, the present paper advances a novel typicality principle which states, roughly, that if the observed data is sufficiently “atypical” in a certain sense relative to a posited theory, then that theory is unwarranted. This emphasis on typicality brings familiar but often overlooked background notions like model-checking to the inferential foreground. One instantiation of the typicality principle is in the context of parameter estimation, where we propose a new typicality-based regularization strategy that leans heavily on goodness-of-fit testing. The effectiveness of this new regularization strategy is illustrated in three non-trivial examples where ordinary maximum likelihood estimation fails miserably. We also demonstrate how the typicality principle fits within a bigger picture of reliable and efficient uncertainty quantification.
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 26–52
Abstract
Society’s capacity for algorithmic problem-solving has never been greater. Artificial Intelligence is now applied across more domains than ever, a consequence of powerful abstractions, abundant data, and accessible software. As capabilities have expanded, so have risks, with models often deployed without fully understanding their potential impacts. Interpretable and interactive machine learning aims to make complex models more transparent and controllable, enhancing user agency. This review synthesizes key principles from the growing literature in this field. We first introduce precise vocabulary for discussing interpretability, like the distinction between glass box and explainable models. We then explore connections to classical statistical and design principles, like parsimony and the gulfs of interaction. Basic explainability techniques – including learned embeddings, integrated gradients, and concept bottlenecks – are illustrated with a simple case study. We also review criteria for objectively evaluating interpretability approaches. Throughout, we underscore the importance of considering audience goals when designing interactive data-driven systems. Finally, we outline open challenges and discuss the potential role of data science in addressing them. Code to reproduce all examples can be found at https://go.wisc.edu/3k1ewe.
Pub. online:11 Feb 2026Type:Data Science ReviewsOpen Access
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 53–85
Abstract
Causal inference is a central goal across many scientific disciplines. Over the past several decades, three major frameworks have emerged to formalize causal questions and guide their analysis: the potential outcomes framework, structural equation models, and directed acyclic graphs. Although these frameworks differ in language, assumptions, and philosophical orientation, they often lead to compatible or complementary insights. This paper provides a comparative introduction to the three frameworks, clarifying their connections, highlighting their distinct strengths and limitations, and illustrating how they can be used together in practice. The discussion is aimed at researchers and graduate students with some background in statistics or causal inference who are seeking a conceptual foundation for applying causal methods across a range of substantive domains.
Pub. online:10 Dec 2025Type:Data Science ReviewsOpen Access
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 86–105
Abstract
Reinforcement Learning (RL) is a powerful framework for sequential decision-making, enabling agents to optimize actions through interaction with their environment. While widely studied in computer science, statisticians have advanced RL by addressing challenges like uncertainty quantification, sample efficiency, and interpretability. These contributions are particularly impactful in healthcare, where RL complements Dynamic Treatment Regimes (DTRs), optimizing personalized medicine by tailoring treatments to individuals based on evolving characteristics. This paper serves as both a tutorial for statisticians new to RL and a review of its integration with statistical methodologies. It introduces foundational RL concepts, classical algorithms, and Q-learning variants, and highlights how statistical perspectives, especially causal inference, address challenges in DTRs. By bridging RL and statistical perspectives, the paper highlights opportunities to enhance decision-making in high-stakes domains like healthcare.
Pub. online:2 Jan 2026Type:Data Science ReviewsOpen Access
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 106–124
Abstract
Artificial intelligence (AI) has lately emerged as a transformative force in scientific discovery, with skills in accelerating knowledge synthesis, automating experimentation, and enhancing interdisciplinary collaboration. As research challenges—ranging from climate change to rare disease treatments—grow more and more complex, the rapid evolution of AI calls for a comprehensive examination of its current and future roles. Despite recent breakthroughs, the field remains fragmented, due to the lack of a unified framework to understand AI’s progression in science and its implications for data science, in particular. To address this gap, this review provides an analysis on AI for science, and also introduces a novel three-phase framework—Keplerian (data-driven pattern recognition), Edisonian (autonomous experimentation), and Einsteinian (foundational innovation)—to conceptualize AI’s evolving role in science. Additionally, we discuss the ethical, environmental, and data privacy challenges that go alongside AI’s integration in science, emphasizing the need for sustainable and responsible development. This review outlines how AI may transform the scientific methods and to help researchers harness AI’s potential to drive scientific innovation.
Pub. online:16 Dec 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 125–145
Abstract
Black-box machine learning models are recognized as useful tools for prediction applications, but the algorithmic complexity of some models causes interpretation challenges. Explainability methods have been proposed to provide insight into these models, but there is little research focused on supervised modeling with functional data inputs. We argue that, especially in applications of high consequence, it is important to explicitly model the functional dependence in a black-box analysis to not obscure or misrepresent patterns in explanations. As such, we propose the Variable importance Explainable Elastic Shape Analysis (VEESA) pipeline for training supervised machine learning models with functional inputs. The pipeline is an analysis process that includes the data preprocessing, modeling, and post-hoc explanations. The preprocessing is done using elastic functional principal components analysis, which accounts for vertical and horizontal variability in functional data and, ultimately, allows for explanations in the original data space that identify the important functional variability without bias due to correlated variables. Here, we demonstrate the pipeline on two high-consequence applications: explosives classification for national security and inkjet printer identification in forensic science. The applications exhibit the VEESA pipeline’s ability to provide an understanding of the characteristics of the functional data useful for prediction. Code for implementing the pipeline is available in the veesa R package (and supplemental python code).
Pub. online:21 Oct 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 146–166
Abstract
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences represented as rankings of responses to prompts. In this paper, we document the phenomenon of reward collapse, an empirical observation where the prevailing ranking-based approach results in an identical reward distribution for diverse prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like “write a short story about your best friend” should yield a continuous range of rewards for their completions, while specific prompts like “what is the capital city of New Zealand” should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. Then we derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic setting. Based on the reward distributions for different utility functions, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
Pub. online:13 Nov 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 167–186
Abstract
Neuroimaging technology has received considerable attention in recent years. One of the key problems in the imaging data analysis is the heterogeneity among individual subjects. In particular, the relationship between the imaging biomarkers and the clinical outcomes may vary across different individuals. Popular existing statistical methodologies such as the functional linear regression and high dimensional linear regression can be inadequate because the homogeneous regression relationship is assumed for all subjects. In this paper, we propose the Subject-Specific Scalar-on-Image Regression (S3IR) model to handle heterogeneous populations. Specifically, we utilize a binary subject-specific masking image to capture the heterogeneous sparsity among individuals. The proposed S3IR model incorporates the spatial structure of the imaging data and is able to achieve both local smoothness and subject-specific sparsity of the estimated regression coefficients. Furthermore, we design an EM-type adaptive algorithm to estimate the model coefficients. Simulation studies are presented to show the superior performance of our proposed method over some existing ones in handling heterogeneity. Finally, we apply the S3IR model to analyze data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The results show that our model can effectively identify interpretable and significant disease-related regions and improve prediction performance of the cognitive scores.
Pub. online:3 Oct 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 187–202
Abstract
We propose a differentially private Bayesian framework for envelope regression, a technique that improves estimation efficiency by modelling the response as a function of a low-dimensional subspace of the predictors. Our method applies the analytic Gaussian mechanism to privatize sufficient statistics from the data, ensuring formal $(\epsilon ,\delta )$-differential privacy. We develop a tailored Gibbs sampling algorithm that performs valid Bayesian inference using only the noisy sufficient statistics. This approach leverages the envelope structure to isolate the variation in predictors that is relevant to the response, reducing estimation error compared to standard regression under the same privacy constraints. Through simulation studies, we demonstrate improved estimation accuracy and tighter credible intervals relative to a differentially private Bayesian linear regression baseline.