Pub. online:26 Jan 2026Type:Philosophies Of Data ScienceOpen Access
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 4–25
Abstract
A central focus of data science is the transformation of empirical evidence into knowledge. By “knowledge,” we mean claims that are (i) supported by data through an explicit inferential procedure and (ii) accompanied by calibrated measures of uncertainty. As such, the scientific insights and attitudes of deep thinkers like Ronald A. Fisher, Karl R. Popper, and John W. Tukey are expected to inspire exciting new advances in machine learning and artificial intelligence in years to come. Along these lines, the present paper advances a novel typicality principle which states, roughly, that if the observed data is sufficiently “atypical” in a certain sense relative to a posited theory, then that theory is unwarranted. This emphasis on typicality brings familiar but often overlooked background notions like model-checking to the inferential foreground. One instantiation of the typicality principle is in the context of parameter estimation, where we propose a new typicality-based regularization strategy that leans heavily on goodness-of-fit testing. The effectiveness of this new regularization strategy is illustrated in three non-trivial examples where ordinary maximum likelihood estimation fails miserably. We also demonstrate how the typicality principle fits within a bigger picture of reliable and efficient uncertainty quantification.
Abstract: A new set of methods are developed to perform cluster analysis of functions, motivated by a data set consisting of hydraulic gradients at several locations distributed across a wetland complex. The methods build on previous work on clustering of functions, such as Tarpey and Kinateder (2003) and Hitchcock et al. (2007), but explore functions generated from an additive model decomposition (Wood, 2006) of the original time series. Our decomposition targets two aspects of the series, using an adaptive smoother for the trend and circular spline for the diurnal variation in the series. Different measures for comparing locations are discussed, including a method for efficiently clustering time series that are of different lengths using a functional data approach. The complicated nature of these wetlands are highlighted by the shifting group memberships depending on which scale of variation and year of the study are considered.
Abstract: State lotteries employ sales projections to determine appropri ate advertised jackpot levels for some of their games. This paper focuses on prediction of sales for the Lotto Texas game of the Texas Lottery. A novel prediction method is developed in this setting that utilizes functional data analysis concepts in conjunction with a Bayesian paradigm to produce predictions and associated precision assessments.