Pub. online:21 Apr 2023Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 412–427
Abstract
The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in finding the “optimal” boundary. In this paper, we summarized fifteen boundaries that consist of five error spending functions that allow early termination for futility, difference, or both, as well as a fixed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reflect different settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates different weights for the expected sample size under a null experiment with no difference between variants, the expected sample size under an experiment with a difference in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.
Pub. online:31 Mar 2023Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 333–353
Abstract
High-Order Markov Chains (HOMC) are conventional models, based on transition probabilities, that are used by the United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) to study crop-rotation patterns over time. However, HOMCs routinely suffer from sparsity and identifiability issues because the categorical data are represented as indicator (or dummy) variables. In fact, the dimension of the parametric space increases exponentially with the order of HOMCs required for analysis. While parsimonious representations reduce the number of parameters, as has been shown in the literature, they often result in less accurate predictions. Most parsimonious models are trained on big data structures, which can be compressed and efficiently processed using alternative algorithms. Consequently, a thorough evaluation and comparison of the prediction results obtain using a new HOMC algorithm and different types of Deep Neural Networks (DNN) across a range of agricultural conditions is warranted to determine which model is most appropriate for operational crop specific land cover prediction of United States (US) agriculture. In this paper, six neural network models are applied to crop rotation data between 2011 and 2021 from six agriculturally intensive counties, which reflect the range of major crops grown and a variety of crop rotation patterns in the Midwest and southern US. The six counties include: Renville, North Dakota; Perkins, Nebraska; Hale, Texas; Livingston, Illinois; McLean, Illinois; and Shelby, Ohio. Results show the DNN models achieve higher overall prediction accuracy for all counties in 2021. The proposed DNN models allow for the ingestion of long time series data, and robustly achieve higher accuracy values than a new HOMC algorithm considered for predicting crop specific land cover in the US.
Pub. online:21 Mar 2023Type:Education In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 428–441
Abstract
Although there are various ways to represent data patterns and models, visualization has been primarily taught in many data science courses for its efficiency. Such vision-dependent output may cause critical barriers against those who are blind and visually impaired and people with learning disabilities. We argue that instructors need to teach multiple data representation methods so that all students can produce data products that are more accessible. In this paper, we argue that accessibility should be taught as early as the introductory course as part of the data science curriculum so that regardless of whether learners major in data science or not, they can have foundational exposure to accessibility. As data science educators who teach accessibility as part of our lower-division courses in two different institutions, we share specific examples that can be utilized by other data science instructors.
Pub. online:15 Mar 2023Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 295–309
Abstract
The reputation of the maximum pseudolikelihood estimator (MPLE) for Exponential Random Graph Models (ERGM) has undergone a drastic change over the past 30 years. While first receiving broad support, mainly due to its computational feasibility and the lack of alternatives, general opinions started to change with the introduction of approximate maximum likelihood estimator (MLE) methods that became practicable due to increasing computing power and the introduction of MCMC methods. Previous comparison studies appear to yield contradicting results regarding the preference of these two point estimators; however, there is consensus that the prevailing method to obtain an MPLE’s standard error by the inverse Hessian matrix generally underestimates standard errors. We propose replacing the inverse Hessian matrix by an approximation of the Godambe matrix that results in confidence intervals with appropriate coverage rates and that, in addition, enables examining for model degeneracy. Our results also provide empirical evidence for the asymptotic normality of the MPLE under certain conditions.
Pub. online:14 Mar 2023Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 354–367
Abstract
We propose a scalable Bayesian network learning algorithm based on sparse Cholesky decomposition. Our approach only requires observational data and user-specified confidence level as inputs and can estimate networks with thousands of variables. The computational complexity of the proposed method is $O({p^{3}})$ for a graph with p vertices. Extensive numerical experiments illustrate the usefulness of our method with promising results. In simulation, the initial step in our approach also improves an alternative Bayesian network structure estimation method that uses an undirected graph as an input.
Pub. online:13 Mar 2023Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 255–280
Abstract
Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct discovery methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error trade off is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.
Vaccine efficacy is a key index to evaluate vaccines in initial clinical trials during the development of vaccines. In particular, it plays a crucial role in authorizing Covid-19 vaccines. It has been reported that Covid-19 vaccine efficacy varies with a number of factors, including demographics of population, time after vaccine administration, and virus strains. By examining clinical trial data of three Covid-19 vaccine studies, we find that current approach to evaluating vaccines with an overall efficacy does not provide desired accuracy. It requires no time frame during which a candidate vaccine is evaluated, and is subject to misuse, resulting in potential misleading information and interpretation. In particular, we illustrate with clinical trial data that the variability of vaccine efficacy is underestimated. We demonstrate that a new method may help to address these caveats. It leads to accurate estimation of the variation of efficacy, provides useful information to define a reasonable time frame to evaluate vaccines, and avoids misuse of vaccine efficacy and misleading information.
Pub. online:2 Mar 2023Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 310–332
Abstract
Analyzing “large p small n” data is becoming increasingly paramount in a wide range of application fields. As a projection pursuit index, the Penalized Discriminant Analysis ($\mathrm{PDA}$) index, built upon the Linear Discriminant Analysis ($\mathrm{LDA}$) index, is devised in Lee and Cook (2010) to classify high-dimensional data with promising results. Yet, there is little information available about its performance compared with the popular Support Vector Machine ($\mathrm{SVM}$). This paper conducts extensive numerical studies to compare the performance of the $\mathrm{PDA}$ index with the $\mathrm{LDA}$ index and $\mathrm{SVM}$, demonstrating that the $\mathrm{PDA}$ index is robust to outliers and able to handle high-dimensional datasets with extremely small sample sizes, few important variables, and multiple classes. Analyses of several motivating real-world datasets reveal the practical advantages and limitations of individual methods, suggesting that the $\mathrm{PDA}$ index provides a useful alternative tool for classifying complex high-dimensional data. These new insights, along with the hands-on implementation of the $\mathrm{PDA}$ index functions in the R package classPP, facilitate statisticians and data scientists to make effective use of both sets of classification tools.
The least squares (LS) estimator of the autoregressive coefficient in the bifurcating autoregressive (BAR) model was recently shown to suffer from substantial bias, especially for small to moderate samples. This study investigates the impact of the bias in the LS estimator on the behavior of various types of bootstrap confidence intervals for the autoregressive coefficient and introduces methods for constructing bias-corrected bootstrap confidence intervals. We first describe several bootstrap confidence interval procedures for the autoregressive coefficient of the BAR model and present their bias-corrected versions. The behavior of uncorrected and corrected confidence interval procedures is studied empirically through extensive Monte Carlo simulations and two real cell lineage data applications. The empirical results show that the bias in the LS estimator can have a significant negative impact on the behavior of bootstrap confidence intervals and that bias correction can significantly improve the performance of bootstrap confidence intervals in terms of coverage, width, and symmetry.
Deep residual networks (ResNets) have shown state-of-the-art performance in various real-world applications. Recently, the ResNets model was reparameterized and interpreted as solutions to a continuous ordinary differential equation or Neural-ODE model. In this study, we propose a neural generalized ordinary differential equation (Neural-GODE) model with layer-varying parameters to further extend the Neural-ODE to approximate the discrete ResNets. Specifically, we use nonparametric B-spline functions to parameterize the Neural-GODE so that the trade-off between the model complexity and computational efficiency can be easily balanced. It is demonstrated that ResNets and Neural-ODE models are special cases of the proposed Neural-GODE model. Based on two benchmark datasets, MNIST and CIFAR-10, we show that the layer-varying Neural-GODE is more flexible and general than the standard Neural-ODE. Furthermore, the Neural-GODE enjoys the computational and memory benefits while performing comparably to ResNets in prediction accuracy.