Pub. online:23 Nov 2022Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 177–192
Abstract
Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.
Pub. online:9 Nov 2022Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 566–584
Abstract
The article presents a methodology for supervised regionalization of data on a spatial domain. Defining a spatial process at multiple scales leads to the famous ecological fallacy problem. Here, we use the ecological fallacy as the basis for a minimization criterion to obtain the intended regions. The Karhunen-Loève Expansion of the spatial process maintains the relationship between the realizations from multiple resolutions. Specifically, we use the Karhunen-Loève Expansion to define the regionalization error so that the ecological fallacy is minimized. The contiguous regionalization is done using the minimum spanning tree formed from the spatial locations and the data. Then, regionalization becomes similar to pruning edges from the minimum spanning tree. The methodology is demonstrated using simulated and real data examples.
Pub. online:8 Nov 2022Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 439–460
Abstract
In the last few decades, the size of spatial and spatio-temporal datasets in many research areas has rapidly increased with the development of data collection technologies. As a result, classical statistical methods in spatial statistics are facing computational challenges. For example, the kriging predictor in geostatistics becomes prohibitive on traditional hardware architectures for large datasets as it requires high computing power and memory footprint when dealing with large dense matrix operations. Over the years, various approximation methods have been proposed to address such computational issues, however, the community lacks a holistic process to assess their approximation efficiency. To provide a fair assessment, in 2021, we organized the first competition on spatial statistics for large datasets, generated by our ExaGeoStat software, and asked participants to report the results of estimation and prediction. Thanks to its widely acknowledged success and at the request of many participants, we organized the second competition in 2022 focusing on predictions for more complex spatial and spatio-temporal processes, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. In this paper, we describe in detail the data generation procedure and make the valuable datasets publicly available for a wider adoption. Then, we review the submitted methods from fourteen teams worldwide, analyze the competition outcomes, and assess the performance of each team.
Single-index models are becoming increasingly popular in many scientific applications as they offer the advantages of flexibility in regression modeling as well as interpretable covariate effects. In the context of survival analysis, the single-index hazards models are natural extensions of the Cox proportional hazards models. In this paper, we propose a novel estimation procedure for single-index hazard models under a monotone constraint of the index. We apply the profile likelihood method to obtain the semiparametric maximum likelihood estimator, where the novelty of the estimation procedure lies in estimating the unknown monotone link function by embedding the problem in isotonic regression with exponentially distributed random variables. The consistency of the proposed semiparametric maximum likelihood estimator is established under suitable regularity conditions. Numerical simulations are conducted to examine the finite-sample performance of the proposed method. An analysis of breast cancer data is presented for illustration.
Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) has been developed to estimate class probabilities through ensemble learning for K-class problems (Wu et al., 2010; Wang et al., 2019), where K is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demands polynomial time in K. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in K. Though not the most efficient in computation, the OVA is found to have the best estimation accuracy among all the procedures under comparison. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate their finite sample performance.
Pub. online:3 Nov 2022Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 533–544
Abstract
Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.
Pub. online:3 Nov 2022Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 512–532
Abstract
Large or very large spatial (and spatio-temporal) datasets have become common place in many environmental and climate studies. These data are often collected in non-Euclidean spaces (such as the planet Earth) and they often present nonstationary anisotropies. This paper proposes a generic approach to model Gaussian Random Fields (GRFs) on compact Riemannian manifolds that bridges the gap between existing works on nonstationary GRFs and random fields on manifolds. This approach can be applied to any smooth compact manifolds, and in particular to any compact surface. By defining a Riemannian metric that accounts for the preferential directions of correlation, our approach yields an interpretation of the nonstationary geometric anisotropies as resulting from local deformations of the domain. We provide scalable algorithms for the estimation of the parameters and for optimal prediction by kriging and simulation able to tackle very large grids. Stationary and nonstationary illustrations are provided.
Many undergraduate students who matriculated in Science, Technology, Engineering and Mathematics (STEM) degree programs drop out or switch their major. Previous studies indicate that performance of students in prerequisite courses is important for attrition of students in STEM. This study analyzed demographic information, ACT/SAT score, and performance of students in freshman year courses to develop machine learning models predicting their success in earning a bachelor’s degree in biology. The predictive model based on Random Forest (RF) and Extreme Gradient Boosting (XGBoost) showed a better performance in terms of AUC (Area Under the Curve) with more balanced sensitivity and specificity than Logistic Regression (LR), K-Nearest Neighbor (KNN), and Neural Network (NN) models. An explainable machine learning approach called break-down was employed to identify important freshman year courses that could have a larger impact on student success at the biology degree program and student levels. More important courses identified at the program level can help program coordinators to prioritize their effort in addressing student attrition while more important courses identified at the student level can help academic advisors to provide more personalized, data-driven guidance to students.
Pub. online:14 Oct 2022Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 475–492
Abstract
We describe our implementation of the multivariate Matérn model for multivariate spatial datasets, using Vecchia’s approximation and a Fisher scoring optimization algorithm. We consider various pararameterizations for the multivariate Matérn that have been proposed in the literature for ensuring model validity, as well as an unconstrained model. A strength of our study is that the code is tested on many real-world multivariate spatial datasets. We use it to study the effect of ordering and conditioning in Vecchia’s approximation and the restrictions imposed by the various parameterizations. We also consider a model in which co-located nuggets are correlated across components and find that forcing this cross-component nugget correlation to be zero can have a serious impact on the other model parameters, so we suggest allowing cross-component correlation in co-located nugget terms.