Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 892

Order by:

Select: All None Download:

Modeling Dynamic Transport Network with Matrix Factor Models: an Application to International Trade Flow

Elynn Y. Chen Rong Chen

https://doi.org/10.6339/22-JDS1065

Pub. online: 5 Dec 2022 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 21, Issue 3 (2023): Special Issue: Advances in Network Data Science, pp. 490–507

Abstract

International trade research plays an important role to inform trade policy and shed light on wider economic issues. With recent advances in information technology, economic agencies distribute an enormous amount of internationally comparable trading data, providing a gold mine for empirical analysis of international trade. International trading data can be viewed as a dynamic transport network because it emphasizes the amount of goods moving across network edges. Most literature on dynamic network analysis concentrates on parametric modeling of the connectivity network that focuses on link formation or deformation rather than the transport moving across the network. We take a different non-parametric perspective from the pervasive node-and-edge-level modeling: the dynamic transport network is modeled as a time series of relational matrices; variants of the matrix factor model of Wang et al. (2019) are applied to provide a specific interpretation for the dynamic transport network. Under the model, the observed surface network is assumed to be driven by a latent dynamic transport network with lower dimensions. Our method is able to unveil the latent dynamic structure and achieves the goal of dimension reduction. We applied the proposed method to a dataset of monthly trading volumes among 24 countries (and regions) from 1982 to 2015. Our findings shed light on trading hubs, centrality, trends, and patterns of international trade and show matching change points to trading policies. The dataset also provides a fertile ground for future research on international trade.

Active Data Science for Improving Clinical Risk Prediction

Donna P. Ankerst

Matthias Neumair

https://doi.org/10.6339/22-JDS1078

Pub. online: 23 Nov 2022 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 177–192

Abstract

Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.

Editorial: Large-Scale Spatial Data Science

Sameh Abdulah Stefano Castruccio Marc G. Genton All authors (4)

https://doi.org/10.6339/22-JDS204EDI

Pub. online: 11 Nov 2022 Type: Editorial

Open Access

Journal: Journal of Data Science Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 437–438

Supervised Spatial Regionalization using the Karhunen-Loève Expansion and Minimum Spanning Trees

Ranadeep Daw

Christopher K. Wikle

https://doi.org/10.6339/22-JDS1077

Pub. online: 9 Nov 2022 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 566–584

Abstract

The article presents a methodology for supervised regionalization of data on a spatial domain. Defining a spatial process at multiple scales leads to the famous ecological fallacy problem. Here, we use the ecological fallacy as the basis for a minimization criterion to obtain the intended regions. The Karhunen-Loève Expansion of the spatial process maintains the relationship between the realizations from multiple resolutions. Specifically, we use the Karhunen-Loève Expansion to define the regionalization error so that the ecological fallacy is minimized. The contiguous regionalization is done using the minimum spanning tree formed from the spatial locations and the data. Then, regionalization becomes similar to pruning edges from the minimum spanning tree. The methodology is demonstrated using simulated and real data examples.

The Second Competition on Spatial Statistics for Large Datasets

Sameh Abdulah Faten Alamri Pratik Nag All authors (7)

https://doi.org/10.6339/22-JDS1076

Pub. online: 8 Nov 2022 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 439–460

Abstract

In the last few decades, the size of spatial and spatio-temporal datasets in many research areas has rapidly increased with the development of data collection technologies. As a result, classical statistical methods in spatial statistics are facing computational challenges. For example, the kriging predictor in geostatistics becomes prohibitive on traditional hardware architectures for large datasets as it requires high computing power and memory footprint when dealing with large dense matrix operations. Over the years, various approximation methods have been proposed to address such computational issues, however, the community lacks a holistic process to assess their approximation efficiency. To provide a fair assessment, in 2021, we organized the first competition on spatial statistics for large datasets, generated by our ExaGeoStat software, and asked participants to report the results of estimation and prediction. Thanks to its widely acknowledged success and at the request of many participants, we organized the second competition in 2022 focusing on predictions for more complex spatial and spatio-temporal processes, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. In this paper, we describe in detail the data generation procedure and make the valuable datasets publicly available for a wider adoption. Then, we review the submitted methods from fourteen teams worldwide, analyze the competition outcomes, and assess the performance of each team.

Maximum Likelihood Estimation for Shape-restricted Single-index Hazard Models

Jing Qin Yifei Sun Ao Yuan All authors (4)

https://doi.org/10.6339/22-JDS1061

Pub. online: 4 Nov 2022 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 21, Issue 4 (2023), pp. 681–695

Abstract

Single-index models are becoming increasingly popular in many scientific applications as they offer the advantages of flexibility in regression modeling as well as interpretable covariate effects. In the context of survival analysis, the single-index hazards models are natural extensions of the Cox proportional hazards models. In this paper, we propose a novel estimation procedure for single-index hazard models under a monotone constraint of the index. We apply the profile likelihood method to obtain the semiparametric maximum likelihood estimator, where the novelty of the estimation procedure lies in estimating the unknown monotone link function by embedding the problem in isotonic regression with exponentially distributed random variables. The consistency of the proposed semiparametric maximum likelihood estimator is established under suitable regularity conditions. Numerical simulations are conducted to examine the finite-sample performance of the proposed method. An analysis of breast cancer data is presented for illustration.

Linear Algorithms for Robust and Scalable Nonparametric Multiclass Probability Estimation

Liyun Zeng Hao Helen Zhang

https://doi.org/10.6339/22-JDS1069

Pub. online: 3 Nov 2022 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 21, Issue 4 (2023), pp. 658–680

Abstract

Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) has been developed to estimate class probabilities through ensemble learning for K-class problems (Wu et al., 2010; Wang et al., 2019), where K is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demands polynomial time in K. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in K. Though not the most efficient in computation, the OVA is found to have the best estimation accuracy among all the procedures under comparison. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate their finite sample performance.

Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes

Arkajyoti Saha Abhirup Datta Sudipto Banerjee

https://doi.org/10.6339/22-JDS1073

Pub. online: 3 Nov 2022 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 533–544

Abstract

Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.

Geostatistics for Large Datasets on Riemannian Manifolds: A Matrix-Free Approach

Mike Pereira Nicolas Desassis Denis Allard

https://doi.org/10.6339/22-JDS1075

Pub. online: 3 Nov 2022 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 512–532

Abstract

Large or very large spatial (and spatio-temporal) datasets have become common place in many environmental and climate studies. These data are often collected in non-Euclidean spaces (such as the planet Earth) and they often present nonstationary anisotropies. This paper proposes a generic approach to model Gaussian Random Fields (GRFs) on compact Riemannian manifolds that bridges the gap between existing works on nonstationary GRFs and random fields on manifolds. This approach can be applied to any smooth compact manifolds, and in particular to any compact surface. By defining a Riemannian metric that accounts for the preferential directions of correlation, our approach yields an interpretation of the nonstationary geometric anisotropies as resulting from local deformations of the domain. We provide scalable algorithms for the estimation of the parameters and for optimal prediction by kriging and simulation able to tackle very large grids. Stationary and nonstationary illustrations are provided.

Identifying Prerequisite Courses in Undergraduate Biology Using Machine Learning

Youngjin Lee

https://doi.org/10.6339/22-JDS1068

Pub. online: 20 Oct 2022 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 21, Issue 4 (2023), pp. 745–760

Abstract

Many undergraduate students who matriculated in Science, Technology, Engineering and Mathematics (STEM) degree programs drop out or switch their major. Previous studies indicate that performance of students in prerequisite courses is important for attrition of students in STEM. This study analyzed demographic information, ACT/SAT score, and performance of students in freshman year courses to develop machine learning models predicting their success in earning a bachelor’s degree in biology. The predictive model based on Random Forest (RF) and Extreme Gradient Boosting (XGBoost) showed a better performance in terms of AUC (Area Under the Curve) with more balanced sensitivity and specificity than Logistic Regression (LR), K-Nearest Neighbor (KNN), and Neural Network (NN) models. An explainable machine learning approach called break-down was employed to identify important freshman year courses that could have a larger impact on student success at the biology degree program and student levels. More important courses identified at the program level can help program coordinators to prioritize their effort in addressing student attrition while more important courses identified at the student level can help academic advisors to provide more personalized, data-driven guidance to students.

12 13 14 15 16

Items per page

Export citation

Copy and paste formatted citation

Formatted citation

Placeholder

Citation style

Download citation in file

Export format

Authors

Placeholder

RSS

Journal of data science

Online ISSN: 1683-8602
Print ISSN: 1680-743X

About

About journal

For contributors

Submit
OA Policy
Become a Peer-reviewer

Contact us

JDS@ruc.edu.cn
No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China