Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 889

Order by:

Select: All None Download:

Statistical Learning in Medical Research with Decision Threshold and Accuracy Evaluation

Sumaiya Z. Sande Loraine Seng Jialiang Li All authors (4)

https://doi.org/10.6339/21-JDS1022

Pub. online: 23 Sep 2021 Type: Data Science Reviews

Journal: Journal of Data Science Volume 19, Issue 4 (2021), pp. 634–657

Abstract

Machine learning methods are increasingly applied for medical data analysis to reduce human efforts and improve our understanding of disease propagation. When the data is complicated and unstructured, shallow learning methods may not be suitable or feasible. Deep learning neural networks like multilayer perceptron (MLP) and convolutional neural network (CNN), have been incorporated in medical diagnosis and prognosis for better health care practice. For a binary outcome, these learning methods directly output predicted probabilities for patient’s health condition. Investigators still need to consider appropriate decision threshold to split the predicted probabilities into positive and negative regions. We review methods to select the cut-off values, including the relatively automatic methods based on optimization of the ROC curve criteria and also the utility-based methods with a net benefit curve. In particular, decision curve analysis (DCA) is now acknowledged in medical studies as a good complement to the ROC analysis for the purpose of decision making. In this paper, we provide the R code to illustrate how to perform the statistical learning methods, select decision threshold to yield the binary prediction and evaluate the accuracy of the resulting classification. This article will help medical decision makers to understand different classification methods and use them in real world scenario.

Variable Importance Scores

Wei-Yin Loh

Peigen Zhou

https://doi.org/10.6339/21-JDS1023

Pub. online: 16 Sep 2021 Type: Statistical Data Science

Journal: Journal of Data Science Volume 19, Issue 4 (2021), pp. 569–592

Abstract

There are many methods of scoring the importance of variables in prediction of a response but not much is known about their accuracy. This paper partially fills the gap by introducing a new method based on the GUIDE algorithm and comparing it with 11 existing methods. For data without missing values, eight methods are shown to give biased scores that are too high or too low, depending on the type of variables (ordinal, binary or nominal) and whether or not they are dependent on other variables, even when all of them are independent of the response. Among the remaining four methods, only GUIDE continues to give unbiased scores if there are missing data values. It does this with a self-calibrating bias-correction step that is applicable to data with and without missing values. GUIDE also provides threshold scores for differentiating important from unimportant variables with 95 and 99 percent confidence. Correlations of the scores to the predictive power of the methods are studied in three real data sets. For many methods, correlations with marginal predictive power are much higher than with conditional predictive power.

Common Growth Patterns for Regional Social Networks: A Point Process Approach

Tiandong Wang

Sidney I. Resnick

https://doi.org/10.6339/21-JDS1021

Pub. online: 16 Sep 2021 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 21, Issue 3 (2023): Special Issue: Advances in Network Data Science, pp. 446–469

Abstract

In this paper, we study macroscopic growth dynamics of social network link formation. Rather than focusing on one particular dataset, we find invariant behavior in regional social networks that are geographically concentrated. Empirical findings suggest that the startup phase of a regional network can be modeled by a self-exciting point process. After the startup phase ends, the growth of the links can be modeled by a non-homogeneous Poisson process with a constant rate across the day but varying rates from day to day, plus a nightly inactive period when local users are expected to be asleep. Conclusions are drawn based on analyzing four different datasets, three of which are regional and a non-regional one is included for contrast.

A Pan-Cancer Network Analysis with Integration of miRNA-Gene Targeting for Multiomics Datasets

Henry Linder Yuping Zhang

https://doi.org/10.6339/21-JDS1019

Pub. online: 16 Aug 2021 Type: Statistical Data Science

Journal: Journal of Data Science Volume 19, Issue 4 (2021), pp. 555–568

Abstract

Large-scale genomics studies provide researchers with access to extensive datasets with extensive detail and unprecedented scope that encompasses not only genes, but also more experimental functional units, including non-coding microRNAs (miRNAs). In order to analyze these high-fidelity data while remaining faithful to the underlying biology, statistical methods are necessary that can reflect the full range of understanding in contemporary molecular biology, while remaining flexible enough to analyze a wide range of data and complex phenomena. Leveraging multiple omics datasets, miRNA-gene targets as well as signaling pathway topology, we present an integrative linear model to analyze signaling pathways. Specifically, we use a mixed linear model to characterize tumor and healthy tissue, and execute statistical significance testing to identify pathway disturbances. In this paper, pan-cancer analysis is performed for a wide range of signaling pathways. We discuss specific findings from this analysis, as well as an interactive data visualization available for public consumption that contains the full range of our analytic findings.

BDNNSurv: Bayesian Deep Neural Networks for Survival Analysis Using Pseudo Values

Dai Feng Lili Zhao

https://doi.org/10.6339/21-JDS1018

Pub. online: 13 Aug 2021 Type: Statistical Data Science

Journal: Journal of Data Science Volume 19, Issue 4 (2021), pp. 542–554

Abstract

There has been increasing interest in modeling survival data using deep learning methods in medical research. In this paper, we proposed a Bayesian hierarchical deep neural networks model for modeling and prediction of survival data. Compared with previously studied methods, the new proposal can provide not only point estimate of survival probability but also quantification of the corresponding uncertainty, which can be of crucial importance in predictive modeling and subsequent decision making. The favorable statistical properties of point and uncertainty estimates were demonstrated by simulation studies and real data analysis. The Python code implementing the proposed approach was provided.

Shape-Restricted Regression Splines with R Package splines2

Wenjie Wang

Jun Yan

https://doi.org/10.6339/21-JDS1020

Pub. online: 12 Aug 2021 Type: Computing In Data Science

Open Access

Journal: Journal of Data Science Volume 19, Issue 3 (2021), pp. 498–517

Abstract

Splines are important tools for the flexible modeling of curves and surfaces in regression analyses. Functions for constructing spline basis functions are available in R through the base package splines. When the curves to be modeled have known characteristics in monotonicity or curvature, more efficient statistical inferences are possible with shape-restricted splines. Such splines, however, are not available in the R package splines. The package splines2 provides easy-to-use shape-restricted spline basis functions, along with their derivatives and integrals which are important tools in many inference scenarios. It also provides additional splines and features that are not available in the splines package, such as periodic splines and generalized Bernstein polynomials. The usages of the functions are illustrated with shape-restricted regression, recurrent event data analysis, and extreme-value copulas.

Sign-based Shrinkage Based on an Asymmetric LASSO Penalty

Eric S. Kawaguchi Burcu F. Darst Kan Wang All authors (4)

https://doi.org/10.6339/21-JDS1015

Pub. online: 2 Jun 2021 Type: Statistical Data Science

Journal: Journal of Data Science Volume 19, Issue 3 (2021), pp. 429–449

Abstract

Penalized regression provides an automated approach to preform simultaneous variable selection and parameter estimation and is a popular method to analyze high-dimensional data. Since the conception of the LASSO in the mid-to-late 1990s, extensive research has been done to improve penalized regression. The LASSO, and several of its variations, performs penalization symmetrically around zero. Thus, variables with the same magnitude are shrunk the same regardless of the direction of effect. To the best of our knowledge, sign-based shrinkage, preferential shrinkage based on the sign of the coefficients, has yet to be explored under the LASSO framework. We propose a generalization to the LASSO, asymmetric LASSO, that performs sign-based shrinkage. Our method is motivated by placing an asymmetric Laplace prior on the regression coefficients, rather than a symmetric Laplace prior. This corresponds to an asymmetric ${\ell _{1}}$ penalty under the penalized regression framework. In doing so, preferential shrinkage can be performed through an auxiliary tuning parameter that controls the degree of asymmetry. Our numerical studies indicate that the asymmetric LASSO performs better than the LASSO when effect sizes are sign skewed. Furthermore, in the presence of positively-skewed effects, the asymmetric LASSO is comparable to the non-negative LASSO without the need to place an a priori constraint on the effect estimates and outperforms the non-negative LASSO when negative effects are also present in the model. A real data example using the breast cancer gene expression data from The Cancer Genome Atlas is also provided, where the asymmetric LASSO identifies two potentially novel gene expressions that are associated with BRCA1 with a minor improvement in prediction performance over the LASSO and non-negative LASSO.

Mutstats: An Ultra-fast Computational Method to Determine Clonal Status of Somatic Mutations

Dehua Bi Subhajit Sengupta Tianjian Zhou All authors (4)

https://doi.org/10.6339/21-JDS1016

Pub. online: 1 Jun 2021 Type: Data Science In Action

Journal: Journal of Data Science Volume 19, Issue 3 (2021), pp. 465–484

Abstract

Tumor cell population is a mixture of heterogeneous cell subpopulations, known as subclones. Identification of clonal status of mutations, i.e., whether a mutation occurs in all tumor cells or in a subset of tumor cells, is crucial for understanding tumor progression and developing personalized treatment strategies. We make three major contributions in this paper: (1) we summarize terminologies in the literature based on a unified mathematical representation of subclones; (2) we develop a simulation algorithm to generate hypothetical sequencing data that are akin to real data; and (3) we present an ultra-fast computational method, Mutstats, to infer clonal status of somatic mutations from sequencing data of tumors. The inference is based on a Gaussian mixture model for mutation multiplicities. To validate Mutstats, we evaluate its performance on simulated datasets as well as two breast carcinoma samples from The Cancer Genome Atlas project.

Random Machines: A Bagged-Weighted Support Vector Model with Free Kernel Choice

Anderson Ara Mateus Maia Francisco Louzada All authors (4)

https://doi.org/10.6339/21-JDS1014

Pub. online: 1 Jun 2021 Type: Statistical Data Science

Journal: Journal of Data Science Volume 19, Issue 3 (2021), pp. 409–428

Abstract

Improvement of statistical learning models to increase efficiency in solving classification or regression problems is a goal pursued by the scientific community. Particularly, the support vector machine model has become one of the most successful algorithms for this task. Despite the strong predictive capacity from the support vector approach, its performance relies on the selection of hyperparameters of the model, such as the kernel function that will be used. The traditional procedures to decide which kernel function will be used are computationally expensive, in general, becoming infeasible for certain datasets. In this paper, we proposed a novel framework to deal with the kernel function selection called Random Machines. The results improved accuracy and reduced computational time, evaluated over simulation scenarios, and real-data benchmarking.

Time Series Regression Models for COVID-19 Deaths

Marinho G. Andrade Jorge A. Achcar Katiane S. Conceição All authors (4)

https://doi.org/10.6339/21-JDS991

Pub. online: 7 May 2021 Type: Data Science In Action

Journal: Journal of Data Science Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic, pp. 269–292

Abstract

This article develops nonlinear functional forms for modeling count time series of daily deaths due to the COVID-19 virus. Our models explain the mean levels of the time series while accounting for the time-varying variances. A Bayesian approach using Markov chain Monte Carlo (MCMC) is adopted for analysis, inference and forecasting of the time series under the proposed models. Applications are shown for time series of death counts from several countries affected by the pandemic.

84 85 86 87 88

Items per page

Export citation

Copy and paste formatted citation

Formatted citation

Placeholder

Citation style

Download citation in file

Export format

Authors

Placeholder

RSS

Journal of data science

Online ISSN: 1683-8602
Print ISSN: 1680-743X

About

About journal

For contributors

Submit
OA Policy
Become a Peer-reviewer

Contact us

JDS@ruc.edu.cn
No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China