Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 877

Order by:

Select: All None Download:

Covid-19 Vaccine Efficacy: Accuracy Assessment, Comparison, and Caveats

Wenjiang Fu Jieni Li Paul A. Scheet

https://doi.org/10.6339/23-JDS1089

Pub. online: 7 Mar 2023 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 1 (2024), pp. 45–55

Abstract

Vaccine efficacy is a key index to evaluate vaccines in initial clinical trials during the development of vaccines. In particular, it plays a crucial role in authorizing Covid-19 vaccines. It has been reported that Covid-19 vaccine efficacy varies with a number of factors, including demographics of population, time after vaccine administration, and virus strains. By examining clinical trial data of three Covid-19 vaccine studies, we find that current approach to evaluating vaccines with an overall efficacy does not provide desired accuracy. It requires no time frame during which a candidate vaccine is evaluated, and is subject to misuse, resulting in potential misleading information and interpretation. In particular, we illustrate with clinical trial data that the variability of vaccine efficacy is underestimated. We demonstrate that a new method may help to address these caveats. It leads to accurate estimation of the variation of efficacy, provides useful information to define a reasonable time frame to evaluate vaccines, and avoids misuse of vaccine efficacy and misleading information.

Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R

Zhaoxing Wu Chunming Zhang

https://doi.org/10.6339/23-JDS1096

Pub. online: 2 Mar 2023 Type: Computing In Data Science

Open Access

Journal: Journal of Data Science Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 310–332

Abstract

Analyzing “large p small n” data is becoming increasingly paramount in a wide range of application fields. As a projection pursuit index, the Penalized Discriminant Analysis ($\mathrm{PDA}$) index, built upon the Linear Discriminant Analysis ($\mathrm{LDA}$) index, is devised in Lee and Cook (2010) to classify high-dimensional data with promising results. Yet, there is little information available about its performance compared with the popular Support Vector Machine ($\mathrm{SVM}$). This paper conducts extensive numerical studies to compare the performance of the $\mathrm{PDA}$ index with the $\mathrm{LDA}$ index and $\mathrm{SVM}$, demonstrating that the $\mathrm{PDA}$ index is robust to outliers and able to handle high-dimensional datasets with extremely small sample sizes, few important variables, and multiple classes. Analyses of several motivating real-world datasets reveal the practical advantages and limitations of individual methods, suggesting that the $\mathrm{PDA}$ index provides a useful alternative tool for classifying complex high-dimensional data. These new insights, along with the hands-on implementation of the $\mathrm{PDA}$ index functions in the R package classPP, facilitate statisticians and data scientists to make effective use of both sets of classification tools.

Impact of Bias Correction of the Least Squares Estimation on Bootstrap Confidence Intervals for Bifurcating Autoregressive Models

Tamer Elbayoumi Sayed Mostafa

https://doi.org/10.6339/23-JDS1092

Pub. online: 24 Feb 2023 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 1 (2024), pp. 25–44

Abstract

The least squares (LS) estimator of the autoregressive coefficient in the bifurcating autoregressive (BAR) model was recently shown to suffer from substantial bias, especially for small to moderate samples. This study investigates the impact of the bias in the LS estimator on the behavior of various types of bootstrap confidence intervals for the autoregressive coefficient and introduces methods for constructing bias-corrected bootstrap confidence intervals. We first describe several bootstrap confidence interval procedures for the autoregressive coefficient of the BAR model and present their bias-corrected versions. The behavior of uncorrected and corrected confidence interval procedures is studied empirically through extensive Monte Carlo simulations and two real cell lineage data applications. The empirical results show that the bias in the LS estimator can have a significant negative impact on the behavior of bootstrap confidence intervals and that bias correction can significantly improve the performance of bootstrap confidence intervals in terms of coverage, width, and symmetry.

Neural Generalized Ordinary Differential Equations with Layer-Varying Parameters

Duo Yu Hongyu Miao Hulin Wu

https://doi.org/10.6339/23-JDS1093

Pub. online: 23 Feb 2023 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 1 (2024), pp. 10–24

Abstract

Deep residual networks (ResNets) have shown state-of-the-art performance in various real-world applications. Recently, the ResNets model was reparameterized and interpreted as solutions to a continuous ordinary differential equation or Neural-ODE model. In this study, we propose a neural generalized ordinary differential equation (Neural-GODE) model with layer-varying parameters to further extend the Neural-ODE to approximate the discrete ResNets. Specifically, we use nonparametric B-spline functions to parameterize the Neural-GODE so that the trade-off between the model complexity and computational efficiency can be easily balanced. It is demonstrated that ResNets and Neural-ODE models are special cases of the proposed Neural-GODE model. Based on two benchmark datasets, MNIST and CIFAR-10, we show that the layer-varying Neural-GODE is more flexible and general than the standard Neural-ODE. Furthermore, the Neural-GODE enjoys the computational and memory benefits while performing comparably to ResNets in prediction accuracy.

Analyzing the Rainfall Pattern in Honduras Through Non-Homogeneous Hidden Markov Models

Gustavo Alexis Sabillón Daiane Aparecida Zuanetti

https://doi.org/10.6339/23-JDS1091

Pub. online: 22 Feb 2023 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 21, Issue 4 (2023), pp. 799–817

Abstract

One of the major climatic interests of the last decades has been to understand and describe the rainfall patterns of specific areas of the world as functions of other climate covariates. We do it for the historical climate monitoring data from Tegucigalpa, Honduras, using non-homogeneous hidden Markov models (NHMMs), which are dynamic models usually used to identify and predict heterogeneous regimes. For estimating the NHMM in an efficient and scalable way, we propose the stochastic Expectation-Maximization (EM) algorithm and a Bayesian method, and compare their performance in synthetic data. Although these methodologies have already been used for estimating several other statistical models, it is not the case of NHMMs which are still widely fitted by the traditional EM algorithm. We observe that, under tested conditions, the performance of the Bayesian and stochastic EM algorithms is similar and discuss their slight differences. Analyzing the Honduras rainfall data set, we identify three heterogeneous rainfall periods and select temperature and humidity as relevant covariates for explaining the dynamic relation among these periods.

Binary Classification of Malignant Mesothelioma: A Comparative Study

Ted Si Yuan Cheng Xiyue Liao

https://doi.org/10.6339/23-JDS1090

Pub. online: 14 Feb 2023 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 205–224

Abstract

Malignant mesotheliomas are aggressive cancers that occur in the thin layer of tissue that covers most commonly the linings of the chest or abdomen. Though the cancer itself is rare and deadly, early diagnosis will help with treatment and improve outcomes. Mesothelioma is usually diagnosed in the later stages. Symptoms are similar to other, more common conditions. As such, predicting and diagnosing mesothelioma early is essential to starting early treatment for a cancer that is often diagnosed too late. The goal of this comprehensive empirical comparison is to determine the best-performing model based on recall (sensitivity). We particularly wish to avoid false negatives, as it is costly to diagnose a patient as healthy when they actually have cancer. Model training will be conducted based on k-fold cross validation. Random forest is chosen as the optimal model. According to this model, age and duration of asbestos exposure are ranked as the most important features affecting diagnosis of mesothelioma.

Random Forest of Interaction Trees for Estimating Individualized Treatment Regimes with Ordered Treatment Levels in Observational Studies

Justin Thorp Richard A. Levine

Luo Li All authors (4)

https://doi.org/10.6339/23-JDS1084

Pub. online: 2 Feb 2023 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 391–411

Abstract

Traditional methods for evaluating a potential treatment have focused on the average treatment effect. However, there exist situations where individuals can experience significantly heterogeneous responses to a treatment. In these situations, one needs to account for the differences among individuals when estimating the treatment effect. Li et al. (2022) proposed a method based on random forest of interaction trees (RFIT) for a binary or categorical treatment variable, while incorporating the propensity score in the construction of random forest. Motivated by the need to evaluate the effect of tutoring sessions at a Math and Stat Learning Center (MSLC), we extend their approach to an ordinal treatment variable. Our approach improves upon RFIT for multiple treatments by incorporating the ordered structure of the treatment variable into the tree growing process. To illustrate the effectiveness of our proposed method, we conduct simulation studies where the results show that our proposed method has a lower mean squared error and higher optimal treatment classification, and is able to identify the most important variables that impact the treatment effect. We then apply the proposed method to estimate how the number of visits to the MSLC impacts an individual student’s probability of passing an introductory statistics course. Our results show that every student is recommended to go to the MSLC at least once and some can drastically improve their chance of passing the course by going the optimal number of times suggested by our analysis.

Identifying Drone Web Sites in Multiple Countries and Languages with a Single Model

Piet Daas

Blanca de Miguel

Maria de Miguel

https://doi.org/10.6339/23-JDS1087

Pub. online: 26 Jan 2023 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 225–238

Abstract

A text-based, bag-of-words, model was developed to identify drone company websites for multiple European countries in different languages. A collection of Spanish drone and non-drone websites was used for initial model development. Various classification methods were compared. Supervised logistic regression (L2-norm) performed best with an accuracy of 87% on the unseen test set. The accuracy of the later model improved to 88% when it was trained on texts in which all Spanish words were translated into English. Retraining the model on texts in which all typical Spanish words, such as names of cities and regions, and words indicative for specific periods in time, such as the months of the year and days of the week, were removed did not affect the overall performance of the model and made it more generally applicable. Applying the cleaned, completely English word-based, model to a collection of Irish and Italian drone and non-drone websites revealed, after manual inspection, that it was able to detect drone websites in those countries with an accuracy of 82 and 86%, respectively. The classification of Italian texts required the creation of a translation list in which all 1560 English word-based features in the model were translated to their Italian analogs. Because the model had a very high recall, 93, 100, and 97% on Spanish, Irish and Italian drone websites respectively, it was particularly well suited to select potential drone websites in large collections of websites.

Comparing Extreme Value Estimation Techniques for Short-Term Snow Accumulations

Kenneth Pomeyie

Brennan Bean Yan Sun

https://doi.org/10.6339/23-JDS1086

Pub. online: 25 Jan 2023 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 368–390

Abstract

The potential weight of accumulated snow on the roof of a structure has long been an important consideration in structure design. However, the historical approach of modeling the weight of snow on structures is incompatible for structures with surfaces and geometry where snow is expected to slide off of the structure, such as standalone solar panels. This paper proposes a “storm-level” adaptation of previous structure-related snow studies that is designed to estimate short-term, rather than season-long, accumulations of the snow water equivalent or snow load. One key development associated with this paper includes a climate-driven random forests model to impute missing snow water equivalent values at stations that measure only snow depth in order to produce continuous snow load records. Additionally, the paper compares six different approaches of extreme value estimation on short-term snow accumulations. The results of this study indicate that, when considering the 50-year mean recurrence interval (MRI) for short-term snow accumulations across different weather station types, the traditional block maxima approach, the mean-adjusted quantile method with a gamma distribution approach, and the peak over threshold Bayesian approach tend to most often provide MRI estimates near the median of all six approaches considered in this study. Further, this paper also shows, via bootstrap simulation, that the peak over threshold extreme value estimation using automatic threshold selection approaches tend to have higher variance compared to the other approaches considered. The results suggest that there is no one-size-fits-all option for extreme value estimation of short-term snow accumulations, but highlights the potential value from integrating multiple extreme value estimation approaches.

Central Posterior Envelopes for Bayesian Functional Principal Component Analysis

Joanna Boland Donatello Telesca Catherine Sugar All authors (8)

https://doi.org/10.6339/23-JDS1085

Pub. online: 19 Jan 2023 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 21, Issue 4 (2023), pp. 715–734

Abstract

Bayesian methods provide direct uncertainty quantification in functional data analysis applications without reliance on bootstrap techniques. A major tool in functional data applications is the functional principal component analysis which decomposes the data around a common mean function and identifies leading directions of variation. Bayesian functional principal components analysis (BFPCA) provides uncertainty quantification on the estimated functional model components via the posterior samples obtained. We propose central posterior envelopes (CPEs) for BFPCA based on functional depth as a descriptive visualization tool to summarize variation in the posterior samples of the estimated functional model components, contributing to uncertainty quantification in BFPCA. The proposed BFPCA relies on a latent factor model and targets model parameters within a hierarchical modeling framework using modified multiplicative gamma process shrinkage priors on the variance components. Functional depth provides a center-outward order to a sample of functions. We utilize modified band depth and modified volume depth for ordering of a sample of functions and surfaces, respectively, to derive at CPEs of the mean and eigenfunctions within the BFPCA framework. The proposed CPEs are showcased in extensive simulations. Finally, the proposed CPEs are applied to the analysis of a sample of power spectral densities from resting state electroencephalography where they lead to novel insights on diagnostic group differences among children diagnosed with autism spectrum disorder and their typically developing peers across age.

9 10 11 12 13

Items per page

Export citation

Copy and paste formatted citation

Formatted citation

Placeholder

Citation style

Download citation in file

Export format

Authors

Placeholder

RSS

Journal of data science

Online ISSN: 1683-8602
Print ISSN: 1680-743X

About

About journal

For contributors

Submit
OA Policy
Become a Peer-reviewer

Contact us

JDS@ruc.edu.cn
No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China