Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Keywords: missing data

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 14

Order by:

Select: All None Download:

Pseudo Partial Likelihood Method for Proportional Hazards Models when Time Origin Is Missing for Control Group with Applications to SARS-CoV-2 Seroprevalence Study

Yunro Chung

Vel Murugan Kassu Mehari Beyene All authors (4)

https://doi.org/10.6339/25-JDS1199

Pub. online: 7 Oct 2025 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science

Abstract

Time-to-event data analysis without a well-defined time origin commonly occurs in observational studies that retrospectively collect survival endpoints. For instance, after enrolling participants who have or have not received a specific treatment, an event status can be observed for all participants; however, the start date of treatment is only observable for the treatment group. The corresponding time origin does not exist for the control group, resulting in missing survival time data. Complete-case analysis is often considered the standard approach, but it disregards information from all participants in the control group and does not allow us to compare their survival distributions. To address this challenge, we propose a novel semiparametric proportional hazards model by regarding these missing time origins as nuisance parameters. We approximate the risk sets as cumulative normal distributions to deal with these nuisance parameters and develop estimation and inference procedures for our proposed estimator. We study the asymptotic properties of this model and conduct the simulation studies to validate its finite sample property. Analysis of data from a recent SARS-CoV-2 seroprevaluence study illustrates the applicability of our methods. The proposed methods are implemented in the R package coxphm.

Variable Selection with FDR Control for Noisy Data – An Application to Screening Metabolites that Are Associated with Breast Cancer and Colorectal Cancer

Runqiu Wang Ran Dai

Ying Huang All authors (8)

https://doi.org/10.6339/25-JDS1166

Pub. online: 11 Jun 2025 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 23, Issue 3 (2025): Special Issue: 2024 WNAR/IMS/Graybill Annual Meeting, pp. 499–520

Abstract

The rapidly expanding field of metabolomics presents an invaluable resource for understanding the associations between metabolites and various diseases. However, the high dimensionality, presence of missing values, and measurement errors associated with metabolomics data can present challenges in developing reliable and reproducible approaches for disease association studies. Therefore, there is a compelling need for robust statistical analyses that can navigate these complexities to achieve reliable and reproducible disease association studies. In this paper, we construct algorithms to perform variable selection for noisy data and control the False Discovery Rate when selecting mutual metabolomic predictors for multiple disease outcomes. We illustrate the versatility and performance of this procedure in a variety of scenarios, dealing with missing data and measurement errors. As a specific application of this novel methodology, we target two of the most prevalent cancers among US women: breast cancer and colorectal cancer. By applying our method to the Women’s Health Initiative data, we successfully identify metabolites that are associated with either or both of these cancers, demonstrating the practical utility and potential of our method in identifying consistent risk factors and understanding shared mechanisms between diseases.

An Innovative Method of Singular Spectrum Analysis to Conduct Gap-filling and Denoising on Time Series Data

James J. Yang

Anne Buu

https://doi.org/10.6339/25-JDS1164

Pub. online: 28 Jan 2025 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science

Abstract

Heart rate data collected from wearable devices – one type of time series data – could provide insights into activities, stress levels, and health. Yet, consecutive missing segments (i.e., gaps) that commonly occur due to improper device placement or device malfunction could distort the temporal patterns inherent in the data and undermine the validity of downstream analyses. This study proposes an innovative iterative procedure to fill gaps in time series data that capitalizes on the denoising capability of Singular Spectrum Analysis (SSA) and eliminates SSA’s requirement of pre-specifying the window length and number of groups. The results of simulations demonstrate that the performance of SSA-based gap-filling methods depends on the choice of window length, number of groups, and the percentage of missing values. In contrast, the proposed method consistently achieves the lowest rates of reconstruction error and gap-filling error across a variety of combinations of the factors manipulated in the simulations. The simulation findings also highlight that the commonly recommended long window length – half of the time series length – may not apply to time series with varying frequencies such as heart rate data. The initialization step of the proposed method that involves a large window length and the first four singular values in the iterative singular value decomposition process not only avoids convergence issues but also facilitates imputation accuracy in subsequent iterations. The proposed method provides the flexibility for researchers to conduct gap-filling solely or in combination with denoising on time series data and thus widens the applications.

Predictive Mean Matching Imputation Procedure Based on Machine Learning Models for Complex Survey Data

Sixia Chen Chao Xu

https://doi.org/10.6339/24-JDS1135

Pub. online: 10 Jul 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 456–468

Abstract

Missing data is a common occurrence in various fields, spanning social science, education, economics, and biomedical research. Disregarding missing data in statistical analyses can introduce bias to study outcomes. To mitigate this issue, imputation methods have proven effective in reducing nonresponse bias and generating complete datasets for subsequent analysis of secondary data. The efficacy of imputation methods hinges on the assumptions of the underlying imputation model. While machine learning techniques such as regression trees, random forest, XGBoost, and deep learning have demonstrated robustness against model misspecification, their optimal performance may necessitate fine-tuning under specific conditions. Moreover, imputed values generated by these methods can sometimes deviate unnaturally, falling outside the normal range. To address these challenges, we propose a novel Predictive Mean Matching imputation (PMM) procedure that leverages popular machine learning-based methods. PMM strikes a balance between robustness and the generation of appropriate imputed values. In this paper, we present our innovative PMM approach and conduct a comparative performance analysis through Monte Carlo simulation studies, assessing its effectiveness against other established methods.

Association Between Body Fat and Body Mass Index from Incomplete Longitudinal Proportion Data: Findings from the Fels Study

Xin Tong Seohyun Kim Dipankar Bandyopadhyay

All authors (4)

https://doi.org/10.6339/23-JDS1104

Pub. online: 15 Jun 2023 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 22, Issue 1 (2024), pp. 116–137

Abstract

Obesity rates continue to exhibit an upward trajectory, particularly in the US, and is the underlying cause of several comorbidities, including but not limited to high blood pressure, high cholesterol, diabetes, heart disease, stroke, and cancers. To monitor obesity, body mass index (BMI) and proportion body fat (PBF) are two commonly used measurements. Although BMI and PBF changes over time in an individual’s lifespan and their relationship may also change dynamically, existing work has mostly remained cross-sectional, or separately modeling BMI and PBF. A combined longitudinal assessment is expected to be more effective in unravelling their complex interplay. To mitigate this, we consider Bayesian cross-domain latent growth curve models within a structural equation modeling framework, which simultaneously handles issues such as individually varying time metrics, proportion data, and potential missing not at random data for joint assessment of the longitudinal changes of BMI and PBF. Through simulation studies, we observe that our proposed models and estimation method yielded parameter estimates with small bias and mean squared error in general, however, a mis-specified missing data mechanism may cause inaccurate and inefficient parameter estimates. Furthermore, we demonstrate application of our method to a motivating longitudinal obesity study, controlling for both time-invariant (such as, sex), and time-varying (such as diastolic and systolic blood pressure, biceps skinfold, bioelectrical impedance, and waist circumference) covariates in separate models. Under time-invariance, we observe that the initial BMI level and the rate of change in BMI influenced PBF. However, in presence of time-varying covariates, only the initial BMI level influenced the initial PBF. The added-on selection model estimation indicated that observations with higher PBF values were less likely to be missing.

Active Data Science for Improving Clinical Risk Prediction

Donna P. Ankerst

Matthias Neumair

https://doi.org/10.6339/22-JDS1078

Pub. online: 23 Nov 2022 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 177–192

Abstract

Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.

Tree-Based Missing Value Imputation Using Feature Selection

Heizel Rosado-Galindo Saylisse Dávila-Padilla

https://doi.org/10.6339/JDS.202010_18(4).0002

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 18, Issue 4 (2020), pp. 606–631

Abstract

Researchers and practitioners of many areas of knowledge frequently struggle with missing data. Missing data is a problem because almost all standard statistical methods assume that the information is complete. Consequently, missing value imputation offers a solution to this problem. The main contribution of this paper lies on the development of a random forest-based imputation method (TI-FS) that can handle any type of data, including high-dimensional data with nonlinear complex interactions. The premise behind the proposed scheme is that a variable can be imputed considering only those variables that are related to it using feature selection. This work compares the performance of the proposed scheme with other two imputation methods commonly used in literature: KNN and missForest. The results suggest that the proposed method can be useful in complex scenarios with categorical variables and a high volume of missing values, while reducing the amount of variables used and their corresponding preliminary imputations.

Edition and Imputation of Multiple Time Series Data Generated by Repetitive Surveys

Victor M. Guerrero Blanca I. Gaspar

https://doi.org/10.6339/JDS.2010.08(4).623

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 8, Issue 4 (2010), pp. 555–577

Abstract

Abstract: This paper considers the statistical problems of editing and imputing data of multiple time series generated by repetitive surveys. The case under study is that of the Survey of Cattle Slaughter in Mexico’s Municipal Abattoirs. The proposed procedure consists of two phases; firstly the data of each abattoir are edited to correct them for gross inconsistencies. Secondly, the missing data are imputed by means of restricted forecasting. This method uses all the historical and current information available for the abattoir, as well as multiple time series models from which efficient estimates of the missing data are obtained. Some empirical examples are shown to illustrate the usefulness of the method in practice.

Imputation Methods for Missing Categorical Questionnaire Data: A Comparison of Approaches

W. Holmes Finch

https://doi.org/10.6339/JDS.2010.08(3).612

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 8, Issue 3 (2010), pp. 361–378

Abstract

Abstract: Missing data are a common problem for researchers working with surveys and other types of questionnaires. Often, respondents do not respond to one or more items, making the conduct of statistical analyses, as well as the calculation of scores difficult. A number of methods have been developed for dealing with missing data, though most of these have focused on continuous variables. It is not clear that these techniques for imputation are appropriate for the categorical items that make up surveys. However, methods of imputation specifically designed for categorical data are either limited in terms of the number of variables they can accommodate, or have not been fully compared with the continuous data approaches used with categorical variables. The goal of the current study was to compare the performance of these explicitly categorical imputation approaches with the more well established continuous method used with categorical item responses. Results of the simulation study based on real data demonstrate that the continuous based imputation approach and a categorical method based on stochastic regression appear to perform well in terms of creating data that match the complete datasets in terms of logistic regression results.

Parametric Fractional Imputation for Longitudinal Data with Intermittent Missing Values

Ahmed M. Gad Hanan E. G. Ahmed

https://doi.org/10.6339/JDS.201904_17(2).0005

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 17, Issue 2 (2019), pp. 331–348

Abstract

Longitudinal data analysis had been widely developed in the past three decades. Longitudinal data are common in many fields such as public health, medicine, biological and social sciences. Longitudinal data have special nature as the individual may be observed during a long period of time. Hence, missing values are common in longitudinal data. The presence of missing values leads to biased results and complicates the analysis. The missing values have two patterns: intermittent and dropout. The missing data mechanisms are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The appropriate analysis relies heavily on the assumed mechanism and pattern. The parametric fractional imputation is developed to handle longitudinal data with intermittent missing pattern. The maximum likelihood estimates are obtained and the Jackkife method is used to obtain the standard errors of the parameters estimates. Finally a simulation study is conducted to validate the proposed approach. Also, the proposed approach is applied to a real data.

Detailed search

Search results 14

Export citation

Copy and paste formatted citation

Download citation in file

Authors