Pub. online:23 Apr 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 312–331
Abstract
The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical disclosure control (SDC) aims to maximize data utility while minimizing the disclosure risk, i.e., the risk of individual identification. A key SDC method is data perturbation, with General Additive Data Perturbation (GADP) and Copula General Additive Data Perturbation (CGADP) being two prominent approaches. Both leverage multivariate normal distributions to generate synthetic data while preserving statistical properties of the original dataset. Given the increasing use of machine learning for data modeling, this study compares the performance of various machine learning models on GADP- and CGADP-perturbed data. Using Monte Carlo simulations with three data-generating models and a real dataset, we evaluate the predictive performance and robustness of ten machine learning techniques under data perturbation. Our findings provide insights into the machine learning techniques that perform robustly on GADP- and CGADP-perturbed datasets, extending previous research that primarily focused on simple statistics such as means, variances, and correlations.
Approximately 15% of adults in the United States (U.S.) are afflicted with chronic kidney disease (CKD). For CKD patients, the progressive decline of kidney function is intricately related to hospitalizations due to cardiovascular disease and eventual “terminal” events, such as kidney failure and mortality. To unravel the mechanisms underlying the disease dynamics of these interdependent processes, including identifying influential risk factors, as well as tailoring decision-making to individual patient needs, we develop a novel Bayesian multivariate joint model for the intercorrelated outcomes of kidney function (as measured by longitudinal estimated glomerular filtration rate), recurrent cardiovascular events, and competing-risk terminal events of kidney failure and death. The proposed joint modeling approach not only facilitates the exploration of risk factors associated with each outcome, but also allows dynamic updates of cumulative incidence probabilities for each competing risk for future subjects based on their basic characteristics and a combined history of longitudinal measurements and recurrent events. We propose efficient and flexible estimation and prediction procedures within a Bayesian framework employing Markov Chain Monte Carlo methods. The predictive performance of our model is assessed through dynamic area under the receiver operating characteristic curves and the expected Brier score. We demonstrate the efficacy of the proposed methodology through extensive simulations. Proposed methodology is applied to data from the Chronic Renal Insufficiency Cohort study established by the National Institute of Diabetes and Digestive and Kidney Diseases to address the rising epidemic of CKD in the U.S.
Pub. online:22 May 2024Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 259–279
Abstract
Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.
This paper presents an empirical study of a recently compiled workforce analytics data-set modeling employment outcomes of Engineering students. The contributions reported in this paper won the data challenge of the ACM IKDD 2016 Conference on Data Science. Two problems are addressed - regression using heterogeneous information types and the extraction of insights/trends from data to make recommendations; these goals are supported by a range of visualizations. Whereas the data-set is specific to a nation, the underlying techniques and visualization methods are generally applicable. Gaussian processes are proposed to model and predict salary as a function of heterogeneous independent attributes. Key novelties the GP approach brings to the domain of understanding workforce analytics are (a) statistically sound notion of uncertainty of prediction that is data dependent, (b) automatic relevance determination of various independent attributes to the dependent variable (salary),(c) seamless incorporation of both numeric and string attributes within the same regression frame- work without dichotomization; specifically, string attributes include single-word or categorical (e.g. gender) or nominal attributes (e.g. college tier) or multi-word attributes (e.g. specialization) and (d) treatment of all data as being correlated towards making predictions. Insights from both predictive modeling approaches and data analysis were used to suggest factors, that if improved, might lead to better starting salaries for Engineering students. A range of visualization techniques were used to extract key employment patterns from the data.