Impact of Data Perturbation for Statistical Disclosure Control on the Predictive Performance of Machine Learning Techniques
Pub. online: 23 April 2025
Type: Statistical Data Science
Open Access
Received
17 August 2024
17 August 2024
Accepted
9 April 2025
9 April 2025
Published
23 April 2025
23 April 2025
Abstract
The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical disclosure control (SDC) aims to maximize data utility while minimizing the disclosure risk, i.e., the risk of individual identification. A key SDC method is data perturbation, with General Additive Data Perturbation (GADP) and Copula General Additive Data Perturbation (CGADP) being two prominent approaches. Both leverage multivariate normal distributions to generate synthetic data while preserving statistical properties of the original dataset. Given the increasing use of machine learning for data modeling, this study compares the performance of various machine learning models on GADP- and CGADP-perturbed data. Using Monte Carlo simulations with three data-generating models and a real dataset, we evaluate the predictive performance and robustness of ten machine learning techniques under data perturbation. Our findings provide insights into the machine learning techniques that perform robustly on GADP- and CGADP-perturbed datasets, extending previous research that primarily focused on simple statistics such as means, variances, and correlations.
Supplementary material
Supplementary MaterialThe supplementary material includes the following: (1) README: a brief explanation of the supplementary material; (2) a detailed description of the predictive machine learning techniques compared in this paper and additional simulation results; and (3) R code files.
References
Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32. https://doi.org/10.1023/A:1010933404324
Chen T, Guestrin C (2016). XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
Chu AM, Ip CY, Lam BS, So MK (2022). Statistical disclosure control for continuous variables using an extended skew-t copula. Applied Stochastic Models in Business and Industry, 38(1): 96–115. https://doi.org/10.1002/asmb.2650
Chu AM, Lam BS, Tiwari A, So MK (2019). An empirical study of applying statistical disclosure control methods to public health research. International Journal of Environmental Research and Public Health, 16(22): 4519. https://doi.org/10.3390/ijerph16224519
Duroux R, Scornet E (2018). Impact of subsampling and tree depth on random forests. ESAIM: Probability and Statistics, 22: 96–128. https://doi.org/10.1051/ps/2018008
Estes JP, Mukherjee B, Taylor JM (2018). Empirical Bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statistics in Biosciences, 10: 568–586. https://doi.org/10.1007/s12561-018-9217-4
Gu T, Taylor JM, Cheng W, Mukherjee B (2019). Synthetic data method to incorporate external information into a current study. Canadian Journal of Statistics, 47(4): 580–603. https://doi.org/10.1002/cjs.11513
Hoshino N (2020). A firm foundation for statistical disclosure control. Japanese Journal of Statistics and Data Science, 3: 721–746. https://doi.org/10.1007/s42081-020-00086-9
Hu J, Drechsler J, Kim HJ (2022a). Accuracy gains from privacy amplification through sampling for differential privacy. Journal of Survey Statistics and Methodology, 10(3): 688–719. https://doi.org/10.1093/jssam/smac012
Hu J, Savitsky TD, Williams MR (2022b). Private tabular survey data products through synthetic microdata generation. Journal of Survey Statistics and Methodology, 10(3): 720–752. https://doi.org/10.1093/jssam/smac001
Kokosi T, De Stavola B, Mitra R, Frayling L, Doherty A, Dove I, et al. (2022). An overview of synthetic administrative data for research. International Journal of Population Data Science, 7(1). https://doi.org/10.23889/ijpds.v7i1.1727
Li B, Li X, Zhao Z (2006). Novel algorithm for constructing support vector machine regression ensemble. Journal of Systems Engineering and Electronics, 17(3): 541–545. https://doi.org/10.1016/S1004-4132(06)60093-5
Lundell JF (2023). Tuning support vector machines and boosted trees using optimization algorithms. arXiv preprint arXiv:2303.07400.
Muralidhar K, Parsa R, Sarathy R (1999). A general additive data perturbation method for database security. Management Science, 45(10): 1399–1415. https://doi.org/10.1287/mnsc.45.10.1399
Muralidhar K, Sarathy R (2003). A theoretical basis for perturbation methods. Statistics and Computing, 13: 329–335. https://doi.org/10.1023/A:1025610705286
Muralidhar K, Sarathy R (2005). An enhanced data perturbation approach for small data sets. Decision Sciences, 36(3): 513–529. https://doi.org/10.1111/j.1540-5414.2005.00082.x
Muralidhar K, Sarathy R (2006). Data shuffling—a new masking approach for numerical data. Management Science, 52(5): 658–670. https://doi.org/10.1287/mnsc.1050.0503
Sarathy R, Muralidhar K, Parsa R (2002). Perturbing nonnormal confidential attributes: The copula approach. Management Science, 48(12): 1613–1627. https://doi.org/10.1287/mnsc.48.12.1613.439
Shen X, Liu Y, Shen R (2023). Boosting data analytics with synthetic volume expansion. arXiv preprint: https://arxiv.org/abs/2310.17848.
Tay JK, Narasimhan B, Hastie T (2023). Elastic net regularization paths for all generalized linear models. Journal of Statistical Software, 106(1): 1–31. https://doi.org/10.18637/jss.v106.i01
Wright MN, Ziegler A (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1): 1–17. https://doi.org/10.18637/jss.v077.i01