Impact of Data Perturbation for Statistical Disclosure Control on the Predictive Performance of Machine Learning Techniques

Johnson III, Thomas; Mostafa, Sayed A.

doi:10.6339/25-JDS1186

Journal of Data Science

Impact of Data Perturbation for Statistical Disclosure Control on the Predictive Performance of Machine Learning Techniques

Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 312–331

Thomas Johnson III Sayed A. Mostafa

https://doi.org/10.6339/25-JDS1186

Pub. online: 23 April 2025 Type: Statistical Data Science

Open Access

Received
17 August 2024

Accepted
9 April 2025

Published
23 April 2025

Abstract

The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical disclosure control (SDC) aims to maximize data utility while minimizing the disclosure risk, i.e., the risk of individual identification. A key SDC method is data perturbation, with General Additive Data Perturbation (GADP) and Copula General Additive Data Perturbation (CGADP) being two prominent approaches. Both leverage multivariate normal distributions to generate synthetic data while preserving statistical properties of the original dataset. Given the increasing use of machine learning for data modeling, this study compares the performance of various machine learning models on GADP- and CGADP-perturbed data. Using Monte Carlo simulations with three data-generating models and a real dataset, we evaluate the predictive performance and robustness of ten machine learning techniques under data perturbation. Our findings provide insights into the machine learning techniques that perform robustly on GADP- and CGADP-perturbed datasets, extending previous research that primarily focused on simple statistics such as means, variances, and correlations.

Supplementary material

Supplementary Material

The supplementary material includes the following: (1) README: a brief explanation of the supplementary material; (2) a detailed description of the predictive machine learning techniques compared in this paper and additional simulation results; and (3) R code files.

References

Blanco-Justicia A, Sánchez D, Domingo-Ferrer J, Muralidhar K (2022). A critical review on the use (and misuse) of differential privacy in machine learning. ACM Computing Surveys, 55(8): 1–16.

Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32. https://doi.org/10.1023/A:1010933404324

Carlson M, Salabasis M (2002). A data-swapping technique using ranks—a method for disclosure control. Research in Official Statistics, 6(2): 35–64.

Chen T, Guestrin C (2016). XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785

Chu AM, Ip CY, Lam BS, So MK (2022). Statistical disclosure control for continuous variables using an extended skew-t copula. Applied Stochastic Models in Business and Industry, 38(1): 96–115. https://doi.org/10.1002/asmb.2650

Chu AM, Lam BS, Tiwari A, So MK (2019). An empirical study of applying statistical disclosure control methods to public health research. International Journal of Environmental Research and Public Health, 16(22): 4519. https://doi.org/10.3390/ijerph16224519

Duroux R, Scornet E (2018). Impact of subsampling and tree depth on random forests. ESAIM: Probability and Statistics, 22: 96–128. https://doi.org/10.1051/ps/2018008

Elliot M, Domingo-Ferrer J (2018). The future of statistical disclosure control. CoRR, abs/1812.09204.

Estes JP, Mukherjee B, Taylor JM (2018). Empirical Bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statistics in Biosciences, 10: 568–586. https://doi.org/10.1007/s12561-018-9217-4

Gu T, Taylor JM, Cheng W, Mukherjee B (2019). Synthetic data method to incorporate external information into a current study. Canadian Journal of Statistics, 47(4): 580–603. https://doi.org/10.1002/cjs.11513

Hastie T, Tibshirani R, Friedman JH (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer.

Hoshino N (2020). A firm foundation for statistical disclosure control. Japanese Journal of Statistics and Data Science, 3: 721–746. https://doi.org/10.1007/s42081-020-00086-9

Hu J, Drechsler J, Kim HJ (2022a). Accuracy gains from privacy amplification through sampling for differential privacy. Journal of Survey Statistics and Methodology, 10(3): 688–719. https://doi.org/10.1093/jssam/smac012

Hu J, Savitsky TD, Williams MR (2022b). Private tabular survey data products through synthetic microdata generation. Journal of Survey Statistics and Methodology, 10(3): 720–752. https://doi.org/10.1093/jssam/smac001

Kokosi T, De Stavola B, Mitra R, Frayling L, Doherty A, Dove I, et al. (2022). An overview of synthetic administrative data for research. International Journal of Population Data Science, 7(1). https://doi.org/10.23889/ijpds.v7i1.1727

Li B, Li X, Zhao Z (2006). Novel algorithm for constructing support vector machine regression ensemble. Journal of Systems Engineering and Electronics, 17(3): 541–545. https://doi.org/10.1016/S1004-4132(06)60093-5

Lundell JF (2023). Tuning support vector machines and boosted trees using optimization algorithms. arXiv preprint arXiv:2303.07400.

McConville K (2011). Improved estimation for complex surveys using modern regression techniques. Ph.D. thesis, Colorado State University.

Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2023). e1071: Misc Functions of the Department of Statistics, Probability Theory Group. (Formerly: E1071), TU Wien. R package version 1.7-14.

Moore R (1996). Controlled data swapping for masking public use microdata sets. US Census Bureau Research Report, 96(04).

Muralidhar K, Parsa R, Sarathy R (1999). A general additive data perturbation method for database security. Management Science, 45(10): 1399–1415. https://doi.org/10.1287/mnsc.45.10.1399

Muralidhar K, Sarathy R (2003). A theoretical basis for perturbation methods. Statistics and Computing, 13: 329–335. https://doi.org/10.1023/A:1025610705286

Muralidhar K, Sarathy R (2005). An enhanced data perturbation approach for small data sets. Decision Sciences, 36(3): 513–529. https://doi.org/10.1111/j.1540-5414.2005.00082.x

Muralidhar K, Sarathy R (2006). Data shuffling—a new masking approach for numerical data. Management Science, 52(5): 658–670. https://doi.org/10.1287/mnsc.1050.0503

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Sarathy R, Muralidhar K, Parsa R (2002). Perturbing nonnormal confidential attributes: The copula approach. Management Science, 48(12): 1613–1627. https://doi.org/10.1287/mnsc.48.12.1613.439

Shen X, Liu Y, Shen R (2023). Boosting data analytics with synthetic volume expansion. arXiv preprint: https://arxiv.org/abs/2310.17848.

Tay JK, Narasimhan B, Hastie T (2023). Elastic net regularization paths for all generalized linear models. Journal of Statistical Software, 106(1): 1–31. https://doi.org/10.18637/jss.v106.i01

Toth D (2021). rpms: Recursive Partitioning for Modeling Survey Data. R package version 0.5.1.

Venables WN, Ripley BD (2002). Modern Applied Statistics with S. Springer, New York. ISBN 0-387-95457-0.

Wang Y, Wu X, Hu D (2016). Using randomized response for differential privacy preserving data collection. In: Themis Palpanas and Kostas Stefanidis (Eds.), Proceedings of the Workshops of the EDBT/ICDT 2016 Joint Conference, volume 1558, 0090–6778.

Willenborg L, de Waal T (2001). Elements of Statistical Disclosure Control. Springer, New York.

Wright MN, Ziegler A (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1): 1–17. https://doi.org/10.18637/jss.v077.i01

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

data confidentiality data perturbation machine learning predictive modeling statistical disclosure control

Metrics

since February 2021

359

Article info
views

115

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file