Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS)
  4. Impact of Data Perturbation for Statisti ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Impact of Data Perturbation for Statistical Disclosure Control on the Predictive Performance of Machine Learning Techniques
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 312–331
Thomas Johnson III   Sayed A. Mostafa ORCID icon link to view author Sayed A. Mostafa details  

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1186
Pub. online: 23 April 2025      Type: Statistical Data Science      Open accessOpen Access

Received
17 August 2024
Accepted
9 April 2025
Published
23 April 2025

Abstract

The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical disclosure control (SDC) aims to maximize data utility while minimizing the disclosure risk, i.e., the risk of individual identification. A key SDC method is data perturbation, with General Additive Data Perturbation (GADP) and Copula General Additive Data Perturbation (CGADP) being two prominent approaches. Both leverage multivariate normal distributions to generate synthetic data while preserving statistical properties of the original dataset. Given the increasing use of machine learning for data modeling, this study compares the performance of various machine learning models on GADP- and CGADP-perturbed data. Using Monte Carlo simulations with three data-generating models and a real dataset, we evaluate the predictive performance and robustness of ten machine learning techniques under data perturbation. Our findings provide insights into the machine learning techniques that perform robustly on GADP- and CGADP-perturbed datasets, extending previous research that primarily focused on simple statistics such as means, variances, and correlations.

Supplementary material

 Supplementary Material
The supplementary material includes the following: (1) README: a brief explanation of the supplementary material; (2) a detailed description of the predictive machine learning techniques compared in this paper and additional simulation results; and (3) R code files.

References

 
Blanco-Justicia A, Sánchez D, Domingo-Ferrer J, Muralidhar K (2022). A critical review on the use (and misuse) of differential privacy in machine learning. ACM Computing Surveys, 55(8): 1–16.
 
Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32. https://doi.org/10.1023/A:1010933404324
 
Carlson M, Salabasis M (2002). A data-swapping technique using ranks—a method for disclosure control. Research in Official Statistics, 6(2): 35–64.
 
Chen T, Guestrin C (2016). XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
 
Chu AM, Ip CY, Lam BS, So MK (2022). Statistical disclosure control for continuous variables using an extended skew-t copula. Applied Stochastic Models in Business and Industry, 38(1): 96–115. https://doi.org/10.1002/asmb.2650
 
Chu AM, Lam BS, Tiwari A, So MK (2019). An empirical study of applying statistical disclosure control methods to public health research. International Journal of Environmental Research and Public Health, 16(22): 4519. https://doi.org/10.3390/ijerph16224519
 
Duroux R, Scornet E (2018). Impact of subsampling and tree depth on random forests. ESAIM: Probability and Statistics, 22: 96–128. https://doi.org/10.1051/ps/2018008
 
Elliot M, Domingo-Ferrer J (2018). The future of statistical disclosure control. CoRR, abs/1812.09204.
 
Estes JP, Mukherjee B, Taylor JM (2018). Empirical Bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statistics in Biosciences, 10: 568–586. https://doi.org/10.1007/s12561-018-9217-4
 
Gu T, Taylor JM, Cheng W, Mukherjee B (2019). Synthetic data method to incorporate external information into a current study. Canadian Journal of Statistics, 47(4): 580–603. https://doi.org/10.1002/cjs.11513
 
Hastie T, Tibshirani R, Friedman JH (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer.
 
Hoshino N (2020). A firm foundation for statistical disclosure control. Japanese Journal of Statistics and Data Science, 3: 721–746. https://doi.org/10.1007/s42081-020-00086-9
 
Hu J, Drechsler J, Kim HJ (2022a). Accuracy gains from privacy amplification through sampling for differential privacy. Journal of Survey Statistics and Methodology, 10(3): 688–719. https://doi.org/10.1093/jssam/smac012
 
Hu J, Savitsky TD, Williams MR (2022b). Private tabular survey data products through synthetic microdata generation. Journal of Survey Statistics and Methodology, 10(3): 720–752. https://doi.org/10.1093/jssam/smac001
 
Kokosi T, De Stavola B, Mitra R, Frayling L, Doherty A, Dove I, et al. (2022). An overview of synthetic administrative data for research. International Journal of Population Data Science, 7(1). https://doi.org/10.23889/ijpds.v7i1.1727
 
Li B, Li X, Zhao Z (2006). Novel algorithm for constructing support vector machine regression ensemble. Journal of Systems Engineering and Electronics, 17(3): 541–545. https://doi.org/10.1016/S1004-4132(06)60093-5
 
Lundell JF (2023). Tuning support vector machines and boosted trees using optimization algorithms. arXiv preprint arXiv:2303.07400.
 
McConville K (2011). Improved estimation for complex surveys using modern regression techniques. Ph.D. thesis, Colorado State University.
 
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2023). e1071: Misc Functions of the Department of Statistics, Probability Theory Group. (Formerly: E1071), TU Wien. R package version 1.7-14.
 
Moore R (1996). Controlled data swapping for masking public use microdata sets. US Census Bureau Research Report, 96(04).
 
Muralidhar K, Parsa R, Sarathy R (1999). A general additive data perturbation method for database security. Management Science, 45(10): 1399–1415. https://doi.org/10.1287/mnsc.45.10.1399
 
Muralidhar K, Sarathy R (2003). A theoretical basis for perturbation methods. Statistics and Computing, 13: 329–335. https://doi.org/10.1023/A:1025610705286
 
Muralidhar K, Sarathy R (2005). An enhanced data perturbation approach for small data sets. Decision Sciences, 36(3): 513–529. https://doi.org/10.1111/j.1540-5414.2005.00082.x
 
Muralidhar K, Sarathy R (2006). Data shuffling—a new masking approach for numerical data. Management Science, 52(5): 658–670. https://doi.org/10.1287/mnsc.1050.0503
 
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
 
Sarathy R, Muralidhar K, Parsa R (2002). Perturbing nonnormal confidential attributes: The copula approach. Management Science, 48(12): 1613–1627. https://doi.org/10.1287/mnsc.48.12.1613.439
 
Shen X, Liu Y, Shen R (2023). Boosting data analytics with synthetic volume expansion. arXiv preprint: https://arxiv.org/abs/2310.17848.
 
Tay JK, Narasimhan B, Hastie T (2023). Elastic net regularization paths for all generalized linear models. Journal of Statistical Software, 106(1): 1–31. https://doi.org/10.18637/jss.v106.i01
 
Toth D (2021). rpms: Recursive Partitioning for Modeling Survey Data. R package version 0.5.1.
 
Venables WN, Ripley BD (2002). Modern Applied Statistics with S. Springer, New York. ISBN 0-387-95457-0.
 
Wang Y, Wu X, Hu D (2016). Using randomized response for differential privacy preserving data collection. In: Themis Palpanas and Kostas Stefanidis (Eds.), Proceedings of the Workshops of the EDBT/ICDT 2016 Joint Conference, volume 1558, 0090–6778.
 
Willenborg L, de Waal T (2001). Elements of Statistical Disclosure Control. Springer, New York.
 
Wright MN, Ziegler A (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1): 1–17. https://doi.org/10.18637/jss.v077.i01

Related articles PDF XML
Related articles PDF XML

Copyright
2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
data confidentiality data perturbation machine learning predictive modeling statistical disclosure control

Metrics
since February 2021
118

Article info
views

36

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy