Inference for Optimal Differential Privacy Procedures for Frequency Tables

Li, Chengcheng; Wang, Naisyin; Xu, Gongjun

doi:10.6339/22-JDS1044

Journal of Data Science

Inference for Optimal Differential Privacy Procedures for Frequency Tables

Volume 20, Issue 2 (2022), pp. 253–276

Chengcheng Li Naisyin Wang Gongjun Xu

https://doi.org/10.6339/22-JDS1044

Pub. online: 20 April 2022 Type: Statistical Data Science

Open Access

Received
28 March 2022

Accepted
30 March 2022

Published
20 April 2022

Abstract

When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.

Supplementary material

Supplementary Material

Supplementary Material available online includes proofs of theoretical results and additional simulation study results on inter- and intra-table merging.

References

Abowd JM, Vilhuber L (2008). How protective are synthetic data? In: International Conference on Privacy in Statistical Databases, 239–246. Springer, New York, U.S.

Avella-Medina M (2021). Privacy-preserving parametric inference: A case for robust statistics. Journal of the American Statistical Association, 116(534): 969–983.

Awan J, Slavković A (2018). Differentially private uniformly most powerful tests for binomial data. In: Advances in Neural Information Processing Systems (S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett, eds.), volume 31 of Curran Associates. Inc., New York, U.S.

Barrientos AF, Reiter JP, Machanavajjhala A, Chen Y (2019). Differentially private significance tests for regression coefficients. Journal of Computational and Graphical Statistics, 28(2): 440–453.

Bowen CM, Liu F (2020). Comparative study of differentially private data synthesis methods. Statistical Science, 35(2): 280–307.

Campbell Z, Bray A, Ritz A, Groce A (2018). Differentially private ANOVA testing. In: 2018 1st International Conference on Data Intelligence and Security (ICDIS), 281–285.

Canonne CL, Kamath G, Steinke T (2020). The discrete gaussian for differential privacy. Advances in Neural Information Processing Systems, 33: 15676–15688.

Charest AS (2011). How can we analyze differentially private synthetic datasets? Journal of Privacy and Confidentiality, 2(2).

Chaudhuri K, Sarwate A, Sinha K (2012). Near-optimal differentially private principal components. In: Advances in Neural Information Processing Systems, 989–997.

Couch S, Kazan Z, Shi K, Bray A, Groce A (2019). Differentially private nonparametric hypothesis testing. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 737–751.

Degue KH, Le Ny J (2018). On differentially private Gaussian hypothesis testing. In: 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 842–847. IEEE.

Ding B, Nori H, Li P, Allen J (2018). Comparing population means under local differential privacy: with significance and power. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.

Drechsler J (2011). Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, volume 201. Springer Science & Business Media, Berlin, Germany.

Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006a). Our data, ourselves: privacy via distributed noise generation. In: Annual International Conference on the Theory and Applications of Cryptographic Techniques, 486–503. Springer.

Dwork C, McSherry F, Nissim K, Smith A (2006b). Calibrating noise to sensitivity in private data analysis. In: Theory of Cryptography Conference, 265–284. Springer.

Dwork C, Naor M, Pitassi T, Rothblum GN, Yekhanin S (2010). Pan-private streaming algorithms. In: ICS, 66–80.

Dwork C, Roth A (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4): 211–407.

Ferrando C, Wang S, Sheldon D (2020). Parametric bootstrap for differentially private confidence intervals. arXiv preprint: https://arxiv.org/abs/2006.07749.

Friedman A, Berkovsky S, Kaafar MA (2016). A differential privacy framework for matrix factorization recommender systems. User Modeling and User-Adapted Interaction, 26(5): 425–458.

Gaboardi M, Lim H, Rogers R, Vadhan S (2016). Differentially private Chi-squared hypothesis testing: Goodness of fit and independence testing. In: International Conference on Machine Learning, 2111–2120. PMLR.

Geng Q, Viswanath P (2014). The optimal mechanism in differential privacy. In: 2014 IEEE International Symposium on Information Theory, 2371–2375. IEEE.

Geng Q, Viswanath P (2015). The optimal noise-adding mechanism in differential privacy. IEEE Transactions on Information Theory, 62(2): 925–951.

Ghosh A, Roughgarden T, Sundararajan M (2012). Universally utility-maximizing privacy mechanisms. SIAM Journal on Computing, 41(6): 1673–1693.

Golle P, Partridge K (2009). On the anonymity of home/work location pairs. In: Pervasive Computing, 390–397. Springer, Berlin Heidelberg, Berlin, Heidelberg.

Hay M, Machanavajjhala A, Miklau G, Chen Y, Zhang D (2016). Principled evaluation of differentially private algorithms using dpbench. In: Proceedings of the 2016 International Conference on Management of Data, 139–154.

Johnson A, Shmatikov V (2013). Privacy-preserving data exploration in genome-wide association studies. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1079–1087.

Kairouz P, Bonawitz K, Ramage D (2016). Discrete distribution estimation under local privacy. In: International Conference on Machine Learning, 2436–2444. PMLR.

Karwa V, Krivitsky PN, Slavković AB (2015). Sharing social network data: differentially private estimation of exponential family random graph models. arXiv preprint: https://arxiv.org/abs/1511.02930.

Karwa V, Slavković A (2016). Inference using noisy degrees: differentially private model and synthetic graphs. The Annals of Statistics, 44(1): 87–112.

Little RJ (1993). Statistical analysis of masked data. Journal of Official Statistics, 9(2): 407–426.

Liu C, He X, Chanyaswad T, Wang S, Mittal P (2019). Investigating statistical privacy frameworks from the perspective of hypothesis testing. Proceedings on Privacy Enhancing Technologies, 3: 233–254.

Clifford R M, Bryant D, Burchinal M, Barbarin O, Early D, Howes C, et al. (2017). National Center for Early Development and Learning Multistate Study of Pre-Kindergarten. Inter-university Consortium for Political and Social Research [distributor].

Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L (2008). Privacy: Theory meets practice on the map. In: 2008 IEEE 24th International Conference on Data Engineering, 277–286. IEEE.

McClure D, Reiter JP (2012). Differential privacy and statistical disclosure risk measures: an investigation with binary synthetic data. Trans. Data Priv., 5(3): 535–552.

Mohammed N, Chen R, Fung BC, Yu PS (2011). Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 493–501.

Narayanan A, Shmatikov V (2008). Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (sp 2008), 111–125.

Quick H (2019). Generating Poisson-distributed differentially private synthetic data. arXiv preprint: https://arxiv.org/abs/1906.00455.

Raab GM, Nowok B, Dibben C (2016). Practical data synthesis for large samples. Journal of Privacy and Confidentiality, 7(3): 67–97.

Raghunathan TE, Reiter JP, Rubin DB (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1): 1–16.

Reiter JP (2005). Using CART to generate partially synthetic public use microdata. Journal of Official Statistics, 21(3): 441–462.

Rinott Y, O’Keefe CM, Shlomo N, Skinner C (2018). Confidentiality and differential privacy in the dissemination of frequency tables. Statistical Science, 33(3): 358–385.

Rogers R, Roth A, Smith A, Thakkar O (2016). Max-information, differential privacy, and post-selection hypothesis testing. In: 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), 487–494. IEEE.

Rubin DB (1993). Statistical disclosure limitation. Journal of Official Statistics, 9(2): 461–468.

Sheffet O (2017). Differentially private ordinary least squares. In: International Conference on Machine Learning, 3105–3114. PMLR.

Snoke J, Raab G, Nowok B, Dibben C, Slavkovic A (2016). General and specific utility measures for synthetic data. arXiv preprint: https://arxiv.org/abs/1604.06651.

Sweeney L (2013). Matching known patients to health records in Washington state data. arXiv preprint: https://arxiv.org/abs/1307.1370.

Task C, Clifton C (2016). Differentially private significance testing on paired-sample data. In: Proceedings of the 2016 SIAM International Conference on Data Mining (SDM), 153–161.

Vu D, Slavkovic A (2009). Differential privacy for clinical trial data: Preliminary evaluations. In: 2009 IEEE International Conference on Data Mining Workshops, 138–143.

Wang R, Li YF, Wang X, Tang H, Zhou X (2009). Learning your identity and disease from research papers: information leaks in genome wide association study. In: Proceedings of the 16th ACM Conference on Computer and Communications Security, 534–544.

Wang Y, Lee J, Kifer D (2015a). Revisiting differentially private hypothesis tests for categorical data. arXiv preprint: https://arxiv.org/abs/1511.03376.

Wang YX, Fienberg S, Smola A (2015b). Privacy for free: posterior sampling and stochastic gradient monte carlo. In: International Conference on Machine Learning, 2493–2502.

Wasserman L, Zhou S (2010). A statistical framework for differential privacy. Journal of the American Statistical Association, 105(489): 375–389.

Yu F, Fienberg SE, Slavković AB, Uhler C (2014). Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics, 50: 133–141.

2022 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

goodness-of-fit hypothesis testing optimality table merging

Funding

This research was partially supported by NSF SES-1846747.

Metrics

since February 2021

1065

Article info
views

362

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file