Inference for Optimal Differential Privacy Procedures for Frequency Tables
Volume 20, Issue 2 (2022), pp. 253–276
Pub. online: 20 April 2022
Type: Statistical Data Science
Open Access
Received
28 March 2022
28 March 2022
Accepted
30 March 2022
30 March 2022
Published
20 April 2022
20 April 2022
Abstract
When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.
Supplementary material
Supplementary MaterialSupplementary Material available online includes proofs of theoretical results and additional simulation study results on inter- and intra-table merging.
References
Ferrando C, Wang S, Sheldon D (2020). Parametric bootstrap for differentially private confidence intervals. arXiv preprint: https://arxiv.org/abs/2006.07749.
Karwa V, Krivitsky PN, Slavković AB (2015). Sharing social network data: differentially private estimation of exponential family random graph models. arXiv preprint: https://arxiv.org/abs/1511.02930.
Quick H (2019). Generating Poisson-distributed differentially private synthetic data. arXiv preprint: https://arxiv.org/abs/1906.00455.
Snoke J, Raab G, Nowok B, Dibben C, Slavkovic A (2016). General and specific utility measures for synthetic data. arXiv preprint: https://arxiv.org/abs/1604.06651.
Sweeney L (2013). Matching known patients to health records in Washington state data. arXiv preprint: https://arxiv.org/abs/1307.1370.
Wang Y, Lee J, Kifer D (2015a). Revisiting differentially private hypothesis tests for categorical data. arXiv preprint: https://arxiv.org/abs/1511.03376.