Privacy-preserving machine learning methods seek to train useful models that do not disclose information about the data on which they were trained. Such methods are vital when organizations train neural networks on sensitive individual-level data and seek to release the models publicly. Their goal poses a trade-off between predictive performance (utility) and privacy protection. That trade-off makes privacy-preserving machine learning methods difficult to apply in practice, usually requiring extensive iteration and hyperparameter tuning. Yet, practitioners often have little guidance for navigating competing statistical, computational, and privacy demands. We present an implementation algorithm for the Stochastic Weight Averaging–Gaussian Pseudo Posterior Mechanism (SWAG-PPM), a Bayesian differentially private deep learning method. The implementation algorithm focuses on the joint tuning of two key hyperparameters whose interaction governs model convergence and the privacy–utility trade-off. We introduce novel diagnostic tools to evaluate convergence and guide hyperparameter adjustments. Using a transformer model for occupational injury classification, we demonstrate that diagnostic-guided tuning with SWAG-PPM can achieve strong privacy protection and utility. While our case study uses a specific dataset and model architecture, all methodological steps can apply to other settings where privacy risk is heterogeneously distributed.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 875–888
Abstract
To surveil the development of COVID-19 is a complex and challenging issue. The foundation of such surveillance is timely and accurate epidemic data. Therefore, quality control for releasing COVID-19 data is very important, accounting for the releasing agent, the content to release, and the impact of the released data. We suggest that the quality requirements for the release of COVID-19 data be based on the global perspective that the goal of open epidemic data is to create a valuable ecological chain in which all stakeholders are involved. As such, the collection, aggregation, and release process of the COVID-19 data should meet not only the data quality standards of official statistics and health statistics, but also the characteristics of the epidemic statistics and the needs of pandemic prevention. The quality requirements should follow the unique characteristics of the epidemic and be scrutinized by the public. We integrate the perspectives of official statistics, health statistics, and open government data, proposing five quality dimensions for releasing COVID-19 data: accuracy, timeliness, systematicness, userfriendliness and security. Through case studies on the official websites of Chinese provincial health commission, we report the quality problems in the current data releasing process and suggest improvements.