Privacy-Preserving Inference on the Ratio of Two Gaussians Using Sums

The ratio of two Gaussians is useful in many contexts of statistical inference. We discuss statistically valid inference of the ratio under Differential Privacy (DP). We use the delta method to derive the asymptotic distribution of the ratio estimator and use the Gaussian mechanism to provide (epsilon, delta)-DP guarantees. Like many statistics, quantities involved in the inference of a ratio can be re-written as functions of sums, and sums are easy to work with for many reasons. In the context of DP, the sensitivity of a sum is easy to calculate. We focus on getting the correct coverage probability of 95\% confidence intervals (CIs) of the DP ratio estimator. Our simulations show that the no-correction method, which ignores the DP noise, gives CIs that are too narrow to provide proper coverage for small samples. In our specific simulation scenario, the coverage of 95% CIs can be as low as below 10%. We propose two methods to mitigate the under-coverage issue, one based on Monte Carlo simulation and the other based on analytical correction. We show that the CIs of our methods have much better coverage with reasonable privacy budgets. In addition, our methods can handle weighted data, when the weights are fixed and bounded.


Introduction
Ratios are used in many types of statistical analyses. Examples include the ratio of regression coefficients (Hirschberg and Lye, 2007), the therapeutic safety ratio (Dunlap and Silver, 1986), and the percent difference of the outcome metric between two arms in randomized experiments (Deng et al., 2018). The use case we focus on is model calibration. Being well calibrated is widely regarded as a desirable characteristic of a classification model (DeGroot and Fienberg, 1983). A model is said to be calibrated if the predicted scores (s) match the average of true labels (y). Specifically, among observations with prediction s, the actual percent of positive labels is equal to s for all values of s. This is intuitive: If a weather forecaster predicts the chance of rain is 80%, then we expect to observe rain about 80 out of 100 such predictions for the forecaster to be considered valid (Miller, 1962) or reliable (Murphy, 1972). In practice, calibration curves are often used to visually check how calibrated a model is, where the observations are bucketed into K (usually 5, 10, or 20) groups by model score (s), and then the average of y for each group is plotted against the average of model score s. A well calibrated model will have all points close to the 45-degree line. Equivalently, we want each ratio of average s and average y, which we call the calibration ratio, to be close to 1.
Statistical analysis has started to face another requirement -privacy protection. Online pri-vacy, in particular, has become front and center for many organizations' analytical tasks (Abowd et al., 2019). Organizations and corporations are exploring potential solutions to preserve analytical functionalities while preserving user privacy. Differential Privacy (DP) has become one of the more popular formal definitions of privacy (Dwork et al., 2006b), which can be achieved by adding random noise. DP by noise addition comes in two general variants: local DP (Kasiviswanathan et al., 2008), where random noise is added to the individual input data points, and central/global DP, where noise is added to the intermediate or final output. For example, Google uses local DP to collect the Chrome web browser's usage data (Erlingsson et al., 2014). Meta (formerly Facebook) has shared its plan to assess fairness in relation to race in the U.S. in privacy-preserving ways via a combination of Secure Multiparty Computation (SMPC) and global DP (Alao et al., 2021). In a fast-growing literature on differentially private confidence intervals, there are two main approaches. One approach relies on distributional assumptions (either directly or via large sample theories), and the other approach uses resampling and simulation methods to numerically approximate the sampling distribution of estimators. Here we mention a small subset of works in the field. D' Orazio et al. (2015) examined the DP confidence interval for the difference-in-means estimator, and derived the sensitivity for the standard error of difference-in-means to avoid adding noise separately to intermediate summary statistics. Movahedi et al. (2021) describes an industry deployment in a randomized controlled experiment setting, also focusing on difference-in-means estimator, but using an alternative approach that is based on noisy intermediate sufficient statistics and approximate sampling distribution. Vu andSlavkovic (2009), Gaboardi et al. (2016), and Awan and Slavkovic (2020) study DP hypothesis testing for multinomial data and binomial data. Karwa and Vadhan (2017) shows how to construct conservative DP confidence intervals under normality without knowing the bounds in advance, but the resulting confidence intervals are usually too wide to be practically useful. Du et al. (2020) and Ferrando et al. (2020) improve upon Karwa and Vadhan (2017) using simulation to get practical confidence intervals for the mean estimation problem under normality. Brawner and Honaker (2018) use bootstrapping to compute DP confidence interval along with a point estimate, without additional privacy budget under Concentrated DP. Covington et al. (2021), Evans et al. (2019), andEvans et al. (2021) are some recent efforts to provide unbiased DP inference to offset the biases by some DP procedures such as winsorization.
To the best of the authors' knowledge, there is no existing work on Differentially Private statistical inference on the ratio of two random Gaussian variables. This work is an attempt to fill this gap. We propose methodology to conduct statistically valid inference of ratio estimators under DP, with a specific focus on preventing under-coverage of confidence intervals. We also examine the case when the data is weighted (e.g., in a complex survey design). Our methods apply as long as the conditions of using the delta method is satisfied (see Deng et al. (2018) for a recent discussion).

Definitions and Methodology
This section defines the quantity of interest and the privacy semantics. We use n for sample size, y for the label, and s for the score (probability prediction from a classification model). Both y and s are non-negative. Further, l y , u y , l s , u s are the lower and upper bounds on y and s respectively. We focus on the binary classification models, where the bounds on y and s are [0, 1].
When the data is weighted, we use l w , u w for the lower and upper bounds of w, the sample weights, which are assumed to be fixed (e.g., design weights). We also assume that u w is known, which is the case for example when the bounds are specified in the weight calibration step.

Calibration Ratio
Given a model, the calibration ratio is simply r = µ s /µ y , where µ s and µ y are the true means of s and y. An estimator of r isr =s/ȳ. Note this estimator is statistically biased, but its bias is of order 1/n and vanishes quickly as sample size increases. What's more interesting is its variance. We will use the fact thats/ȳ = s/ y to use ratio of sums instead of means, since the ratio of sums is easier to work with for inference. With a slight abuse of notation, we uses and y to denote both the means when data is not weighted as well as the weighted means when data is weighted.

Differential Privacy
DP has grown to be one of the most influential privacy definitions in recent years. In this section we introduce the basic privacy semantics, definitions, and properties of DP. In this paper we focus on the classical Pure DP definition (ǫ-DP) and Approximate DP definition, also known as (ǫ, δ)-DP. For a more complete treatment we refer readers to Dwork and Roth (2014).
A randomized algorithm satisfies the requirement of DP (Dwork et al., 2006b) if for every two neighboring datasets that differ on exactly one record, and for every possible output, the probabilities of the output is close up to a multiplicative factor of e ǫ ≈ 1 + ǫ whether the randomized algorithm is applied on one dataset or the other. This is often called ǫ-DP or pure DP.
As we can see from the informal definition above, DP requires that the neighboring datasets result in essentially indistinguishable distributions of data releases; or more succinctly, close datasets have close outputs. This requires formal measures for 1) the distance between two datasets, and 2) the distance between two distributions of output. The choice of these two distance relations defines the flavor of DP.
There are two popular notions of neighboring datasets in the DP literature. One is called "add/remove-one," where we can get the neighboring dataset by either adding or removing one observation. The other one is called "change-one," where we get the neighboring dataset by changing the value of an observation, instead of adding/removing it to/from the dataset. The change-one definition can be seen as the result of removing one and then adding another observation (or in the reverse order). In this paper we use the "add/remove-one" definition of neighboring, because we intend to protect the sample sizes as well, in order to prevent certain privacy attacks such as membership inference attacks or tracing attacks (Dwork et al., 2017). Dwork et al. (2006a) relaxes the DP requirement by allowing for the violation of ǫ-DP with a (cryptographically) small probability δ. This is often called (ǫ, δ)-DP or Approximate DP. Formally, a randomized algorithm M : X n → Y is (ǫ, δ)-DP if for all neighboring datasets X, X ′ ∈ X n and all outcomes T ⊆ Y we have Pr (M (X) ∈ T ) e ǫ Pr (M (X ′ ) ∈ T ) + δ.
Two properties of DP algorithms are relevant to this paper (Dwork and Roth, 2014): 1. Closure under composition: The composition of K differential private mechanisms, where the kth mechanism is (ǫ k , δ k )-DP, for 1 k K, is ( K k=1 ǫ k , K k=1 δ k )-DP. This is known as basic composition, which we use in this paper. There are more advanced theorems that have tighter composition bounds than the basic composition (Kairouz et al., 2017). 2. Immune to post-processing: If an algorithm is (ǫ, δ)-DP, then any post-processing of its outputs (i.e., without going back and looking at the raw data again) is still (ǫ, δ)-DP.
DP provides strong privacy guarantee for the worst-case scenario, at the cost of utility degradation. The privacy guarantee holds no matter how the data is distributed and what type of attack happens, but the added noise makes the statistical inference less precise. DP makes intuitive sense for robust predictive modeling or statistical inference (Dwork and Lei, 2009). The ultimate goal of a predictive model is to have accurate predictions out of sample, not in sample. Similarly, the ultimate goal of statistical inference is to generalize the conclusion beyond the sample at hand. As a result, a small change in the sample, or one observation in the DP case, should not change the model or the inference much.

Inference
For inference, the point estimate of the ratio is simply the ratio of the two (weighted) means, which is biased but the bias goes away quickly as sample size increases. So, we instead focus on the confidence interval (CI), usually at the 95% confidence level. Due to the Central Limit Theorem, both the numerator and the denominator ofr are means of independent and identically distributed variables and are thus asymptotic Gaussians. For a ratio of two Gaussians, the delta method shows that the asymptotic distribution ofr is itself a Gaussian with variance where µs and µȳ are the means ofs andȳ, σ 2 s and σ 2 y are their variances, and σȳs is their covariance. See Casella and Berger (2002) for a derivation.
As a result of the fact that both y and s are non-negative, the distribution of r is often right skewed. However, the CIs constructed using the delta method is symmetric. As a result, people sometimes either directly use log(r) or construct CI for log(r) and exponentiate both limits back to the original scale. The asymptotic variance of log(r) can also be constructed using the delta method: where the quantities needed are the same as in Equation (1). In the rest of the paper, we will focus on the ratio scale and only briefly discuss the log scale in the Simulations and Results sections.

DP Mechanism
In statistics, many quantities of interest can be written as functions of sums, a fact we make use of here. In particular, sums are appealing in the context of DP because their sensitivity can be easily calculated. It is straightforward to re-write the plug-in estimator of Equation (1) in terms of sums, where x is a placeholder for either s or y: To be explicit, up to 7 sums are needed: However, for a binary classification model, y is either 0 or 1, so n i=1 w i y i = n i=1 w i y 2 i , leading to 6 sums needed. Further, when the data is not weighted, i.e., (Kish, 1965) in Equations (3) and (4). Without weights, the effective sample size is simply n. The effective sample size indicates the loss of efficiency due to weighting.
Recall that one reason we use the sums is that their sensitivity can be easily obtained. Under the add/remove-one definition of neighboring datasets, the sensitivity of each sum is simply the summand with s, y, and w replaced by their (positive) upper bounds. In the binary classification case, the bounds for s and y are [0, 1], so the sensitivity for all sums is simply u w .
We use the Gaussian mechanism to achieve (ǫ, δ)-DP (Dwork et al., 2006a), which uses Gaussian noise with standard deviation: Improved methods are available so that less noise is needed (Balle and Wang, 2018), where the variance of the noise may have to be obtained numerically. Here, n i=1 w i y i , for example, will be released as ( n i=1 w i y i ) dp = n i=1 w i y i + e, where we use a subscript dp to indicate the noisy quantity that can be released. Here, e is the noise term coming from a Gaussian distribution e ∼ Gaussian(0, σ 2 n i=1 w i y i ), where σ is obtained by plugging ∆ = u w into Equation (5). Due to composition, the global budget is split among quantities released. For example, if 6 sums are released, then each one would get to use 1/6 of the total privacy budget: (ǫ/6, δ/6). Tighter composition theorems can be used for large number of composition rounds, but here we use the basic composition for easier exposition.
In model calibration exercises, multiple testings are common; e.g., across many models, many subgroups, and many time periods. We focus on the Gaussian mechanism due to its better utility under a large number of compositions. For smaller compositions, the Laplace mechanism tends to have better utility. We include simulations based on the Laplace mechanism in the Appendix and briefly discuss the conditions under which the Gaussian or Laplace mechanism is more appropriate. When Laplace noises are added, the numerator and denominator of the ratio are not longer Gaussians, which violates the assumptions of the Analytical correction method to be introduced in Section 2.4.3. However, the method seems robust against this violation.

CI Calculation
Once the DP version of the up to 7 sums are released, all calculations based on them are postprocessing, so the privacy guarantee remains the same, by the post-processing property of (ǫ, δ)-DP. The point estimate is simplyr What's more interesting is its variance. Instead of ignoring the DP noises added, we propose two methods that appropriately account for them in the CI calculation.
Once the point estimates and variances are obtained via any of the three methods below, hypothesis testing of the equality of the two ratios r 1 and r 2 can be easily carried out sincê , wherer 1 andr 2 are the point estimates and σ 2 r 1 and σ 2 r 2 are their variances.

No Correction
What is often done in practice is simply ignore the DP noise added and apply no correction. To be explicit, the DP version of the sums are plugged into Equations (2) and (4) to get the mean and variance/covariance estimates, which are then plugged into Equation (1) to get the final variance estimate. We call the variance obtained this way σ 2 no_correction , which ignores uncertainty due to DP noises and thus gives CIs that are expected to be too narrow in small sample settings.

Monte Carlo
To estimate the variance injected by the DP mechanism to the ratio estimate, we can use Monte Carlo simulations. Recall that the ratio of means is the same as the ratio of sums. The procedure is as follows: 1. calculate point estimater where B is a large integer (e.g., 200): (a) generate independent Gaussian noises e s,b and e y,b for n i=1 w i s i and n i=1 w i y i , respectively. Noises are from distributions with the same variances as in the original DP mechanism, according to Equation (5).
3. the extra variance due to DP is then estimated as Note that we are not looking at the raw data beyond the released DP sums and thus not consuming additional privacy budget due to the post-processing property of DP. The Monte Carlo method is easy to implement. In addition, the computation is fairly cheap since it can be vectorized.
The σ 2 extra term is not an unbiased estimator: In step 2(b) noises are added to the DP sums, whereas the random noises are added to the non-DP sums in the DP mechanism that produces the point estimate. The potential bias decreases with an increasing sample size as the DP sums approach non-DP sums. In the simulations, we will test the method's robustness by including cases where the privacy budget is small so that the noise tends to be big.

Analytical Correction
Recall from Equation (1) that the variance ofr depends on the means and variance/covariance ofs andȳ. For convenience we again use the ratio of sums instead of means.
How do the Gaussian noises added to n i=1 w i s i and n i=1 w i y i change their variance? The noise term is independent of the true quantity, so the variance of the released quantity, which is a sum of the two, is simply the sum of the variances. Further, the independent noises do not change the covariance term. As a result, all we need is to add the variance of the noise to the variance terms.
We follow the steps below to analytically adjust the variance of the ratio estimator in Equation (1): (2) through (4), to get estimates µs, µȳ, σ 2 s , σ 2 y , and σȳs. 2. Translate those to the corresponding estimates for sums: µs · ( n i=1 w i ) dp , µȳ · ( n i=1 w i ) dp , σ 2 s · ( n i=1 w i ) 2 dp , σ 2 y · ( n i=1 w i ) 2 dp , and σȳs · ( n i=1 w i ) 2 dp . 3. Analytically correct the variance terms as follows: where the added term to each is the variance of the DP noises based on Equation (5) 4. Plug those corrected terms for the sums in place of the terms for the means into Equation (1) to get corrected variance estimate.

Simulations
With a sample size of 5,000 or 10,000, we simulated s ∼ Beta(2, 2), y ∼ Bernoulli(s/1.1) (so that true calibration ratio was 1.1), and w as Exponential (1) clipped to the range of [1/3, 3]. Values of ǫ used included {0.2, 0.5, 1.0, 4.0}, δ = 1e-6, and both weighted and unweighted data were analyzed. For many use cases, a calibration ratio of 1.0 corresponds to the null hypothesis. Here a calibration ratio of 1.1 was used to represent the situation where the alternative hypothesis is true. We did, however, carry out simulations with a true calibration ratio of 1.0, based on which the main conclusions would not change and the width of CIs were narrower than for a value of 1.1. For each simulated dataset, we generated the 95% Wald confidence intervals, obtained the width of the intervals, checked whether each covered the true (log) calibration ratio, and calculated the interval score (the smaller the better) using Equation 43 of Gneiting and Raftery (2007) for the following methods • Public: the public method without DP • No_correction: the method without correction for DP noise • Monte Carlo: the correction based on Monte Carlo simulation • Analytical correction: the correction based on modified variance terms We also calculated the effective sample size, which gave us a rough idea of how variable the weights are, using the Kish formula ( n i=1 w i ) 2 /( n i=1 w 2 i ) (Kish, 1965). Recall that the inverse of Kish's effective sample size appeared in Equations (3) and (4). We repeated the simulation 1,000 times. The python code for the simulation can be found at https://github.com/miaojingang/private_ratio.
The results for ratio estimation are summarized in Table 1. The public version, as expected, has coverages fairly close to the nominal level of 95%. As expected, its CI score is the best among all methods.
The no-correction method under covers in most cases, and its CIs are similar to or only slightly wider than those of the public method. This is because the no-correction method does not account for the extra variability introduced by the DP mechanism. As a result, its CIs are too narrow, especially for cases with small sample sizes and/or small privacy budget and/or weighted sample. For example, on the weighted data with n = 5,000, ǫ = 0.2, its CIs only covers the true value 7.6% of the time, which is grossly lower than the nominal coverage level. Its CI scores are the worst among all methods. Both correction methods have much better coverage. As ǫ gets smaller, more noise is injected by the DP mechanism, and both correction methods correctly account for that by giving wider CIs that have the right coverage. The correction methods' CI scores are worse than the public method but better than the no-correction method. With a large sample size and a larger privacy budget, the DP CIs are only slightly wider than the public ones; for example, with n = 10,000, ǫ = 4.0 and no weights, both correction methods have a mean CI width of 0.044, which is barely larger than the public method's 0.043. The CI scores also are virtually the same as that of the public method. Privacy was preserved almost for free. On the other hand, the increase in CI width is more pronounced for smaller sample sizes, smaller privacy budgets, and weighted data. Further, when the privacy budget is too small relative to the sample size, the methods could still under cover and the CIs can be too wide. For example, with n = 5,000, ǫ = 0.2 and weighted data, the Monte Carlo method's coverage is only 91.5% for the log ratio (Table 2), and its CIs are too wide to be useful for the ratio (Table 1). In situations like this, practitioners could explore larger samples and/or larger privacy budget, in addition to potential optimizations we enumerate in Section 4. For the estimation of the log ratio (Table 2), the comparisons among the methods are similar to those for the ratio.

Discussion
We explored the ratio estimation problem and proposed a DP mechanism based on adding noise to summary statistics. We also proposed two variance correction methods that give statistically valid CIs under DP. Our simulations confirmed that the DP noise should not be ignored in ratio inference unless the sample size is large and/or the privacy budget is generous; otherwise, the CIs can be too narrow to cover the true values at the nominal level. The proposal has a few nice features. It is simple: The sums are easy to compute, their sensitivity is trivial to calculate, and the variance corrections to get valid CIs are straightforward. It is flexible: Suppose the data has a hierarchical structure. For example, if the inference is done at the state level and later on one wants to aggregate to national level. The sums can be trivially added up. It is extensible: The variance correction methods can be extended to inference on other quantities. Sums are the building blocks of many statistics, including the moments and in turn some more complex quantities that depend on the moments. Therefore, DP mechanisms based on noising sums can be applied to other statistics.
This work represents an early effort on ratio estimation under DP. Further optimizations may be able to achieve better privacy-utility trade-off. Balle and Wang (2018) propose an Analytic Gaussian Mechanism that reduces the noise variance compared to the classical Gaussian mechanism in Equation (5), especially in the high-privacy (ǫ → 0) and high-dimensional regime. Similarly, alternative DP mechanisms such as truncated Laplace for (ǫ, δ)-DP (Geng et al., 2020) could achieve more precise measurements than Gaussian mechanisms. In cases when many summary statistics need to be privatized, advanced composition of privacy loss (Kairouz et al., 2017) or alternative privacy definitions such as Renyi DP (Mironov, 2017), zero-concentrated DP (Bun and Steinke, 2016), and Gaussian DP (Dong et al., 2019) can provide tighter accumulation of privacy loss. In addition, there may be smarter ways of allocating the privacy budget than evenly splitting the budget among summary statistics, to improve the utility without incurring additional privacy cost. In use cases with tight privacy budget and high accuracy requirements, it may help to release fewer intermediary quantities when possible so that the each quantity gets a bigger privacy budget.
Simulation and resampling have also been used to account for DP noises. Du et al. (2020);Ferrando et al. (2020) use simulations to directly measure the combined uncertainty from sampling and DP noise, as opposed to our methods that account for DP uncertainty separately from the sampling uncertainty. Resampling methods, such as non-parametric bootstrapping, have also been proposed to get the standard error of DP statistics without additional privacy loss (Brawner and Honaker, 2018). When sample sizes are huge, subsampling could also help reduce the computational cost (Kleiner et al., 2014).
Finally, we briefly discuss sampling and weighting. Further privacy amplification by subsampling is possible in certain use cases. When the dataset is a sample from a larger dataset, and the individual identities in the sample are kept secret, we could improve the privacy analysis by subsampling, known as the privacy amplification by subsampling (Balle et al., 2020). It would be interesting to explore how sampling weights and different sampling schemes affect privacy in inference. Another direction is to explore how DP may work with more generic types of weights that are not necessarily fixed or that are with no known bounds. One popular example is calibration weights (Deville and Sarndal, 1992), which are random since they depend on the sample at hand.

Appendix
The Laplace mechanism draws random noise from the Laplace distribution to achieve ǫ-DP guarantee. The probability density function of the Laplace distribution (centered at 0) with scale b is: Given a L 1 global sensitivity of ∆ and a privacy loss parameter of ǫ, the Laplace noise is drawn from a Laplace distribution with scale ∆/ǫ.
Gaussian noise has a few advantages over Laplace noise: 1) the Gaussian mechanism calibrates the noise proportional to the L 2 sensitivity, which is often much smaller than L 1 sensitivity used by the Laplace mechanism in vector-output functions. 2) For the same variance, the Gaussian distribution's tails decay much faster than the Laplace distribution. 3) In many applications, other sources of noise or measurement errors are often (approximately) Gaussian, so Gaussian noise works better due to closure under addition. 4) Moreover, the Gaussian mechanism tends to work better under a large number of compositions due to tighter composition theorems.
However, for a small number of queries/compositions, the Laplace mechanism may have an edge in accuracy: for the same value of ǫ and typical values of δ (which the Laplace mechanism does not depend on), Laplace noise has smaller variance than Gaussian noise. Another advantage of the Laplace mechanism is that it achieves ǫ-DP instead of (ǫ, δ)-DP, which may be preferred in some applications.
Under the same simulation settings other than switching to the Laplace mechanism, Tables 3 and 4 show the same patterns as in Tables 1 and 2: e.g., the no-correction method under covers, and both the proposed methods have much better coverage. In particular, although using the Laplace mechanism violates the assumption of the Analytical method, where the numerator and denominator are no longer Gaussians, the method's coverages are still close to the nominal level. Also, compared with the Gaussian mechanism, smaller amounts of noise are needed for the Laplace mechanism for the particular simulation setting, which yields narrower CIs. If the number of computations increases though, for example in an application with multiple testing, the Gaussian mechanism will start to provide narrower CIs. Practitioners are encouraged to compute the variance of the noise under both mechanisms and choose the winner.