Using Occupancy Models to Estimate the Number of Duplicate Cases in a Data System without Unique Identifiers

Data systems collecting information from different sources or over long periods of time can receive multiple reports from the same individual. An important example is public health surveillance systems that monitor conditions with long natural histories. Several state-level systems for surveillance of one such condition, the human immunodeficiency virus (HIV), use codes composed of combinations of non-unique personal characteristics such as birth date, soundex (a code based on last name), and sex as patient identifiers. As a result, these systems cannot distinguish between several different individuals having identical codes and a unique individual erroneously represented several times. We applied results for occupancy models to estimate the potential magnitude of duplicate case counting for AIDS cases reported to the Centers for Disease Control and Prevention with only non-unique partial personal identifiers. Occupancy models with equal and unequal occupancy probabilities are considered. Unbiased estimators for the numbers of true duplicates within and between case reporting areas are provided. Formulas to calculate estimators’ variances are also provided. These results can be applied to evaluating duplicate reporting in other data systems that have no unique identifier for each individual.


Introduction
Public health surveillance systems that monitor conditions with long natural histories can receive multiple reports from different sources regarding the same affected individual.For example, an individual may change his/her place of residence and seek care for the disease under surveillance, likely resulting in case reports from both places.If there is a unique identifier for each individual submitted with surveillance reports, then duplicate reports can be easily identified and removed from the surveillance system.However, because of confidentiality concerns, national surveillance systems do not collect information on variables that can uniquely identify a person.For example, name and social security number may be reported to a state surveillance system as a part of routine reporting, but are not reported to CDC (Centers for Disease Control and Prevention) for national HIV/AIDS surveillance purposes.Instead, an identifier is often created based on several descriptors.This identifier will not be unique.When information submitted to a surveillance system cannot uniquely identify an individual, and the potential for duplicate reports being submitted to the system exists, the system must use additional information to determine if cases with the same nonunique identifiers represent the same person.For this discussion, we call reports with the same partial personal identifiers "potential duplicates".Among these, we classify reports representing the same person as "true duplicates" and those representing different persons as "non-duplicates".
National AIDS surveillance data in the United States have the potential for duplicate reporting and do not have unique identifiers to identify and remove duplicate reports.Using the data available at the national level (cases reported to CDC), one cannot determine whether cases with the same partial personal identifiers represent the same person and therefore are true duplicates.However, it is possible to estimate the expected number of non-duplicates from the potential duplicates in a surveillance system based on the probability of matching on these partial personal identifiers.Larsen (1994) considered this problem in a register of HIV infected persons, using a method to estimate the number of distinct individuals in the register based on the date of birth of each entry and classical occupancy theory where each ball has the same chance of falling into any one of the cells.While this method may be applicable as applied to the date of birth in a given year, it cannot be applied to identifiers where individuals have an unequal chance to take each possible value of the identifier, e.g., the soundex (a code based on last name using a method of encryption, see Fenna, 1984).
Under the classical occupancy model where each ball has the same chance of falling into each cell, the explicit formula for the expected number of empty cells is available.However, the explicit formula for the variance associated with the observed number of empty cells is not available.A similar situation occurs under the model with unequal occupancy probabilities.Only approximate formulas for the variance are available, see Chistyakov (1967), Holst(1971), andSevastyanov (1972).In this paper, we provide exact variance formulas for the observed number of empty cells under the two occupancy models.They are presented in Sections 2 and 3, respectively.In Section 4, we consider a model with cells filled by colored balls.All of these results can be applied to evaluating duplicates in a data system.As an example, we use occupancy models to evaluate duplication in AIDS case reporting.Results are presented in Section 5. Finally, some concerns and recommendations are presented in the discussion section.

Occupancy Model with Equal Occupancy Probabilities
Suppose that r balls are randomly distributed to n cells.Assume that each ball has an equal chance of being distributed to each cell.Let M r,n be the number of cells remaining empty.According to occupancy theory (see Feller, 1968, page 102), the probability distribution of M r,n is given by where x k is the binomial coefficient equal to the number of combinations of k items selected from x items.Note that this formula is difficult to handle because of the potential for rounding error.A useful recursive formula is available (see Feller, 1968, page 60): Based on this recursive formula, one can derive the mean and variance of M r,n .An alternative but simpler way to derive the mean and variance is presented in Section 3. As a special case of equations (3.1) and (3.6), the mean and variance of M r,n defined in (2.1) are: Since we are interested in the number of occupied cells and the number of balls that exceed the minimum necessary to fill the occupied cells, we consider variables K r,n , the number of occupied cells, and D r,n , the number of balls r, minus the number of cells occupied by the r balls, D r,n = r − K r,n = r − (n − M r,n ).Therefore, we have (2.5) and, given n and r, Using the delta method, the variance of the above estimator is The above results can be applied to situations where cells do not all have the same probability of being occupied, but can be divided into subgroups such that within each subgroup each cell has an equal probability of being occupied by a ball.

Occupancy Model with Unequal Occupancy Probabilities
In this section, we assume that the occupancy probabilities differ from cell to cell.Let M r,n denote the number of empty cells after r balls have been placed into n cells with occupancy probabilities p 1 , . . ., p n , and n i=1 p i = 1.Then, the expected number of empty cells can be expressed as Given m, the number of empty cells, we can estimate r, the number of balls, by solving the above equation for r with E(M r,n ) = m.
Using the binomial expansion followed by interchanging the order of summation in (3.1), we have: where Therefore, the expected number of excess balls is given by Given s (2 ≤ s < r), the above formula can be approximated by The difference between (3.4) and (3.5) is the sum of smaller terms r t=s+1 (−1) t r t q t .Because of the cancelation of positive and negative terms, the difference can be small.If r t q t decreases with t when t ≥ s, then the above approximation has a maximum error less than r s+1 q s+1 .Since q t+1 ≤ p max q t where p max = max{p 1 , . . ., p n }, it follows that r t q t ≤ r t+1 q t+1 if t ≥ (rp max ) − 1)/(1 + p max ).Similar to the occupancy model with equal occupancy probabilities, the variances of M r,n , K r,n and D r,n are all the same for given n and r.The common variance is: This can be easily proved by considering the number of empty cells as a sum of binary variables: M r,n = n i=1 X i , where X i is the indicator variable for the i-th cell after r balls have been distributed into the n cells.The probability that the i-th cell is empty is given by P r(X i = 1) = (1 − p i ) r and the probability that two cells, say the i-th and j-th cells, are empty is P r( Therefore, the variance and covariance of these binary variables are Combining (3.7) and (3.8) gives (3.6).Three approximate formulas for the variance can be found in the literature.The first approximation is given by Chistyakov (1967): if log(r/n) is bounded, then M r,n has an asymptotic normal distribution with variance This approximation is quite accurate.It is slightly greater than the true variance and its maximum relative error is small: Under the condition of equal occupancy probabilities, it can be shown that for fixed r, If n is large but r/n is small, Holst (1971) gave a simpler approximation for the variance of M r,n : If n is large, but r/n is not small or the expected number of empty cells is small, Sevastyanov (1972) proved that M r,n has an asymptotic Poisson distribution with a variance equal to its mean: Figure 1:.Distributions of occupancy probabilities that are proportional to i a for n = 1000.To see how accurate these approximations are, we compared the approximate variances with the true value for occupancy probability distributions p i = i a /c, where c = n i=1 i a .When a = 0, occupancy probabilities are equal.The distributions for a = 0, 1/2, 1, and 2 are shown in Figure 1 where σ = V ar(M r,n ).

Occupancy Problem When Cells Are Filled with Balls of Different Colors
In this section, we consider an occupancy model with cells filled by different colored balls.For simplicity, suppose that balls are colored either black or white and there are r 1 black and r 2 white balls.We are interested in the expected number of black balls falling in cells that have white balls.Suppose that balls of both colors are distributed into n cells with the same distribution probabilities p 1 , . . ., p n .Suppose that the r 2 white balls are distributed in k 2 cells labeled i 1 , . . ., i k 2 .The probability that a ball falls in these cells is Let R 12 be the number of black balls occupying cells i 1 , . . ., i k 2 .Then R 12 has a binomial distribution Bin(r 1 , p white ).Therefore, we have E(R 12 ) = r 1 p white and V ar(R 12 ) = r 1 p white (1 − p white ) (4.2) In the equal occupancy probability situation, all p i = 1/n and p white = k 2 /n.The mean and variance are

Application to Analysis of Duplicates in AIDS Case Reporting
As we mentioned earlier, in AIDS surveillance, the partial personal identifiers reported to CDC cannot uniquely identify an individual.Among data elements reported to CDC, sex, date of birth, and soundex (a code based on last name using a method of encryption, see Fenna, 1984) are used for duplicate evaluation.Since the frequency of letters in last names is not uniform, the soundex does not have a uniform distribution over its possible values in the general population.This is also true among persons with AIDS.Although sex is fairly uniformly distributed in the general population, persons with AIDS are more likely to be male than female.Dates of birth for persons with AIDS do not have a uniform distribution over calendar dates, so we cannot directly apply the occupancy model with equal occupancy probabilities to evaluate the duplicate reporting in AIDS surveillance.Note, however, that the date of birth can be considered a combination of birth year and birth day in a year.Among persons diagnosed with AIDS, birth year does not have a uniform distribution over the calendar years.However, the birth day is quite uniformly distributed within a calendar year.Therefore, if we stratify the reported AIDS cases by sex, soundex, and birth year, then we can apply the occupancy results developed in section 2 to evaluate duplicate reporting based on matched birth days within each stratum.If there were no true duplicates (multiple cases reported for the same person) in the AIDS surveillance system, then given the number of cases reported to the system (which would equal the number of persons with AIDS reported to the system), the number of distinct combinations of sex, soundex, and date of birth would satisfy the equations provided in the previous sections.In this application, persons or cases are considered as balls, and cells comprise the various combinations of sex, soundex, and date of birth.Since the number of sex, soundex, and date of birth combinations is observable and not affected by true duplicate reporting, we can work backwards to estimate the number of reported persons with AIDS.If the actual number of reported cases is greater than this estimated number of distinct persons reported, then there exists evidence of duplicate reporting (true duplicates) in the AIDS case reporting system.
As of June 2004, there were 922,835 AIDS cases reported to CDC.After eliminating 573 cases with unknown sex or soundex, or incomplete date of birth, n = 922, 262 cases are used in our duplicate analysis; they "occupy" 840,416 distinct combinations of sex, soundex, and date of birth, thus revealing 81,846 potential duplicates.By following the procedure described in Section 2, we estimate that 37,724 or 4.09% of total reported cases are true duplicates with a 95% confidence interval (4.04%, 4.14%).This percentage varied by sex (see Table 2).We next analyzed the potential duplicates for cases diagnosed within each state.If there are no true duplicates in each state, then the number of potential duplicates can be estimated using the formulas provided in section 2. Based on AIDS cases reported to CDC by June of 2004, we compare the observed numbers of potential duplicates in each state to the expected number of potential duplicates (Figure 2).States with a small number of AIDS cases tend to have more true duplicates (the observed number of potential duplicates greater than the expected number of potential duplicates), while in states with a large number of AIDS cases, the observed numbers of potential duplicates are consistently less than the expected numbers of potential duplicates.
We also evaluated the problem of duplicate reporting of AIDS cases between states.Cases diagnosed in one state may have the same partial personal identifier as cases diagnosed in other states.The number or proportion of true inter-state duplicates can be estimated using the method described in section 4. Results based on AIDS cases reported to CDC by June of 2004 are shown in Figure 3. States with a smaller number of potential duplicates tend to have a higher proportion of true duplicates.

Summary and Discussion
In this paper, we provided formulas to calculate the exact variance of the number of empty cells in occupancy problems with equal or unequal occupancy probabilities.We also considered a generalized occupancy problem with balls of different colors.These results are useful for evaluating duplicate reporting in case surveillance of a disease in which a unique identifier is not available but for which duplicate case reporting is likely to occur.
Although the duplication analysis cannot tell whether a particular pair of potential duplicates is a pair of true duplicates or non-duplicates, it does help identify problems and can be used to estimate the magnitude of true duplicates in a surveillance system.The precision of the duplicate analysis depends on the number n of cells and the number r of cases in each stratum.More precisely, it depends on r and the probability that two distinct cases have the same code composed of a combination of personal identification variables.This probability takes on a minimum value of 1/n when all possible codes are equally likely.On the other hand, if, for example, soundex (with more than 8000 possible outcomes) is the only partial personal identifier considered, then two persons have approximately a 1/450 probability of having the same code.
An assumption underlying our work is that the data elements used as a partial personal identifier, or stratification variables, have values that do not vary over time for each individual.If for some reason, a person's partial personal identifiers used for duplication assessment change, then the method will underestimate the number of true duplicates.For example, a woman's soundex could be changed due to marriage.Also, data entry errors of partial personal identifiers could result in underestimating the number of true duplicates unless the same error occurs consistently in reporting.
Using this occupancy model, we estimate that approximately 4% of AIDS cases reported in the national AIDS surveillance system up to June 2004 represent duplicate reports.This is consistent with previous findings (page 40, Centers for Disease Control and Prevention, 2004) that less than 5% of HIV and AIDS cases in the national surveillance database are duplicates, and is in compliance with recommended performance standards for well functioning surveillance systems that set a minimum performance standard for duplicate reports of ≤ 5%.These methods could be used to monitor national and state surveillance systems as a performance indicator -should estimated duplication rates exceed 5%, evaluation of surveillance practices for intrastate duplication efforts and communication between states for interstate duplication assessment can be undertaken.

Figure 2 :
Figure 2: Observed number of potential duplicates vs. expected number of potential duplicates (without true duplicates) within each diagnosis state by sex.Results are based on cases reported to CDC by June of 2004.

Figure 3 :
Figure 3: Expected proportion of true duplicates among potential duplicates in a state with other states.Results are based on cases reported to CDC by June of 2004.

Table 1 :
True and approximate values of standard deviation of the number of occupied cells for n = 1000.
. True and approximate values of the standard deviation are provided in Table 1.From Table 1, we can

Table 2 :
Estimates related to duplicate reporting of AIDS cases reported to CDC by June of 2004.
* combination of sex, soundex, and date of birth.