Efficient Sampling Design in Audit Data

Auditors are often faced with reviewing a sample drawn from special populations. One is the special population where invoices are divided into two categories, according to whether or not invoices are qualified. In other words, the qualified amount follows a nonstandard mixture distribution in which the qualified amount is either zero with a certain probability or the same as the known invoice amount with a certain probability. The other is the population where some invoices are partially qualified. In other words, some invoices have a qualified amount between zero and the full invoice amount. For these settings, the typical sample design is stratified random, with the estimation method employing a ratio type method. This paper focuses on efficient sample design for this setting and provides some guidelines in setting up stratum boundaries, calculating sample size and allocating sample size optimally across strata.


Introduction
Much of traditional sampling theory was developed in the household survey context.Sampling business records presents very different challenges and often requires different solutions.Most commonly, the quantity to be estimated is financial.It may be, for example, the amount subject to sales tax, the amount deductible from income tax, or the amount that is in error in the business records.The sampling unit is frequently invoices.The estimates for these quantities have a lower bound of zero but can take on large positive values, sometimes millions of dollars.In addition, there are always requirements to minimize the impact of the sampling on company operations and to keep the sample size as small as possible, while still achieving good precision.Whether we are reviewing for the traditional audit purpose of identifying and quantifying errors in business records or for determining taxable amounts, we can generally classify our sampling as audit sampling, where we are beginning with a recorded amount and making some quantitative determination about that original amount.
There are two types of populations that we often face in auditing.One is the special population where invoices are divided into two categories according to whether or not invoices are qualified.In other words, the qualified amount is either zero or the same as the known invoice amount, depending on which category the invoice falls into.This type of populations is called Population One. Figure 1 (left half) shows the scatterplot of the qualified amount against the invoice amount for population one.The other population type arises when some invoices have a qualified amount between zero and the full invoice amount.This type of population is called Population Two. Figure 1 (right half) shows the scatterplot of the qualified amount against the invoice amount for population two.
Invoice Amount (x) Qualified Amount (y) Invoice Amount (x) Qualified Amount (z) For these two populations, the typical sample design is a stratified random sample design using the known invoice amount as the stratifying variable.In this paper, we assume the cases with the largest recorded amounts (or potential 'outliers') are taken with certainty.
We first summarize the characteristics of population one.Suppose that the population includes invoices and each has a known invoice amount.The invoices are divided into two classes -qualified class C and non-qualified class C. If an invoice is in class C, then the qualified amount is equal to its invoice amount; otherwise the qualified amount is zero.In this paper, we assume that the percentage of invoices in one class is in a reasonable range.If the percentage of invoices in one class is extreme, either very small or very large; a hypergeometric estimation method is recommended (Liu, Batcher and Rotz, 2001).Here, however, we will assume a binomial model applies.
Further, we assume that qualified invoices and non-qualified invoices are randomly distributed among the N population units.Let x i be the known invoice amount for invoice and be the unknown qualified amount for invoice i.According to Roberts (1978), the N population units may be characterized as a realization of the following process: The properties of this process in terms of averages over all possible realizations, denoted as E p , lead to some useful applications.We first outline these properties summarized by Roberts (1978).The population parameter to be estimated is the ratio: The corresponding sample estimate under simple random sample is: where ȳ = n −1 n i y i and x = n −1 n i x i .The variance of R, for large n, is approximately: where S 2 d is the variance of d i = y i − Rx i and Under the realization process of population units described in equation (1.1), when the population size, N , is also reasonably large.
We now expand the above properties to population two where some invoices are partially qualified.In order to relate population two to population one and make use of the results from population one, we assume the same average ratio for population two, i.e., E p (R) = p.There should be many scenarios of the relationship between the qualified amount, denoted as z i (in order to distinguish it from y i in population one), and the invoice amount x i .One scenario is that points of are randomly scattered around the line .So the population units can be characterized as a realization of the following process: where u is a random number from Uniform(0, 1).Under the realization process of population units described in equation (1.8), we still have Rewrite d i in population one as: Comparing equations (1.9) and (1.10), we have , since u and d are independent.E p (u 2 ) = 1/3, since u Uniform(0, 1).Therefore, Note that most scenarios of population two fall between the process characterized in equations (1.1) and the process characterized in equation (1.8).Therefore, we may expect the value of S 2 d(z) to lie between S 2 d /3 and S 2 d for most scenarios of population two.

Determination of Stratum Boundaries
At the design stage, we only have knowledge about the invoice amount.In practice, the Dalenius-Hodges method (Cochran 1977, pp. 127-131 andSärndla, et al. 1991, pp. 463-464) is often used to set up stratum boundaries based on the values of x.Then sample size is allocated by the Neyman rule (Cochran 1977, Chapter 5), based on knowledge of x.This works well only if the correlation between x and y is strong, say a correlation coefficient of 0.9 or more.This is often not the case in practice.Therefore, for our special ratio type data, we develop a new method to determine stratum boundaries and sample size allocation using the special relationship between x and y.Specifically, we use equation (1.7) as the approximation of S 2 d .Given the number of strata and the same sample size per stratum, stratum boundaries under Neyman optimum allocation can be determined such that N h S hd (h = 1, 2 . . ., L)) is about the same for all strata.That is, where C is a constant.If we are comfortable with the assumption that all the qualified invoices are evenly distributed in the population, p h is about the same across all the strata.We can, therefore, use the known (S 2 hx + X2 h ).Equation (2.1) is reduced to: Now we can rewrite equation (2.2) as: where CV hx is the coefficient of variation of x for stratum h.Equation (2.3) leads to an important application of setting up stratum boundaries.First, it should be easy to set up stratum boundaries under Neyman allocation using equation (2.3).Further note that CV 2 hx is much smaller than 1 in many accounting applications.Therefore, equation (2.3) can be approximated by X h = C, h = 1, 2, . . ., L if the distribution of invoice amount x is not highly skewed.In other words, since X h is the total value of the invoices in stratum h, then what is being said is that setting equal the total invoice amount per stratum gives us the approximate stratum boundaries for the same sample size per stratum under Neyman allocation.To be more accurate, we may first set up stratum boundaries based on the equal invoice amount; and then adjust the boundaries based on the coefficient of variation per stratum.The above guidelines of optimum stratum boundaries also apply to population two described by equation (1.8), which is supported by equation (1.11).Note that there are many scenarios for population two and equation (1.8) is one of them.The stratum boundaries for the same sample size per stratum under Neyman allocation may vary for different scenarios, but the equal invoice amount criterion can provide a useful approximation for other scenarios as long as the assumption that the CV's are small holds.That is, the qualified invoices are randomly scattered in the population.If the qualified percentage tends to increase or decrease as the invoice amount increases or decreases, we may incorporate information about different qualified percentages in different strata into equation (2.3).That is, we can set up stratum boundaries using:

Sample Size Determination and Allocation
The above stratum boundary criterion yields equal stratum sample sizes for all strata.The sample size formula for population one is: where t is the t-value corresponding to the confidence level and A is the desired absolute precision or margin of error.For population two, that is described by the model of equation (1.8), the sample size is: where B is the desired absolute precision.Since S 2 hd(z) ≈ S 2 hd /3 by equation (1.11), we have: Compare equations (3.1) and (3.3), the same sample size leads to B = 0.58A.In other words, the same sample size can give a better precision for population two than for population one.For the assumed qualified percent p, the sample size to achieve a certain precision under population one is a conservative estimate of the sample size needed to achieve the same precision for some unknown scenario of population two.We should caution that it maybe too conservative sometimes.As in the above analysis, the sample size calculated under population one can give a 42% shorter margin of error for the scenario described in equation (1.8).

Simulation
The simulation population includes 3,231 invoices after removing the largest invoices with certainty.Figure 2 gives the histogram based on invoice amountthe design variable x.The population is divided into five strata with equal stratum total dollar amounts on x.The population summary is presented in Table 1.Variable y is created based on the equation (1.1) to represent population one and variable z is created based on equation (1.8) to represent one of the scenarios in population two.p = 0.2 is used in creating variables y and z.
The Neyman allocations across strata based on different variables are given in Table 2.
The sample size allocation across strata would be best determined by the variable of interest, y or z.In ratio type estimation, the Neyman allocation percentages are calculated for variable y by The results are given in column (a).Column (b) gives the Neyman allocation percentages based on variable z.These percentages are calculated using In summary, Neyman allocations can be calculated using equation (4.1) for both population one and population two.
The allocation percentages across strata in column (a) are very close, which indicates an equal sample size across strata is appropriate.This confirms our earlier finding that stratum boundaries by equation (2.4) are well approximated by setting an equal invoice amount per stratum if the distribution of x is not highly skewed and qualifying percentage p h is about the same across all the strata.
The above simulation is based on p = 0.2.Other simulations using p = 0.5 and p = 0.8 lead to the same conclusion.
Using formula (3.1), the sample sizes in order to reach a relative precision of 10% at 90% confidence level are given in Table 3.The sample sizes using Roberts (1978)'s formula are obtained by substituting equation (1.7) into sample size formula (3.1).As shown in Table 3, Roberts (1978) gives sample sizes very close to those obtained using the simulated variable y.The simulated variable z achieves the same relative precision with smaller sample sizes.For many situations in practice, the variable of interest is between y and z.Therefore, Roberts (1978) gives somewhat conservative sample sizes for these situations.As the values of p increase, the sample sizes decrease.However, even though the overall sample size needed to achieve desired precision levels may be very small, the stratum sample size should not be allowed to become too small in order to reduce bias and stabilize the variance estimation.

Conclusion
For our special ratio type data, assuming the qualified amounts are randomly spread throughout the population, the stratum boundaries with equal stratum sample size under Neyman allocation can be obtained approximately by setting up equal total stratum amounts on the design variable x.The stratum boundaries can, then, be modified by considering the coefficient of variation of x per stratum, using equation (2.3).Even more modification can be made using equation (2.4) if there is prior knowledge about different values of p for different strata.The sample size calculated from the Roberts (1978) formula tends to be conservative in practice for many scenarios of population two.

Future Work
We plan to analyze the effectiveness of different numbers of strata and the stratum sample size.For example, for a fixed sample size of 100 units, we may compare the setting of 4 strata with 25 units per stratum and the setting of

Figure 1 :
Figure 1: Population one (left part) and population two (right part)

Figure 2 :
Figure 2: Histogram of the simulated population

Table 1 :
Simulation Population Summary by Stratum

Table 2 :
Neyman Allocation ComparisonThe above formula (4.1) involves only the known values of variable x.The results are shown in column (c) of Table2.Comparing the numbers in column (c) to those in column (a), there are only minor differences.Therefore, we can achieve Neyman allocation regarding to the variable of interest (y or z) at the design stage without knowing the variable of interest.As a comparison, the Neyman allocation percentages regarding to the design variable x using N h S hx / h N h S hx are also presented in column (d).The numbers in column (d) are quite different from those in the other three columns.This indicates that the Neyman allocation based on the variance of the design variable x alone is very inefficient.It underallocates for certain strata and over-allocates for other s trata by a large degree.

Table 3 :
Sample Size ComparisonAssumed p Using Simu.y Using Simu.z Using Roberts' Formula