Sampling Random Variables : A Paradigm Shift for Opinion Polling

Conventional sampling in biostatistics and economics posits an individual in a fixed observable state (e.g., diseased or not, poor or not, etc.). Social, market, and opinion research, however, require a cognitive sampling theory which recognizes that a respondent has a choice between two options (e.g., yes versus no). This new theory posits the survey respondent as a personal probability. Once the sample is drawn, a series of independent non-identical Bernoulli trials are carried out. The outcome of each trial is a momentary binary choice governed by this unobserved probability. Liapunov’s extended central limit theorem (Lehmann, 1999) and the Horvitz-Thompson (1952) theorem are then brought to bear on sampling unobservables, in contrast to sampling observations. This formulation reaffirms the usefulness of a weighted sample proportion, which is now seen to estimate a different target parameter than that of conventional design-based sampling theory.


Introduction
Estimating a population proportion is commonplace in psychological assessment, experimentation, and opinion surveys.In this vast area of research and application the target parameter is invariably posed as a single probability that governs a sequence of binary respondent outcomes such as "right vs. wrong", "agree vs. disagree", etc.The classical central limit theorem is then used to support the normality of the sample proportion.
Oddly, the above status quo in social research and policy application ignores a well known fact; namely, the existence of broad individual differences on almost any psychological variable imaginable.From a statistical point of view data collectors and analysts have been content with the central limit theorem of Laplace in 1810 rather than progressing to that of Liapunov in 1901.(These sources are referenced by Lehmann 1999, pp.600-601).When applied to opinion polling, the earlier theorem assumes a population of individuals, each having an a priori agreement or disagreement with a statement not yet heard.Thus the presentation of this statement to a sample of individuals from this population is tantamount to the selection of red and white balls from an urn, with the goal of estimating the proportion of red.That is, successive individual-by-individual response solicitations are regarded as independent identically distributed (i.i.d.) Bernoulli trials, each with a common probability that is the population proportion.
The present paper argues for the application of the more realistic Liapunov central limit theorem.This relaxes the status quo to independent non-identically distributed (i.n.d.) Bernoulli trials, each with an individual-specific (case) weight and response probability.First, a sample of size n is drawn without replacement from a population of N individuals.The sample design determines the inclusion probability for each individual in the population.Next, conditioning on the selected sample, the data collection consists of n solicited i.n.d.Bernoulli responses.Rather than revealing a predetermined individual state, each Bernoulli trial generates a momentary response (a one or zero) driven by an individual's unobserved probability.Given this alternative representation of the survey respondent, the Liapunov central limit theorem, and the Horvitz-Thompson (1952) theorem, are then invoked to estimate the mean of the population of personal probabilities.
Section 2 lays out a triangular array of i.n.d.random variables and its central limit theorem.Section 3 defines a weighted Bernoulli variate and inserts it into this array.Section 4 restricts and interprets this formulation to design-based sampling (without replacement) from a finite population of random variables.This gives the conditional and unconditional expectations of the (approximately normal) sample mean of n weighted Bernoulli variates, along with its conditional variance.Section 5 treats the important and frequently used case of self weighted, or epsem (equal probability of selection method), samples.In this case the classical estimate of the standard error of the sample proportion is advocated as reasonable rather than the severe underestimate it is widely believed to be.Finally, Section 6 gives some concluding remarks relevant to this new world of survey sampling.

The Central Limit Theorem for I.N.D. Random Variables
We begin with a sequence of sets of random variables of size n in the triangular array where n → ∞.The following notation will be used to describe Y n1 , Y n2 , . . ., Y nn : Using this notation, a convenient lemma to Liapunov's theorem (cf.Lehmann, 1999, pp. 97-102, 571-573) (Lehmann,1999, pp. 98, 101).The more restrictive conditions (2.2) and (2.3) are used here because they are easily satisfied by the weighted Bernoulli variates defined next.

I. N. D. Bernoulli Trials
For individual i let X ni be a Bernoulli variate taking the values 1 or 0 with probabilities p ni and 1 − p ni .Now define i's weighted Bernoulli variate as where w ni > 0 and sum of the weights is The value taken by Y ni is uniformly bounded, satisfying (2.2).Moreover, in this binary situation satisfies (2.3) when the p ni are bounded away from zero and one.That is, if their exists a constant a > 0 such that a < p ni < 1 − a for all i and n, . Also, the weights w ni vary about one, and there exists a constant b > 0 such that w ni > b for all i and n.Lehmann, 1999, p. 99).

A finite triangular array
With conditions (2.2) and (2.3) satisfied, the limiting distribution (2.1) is now established for the n weighted Bernoulli variates in (3.1).These i.n.d.random variables Y n1 . . ., Y ni , . . ., Y nn in lemma 2.2.1 are now interpreted as arising from a particular sample from a finite population.Therefore we must now restrict the array in Section 2 to a finite set of random variables and interpret these as a sequence of samples of size n = 1, . . ., N. This setup is depicted by the array where the n-th member of this sequence is a sample of n random variables drawn (without replacement) from the N -th member, which is a population of N random variables.It is reiterated that respondent i in this nth sample is represented here as a random variable Y ni upon which an observation is to be realized during later response solicitation.
Curtailing the infinite sequence in Section 2 to the finite population Y N 1 , . .., Y Ni , . . ., Y NN weakens the asymptotic normality of Ȳn to its approximate normality.Also, any simple or complex sampling design determines a probability π ni > 0 that individual i is included in the sample.For example, if the design is self weighting this inclusion probability is n/N for each individual in the population.(See Section 5.) Finally, in this special sampling case of lemma 2.2.1, the weights of the i.n.d.Bernoulli variates take the form where again w ni > 0 and i∈S w ni = n.

The conditional and unconditional expectations of Ȳn
We now condition on the i.n.d.random variables Y n1 , . . ., Y ni , . . ., Y nn actually drawn, observing that their (approximately) normal mean, has the sample specific expectation The sample sum in (4.3) is, by the Horvitz-Thompson (1952) theorem, an unbiased estimate of the total of the N response propensities p Ni in the population.This is stated in the following lemma: be the sum of theN unobserved probabilities in the population.Then The Horvitz-Thompson estimator in the parentheses in (4.4) is an unbiased estimate of the population total for an arbitrary sampling design (Thompson 1997, pp.12-15;Lohr 1999, pp.196-199, 204-210).Thus lemma 4.2.1 implies that the sample-specific expectation p n under lemma 2.2.1 itself has the expectation over all samples of size n.This latter expectation is the population mean of the N unobservable probabilities p Ni .This population mean is also the unconditional expectation of the observed sample mean Ȳn in lemma 2.2.1.

A population census
In a census n = N , and Substituting the inclusion probability of one for π ni in (4.3) Interestingly, the target parameter P N , which is the mean of the population of N personal probabilities, is not realized but only expected in a census.That is, in a census the mean ȲN is still a random variable because each individual response Y Ni is stochastic, taking the values 0 with probability 1 − p Ni and the value 1 with probability p Ni , for i = 1, . . ., N. With w Ni = 1 the variance of this census mean is easily seen to be where the summation is over i = 1, . . ., N for the entire population.The miniscule variance in (4.6) shows that the census ȲN is a random variable that is distributed tightly around the target parameter P N .In contrast, in standard design-based surveys (Lohr, 1999) the census mean is the fixed proportion of 1's (versus 0's) in the population.

The conditional variance of Ȳn
The variance of the sample mean Ȳn in (4.2), which is conditioned on the sample, may be alternatively expressed as where s 2 n is given in (3.2).Writing (4.7) as the variance of the sample mean is seen to be the mean of the n variances divided by the sample size n.This is a generalization of the classic special case, where the variance of the sample mean is the (single) population variance divided by the sample size.Finally, the Horvitz-Thompson (1952) theorem can also be applied to the unobserved individual variances p ni (1 − p ni ) in (4.8): Writing the expectation of the variance of the sample mean in (4.8) as due to the multiplier π −1 ni > 1 in (4.10).In the important case of self weighting in Section 5, π −1 ni = N/n and therefore Equation ( 4.11) shows that the conditional variance of the mean Ȳn is a different order of magnitude than the (miniscule) variance of the census mean ȲN .A particular estimate of V ar( Ȳn ) is suggested for the case of self weighting.

Self Weighting
Complex surveys commonly use stratified multi-stage sampling with all units selected with probability proportional to size except at the final stage.In this last stage a fixed number of individuals are drawn from the last unit (e.g., voting district) by simple random sampling without replacement.This sampling design is self weighting in the sense that each individual in the population has the same probability n/N of being included in the sample.(Skinner, Holt, and Smith, 1989, pp.16, 40;Thompson, 1997, pp.12-15).Other types of epsem designs are used in random-digit-dialing telephone surveys in marketing research.Self weighting also occurs in simple surveys, where n individuals are drawn directly from a population of size N by simple random sampling without replacement.

The conditional epsem variance of Ȳn
Substituting n/N for π ni in (4.1) gives Replacing w ni , in turn, by one in (3.2) and (4.7) gives (5.1) Formula (5.1) is also found by substituting n/N for π ni in (4.8).It is then easily shown that (5.2) Replacing π ni by n/N in (4.3), reveals that p n in (5.2) has the structure which is the mean of the n individual probabilities controlling the sampled Bernoulli variates Y n1 , . . ., Y ni , . . ., Y nn .Finally, dividing both sides of (5.2) by n 2 gives (5.3) showing that the conditional variance of Ȳn increases as p n1 , . . ., p nn become more homogeneous, maximizing when these probabilities are all equal.Therefore, p n (1 − p n )/n is an upper bound for V ar( Ȳn ) in (5.3).

Inferences from
Ȳn to p n and P N Equation (5.3) suggests the classical statistic Ȳn (1 − Ȳn )/n as an overestimate of the conditional (sample dependent) variance in the self weighted case.Moreover, this conservative variance estimate holds for all sample sizes up to and including the population size N .Thus, even in a census, where N is an overestimate of V ar( ȲN ) in (4.9).Over-estimating V ar( Ȳn ) in (5.3) as Ȳn (1− Ȳn )/n sets up the very conservative confidence interval which is greater than 95% for covering p n .This interval is less conservative for covering P N , which is generally more distant from Ȳn than p n .Finally, it is well known that Ȳn (1 − Ȳn )/n severely underestimates the variance of the sample proportion in conventional complex sampling, where each fixed-state respondent is represented by a 0 or 1.In this standard design-based situation the true variance of the proportion is inflated by the homogeneous clustering of 0's and 1's in the population.In contrast, the present construction represents the respondent by a Bernoulli trial that is driven by an unobserved personal probability.In this alternative, more realistic representation of the respondent, the statistic Ȳn (1 − Ȳn )/n is a reasonable estimate of the unconditional variance of Ȳn over all samples of size n.

Discussion
The present paper advocates the design-based sampling of random Bernoulli variates versus the conventional design-based sampling of 1's and 0's.Both procedures produce an observed sample of 1's and 0's and a sample proportion, but their generating processes are very different.A sample of random variables, with subsequent Bernoulli trials, gives a sample proportion that estimates the mean of a population of proportions.In contrast, numerical samples give a sample proportion that estimates the mean of a population of 1's and 0's.Although these two sample proportions are identical, their variances and target parameters are quite distinct.
In the case of Bernoulli variate sampling, individual differences are treated in two senses.First, they are regarded as varying response dispositions governed by individual-specific probabilities p ni .Second, these dispositions are attended by individual-specific weights w ni that are differential representations of the individuals in a population.In this setup individual i's personal probability takes a value in the open interval (0,1), in contrast to i being (extremely) represented by either zero or one.Hence, the sampling is from a population of unobservable probabilities rather than a population of observable zeros and ones.The latter convention is appropriate in biomedical and economic research, where individual i is in a fixed and noticable state, such as diseased or not, poor or not, etc.In social, market, and opinion research, however, an individual has a choice of responding one way or the other.In the present formulation, this choice is under the control of a personal response disposition p ni that is activated upon stimulus presentation.The response observed is still a zero or one but now these two values are taken by a random Bernoulli variate at the individual level.
An important strength of the present approach is that the individual propensity p ni remains unobserved, allowing us to side step its estimation by complex numerical iterations or lengthy experimental replications.For example, item response theory requires computationally intensive methods to estimate distinct individual probabilities for saying "yes" to a survey question.On the other hand, signal detection theory uses arduous replications to estimate subject-specific probabilities for saying "yes" that a tone is present amid noise.Calculations such as these may be necessary for individual evaluation in psychology and education, but the Liapunov and Horvitz-Thompson theorems allow us to circumvent them for group assessment at the population level.
Finally, equation ( 5.3) provides a conventional estimate of V ar( Ȳn ) in the context of the sampling theory developed in Sections 2 through 5.In the case of self weighting the homogeneity of the p ni reduces the second term in (5.3), increasing V ar( Ȳn ).Thus the estimate of this conditional variance, suggested in Section 5.2, is very conservative, providing a reasonable estimate of the unconditional variance of Ȳn .In the unweighted case this is a reassuring property for an ordinary sample mean because the overestimate Ȳn (1 − Ȳn )/n of Ȳn 's conditional variance holds for all sample sizes up to and including the population size N .In contrast, this classical statistic severely underestimates the variance of the sample mean in standard complex sampling from populations with homogeneous clusters of zeros and ones.Thus, replacing a respondent's spurious zero or one by a Bernoulli trial driven by his (or her) personal probability provides a fresh look at the important issue of variance estimation in opinion surveys.
may then be invoked: