Distribution-Free Regression : Reinterpreting Design-Based Sampling

An individual in a finite population is represented by a random variable whose expectation is linearly composed of explanatory variables and a personal effect. This expectation locates her (his) random variable on a scale when s(he) responds to a questionnaire item or physical instrument. This formulation reinterprets design-based sampling, which represents an individual as a constant waiting to be observed. Retaining constant expectations , however, along with fixed realizations of random variables, preserves and strengthens design-based theory through the Horvitz-Thompson (1952) theorem. This interpretation reaffirms the usual design-based regression estimates, whose normality is seen to be free of any assumptions about the distribution of the outcome variable. It also formulates response error in a way that renders a superpopulation, postulated by model-based sampling, unnecessary. The value of distribution-free regression is illustrated with an analysis of American presidential approval.


A Paradigm Shift for Survey Sampling
Design-based sampling postulates an individual in a fixed observable state that may be subjectively measured as a discrete rating, e.g.0 1 2 3 4, or physically measured, say, on a continuous blood pressure scale 0 ... 300mmHg.Thus, the value recorded in a survey interview or clinical trial is regarded as a fixed number in waiting.More realistically, however, an individual may be represented as a random variable that is realized in response to a questionnaire item or physical instrument.
The present paper favors this more plausible interpretation and extends Bechtel's (2005) treatment of survey proportions to survey regressions of any discrete or continuous dependent variable.The individual is posited here as a pair of fixed parameters; namely, a mean and variance that (partially) determine an idiosyncratic probability distribution.Each mean is composed of individual-specific explanatory values along with an individual effect.Thus, a population of N individuals generates realizations of N non-identical distributions.Each individual realization is a momentary numerical value governed by an idiosyncratic mean and variance (and perhaps higher moments).
This approach brings the Horvitz-Thompson theorem to bear on a sample (without replacement) of n realizations from a momentary population of N realizations.It also enhances design-based regression, whose intercept and slopes are normally distributed (over samples) for any idiosyncratic distributions (over realizations) that prevail when humans respond to survey items and instruments.
Section 2 describes doubly bounded random variables, and section 3 lays out a survey regression that "explains" their expectations.Section 4 demonstrates the asymptotic normality of the estimated regression effects in the presence of any distributions that underlie the survey responses.Section 5 completes the present formulation with a treatment of variance estimation in this new context.
This paradigm is then applied in Section 6 to the important case of the population mean, which is simply a regression intercept in the absence of explanatory variables.Section 7 describes STATA commands for computing a distributionfree regression, and Section 8 illustrates this computation with American polling data.Section 9 sums up renewed design-based regression, noting its broad reach across opinion polling, economic surveys, and clinical trials

Stochastic response error
For each individual i = 1, . . ., N in a population let Y i be a random variable such that where E i is a response error with Var(E i ) = σ 2 i , for i = 1, 2, . . ., N.
The expectation and variance of the E i are understood to be over realizations of non-identically distributed random errors E 1 , . . ., E N .In a survey the random variable Y i may take the discrete values 0 1 2 3 4 5 6 for the responses terrible, unhappy, mostly dissatisfied, mixed, mostly satisfied, pleased, or delighted to a question about life quality (Andrews and Withey, 1976).
In a clinical trial Y i may take any value in the interval 0 . . .300mmHg on a blood pressure scale.This continuous random variable, like the preceding discrete one, generates a measurement Y i that departs from individual i's true value η i .The expectation η i denotes a personal location on the response scale, and the standard deviation σ i (over realizations) denotes a personal uncertainty.For example, in a public opinion survey a low σ i implies the crystallization of one's attitude.In a clinical trial a low σ i denotes stability (or consistency) of one's blood pressure.

A linear characterization of η i
Let η p = [η 1 , . . ., η N ] T be the vector of expectations in population p to be "explained" by an N × (k + 1) population matrix X p .The vector η p and matrix X p define the finite population characteristic which is the target parameter here.The function β(X p , η p ) in (2.1) then defines which is a population vector [α 1 , . . ., α N ] T of residual idiosyncratic effects on [η 1 , . . ., η N ] T over and above X p .The η i and α i are well-defined hidden variables in present analysis.In this setup, then, each individual i = 1, . . ., N in population p is represented as where η i is i's expected response to a survey instrument, X 1i , . . ., X ki are i's values of k variables that carry η i through their effects β 1 , . . ., β k , α i is i's residual effect on η i .

Weighting data in the presence of nonresponse
Missing data for all variables in Y i , X 1i , . . ., X ki , called unit nonresponse, results in an absent survey protocol.For i = 1, . . ., N let π i be the probability of i's inclusion in a selected sample and φ i be the probability of i's survey participation given s(he) has been drawn.Regarding i's participation as the last (self-selection) stage of sampling, the probability that s(he) is in the subsample of n observed realizations is π i φ i .In the sequel sample s denotes this subsample of n survey participants, each with case weight w i = 1/(π i φ i ).This case weight adjusts the sample design weight 1/π i upward by the factor 1/φ i to compensate for any population under-representation in the sample s of observed realizations (Särndal and Lundström, 2005, pp. 43-44, 49-53).
The present reinterpretation of design-based sampling holds strictly under true case weights w i = 1/(π i φ i ).The probability φ i (unlike π i ), however, is not known and is usually estimated for each unit in sample s by "weighting class" or "poststratification" adjustments (Lohr, 1999, pp. 264-272).If these n estimates approximate the true participation probabilities φ i , the formulas below give nearly unbiased regression coefficients when the number of respondents n is large.Nevertheless, Lohr (1999, p. 272) cautions that "Weights may improve many of the estimates, but they rarely eliminate all nonresponse bias." Missing data for some variables in {Y i , X 1i , . . ., X ki : i ∈ s}, called item nonresponse, gives an incomplete protocol for individual i in a survey regression.Various imputation procedures are available for filling in missing values in incomplete protocols (Lohr, 1999, pp. 272-278).However, because the theory here assumes that all responses have been realized in {Y i , X 1i , . . ., X ki : i ∈ s}, imputation adds bias to the regression formulas below .Lohr (1999, p. 277) notes, "If the nonresponse is missing at random given the covariates used in the imputation procedure, imputation substantially reduces the bias due to item nonresponse".Section 8.3 uses a regression imputation that avoids the loss of 29% of the cases due to missing item data in an American national survey (StataCorp., 2001, Volume 2, pp. 69-73;Särndal and Lundström, 2005, pp. 153-155, 158-161).

Estimating the target parameter β
Each element of the (k + 1) × (k + 1) matrix X T p X p is a population sum of products, as is each element of the (k + 1) × 1 vector X T p η p (Lohr, 1999, p. 360).Therefore, due to Horvitz and Thompson (1952), unbiased estimates of these matrices are given by X T s W s X s and X T s W s η s , where X s is the known n × (k +1) matrix of explanatory values in sample s, η s = [η 1 , . . ., η n ] T is the unknown respondent vector of individual expectations, W s =diag(w 1 , w 2 , . . ., w n ) is the known n × n diagonal matrix of case weights.
In a large sample s the unobserved Horvitz-Thompson (HT) is consistent and almost unbiased for β, its unbiasedness being approximate because b is the product of estimators (Binder, 1983;Nathan, 1988, pp. 255-256;Thompson, 1997, pp. 106-107;Valliant, Dorfman, and Royall, 1999, pp. 40-41;Lohr, 1999, pp. 354-361).Similarily, for the realized (but unobserved) response error T consists of the realized response errors in p which are transformed to υ.Also, as seen in (3.5) below, the error transform v in (3.2) delivers respondent error E s to the manifest regression effects.
Finally, for the realized and observed measurement where T consists of the realized measurements in p. Formula (3.3) is the estimator of the conventional target (3.4) in design-based regression (Frankel, 1971, pp.7-25;Lohr, 1999, pp. 359-361;StataCorp., 2001, Volume 4, pp. 29-30;Chaudhuri and Stenger, 2005, pp. 264-265).However, in the present reinterpretation B also, and more profoundly, estimates the new target β in (2.1).Thus, given Y s as a subvector of the fixed vector Y p of realizations, where Therefore, B is almost unbiased for β in large-sample surveys.

Normality of the Regression Effects
The expectation of (3.4) over realizations of the stochastic Y p is Because E(θ j ) = β j and Var(θ j ) → 0 as N → ∞ for j = 0, . . .,k, the difference θ j − β j = v j is infinitesimal for a given realization of Y p .Fixing this momentary realization {Y 1 , . . ., Y N }, the resulting reals θ 0 , . . ., θ k become the classic target parameters of design-based regression.Therefore, a strict design-based argument using the θ j can be given for the normality over samples of each element B j in B. This provides a statistic for testing hypotheses about the target parameter β j against the observed coefficient B j .First, given the realization {Y 1 , . . ., Y N }, the coefficient θ j (j = 0, . . ., k) can be written as a smooth function of population totals of cross products in {Y i , 1, X 1i , . . ., X ki : i ∈ p}.Then, from the subset {Y i , 1, X 1i , . . ., X ki : i ∈ s} the estimate B j can be written as the same function of HT estimators of these population totals.The HT estimators are corresponding sample totals of cross products with each term case weighted by w i .For example, i∈s w i X 1i Y i is an HT estimator of i∈p X 1i Y i (cf.Lohr, 1999, pp. 352-360;Thompson, 1997, pp. 106-108).
In applying (4.2) it is reassuring to recall that the asymptotic normality of B j (over samples) does not depend on the distributions of Y 1 , . . ., Y N (over realizations).Large-sample normality of the estimated intercept and slopes prevails in the presence of any idiosyncratic distributions of survey responses.

Variance Estimation
Using "linearization" in design-based sampling, an estimate of the covariance matrix of b in (3.1) is given by where w i is the case weight of respondent i, The unobserved matrix V ar(b) in (5.1) is based on a strict application of design-based regression (cf.Lohr, 1999, pp. 359-361;StataCorp., 2001, Volume 4, pp. 29-30).Thus a vector η s = [η 1 , . . ., η n ] T is sampled from a population vector η p = [η 1 , . . ., η N ] T of constants, and b is computed from η s using (3.1).It is not possible to observe the core residual the covariance matrix of v in (3.2) can be estimated like V ar(b).Replacing the core residual in (5.1) by the "residual" (5.2) Because the E i are not observed, V ar(v) is also a latent estimate.Simulated response errors E i demonstrate that the sum of the latent estimates in (5.1) and (5.2) closely approximates (5. 3) The variance estimator in (5.3) is identical to that in conventional designbased regression (Lohr, 1999, pp. 359-361;StataCorp., 2001, Volume 4, pp. 29-30).Its reinterpretation here is seen by writing i's manifest residual as Finally, rewriting (5.4) as shows that this manifest residual also equals i's residual in (5.1) plus her (his) residual in (5.

The Important Case of the Mean
If the explanatory variables X 1i , . . ., X ki are deleted from (2.3), then X p becomes the unit vector containing N ones.In this special case the target parameter (2.1) is the population mean expectation Correspondingly, substituting the unit vector of n ones for X s in (3.1) gives the latent estimate which is exactly unbiased for β 0 because Next, substituting this unit vector for X s in (3.2) and (3.3) gives The latter formula for B 0 is well-known in design-based sampling as the estimate of (Lohr, 1999, p.198;StataCorp., 2001, Volume 4, p.70).This target θ 0 is found here by substituting the unit vector of N ones for X p in (3.4).In conventional design-based theory θ 0 is the mean of fixed constants Y 1 , . . ., Y N .Here it is the mean of N realizations of random variables.
With the inclusion of response error in the present reinterpretation, B 0 is also seen to estimate β 0 , which is the population mean of expectations η 1 , . . ., η N .Thus using (3.5), and taking expectations over samples s of size n and using (3.6) and (3.7), Therefore, in large samples B 0 is almost unbiased for β 0 .Finally, substituting the unit vector for X s in (5.3) gives an estimate of the variance of B 0 : where Ū = i∈s U i /n (StataCorp., 2001, Volume 4, pp. 29-30, 70).
The formulas in this section show that the mean of a survey variable is the intercept of a distribution-free regression whose slopes are set to zero.In this special case too the intercept β 0 , its latent estimate b 0 , and its manifest estimate B 0 are defined without reference to a superpopulation.

Software for Distribution-Free Regression
The regression coefficients in (3.3), along with their standard errors from (5.3), are easily computed with two STATA commands: , 2001, Volume 4, pp.18-31).In (7.1) and (7.2) weight is a usersupplied variable containing case weights, Y is the survey response variable, and X 1 , . . ., X k are the predictors on which the regression is conditioned.This STATA setup returns the regression effects B 0 , B 1 , . . ., B k and their standard errors.As shown by (5.4) and (5.5), the k + 1 standard errors delivered by (7.1) and (7.2) reflect the effects of the response errors E i on the variances of B 0 , B 1 , . . ., B k .

The quaternary regression model
This section uses survey data that sharply departs from the (usually assumed) continuity, normality, and homoscedasticity of the Y i .The breakdown for our coded survey measure is where Y i is i's observed realization on the integers 0 1 2 3, X 1i , . . ., X ki are i's values on k explanatory variables, α i is i's residual effect on her (his 3).This manifestly estimates i's regression residual as Y i − X T i B, whose latent components are given in (5.4) and (5.5).

The survey items and sample
The response values 0 1 2 3 taken by Y i in (8.1) code four response options to the following item: Overall, how would you rate President Bush's performance on the job?
This item is administered monthly by Zogby International, who monitors the perceived performance of the American President.Presidential approval is a closely watched variable that is also tracked by the Gallup Organization, CBS News/New York Times, ABC/Washington Post, NBC News/Wall Street Journal, and the American National Election Studies (Clarke, Stewart, and Rodgers, 2005).This non-normal, discrete, and heteroscedastic variable was regressed on nine predictors also measured in the Zogby poll.Responses to these nine explanatory items in Table 1 are also coded 0 1 2 3. Therefore, the nine regression slopes in Tables 2 and 3 are comparable in magnitude.The overall opinion item in Table 1, along with the eight specific performance items, serve as mutual controls in predicting presidential job approval.
The 1009 respondents to these items were selected by probability sampling and contacted by computer-assisted telephone interviewing (CATI) between February 25 and 27, 2005.This was one month into the second presidential term of George W. Bush, who was reelected in the autumn of 2004.Case weights for the 1009 respondents were obtained from a demographic profile geared to the American population.These weights reflect region, political party, age, race, religion, and gender in order to more accurately represent this population.

Analysis and results
Missing rates for the predictors in Table 1 are 4% or less, except for the environment, foreign policy, and taxes which have 6%, 12%, and 13% missing responses.Because six of these rates are very low, and in order to preserve sample size, a regression imputation was carried out for each of the nine predictors against the other eight (cf.StataCorp., 2001, Volume 2, pp. 69-73).Using the STATA commands in (7.1) and (7.2), 979 non-missing ratings of George W. Bush's job performance Y were regressed on the imputed predictors X 1 , . . ., X 9 .The dependent variable Y was not imputed due to its low missing rate of 3%.The nine estimated slopes B 1 , . . ., B 9 are exhibited in Table 2, where the predictors are ranked in the order of their effects on perceived job performance.As already noted, these slopes are comparable in magnitude due to the 0 1 2 3 coding of X 1 , . . ., X 9 .
The R 2 of .72 indicates that almost three-quarters of the variance in presidential job approval is explained by these nine predictors in the Zogby poll.Overall favorability toward George W. Bush is the strongest predictor.Controlling for this general opinion, the quaternary regression also shows that jobs and the economy and the Iraq war are the most specifically predictive of overall job performance.(These two issues remain paramount for the American public at the present writing.)The environment, foreign policy, social security and Medicare, and education show an evenly descending gradient in the strength of their regression effects.Foreign policy, surprisingly, is negative in sign suggesting that the American public looks unfavorably on presidential efforts in this direction.Finally, Table 2 shows that in February 2005 taxes and the war on terrorism were unimportant issues.This despite the administration's emphasis on the importance of lowering taxes and its asserted link between its wars on terrorism and Iraq.

A binary regression
Departures from continuity, normality and homoscedasticity for Y i in (8.1) are now pressed to the most extreme case in which Y i is dichotomous on the integers 0 1.This alternative dependent variable was generated by recoding the Zogby data as follows: Overall, how would you rate President Bush's performance on the job?
The resulting 979 binary measures were also regressed on the nine imputed predictors in Table 1 using the STATA commands in (7.1) and (7.2).The nine regression slopes, exhibited in Table 3, are plotted on their quaternary counterparts in Figure 1.The near perfect linearity (through the origin) of this plot demonstrates that distribution-free regression delivers valid slopes, even in its most extreme case of one step from "negative" to "positive".Conversely, as noted in Section 2.1, the equivalence of these quaternary and binary slopes demonstrates the robustness of assuming three equal steps from "poor" to "fair" to "good" to "excellent".Tables 2 and 3 show that quaternary slopes enjoy larger t statistics and a greater R 2 than binary slopes.Evidently quaternary regressions are to be preferred, especially since they also offer an easy choice task to the respondent.

The Reach of Renewed Design-based Regression
The present work replaces a population of constants with a population of random variables.Both of these populations produce an observed sample of numbers, but their generating processes are very different.In conventional designbased sampling, fixed individual states are believed to be selected and observed directly.In the present reinterpretation, stochastic response error generates N individual random variables that are realized in a population.Subsequently, n of these realizations are observed in a sample from this population.The status-quo theory (unrealistically) regards these population and sample realizations as fixed and immutable constants (cf.Lehmann, 1999, pp.115-116).
9.1 Binary responses Bechtel (2005) introduced the distinction between these two types of populations in binary applications.In this case a population of Bernoulli variates, governed by personal probabilities, replaces a population of fixed 0's and 1's.The sample proportion produced by survey solicitations estimates the population mean of these probabilities, which are individual expectations of responding 0 or 1.In contrast, conventional design-based sampling interprets this same sample proportion as the mean of a population of 0s and 1s that are fixed and noticeable states.
The application in Section 8.4 extends Bechtel (2005) by analysing personal probabilities η i with equation (8.1).Here E i is a binary response error causing η i to manifest as Y i = 0 or Y i = 1.Even with this most extreme departure of survey data from continuity, normality, and homoscedasticity, normality (over samples) of regression effects on personal probabilities is justified in Section 4.

Equal-step response scales
The regression of binary variates generalizes immediately to discrete random variables that code ordered responses in public opinion polls.This type of survey item solicits individual choice behavior (cf.Luce, 1959) over a set of options such as poor, fair, good, and excellent in Section 8.2.There E i is a quaternary response error causing i's expectation η i to manifest as Y i = 0 1 2 or 3.In the case of three response options, such as disagree, neutral, and agree, i's expectation η i is continuous but her (his) potential realizations Y i are limited to the integers 0 1 or 2.
Discrete dependent variables may also arise from multiple-item scales in survey questionnaires.For example, the ternary coding 0 1 2 may be used for each of three items that measure a subjective attribute.Summing these three item scores generates a discrete random variable Y i whose expected value for individual i is The true value η i is continuous, whereas Y i is restricted to the integers 0 1 2 3 4 5 or 6.The assumption in Section 2.1 that Y i = η i + E i , i.e. that a fixed individual i's observed scale score equals her (his) true score plus a random error, has been used in psychological test theory by Lord and Novick (1968, pp. 27-38).The variance of Y i (over realizations) on the three-item scale in the present illustration is , where, for example, γ i12 is the covariance of i's responses to items 1 and 2. These inter-item covariances relax the implausible assumption of "local independence" in item response theory, which requires that γ i12 = γ i13 = γ i23 = 0, i.e. that individual i's responses to successive questionnaire items be independent.(Embretson and Reise, 2000).
Single and multiple-item scales have been a mainstay of psychological measurement since Sir Francis Galton (1883) first introduced the rating of subjective attributes.Coombs (1964, pp. 211-212) gave various reasons for the ubiquity of rating scales, whose common use dates back to the early 1900s (Thurstone, 1925;Likert, 1932).In the 1970s survey ratings underlaid the measurement of life quality.This effort was stimulated by Levy and Guttman (1975), Andrews and Withey (1976), and Clogg's (1979) latent class analysis of the 1975 General Social Survey.In that decade the rating scale was also the vessel for consumer satisfaction (Bechtel, 1977).Subsequently the quality-control revolution, stimulated by the earlier work of W. Edwards Deming (Mann, 1994), led to worldwide preoccupation with satisfaction.In the public and private sectors this concern surfaced as "outcome evaluation", where rated satisfaction is solicited in national surveys and clinical trials.

Continuous scales
In contrast to discrete scales for measuring subjective variables, clinical and economic measures tap physical properties such as blood pressure and wealth.Here too status quo sampling theory unrealistically postulates fixed blood pressures in the population, rather than realizations of individual random variables.The alternative here samples these realizations which are continuous in mmHg units.Each reading Y i departs from i's true pressure η i due to response error E i .Instead of being equally spaced these Y i , like their η i , are continuous on the scale 0 . . .300mmHg.

Explaining individual idiosyncrasies
In both its discrete and continuous applications reinterpreted design-based theory represents respondent i as an idiosyncratic probability distribution.Her (his) random variable Y i differs from its true value η i due to a stochastic response error E i defined in Section 2.1.A continuous Y i , along with its mean η i , can take any value on the response scale.Discrete Y i , however, are restricted to equally spaced response values.In Section 8.2 the values 0 1 2 3 code the well known Zogby scale of poor, fair, good, or excellent presidential performance.
When Y 1 , . . ., Y n are sampled from a population realization {Y 1 , . . ., Y N }, the regression estimate B in (3.3) is asymptotically normal over samples s and almost unbiased for β in (2.1).The target β partially accounts for response expectations η 1 , . . ., η N in a finite population of individuals.The variances of coefficients B 0 , B 1 , . . ., B k in B are estimated by the diagonals of the matrix in (5.3).These same diagonals are used in conventional design-based theory, where Y 1 , . . ., Y n are (implausibly) regarded as drawn from a population {Y 1 , . . ., Y N } of human constants.Alternatively, the renewed theory here interprets each Y i as a realization of a random variable (partially) governed by i's personal parameters η i and σ 2 i defined in Section 2.1.This interpretation better justifies formulas (3.3) and (5.3), long used to estimate survey regression coefficients and their standard errors.It also strengthens the foundation of design-based theory by realistically representing human populations as finite sets of unique individuals who are subject to idiosyncratic response errors.The errors considered here occur in the absence of a hypothetical superpopulation with particular distribution and covariance structures.These arbitrarily distributed errors lend credibility to the widely-used formulas of design-based regression theory.
5.4)where u = b − β and v = B − b.Equation (5.4)  shows that i's observed residual equals her (his) actual residual in addition to a component containing two estimation errors; namely, the departure of the latent estimate b from β and the deviation of the manifest estimate B from b.
2).Thus V ar(b) expands to V ar(B) due to response error E i and estimation error v = B − b.In particular, the k + 1 diagonals of V ar(B), which are the manifestly estimated variances of the observed B 0 , B 1 , . . ., B k , are larger than the k + 1 diagonals of V ar(b).These latter diagonals are the latently estimated variances of the unobserved b 0 , b 1 , . . ., b k , which are generated by sampling η 1 , . . ., η n from the population {η 1 , . . ., η N }.

Table 1 :
Predictors of presidential performance

Table 2 :
Quaternary Regression of presidential job performance (R 2 = .72)The values in this table were obtained from the STATA commands svyset and svyreg described in the text.This linear survey regression was carried out on a STATA spreadsheet translated from an SPSS data file supplied by Zogby International.The translation was done with STAT/TRANSFER software obtained from Circle Systems, Inc.

Table 3 :
Binary Regression of presidential job performance (R 2 = .62)The values in this table were obtained from the STATA commands svyset and svyreg described in the text.This linear survey regression was carried out on a STATA spreadsheet translated from an SPSS data file supplied by Zogby International.The translation was done with STAT/TRANSFER software obtained from Circle Systems, Inc.