Panel Regression of Arbitrarily Distributed Responses

The primary advantage of panel over cross-sectional regression stems from its control for the effects of omitted variables or ”unobserved heterogeneity”. However, panel regression is based on the strong assumptions that measurement errors are independently identically ( i.i.d.) and normal. These assumptions are evaded by design-based regression, which dispenses with measurement errors altogether by regarding the response as a fixed real number. The present paper establishes a middle ground between these extreme interpretations of longitudinal data. The individual is now represented as a panel of responses containing dependently non-identically distributed (d.n.d) measurement errors. Modeling the expectations of these responses preserves the Neyman randomization theory, rendering panel regression slopes approximately unbiased and normal in the presence of arbitrarily distributed measurement error. The generality of this reinterpretation is illustrated with German Socio-Economic Panel (GSOEP) responses that are discretely distributed on a 3-point scale.


Random Individual-Wave Variables
Design-based sampling postulates the respondent in a fixed observable state that s(he) reports as a discrete rating, such as 0 1 2, or recalls on a continuous monetary scale.Thus, the value recorded on an opinion poll or economic survey is regarded as a real number in waiting.More realistically, however, the survey response may be viewed as a random variable containing measurement error (cf.Diggle, Liang, and Zeger, 1994).The present paper favors this more plausible interpretation and extends Bechtel's (2007) treatment of cross-sectional regressions to longitudinal regressions involving repeated measurements.These measurements make up a panel of random variables, which may be dependently non-identically distributed (d.n.d) within each respondent.A finite population of these panels then gives rise to a finite population of realized random individualwave variables.Each realization is a momentary numerical value governed by an individual-wave-specific mean and variance.This approach retains and enhances design-based regression, whose slopes are still normally distributed (over samples) for any stochastic distributions (over realizations) that prevail for individualwave-specific responses.
Section 2 describes an unbalanced longitudinal population along with a singlestage sample of panels.Section 3 regresses response expectations over this population, defining new target parameters as functions of these expectations.Using the Sections 4,5,and 6 show that these new parameters are estimated by the well-known design-based coefficients.Section 7 describes a user-friendly computation of these coefficients with STATA software.Section 8 uses this software to evaluate predictors of environmental concern in the German Socio-Economic Panel.The final section summarizes distributionfree panel regression and reemphasizes its applicability to arbitrarily distributed survey responses.

The Population and Sample of Panels
The term panel is used here to denote an intra-individual sequence of wave measures Y it .This sequence is illustrated by a single row in Table 1, where t = 1 for individual i's first appearance.A population of panels is a finite set of panels exemplified by the seven rows in Table 1.This population is "unbalanced" because different individuals make different numbers of wave appearances.An unbalanced population of panels is also a series of incomplete censuses, such as the four columns in Table 1.The boldface rows in Table 1 exhibit an unbalanced sample of three panels drawn without replacement from our population of seven panels.Because every wave appearance in each sampled panel is measured, Table 1 illustrates singlestage cluster sampling (Lohr, 1999, pp. 136-145), which is called single-stage panel sampling in the sequel.In this example a sample of eight individual-wave measures are drawn from a population of eighteen individual-wave values by single stage panel sampling.
In the sequel Y it in Table 1 plays three roles: a realizable random variable, its realized value which is a real number, and an observed (i.e.sampled) realized value.A panel can be viewed, therefore, as a cluster of random variables or as a cluster of fixed realizations.The following sections emphasize the importance of stochastic measurement error in these distinctions.

Stochastic measurement error
The present paper uses survey data that sharply departs from the (usually assumed) continuity, normality, and homoscedasticity of the panel response variable (cf.Baltagi, 2001;Hsiao, 2003).Here Y it = 0, 1, 2 denotes a rating of environmental concern by German panelist i on wave t.The response options and coding for the GSOEP's three-point scale are: This score is a discrete random variable that may be decomposed as where it that determine an idiosyncratic, wave specific probability distribution on the scale 0 1 2. The mean H it is continuous in the interval [0, 2] and is composed of an individual intercept, individual-wave-specific predictors, and an effect that is unique to individual i on wave t.This latter effect γ * it saturates the linear model for the H it in (3.1), i.e. the structure fits the H it exactly without constraining these expectations.
The E it in (3.1) may be dependently non-identically distributed (d.n.d.) over waves within individuals.Fixing individual i and wave t, the random variable E it can be displayed as follows: The (unknown) response probabilities p 0it , p 1it , and p 2it for not concerned at all, somewhat concerned, and very concerned are arbitrarily distributed over the points 0 -H it ,1 -H it , and 2 -H it .The standard deviation σ it on this 3-point error scale denotes uncertainty in i's rating of environmental concern.A small σ it represents a precisely reporting individual with a narrow error distribution.A broad error distribution has a large σ it characterizing an individual with less consistent ratings over repeated realizations of the random variable Y it .

New target parameters for design-based regression
In equation (3.1) the intercept α * i , the slopes β * 1 , . . ., β * k , and the effects γ * it will be uniquely identified by the ordinary-least-squares (OLS) condition that T is given by the following function of these expectations: In (3.2) X it = (X 1it , . . ., X kit ), and X i••• and H i• are the means of X it and H it within panel i (StataCorp. 2001, p. 437 ;Hsiao 2003, pp. 30-33).Thus β in (3.2) is expressed in terms of the deviations of response expectations and predictors from their panel means.Equation (3.2) selects the unique parameterization α i , This defines the new target parameters of design-based regression as β 1 , . . ., β k .

Single-stage panel sampling
Our clustered population of individuals, each containing T i survey waves for i = 1, . . ., N , is anchored by In Table 1, for example, i = 1, . . ., 7 panels and ∑ i T i = 18 individual-waves.Now let the random variable Y it be realized for every individual-wave in the population of panels.This population realization occurs in a hypothetical (but possible) series of incomplete censuses.A single-stage cluster sample of n panels is then drawn without replacement from this population of N panels.The sample consists of ∑ i T i ratings Y it for i = 1, . . ., n.In Table 1 ∑ i T i = 8 individual-waves are drawn from i = 1, 2, 3 sampled panels.This setup reinterprets conventional design-based sampling which treats the Y it as constants rather than realizations of random variables.

Longitudinal weights
The sample inclusion probability for a panel is the cross-sectional inclusion probability of its initial wave multiplied by the retention probabilities for its subsequent waves.These retention probabilities are "the conditional probabilities of remaining in the panel" over these remaining waves (Haisken-DeNew and Frick 2005, p. 171).For example, the sample inclusion probability π 3 for individual 3 in Table 1 is her (his) cross-sectional inclusion probability in wave 2 multiplied by her (his) retention probabilities for waves 3 and 4. The sample inclusion probability π 5 for individual 5, however, is simply her (his) cross-sectional inclusion probability in wave 2. The final longitudinal weights for individuals 3 and 5 are the reciprocals of their inclusion probabilities, i.e. w 3 = 1/π 3 and w 5 = 1/π 5 .
In the German Socio-Economic Panel each respondent is assigned a crosssectional weight and a longitudinal weight for each wave.The cross-sectional weight for panel i's first participating year is multiplied by the longitudinal weights for her (his) subsequent participating years.Each longidudinal weight is the reciprocal of i's "staying" probability for that subsequent year, i.e. the conditional probability s(he) participates in that wave and in the previous waves of her (his) panel (Haisken-DeNew and Frick 2005, p. 180).The product of panel i's initial cross-sectional weight and subsequent longitudinal weights produces i's final longitudinal weight w i .This weight w i covers the sequence of years individual i is monitored within the time span 1999-2005.(Lohr, 1999, p. 360).Each sum of products contains panel deviation scores

The new stationary target
The important result here is that the conventional formula (5.1) also estimates the more profound and anchored target parameter (3.2), which is a function of constant expectations H it rather than momentary realizations Y it .To obtain this result we take the expected value of (5.2) over realizations of the stochastic ratings Y it : Because E(θ j ) = β j and Var(θ j ) → 0 as the number of panels N → ∞, the differences θ j − β j for j = 1, . . ., k are infinitesimal for a given large population realization.Thus B in (5.1), which is almost unbiased for θ in (5.2), is almost unbiased for β in (3.2) as well.

Normality and Variances of the Estimated Coefficients
Fixing the momentary population realizations Y it for i = 1, . . ., N and t = 1, . . ., T i , the resulting reals θ 1 , . . ., θ k in (5.2) become the classic target parameters of design-based regression.Therefore, a strict design-based argument using the θ j can be given for the normality over large samples of each element B j in B. This provides a statistic for testing hypotheses about the new target parameter β j against the conventional estimate B j .
First, given the population realizations Y it , the coefficient θ j for j = 1, . . ., k can be written as a smooth function of cross-product totals in the population {y it , x 1it , . . ., x kit : i = 1, . . ., N ; t = 1, . . ., T i } of deviation scores.Then, from the sample {y it , x 1it , . . ., x kit : i = 1, . . ., n; t = 1, . . ., T i } the estimate B j can be written as the same function of HT estimators of these population totals.The HT estimators are corresponding sample totals of cross products with each term weighted by w i .For example, the sample total is an HT estimator of the population total ∑ it x 1it y it for i = 1, . . ., N ; t = 1, . . ., T i (cf.Lohr, 1999, pp. 352-360;Thompson, 1997, pp. 106-108).
The asymptotic normality of HT estimators (Sen, 1988, pp. 313-328) may now be used to justify the asymptotic normality of B j , which is a nonlinear function of these estimators.A "linearization" of the error B j − θ j is provided by the first-order approximation B j − θ j ≈ j , where j is the linear term in a Taylor series expansion of this error.Asymptotic multivariate normality of the HT estimators then implies that (B j − θ j )/ √ Var( j ) is asymptotically N (0, 1) (Lehmann, 1999, pp. 253-269, 309-315;Lohr, 1999, pp. 290-293, 310, 352-360;Thompson, 1997, pp. 58-64, 106-111).The estimate V ar( j ) of Var( j ) is given by the j-th diagonal element of the matrix V ar(B) in (6.2) and is computed by software described in Section 7. Due to the infinitesimal difference between θ j and β j , the statistic may be used to test an hypothesis H : β j = β j0 about our target coefficient β j .This test for β j0 = 0 is illustrated for the regression coefficients in Table 2 below.Finally, again using "linearization", an estimate of the entire covariance matrix of B is where i = 1, . . ., n and t = 1, . . ., T i (Lohr, 1999, pp. 359-361;StataCorp., 2001, Volume 4, pp. 29-30).As described in Section 4.2, the longitudinal weight w i in (6.2) is fixed over the T i waves in individual i's panel.The j-th diagonal of V ar(B) is the estimated variance of B j in the denominator of (6.1).Again note that the covariance estimator in (6.2) is expressed in panel deviation scores.

Software
The estimated regression coefficients, along with their standard errors and test statistics, are easily calculated with the STATA commands: svyset pweight longitudinal weight (7.1) svyset psu panel (7.2) svyreg devY devX1 ... devXk , noconstant ( . 2001, Volume 4, pp. 18-31).In (7.1) longitudinal weight is the variable containing the longitidunal weights.In (7.2) panel is the variable containing the panel identifications.The definitions of the deviation variables in (7.3) are: The option noconstant in (7.3) suppresses the intercept because the response variable and its predictors are deviations from their panel means.
For large samples these three STATA commands return (approximately) normal and unbiased estimates B 1 , . . ., B k in the presence any distributions of the measurement errors E it in (3.1).The standard errors of the estimated coefficients reflect the effects of these measurement errors on coefficient variance.

The GSOEP for 1999-2005
Because Germany has been at the forefront of environmental protection, the present investigation of environmental concern relies upon the well-established German Socio-Economic Panel.The first wave of the GSOEP was carried out in 1984 in the Federal Republic of Germany.The panelists studied here are residents of the former Federal Republic living in private households whose head is not Turkish, Greek, Yugoslavian, Spanish, or Italian.These respondents are known as the "west German sample" of the GSOEP (Haisken-DeNew and Frick 2005, p. 19).
The GSOEP interviews are conducted face-to-face with all persons in a household aged 16 and over.Our west German sample consists of 6634 respondents measured within the seven years of the present study, i.e. 1999-2005.Further details on the English Language Public Use File of the GSOEP, including instructions for obtaining the data, have been given by Wagner, Burkhauser, and Behringer (1993).
The survey firm Infratest Burke Sozialforschung in Munich carries out the fieldwork for the GSOEP.In addition to demographic information, the GSOEP questionnaire contains "objective" measures such as income and unemployment, as well as "subjective" ratings of satisfactions, worries and fears of the German population.

The GSOEP items for rating worry
The GSOEP contains ratings of concern, or worry, about various living conditions in Germany, Europe, and the world.These items are prefaced with the question: What is your attitude toward the following areas -are you concerned about them?
The areas of concern studied here are: Environmental protection; General economic development; Your health; Maintaining peace; Crime in Germany; Hostility toward foreigners or minorities in Germany The response scale and coding for these items were described in Section 3.1 as: Not concerned at all (0) Somewhat concerned (1) Very concerned (2)

The weighted panel regression
Using the STATA commands (7.1), (7.2), and (7.3), equation (3.1) is estimated as Ŷit Five significant predictors of environmental concern, along with age, are exhibited in Table 2.This west German regression was run over 34269 individual-wave measures generated by 6634 panels from 1999 to 2005.These panels ranged from one to seven waves, with an average of 5.2 waves.The estimates A 1 , . . ., A 6634 of the individual effects are not included in this report.
The five concern coefficients in Table 2 are commensurate because they share the three-point rating scale in Section 6.2.The strongest predictor of environmental concern is worry about maintaining peace, followed by worries about your health and crime in Germany.The negative age coefficient reveals that younger Germans have greater environmental concern.
The predictors in Table 2, except for your health, suggest that German environmental concern has an altruistic societal, rather than selfish individualistic, orientation.This is supported by the finding that potential regressors, such as personal dwelling satisfaction, and concern with your own economic situation, failed to reach statistical significance in predicting environmental concern.

Summary
An unbalanced longitudinal population of real numbers is reinterpreted as a set of momentary realizations of random variables Y it , each governed by the parameters H it and σ 2 it for individual i on wave t. (See Table 1.)This reinterpretation better justifies the usual design-based regression estimates and their standard errors.It opens up panel regression to design-based theory, response weighting, and arbitrary stochastic responding without reference to an abstract superpopulation (cf.Skinner, Holt, and Smith 1989).
The primary advantage of panel over cross-sectional regression lies in the possibility of bringing variable intercepts α * i into the model.These individual effects, which reside in the error term of a cross-sectional model, bias regression coefficients if they are related to both the response and its predictors.This potential bias is removed by (3.1) which contains α * i as an estimable effect.However, this individual effect, and the non-estimable individual-wave effect γ * it in (3.1), are not needed in defining our target β in (3.2) and its estimate B in (5.1).
The present panel extension of Bechtel (2007) differs from model-based sampling, where the finite population of realizations is itself a sample from a "su-perpopulation" with assumed distribution and covariance properties (cf.Binder 1983;Nathan 1988;Skinner, Holt, and Smith 1989;Thompson 1997;Valliant, Dorfman, and Royall 1999;Binder and Roberts 2003).Here this "superpopulation" is simply a finite set of arbitrarily distributed wave variables that are clustered by individuals.These random variables are realized as responses to a hypothetical (but possible) sequence of incomplete censuses.The targets of inference are population regression coefficients that are functions of the expectations of individual-wave realizations.This longitudinal population, and its limited target parameters, establish a plausible bridge between design-and model-based regression theory.
Finally, the estimate B in (5.1) of the target β in (3.2) is asymptotically normal and almost unbiased (over samples) whatever the distribution (over realizations) of Y it in the panel population.Thus, the reinterpretation of Y it as a stochastic response rather than a fixed real number is a step forward in the Neyman paradigm (Bellhouse, 1988).By allowing this response to be arbitrarily stochastic, formulas (3.2) and (5.1) also strip away the distribution assumptions thought to be necessary for panel regression (cf.Baltagi, 2001;Hsiao, 2003).

Table 1 :
An unbalanced longitudinal population of panels

X kit are fixed individual-wave-specific predictors, γ * it is a fixed individual-wave effect on H it , and E it = Y it − H it is a measurement error for individual i on wave t, with E(E it ) = 0 and Var(E it ) = σ 2 it . In (3.1) our unit of interest, individual i on wave t, is represented by a pair of parameters; namely, a mean it and variance σ 2
, unbiased estimates of the matrix ∑ it x it x T it and the vector ∑ it x it y it are given by ∑ it w i x it x T it and ∑ it w i x it y it for i = 1, . . ., n; t = 1, . . ., T i .The weight w i is individual i's final longitudinal weight described in Section 4.2.When the sample size n is large, the Horvitz-Thompson (HT) estimator