Generalized Poisson-Poisson Mixture Model for Misreported Counts with an Application to Smoking Data

The assumption that is usually made when modeling count data is that the response variable, which is the count, is correctly reported. Some counts might be overor under-reported. We derive the Generalized PoissonPoisson mixture regression (GPPMR) model that can handle accurate, underreported and overreported counts. The parameters in the model will be estimated via the maximum likelihood method. We apply the GPPMR model to a real-life data set.


Introduction
Many real world applications involve count data.There are a lot of regression models that have been used in modeling count data.Some of these regression models have been applied to data on number of bottles of port wine purchased (Ramos, 1999); number of absenteeism in the workplace (Barmby et. al, 1991); underreporting of needlestick injuries by medical students (Watermann, Jankowski and Madan, 1994), the frequency of criminal victimization, (Li, Trivedi and Guo, 2003), to mention only a few.A good compilation on regression analysis of count data is given by Cameron and Trivedi, (1998).In most of these cases, the number of counts could have been potentially overreported, underreported or correctly reported.In the case of the counts having been correctly reported, then the appropriate count data regression model such as negative binomial, Poisson and generalized Poisson can be applied to such data.In real life there is potential of misreporting and it is necessary to check count data for this kind of reporting.Winkelmann (1996) proposed a Poisson regression model that takes underreporting into account.This model is a mixture of the Poisson and the binomial distributions.The number of reported events, y i , that result only if absenteeism occurs was assumed to be Poisson distributed with probability π i , captured by the binomial distribution that each individual event is reported.The Poisson regression model for underreported counts is given by P (Y = y i ) = e π i µ i (π i µ i ) y i y i ! for y i = 0, 1, 2, ..., (1.1) with mean, E(Y i ) = π i µ i , where The negative binomial regression model that takes underreporting into account (Mukhopadhyay, 1997) was derived as a mixture of the negative binomial and the binomial distributions.The resulting mixture regression model for underreported counts is the negative binomial regression model given by Mukhopadhyay (1997) as (1.2) where α is the dispersion parameter and 0 < π i < 1 is the probability of underreporting an event and is conditional on some covariates z i = (z i1 , z i2 , ..., z im ) and The probability π i is modeled through the logit link function specification.The mean and variance of this model are given by Mukhopadhyay (1997) as Mukhopadhyay (1997) applied this regression model to a data set from the National Longitudinal Survey for Youth for the year 1980 with the response variable being the number of times one had been convicted of some illegal activity.
The generalized Poisson regression (GPR) model (Famoye, 1993) is given by (1.3) for y i ≥ 0 and µ i is the log-link function.When α = 0, the GPR model becomes the Poisson regression model.When α > 0 the GPR model can be used for overdispersed data and when α < 0, the GPR model can be used for underdispersed data.The generalized Poisson regression model for underreported counts (GPRUC model) was derived by Pararai, Famoye and Lee (2006).The GPRUC model was applied to data on number of sexual partners.The models mentioned are appropriate if the counts are underreported.Li, Trivedi and Guo (2003) suggested a mixture model of the Poisson and negative binomial regression models that can be used to handle data that that is under-, over-and accurately reported.In the regression model by Li et al. (2003), misreporting would occur when an individual reports the number of events as y i , i = 1, 2, ..., n which may differ from the true count y * i , i = 1, 2, ..., n.The negative binomial regression mode1 took care of the accurate counts while the Poisson regression model took care of the underreported and overreported counts.The means of the accurate, overreported and underreported counts were given respectively by λ i = exp(x ij γ j ), µ i = exp(z ij δ j ) and ψ i = y * i exp(x ij β j ), where x ij and z ij represent the covariates on which these means depend.The regression model for handling data with accurately reported, overreported and underreported counts derived by Li et al. (2003) is given as .
(1.4) Li et al. (2003) applied the regression model in (3.1) to school crime victimization data drawn from the National Crime Victimization Survey for the year 1995.The response variable was the number of stolen items from one's locker in school.
In many cases the negative binomial regression model and the generalized Poisson regression model are competitors when fitting count data.It is therefore reasonable to derive a generalized Poisson regression model that accommodates misreported counts along the same way as its negative binomial counterpart by Li, Trivedi and Tong (2003).
The remainder of the paper is organized as follows: In section 2 a description of the National Pregnancy and Health Survey (NPHS) data is given.Section 3 gives an outline of how the GPPMR model is developed.The parameters of the model are estimated via the maximum likelihood method and this is explained in section 4. Some goodness-of-fit tests are given in section 5.The GPPMR model is applied to the NPHS data in section 6 and the results are also discussed.The concluding remarks will be given in section 7.

National Pregnancy and Health Survey Data
The data was collected from the National Pregnancy and Health Survey: Drug Use Among Women Delivering Live Births, 1992.The data can be accessed from http://webapp.icpsr.umich.edu/cocoon/icpsr-study/02835.xml.One of the objectives of the study was to describe the use of illegal drugs by expecting mothers.The data on substance use was collected through a questionnaire that was administered to women during pregnancy.One of the variables measured was the number of cigarettes a woman smoked each day during the first trimester of pregnancy.To demonstrate the GPPMR model, the NPHS data is considered with the number of cigarettes a woman smoked in the first trimester of pregnancy as the response variable.The explanatory variables and the response variable used in modeling the data are described in Table 1.The descriptive statistics for this data are shown in Table 2.The mean of the number of cigarettes smoked in Table 2 is less than its variance showing that the data is overdispersed.The variables chosen in illustrating the GPPMR model pertain to source of income of the respondent.

Generalized Poisson-Poisson Mixture Model
The generalized Poisson-Poisson mixture regression (GPPMR) model accommodating over-, under-and accurately reported counts is a mixture of the generalized Poisson regression model in (1.3) and the Poisson regression model.The justification for mixing Poisson and generalized Poisson is that we want two data generating processes that result in count data.The Poisson model provided the most reasonable choice after trying other models such as negative binomial and generalized Poisson.Also, in the simulations that were carried out, convergence was much quicker when mixing the generalized Poisson and Poisson models.The assumptions used in deriving the GPPMR model are the same as those used by Li et al. (2003) in deriving the NBPMR model in (3.1).Let y * i denote the total number of true events for individual i where i = 1, 2, ..., n.Assume that y * i conditional on covariates x i = (x i1 , x i2 , ..., x ik ) follows the generalized Poisson distribution with probability function where the mean function λ i = exp(x i γ) and γ is a k−dimensional vector of unknown regression coefficients.The variance of the regression model in ( 6) is ].The counts are reported incorrectly when an individual reports the number of an event as y i , different from y * i , i = 1, 2, .., n.Assume that when y * i = 0, the observed count y i is Poisson distributed with mean and variance µ i = exp(z i δ) denoted by P (µ i ) where z i = (z i1 , ..., z ip ) are some explanatory variables and δ is a vector of some unknown parameters.This is a situation when potential overreporting may occur since an individual is reporting a value y i while the true value is y * i .Furthermore, conditional on y * i > 0, y i is Poisson distributed with mean and variance given by y * i exp(z i β) (Li, Trivedi and Guo, 2003).This distribution shall be denoted by P (y * i ξ i ) where ξ i = exp(z i β) is dependent on the covariates z i = (z i1 , ..., z ip ) and β is a vector of unknown parameters.T his is a situation where potential underreporting of events occurs.The covariates used in modeling the accurate portion of the regression model maybe the same as those used in modeling the over and underreported portions.These assumptions from Li et al. (2003) can be summarized as: (1) Overreporting occurs for The probability distribution of the reported count y i can be obtained as the marginal density of the joint distribution of the generalized Poisson and the Poisson regression models.The model for the reported counts is: The mean and variance of the GPPMR model are , (3.4) respectively.

Estimation of Model Parameters
The log-likelihood function of the GPPMR model in (3.3) is To estimate the parameters α, β, δ and γ, the Statistical Analysis Software (SAS, 1999) was used.The NLPNRA algorithm, which is a nonlinear optimization based on the Newton-Raphson method is used to estimate the parameters α, β, δ and γ.The variance-covariance matrix of the estimated parameters is obtained from the NLPFDD subroutine in SAS.This subroutine can approximate derivatives by using finite differences and computes the gradient vector and the Hessian matrix H, all evaluated at α, β, δ and γ.

Goodness-of-fit Tests
The GPPMR model in (3.3) reduces to the Poisson-Poisson mixture regression model when the dispersion parameter α = 0. To assess the appropriateness of the GPPMR model over the Poisson-Poisson mixture regression model one can test the hypothesis: H 0 : α = 0 against H a : α = 0. To carry out the test, one fits the GPPMR model and uses the asymptotic Wald t-test.The statistic to be computed is given by where α is the maximum likelihood estimate of α and se(α) is its corresponding standard error.This statistic is compared to the t distribution with n − ν − 1 degrees of freedom, where ν is the total number of parameters in the GPPMR model.
The GPPMR and NBPMR models for underreported, overreported and accurate counts are non-nested.In order to discriminate between the two non-nested models the Vuong (1989) test will be used.The hypothesis to be tested is H 0 : GPPMR and NBPMR models are equivalent against the two alternatives H f : GPPMR model is better than NBPMR model, or H g : NBPMR model is better than GPPMR model,where H f and H g are the two competing alternative hypotheses for the GPPMR and NBPMR models respectively.

Results and Discussion
The independent variables described in Table 1 are taken as the covariates that affect λ i , the mean of the accurately reported counts.The covariates marital status, hispanic, black, wages and salaries, food stamps, and unemployment income are used to model µ i and y * i exp(z i β) which represent the mean of the overreported and underreported counts respectively.The proportion of zeros in the response variable was 76.42%.The results obtained from fitting the GPPMR and NBPRM models are shown in Table 3.A test of the null hypothesis α = 0 by using the asymptotic Wald t− statistic shows that α, the dispersion parameter, is different from zero in Table 3.The Poisson-Poisson Mixture Regression model is not appropriate and hence cannot be used to describe this data based on the Wald's t− test result.A comparison between the fitted NBPMR and the GPPMR models is made using the Vuong (1989) test.The Vuong (1989) statistic calculated is equal to -0.0363.Since |T * | = 0.0363 < Z 0.025 = 1.96, the null hypothesis that states that the GPPMR and the NBPMR models are equivalent cannot be rejected.This result shows that the NBPMR and GPPMR models are similar in their performance.In Table 3 the log-likelihood values for the NBPMR and GPPMR models are given respectively by -3116.535 and -3128.56showing no significant difference in the performance of the two models.

Results on accurate reporting
The probability of accurately reporting the number of cigarettes smoked in a day by a woman during the first 3 months of pregnancy is given by P (y * i |x i , γ).The results from the GPPMR model in Table 3 suggest that this probability is greater among women who lived with a smoker, tried to get help to quit smoking, smoked in the last 3 months of pregnancy, received food stamps, unmarried women, non-hispanics, non-blacks, women with no college education and women who do not have wages and salaries as sources of income.This probability does not seem to be affected by women who receive social security income, public assistance, housing assistance and unemployment income.

Results on overreporting and underreporting
The probability of overreporting of events is given by P (y i |y * i = 0, z i , δ).The GPPMR model shows the only covariate that affects the probability of overreporting the number of cigarettes smoked by a pregnant woman in the first 3 months of pregnancy is wages and salaries.The probability of overreporting an event is negatively related to wages and salaries.Women who did not have wages and salaries as sources of income tend to overreport the number of cigarettes smoked in a day during the first 3 months of pregnancy.
The probability of underreporting the number of cigarettes smoked is positively related to marital status and wages and salaries.Married women who smoked during the first 3 months of pregnancy tend to underreport the number of cigarettes smoked in a day compared to their unmarried counterparts.

Concluding Remarks
In this paper we presented and examined the mixture model of the Poisson and generalized Poisson regression models, namely, the generalized Poisson-Poisson mixture regression model.This was an attempt to come up with a model that can be used to model counts that could be potentially misreported.The model is capable of capturing all 3 potential forms of reporting namely, accurate, underand over-reporting.Other methods of estimating the parameters other than the maximum likelihood method could also be explored.The issue of variable selection could further be explored in as far as determining how to choose the variables that affect the accurate, under-and over-reported portions of the model.Other data sets could possibly yield results in which the GPPMR model outperforms the NBPMR model.

Table 1 :
Description of variables for the NPHS data

Table 2 :
Descriptive statistics for the NPHS data

Table 3 :
Estimates for NBPRM and GPPRM models