On the Zero-Inflated Count Models with Application to Modelling Annual Trends in Incidences of Some Occupational Allergic Diseases in France

This paper reviews zero-inflated count models and applies them to modelling annual trends in incidences of occupational allergic asthma, dermatitis and rhinitis in France. Based on the data collected from 2001 to 2009, the study uses the incidence rate ratios (IRR) as percentage of changes in incidences and plots them as function of the years to obtain trends. The investigation reveals that the trend is decreasing for asthma and rhinitis, and increasing for dermatitis, and that there is a possible positive association between the three diseases.


Introduction
One generally means by count data those issued from the count of the number of occurrences of an event of interest.Some examples of such data are, the number of medical visit per month for a patient, the number of vehicles produced by a firm per year, the number of failures of a machine during a period.It is well known that count data may exhibit over/under-dispersion and/or contain too many zeros than expected.These properties suggest the use of ad-hoc models such as the so-called zero-inflated regression models or hurdle regression models, rather than the usual Poisson regression model which assumes the equality of the mean and the variance of the observations.Zero-inflated models and hurdle models are reviewed for instance, in Gschlöβl and Czado (2008) and Ridout et al. (1998).The reader can also refer to Grumu (1997) and Hall (2000), and references therein.These models whose story goes back at least to Mullahy (see Mullahy, 1986), have successfully been used in econometrics, demography, medicine, public health, epidemiology, biology and in many other fields.One of their main interesting features is that they adjust well to data issued from a particular mixture of two populations: one in which one has only zero counts and another in which the counts are the realizations of a discrete distribution.An example in public health is that of a population composed of a group of persons at risk and of a group of persons not at risk.Zero-inflated models would allow the occurrences of zeros in both groups while hurdle models would allow occurrences of zeros only in the group of persons not at risk.These two classes of models therefore assume that the data are issued from a mixture of two processes: one generating zero counts and the other generating positive integers data.Lambert (1992) provides a motivation application of these models and discusses the case of zero-inflated Poisson (ZIP) models.Other papers dealing with these count models are amongst others, Mullahy (1986), Hall and Berenhaut (2002), Jansakul and Hinde (2001), Gupta and Gupta (2004) and Deng and Paul (2005).
Zero-inflated and hurdle models can be summarized as follows: where Y is the count variable, ω is the proportion of the excess of zeros, δ 0 (y) = 1 if y = 0, and = 0, otherwise, f (y) is the density of a count distribution.
One can easily observe that for f (0) = 0 and ω = 0, (1.1) is a hurdle model, while for f (0) = 0 and ω = 0, it is a zero-inflated model.For ω = 0, one retrieves a classical count distribution as Poisson, binomial etc.For ω > 0, (1.1) is either a zero-inflated model or a hurdle model.For ω < 0, (1.1) is a zero-deflated model and is no more considered as a mixture model.In the literature, f (y) is either a binomial, a geometric, a Poisson, a negative binomial or a generalized Poisson distributions.
Once the proportion of excess of zeros is estimated, their number can easily be estimated.The estimation can in turn be interpreted as an estimation of the lower bound of the number of occurrences of the event of interest that were not counted.Indeed, an excess of zero count corresponds to an occurrence which, for one reason or another, is not taken into account.Therefore, in epidemiology for example, the knowledge of the proportion of excess of zeros in data on the incidence of a given disease can help improving the analysis of these data.
In statistics, trend can be defined as the general direction of the curve describing a relationship between two variables.This notion is very familiar in the modelization of economic and financial time series where it is known as temporal or time trend.It is however also largely studied in genetic (see, eg, Texier and Sellier, 1986;Zamudio et al., 2002;Bokor et al., 2007;Mourao et al., 2008;Bakir et al., 2009), and epidemiology (see, eg, Bassetti et al., 2006;Zaghloul et al., 2008;Hothorn et al., 2009;McNamee et al., 2009;Bateman et al., 2010).Estimating trend can be very useful for the sake of prediction.For example, in epidemiology, the knowledge of the trend in the incidence of a disease can help preparing useful materials for containing this disease.McNamee et al. (2009) has to do with the study of temporal trends in some work-related skin and respiratory diseases in the United Kindom.In this paper, the authors donnot use zero-inflated models.Instead, they use a Poisson model with a gamma random effect to modelize a set of data containing possible extra zeros.
The aim of this paper is to present zero-inflated count models, and apply them to modelling annual trends in the incidences of some occupational allergic diseases in France.Our study is based on the idea developed in McNamee et al. (2009), with an application to the data collected from 2001 to 2009 by the Réseau National de Vigilance et de Prévention des Pathologies Professionnelles (RNV3P).
This paper is organized as follows.In Section 2, we give a survey of zeroinflated models.In Section 3, we apply these models to the study of trends in the incidences of occupational asthma, rhinitis and dermatitis in France.

The Common Count Models
We first present the count models commonly encountered in literature.The most common one is undoubtedly the Poisson regression model.The Poisson distribution with parameter µ > 0, denoted by P (µ) is defined by: It is well known that for this distribution, the expectation equals the variance.That is, E(Y = y|µ) = V ar(Y = y|µ) = µ.In the Poisson regression the response Y i 's are independent, and each Y i ∼ P (µ i ), µ i > 0, i = 1, 2, . . ., n, with mean expressed in terms of some covariables x i and the unknown regression parameters vector An alternative to the Poisson regression model is the negative binomial regression model which takes into account a possible over-dispersion of the data.The distribution of the negative binomial distribution with parameters r > 0 and µ > 0, denoted by N B(r, µ) is given by: For this distribution, one can easily prove that E(Y = y|µ) = µ and V ar(Y = y|r, µ) = µ (1 + µ/r) = µϕ.This last equality clearly shows that ϕ is the overdispersion factor.It is immediate that for r → ∞ one retrieves the Poisson distribution with parameter µ.It would be interesting to mention that other parametrizations use r = a −1 for a > 0.
From simple computations, one finds that if is not a positive integer, the negative binomial distribution has a unique mode at [y 0 ] (the integer part of y 0 ), and that if k is an integer, this distribution has two modes at y 0 and y 0 + 1.
In negative binomial regression, the responses Y i 's are independent, and each mean expressed in terms of some covariables x i and an unknown regression parameter vector β as in Poisson regression: Here, the over-dispersed parameter ϕ i = 1 + µ i /r depends on i.
Another alternative to the Poisson regression model is the generalized Poisson regression model.A random variable Y is said to have a generalized Poisson distribution with parameters θ and λ > 0, and denoted by GP (θ, λ) if: where θ > 0, max(−1, −θ/m) ≤ λ ≤ 1 and m (≥ 4) is the largest positive integer for which θ + λm > 0 when λ < 0. It is easy to show that E(Y = y|θ, λ) = θ/(1 − λ) = θϕ and V ar(Y = y|θ, λ) = E(Y = y|θ, λ)ϕ 2 .One can remark that ϕ 2 represents an over-dispersion factor.For λ = 0 this distribution reduces to the Poisson distribution P (θ).For λ > 0 it is over-dispersed, and for λ < 0 it is under-dispersed.Here, in contrast to the negative binomial model the dispersion factor is the same for all observations.Another important remark is that the generalized Poisson distribution GP (θ, λ) is unimodal regardless the values of θ and λ.
In the generalized Poisson regression model, the responses Y i 's are independent, and each for covariables x i and parameter β.One can also observe that the equality µ i = θ i /(1 − λ) = θ i ϕ leads to the following new parametrization of the distribution:

Inference in Parametric Zero-Inflated Models
Rewriting (1.1) with f (y) = f (y|φ) depending on an unknown parameter φ, one has: From simple computations, one finds that the mean and the variance of this distribution are given by: Denote µ L the mean of a distribution L. It results from (2.2) that for the zero-inflated Poisson ZIP (µ P ) model, the mean equals (1 − ω)µ P and the variance equals the mean times ωµ P + 1.For the zero-inflated generalized Poisson ZIGP (θ, λ) model, the expectation is (1 − ω)µ GP while the variance equals this number times µ GP ω + 1/(1 − λ) 2 .Finally, for the zero-inflated negative binomial ZIN B(r, µ) model, the mean is (1 − ω)µ N B and the variance is this quantity times ωµ BN + 1 + µ N B /r. From these results, one can see that the dispersion can result either from ω, r or λ.
Zero-inflated regression models are generally built as follows.Let Y 1 , • • • , Y n be independent random variables following one of the above distributions with expectation µ L,i and proportion of excess of zeros ω i depending on individuals.
where z i and x i are the covariables and α and β the corresponding parameter vectors, and the link function G(x) being either the logistic function or the cumulative distribution function of a standard normal random variable: In many situations, ω i and µ L,i are assumed to be linked by some relation which can considerably reduce the number of parameters in the model.The most common example is that where for all i = 1, • • • , n, , for some real parameter number γ.For positive values of γ, the zero state becomes less likely and for negative values, excess zeros become more likely.
The form of the likelihood of ( 1.
When the ω i 's are expressed in terms of the covariates z i 's and parameter α, and when the φ i = µ L,i 's are expressed in terms of the covariates x i 's and parameter β, one obtains another parametrization of the likelihood on the basis of which inference can be done.
Parameter estimation in these models are generally done by the maximum likelihood method.That is, by maximizing (2.4) or its logarithm after pluggingin (2.3).For doing this, one usually needs optimization methods such as Gauss-Newton, Newton-Raphson or other numerical methods.A relevant paper is Lambert (1992) where this estimation is considered in the case of ZIP model with the study of its standard errors and confidence intervals.However, parameter estimation by maximum likelihood method has been discussed in many papers before.In Fahrmeir and Kaufmann (1985) is studied the consistency and the asymptotic normality of the maximum likelihood estimator of a generalized linear model.In Lawless (1987) is estimated the parameters of a negative binomial model by a likelihood method and by the approach of Breslow (1984).A more recent paper in this field is Famoye and Singh (2006) where is investigated likelihood estimators in zero-inflated generalized Poisson regression models.Many other papers dealing with maximum likelihood estimation in these models can be found in the references given in the above cited papers.
As far as testing statistical hypotheses is concerned, the tests used in zeroinflated models are score-type tests.The main hypothesis tested are either the inflation of zeros, either the over-dispersion or jointly inflation of zeros and overdispersion.Such tests are used for instance, in Mullahy (1986) for testing a general class of count models, and in Lawless (1987) for testing a Poisson model against a negative binomial model.Most of the existing papers are, however, concerned with testing the excess of zeros.Such papers are amongst others, van den Broek (1995) who studies a score test for testing inflation in a Poisson distribution, Deng and Paul (2000) who presents a score test of goodness-of-fit for discrete generalized linear models against zero-inflated models, Hall and Berenhaut (2002) where is proposed a score test for heterogeneity and over-dispersion in zero-inflated and binomial regression models, Famoye and Singh (2006) where is applied a score test for the excess of zeros in zero-inflated regression models, Gupta and Gupta (2004) whose score test is applied to testing zero-inflated generalized Poisson regression models, Deng and Paul (2000) where a score test is used for testing the inflation of zeros, the over-dispersion and jointly inflation of zeros and over-dispersion in zero-inflated generalized linear regression models.

Some Existing Applications
Zero-inflated models have been applied to many genuine data sets from various sources.In Mullahy (1986) these models are applied to modelling survey micro data on beverage consumption.In Lawless (1987) such a model is adjusted to a set of data from ship damage incidents (see McCullagh and Nelder, 1983).In Lambert (1992) zero-inflation models are applied to modelling defects in manufacturing, while in van den Broek (1995) they are applied on data from HIV-infected men (see Hoepelman et al.,1992).Using an hurdle model, Bohara and Krieg (1996) examines the migration frequency in the United States of America.In Böhning et al. (1997) zero-inflated count models are used for modeling four sets of data from dental epidemiology, traffic accidents, crime sociology and graphic epidemiology respectively.In Deng and Paul (2000) they are adjusted to data concerning patients who experienced frequent premature ventricular contractions.In Famoye and Singh (2006) a such model is adjusted to a set of domestic violence data with many zeros.Ridout et al. (1998) illustrate their work with an example from horticultural research, and review a broad amount of papers treating biological examples of data sets modelled by zero-inflation count models.Another relevant paper is Gschlöβl and Czado (2008) where these models are applied to modelling invasive meningococcal disease in Germany.As one can see, there is no doubt that the scope of application of these models is large.In the next section we give more applications.

Trends in Genetics and Epidemiology Data
Trends have been studied in many fields of genetic including cattle and threes.On this subject some relevant works are, Texier and Sellier (1986) who estimate genetic trends for growth and carcass traits in two French pig breeds, Zamudio et al. (2002) where is studied trends in wood density and radial growth with cambial age in a radiata pine progeny test, Bokor et al. (2007) where is investigated trends in the Hungarian racehorse populations, Mourao et al. (2008) in which is estimated trend of meat quality traits in a male boiler line, and Bakir et al. (2009) where trends in days yield in Holstein Friesian cattle are estimated.The statistical tool used in these papers for the study of trends is the classical linear model or its extension to random effects or fixed effects models.The reason is that the response variables and the covariates are of real nature.
Trends in general, and temporal trends in particular, have also been investigated in epidemiology.For instance, Hothorn et al. (2009) present some trend tests for evaluating exposure-response relationships in epidemiological exposure studies.Using a chi-square test, Bassetti et al. (2006) study epidemiological trends in nosocomial candidemia in intensive care.Zaghloul et al. (2008) study temporal trends in patient with bladder cancer who underwent definitive surgery along an extended time of 17 years.The tools used for this study are ANOVA, Student and chi-square tests.
The study in McNamee et al. (2009) is of a great interest to us as it is very similar to what we wish to do.In this paper, the authors measure temporal trends in the incidence of some work-related diseases in the United Kingdom from 1996 to 2005 on the basis of count data with possible extra zeros counts, collected by three groups of reporters spread all over the country.The authors use a Poisson regression model with a gamma random effects, which is equivalent to using a negative binomial regression.The dependent variable is the number of case per reporter per month.The main covariates are months or seasons, the years as categorial variables and as numerical variables.The authors considered the effects of the calendar years in the regression as incidence rate ratio (IRR).They next interpret these IRR as percentage of changes in incidence, and plot them as functions of the calendar years to display annual trends.They modelize separately trends in probability of non-response.However, we think that it could be very interesting to treat both modelizations with one single model, by using zero-inflated models.

The Data and the Methods
As already mentioned earlier, one of our main objectives is to model annual trends in incidences of some occupational dermatitis and respiratory diseases in France from 2001 to 2009.Our work is based on data collected by the RNV3P from the 32 French centres of occupational diseases, named Centre de Consultation de Pathologies Professionnelles (CCPP).The diseases involved are allergic asthma, dermatitis and rhinitis.
Organization and goals of the RNV3P were described in Bonneterre et al. (2008).Briefly, Occupational disease Departments of French University Hospitals reported since 2001 all cases of diseases thought to be in relation with work exposures.Each occupational health report is a structured expert clinical report whose principal coded items are: principal disease and co-morbid diseases (ICD-10), principal nuisance and four other possible nuisances (INRS-CNAM), professional position (ISCO-88, edited by ILO) and sector of professional activity (NAF, edited by INSEE).Each association plausibility between the principal nuisance and nuisances was rated by an expert.The present work included all cases of asthma (J45 to J45.9 ICD-10 codes), allergic rhinitis (J30.0 to J31.0 ICD-10 codes) and contact dermatitis (L23.0 to L23.9 ICD-10 codes) reported between 2001 and 2009 with at least probable or certain association with one occupational exposure.
For the study of the annual trends in the incidences of these diseases, we follow the approach developed in McNamee et al. (2009).But rather than using a Poisson regression models with random effects, we use ZINB regression models described in the preceding paragraph.The dependent variable is the number of cases per centre per month.The covariates are the months labeled Jan, Feb Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, the years considered as categorial variables labeled Year1, Year2, • • • , Year9, and the 32 centres labeled C01, C02, • • • , C31, C32.The reference month is August, the reference year is 2004 and the reference centre is C18.We checked that these arbitrary choices donnot have any incidence on the trends of our data.

Numerical Results
Each set of data contains n = 3456 observations that can be assumed to be independent.Examining these data, it is seen that they comprise a large number zeros : 2162 for asthma, 2156 for dermatitis and 2698 for rhinitis.See also the histograms of Figures 1-3.Given this amount of zeros, it is natural to question the possibility of a proportion of extra zeros amongst them.Next, one finds that for asthma, the mean is 0.835, the variance is 2.55 and the maximal value is 22.For dermatitis, the mean is 0.977, the variance is 3.735 and the maximal value is 21.Finally, for rhinitis, the mean is 0.349, the variance is 0.730 and the maximal value is 12.One can see that data are overdispersed as the variances are larger than the means.
These features of our data suggest the use of zero-inflated models for their modelizations.Although we presented three classes of these count models, we only used ZIP and ZINB regression models for doing this.The main reason is that only these models are available on the software SAS that we use.But we would like to mention also that we used R software for the study of trend tests and plotting graphics.
The zero-inflated link function, or the function G(x) we used was the logistic function.The covariates in this part of the model were the months and the centres, while in the main part, in addition to these were the years as covariates.We made this choice because including the years in the zero-inflated part gives a non-linear function of the years and their effects considered as IRR's are difficult to compute.In this situation, studying the trends in the data in the spirit of McNamee (2009) as we want to do is not easy.Although the relation provides more parsimonious models, it also leads to a non-linear function of the years and can induce the difficulty mentioned earlier.Moreover, the option of using this relation is not available on the SAS software.For these reasons, we do not use it.We first adjusted a ZINB regression model to each of the three sets of data.For asthma and dermatitis, the estimated dispersion parameter was too large, meaning that the overdispersion observed in the data likely comes from excess of zeros rather than the heterogeneity among observations.Since in addition the pvalue of the associated Student test significant, we decided to modelize these sets of data by ZIP regression models, and the rhinitis data by a ZINB regression model.For each data, the likelihood, the AIC (Akaike Information Criterion) and the BIC (Bayesian Information Criterion) of the corresponding model (the one with months, years and centres in the main part and months and centres in the zero-inflated part) were both larger than those of many other competing zero-inflated regression models.Some of the latter models did not include either the zero-inflated part and the centres, or the zero-inflated part and the months and centres, or the zero-inflated part and the years and centres, or some naive models such as Poisson and Negative Binomial (without any covariate).
As a checking procedure for the suitability of the models adjusted to each data, we plotted the residual series and their histograms.These series are obtained as the difference between the observations and the predicted values from the zeroinflated modelizations.Figure 4 shows that for the three diseases, more than 85 % of the residuals are within [−1, 1].This indicates that the zero-inflated regressions models used are good predictable models for the three sets of data.For the estimation of the parameters of our models, we used Newton-Raphson algorithm.This algorithm converged for each set of data, yielding reasonable standard deviations of the estimators of the parameters.Table 1 presents the estimates of the parameters of the zero-inflated models for the three diseases.To save space, we do not present the associated standard deviations.It can be seen on this table that, for the three diseases, the magnitudes of the estimates associated with C06 and C10 are significantly different to those of the other such covariates.A same remark can be done for the estimates associated with Inf.C04, Inf.C05, Inf.C07, Inf.C09-Inf.C11, Inf.C22 and Inf.C25.We could not find any explanation to this phenomenon.
The lower plots in Figures 1-3 are those of trends.On these plots, the IRR, obtained as the coefficients of the years in the principal part of the zero-inflated model, on the y-axis is multiplied by 100.It can be seen from the figures that the trend in asthma and rhinitis is decreasing with calendar time, while it is nearly constant but slightly increasing in dermatitis.Kendall τ and the associated test used as trend detection provided more evidences to support these conclusions.Indeed, for asthma and rhinitis respectively, we obtained τ = −0.7222222and −0.6666667 showing a negative association between the IRR's and the years, a result confirmed by the p-values 0.005886 and 0.01267.For dermatitis, τ = 0.3333333 showing a weak positive association between the IRR's and the years, while the p-value = 0.2595 leads to rejecting the hypothesis of association between the IRR's and the years.That is, for dermatitis, the IRR's are constant over the years.
We also computed the estimated proportions of excess of zeros.To save space, we only present the results for rhinitis.The model used was a ZINB regression model including months, years and centres in the main part, and months and centres in the zero-inflated part.The results are gathered in the Table 1 from which it can be seen that some of these proportions are too small.In other word, the probability to have an extra zero count in some centres at some months of the year is almost nil for rhinitis.But for many other centres as C25, C26, C27, C30 the probability of having an extra zero in January, February and March is very significative.
We studied the case where the proportions of zeros were functions of the months only.That is, we considered models for which the zero-inflated part do not include centres.The results depicted in Table 2 show that the probability of having an excess of zeros in France for asthma is small for all months and is far below 0.305 which is the probability of having an excess of zero in August.These probabilities are generally higher for dermatitis with 0.31 in August and 0.214 in December.The same observation can be made for rhinitis, with a value of 0.257 in march.It is interesting to note that the values in Tables 2 and 3 can be used to improve the incidence.For example concerning the incidence of allergic occupational asthma, from Table 3, one estimates that in France during August, if η zeros are observed amongst the 32 centres at a given year, then about 0.305 × η of these zeros are in excess.In other words, at least 0.305 × η cases of occupational allergic asthma are missing during that August.The same reasoning can be done to find a lower bound for missing cases for other diseases at a given month.
The Kendall and Spearman tests applied to pairs of the three sets of data show that there is a positive association between them.Indeed, the Kendall and Spearman coefficients vary between 0.3 to 0.5 and the tests reject the null hypothesis that these coefficients are nil.

Conclusion
We have reviewed zero-inflated count models, a class of models widely applied to modelling count data in various fields.Using the approach of McNamee et al. (2009), we have applied these models to modelling trends in occupational allergic asthma, dermatitis and rhinitis in France on the basis of sets of data collected from 2001 to 2009.From our study, it comes out that the trends are decreasing for asthma and rhinitis and that it is almost constant for dermatitis.We checked that whether the centres were used as covariates or not these trends did not change nor do they depend on the reference variables choosen.We also estimated the probabilities of obtaining excess of zeros.Although in our study they seemed to depend on the reference variables choosen, they can help improving the incidences of the diseases studied.
The test of Kendall and that of Spearman applied to pairs of the three sets of data prove that there is a possible positive association between asthma, dermatitis and rhinitis.This result suggests to study conjointly these three diseases.However, this study is beyond the scope of this paper, and will be the subject of a forthcoming paper.

Figure 1 :
Figure 1: Histogram and year trend for asthma

Table 1 :
Rhinitis : probability of having an excess of zero for a center at a

Table 2 :
Probability of having an excess of zero for a disease at a given month

Table 3 :
Parameter estimates in the ZINB regression model for the three diseases

Table 3 :
(continued)Parameter estimates in the ZINB regression model for the three diseases