Exponentiated Weibull-geometric distribution and its application to count data

An exponentiated Weibull-geometric distribution is defined and studied. A new count data regression model, based on the exponentiated Weibull-geometric distribution, is also defined. The regression model can be applied to fit an under-dispersed or an over-dispersed count data. The exponentiated Weibull-geometric regression model is fitted to two numerical data sets. The new model provided a better fit than the fit from its competitors.


Introduction
Many techniques for generating families of discrete distributions have been developed in the literature. See for examples the books by Balakrishnan and Nevzorov (2003), Johnson et al. (2005), Consul and Famoye (2006), and the references therein. These discrete distributions are found useful in many different areas of life. Frome et al. (1973) considered the Poisson distribution in the context of non-linear regression analysis for count data where the sample mean and sample variance are about equal. When the sample mean and sample variance are about equal, we have an equi-dispersion situation. When the sample mean is smaller (or greater) than the sample variance, we have over-dispersion (or under-dispersion) situation.
Many researchers obtained discrete distributions by discretizing continuous distributions. Nekoukhou and Bidram (2015) gave a long list of these works. Another method to generalize an existing distribution is by adding parameters to the distribution to form an exponentiated family  and the references therein). By exponentiating the cumulative distribution function of discrete Weibull distribution (Nakagawa and Osaki, 1975), Nekoukhou and Bidram (2015) defined the exponentiated discrete Weibull distribution. Mahmoudi and Shiran (2012) defined an exponentiated Weibull-geometric distribution by compounding the exponentiated Weibull and geometric distribution to form a continuous distribution. In this paper, we define an exponentiated Weibull-geometric distribution by using the T-R framework proposed by Alzaatreh et al. (2013) and recently used by Hamed et al. (2018). This new distribution is a discrete distribution and it is the discrete analogue of the continuous exponentiated Weibull distribution. This is like calling the geometric distribution a discrete analogue of the exponential distribution. The distribution in (1.1) belongs to the T-R family. Many continuous distributions have been defined and studied by using the result in (1.1). In particular, Alzaatreh et al. (2012) defined the T-geometric family. This family consists of the discrete analogue to the distribution of the non-negative continuous random variable T. Furthermore, the authors defined and studied the exponentiated-exponential geometric distribution (EEGD). The EEGD with one shape parameter provided excellent fits to many count data sets. This observation motivated the definition and study of EWGD. The EWGD, characterized by two shape parameters is a generalization of EEGD and the geometric distribution.
In this article, an exponentiated Weibull-geometric distribution (EWGD) is defined and studied. The paper is organized as follows: In Section 2, the definition and some properties of EWGD are given. In Section 3, estimation of the parameters is considered along with some test and goodness-of-fit statistics for EWGD. An exponentiated Weibull-geometric regression (EWGR) model to fit a count response variable that follows the EWGD is defined in Section 4. A zero-inflated EWGR is also given in Section 4. In Section 5, the EWGR model is applied to two real life data sets and the results are compared with other count data regression models. Some concluding remarks are provided in Section 6.


, for y = 0, 1, 2, 3, … (2.1) The corresponding probability mass function (PMF) for EWGD is given as The EWGD in (2.1) is the same as the exponentiated discrete Weibull distribution (Nekoukhou and Bidram, 2015). The two distributions are derived through different methods. In this paper, different properties and applications to count data modeling are emphasized. When c = 1, EWGD reduces to the exponentiated exponential-geometric distribution (EEGD) defined and studied by Alzaatreh et al. (2012). When c = a = 1, the EWGD reduces to the geometric distribution with parameter . When a = 1, the EWGD reduces to the discrete Weibull distribution defined and studied by Nakagawa and Osaki (1975). When a = 1 and c = 2, the EWGD reduces to the discrete Rayleigh distribution defined by Roy (2004).

Transformations:
The following propositions show the relationships between EWGD and some continuous distributions. These relationships can be used to simulate random variates from the EWGD.
follows an EWGD with parameters a, c, and . follows an EWGD with parameters a, c, and .

Quantile Function:
By using Proposition 1, the quantile function of EWGD is = ( ) = [{ (1 − 1/ )} 1/ ], where [v] is the largest integer less than or equal to v. This result can be used to simulate a random sample from EWGD. In order to do this, simulate random variate u from the uniform (0, 1) and compute () Y Qu to obtain a random variate y from the EWGD.
The exponentiated Weibull distribution with the PDF Note that there are other values of the parameters c and a for which the EWGD is monotonically decreasing even though the distribution of T is not monotonically decreasing.
The hazard function of EWGD is given by Nekoukhou and Bidram (2015) illustrated the hazard rate function of EWGD for different values of the parameters a, c and  . They noted that the hazard rate function could be decreasing, increasing, bathtub-shaped, and upside-down bathtub. This shows that the EWGD, characterized by two shape parameters, is more flexible than many other discrete distributions.
By using Theorem 2 in Alzaatreh et al. (2012), if the distribution of T (i.e., exponentiated Weibull distribution) is unimodal, so also is the distribution of the T-geometric distribution. Nassar and Eissa (2003) showed that the exponentiated Weibull distribution is unimodal. Hence, the EWGD is unimodal.

Moments and dispersion:
The moments and the moment generating function cannot be expressed in closed forms. However, the r th central moments can be computed numerically by evaluating  Table 1.
When 2 a  and 1 c  , the EWGD is over-dispersed. For all other values of a and c, the distribution is either under-dispersed, equi-dispersed or over-dispersed.

Statistical Inference
We consider parameter estimation, test of hypothesis and goodness-of-fit tests. In Subsection 3.1, we address the maximum likelihood estimation of the three parameters of EWGD. In Sub-section 3.2, we compare the EWGD with its sub-models and briefly describe some goodness-of-fit statistics.

Maximum likelihood estimation
Suppose a random sample 12 , , , n Y Y Y of size n is taken from the EWGD. The loglikelihood function of the EWGD in Equation (2.2) is given by The partial derivatives of Equation (3.1) with respect to a, c, and  give the likelihood equations. The maximum likelihood estimates ĉ , â , and  of the parameters are obtained by using PROC NLMIXED in SAS to maximize the log-likelihood function in Equation (3.1).
When a = c = 1, the EWGD reduces to the geometric distribution. We consider the data to be from geometric distribution and use the moment estimate of the geometric distribution to obtain the initial estimate of . Thus, the initial estimate of  is given by equating the sample mean from the data to the geometric population mean. This is given as f and 1 f are non-zero.

Tests and goodness-of-fit statistics
The EWGD reduces to EEGD when c = 1. To compare the EWGD with EEGD, we test the hypothesis 0 :1 Hc  against 1 :1 Hc  . The null hypothesis can be tested by using the

Count data regression
Suppose that Y is a count response variable that follows the EWGD in Equation (2.2) and Y is associated with a set of predictors. We wish to fit the response variable Y by using the predictors. Suppose we have a k -1 row vector of predictors ( 1, , , , ) In count data modeling, it is common to model the mean by a log-linear relationship. The mean of EWGD is not in closed form, but it is a function of parameter  . We assume that the parameter  of EWGD is a function of i x given by ( ) ( , ) This leads to the exponentiated Weibull-geometric regression (EWGR) model given by where () ii x   is given in Equation (4.1). The estimation of the parameters can be carried out by using the maximum likelihood estimation method. The log-likelihood function is given by A count data may have an inflated number of k value in the data. The most common k value is the zero which leads to zero-inflated regression model. Similarly, the count data may not have a zero count and this leads to zero-truncated regression model. In this section, we will define a zero-inflated regression model for the EWGR model. A zero-inflated EWGR (ZIEWGR) model is a mixture model with the probability mass function (1 )

Applications
In this section, we apply the generalized Poisson regression (GPR) model defined by Famoye (1993), the exponentiated exponential geometric regression (EEGR) model defined by Famoye and Lee (2017) and the EWGR model to two count data sets. These two models are chosen because both can be over-or under-dispersed. Because the data sets have high proportion of zero, the zero-inflated versions of the models were also applied and the results are compared. Kamalja and Wagh (2018) pointed out that ignoring the zero-inflated nature of a data set can result in an underestimation of the parameters, which may lead to insignificant findings. Cameron et al. (1988) used the data from 1977-78 Australian Health Survey to analyze various measures of health-care utilization. The data can be obtained from the Journal of Applied Econometrics 1997 Data Archive. Many authors, including Mullahy (1997) and Cameron and Johansson (1997), fitted the data to univariate regression models. A detailed description of the predictor variables can be found in Gurmu and Elder (2000). Summary statistics for the predictor variables were provided in Cameron et al. (1988).

Health Care Data
We model the response variable y, the total number of non-prescribed medications used in the past two days. The complete data set has six response variables. All the six variables were adequately fitted by the EWGR and ZIEWGR models. The SAS NLMIXED procedure was used to fit the regression models to the response variables. There is an adequate fit when the optimization program converged and the gradient for each of the parameter estimates is less than 1.0E-6. When we considered GPR and EEGR and their inflated models, these two models adequately fitted the response variable y and one other response variable (the number of admissions to a hospital, psychiatric hospital, nursing or convalescent home in the past 12 months). The results from this other response variable is similar to the variable y reported in Table 2. The response variable y ranges from 0 to 8 with a mean of 0.3557 and a standard deviation of 0.507. The variable is over-dispersed and it is highly skewed to the right with skewness of 3.05 and kurtosis of 15.11.
The results of fitting ZIEEGR and ZIEWGR are presented in Table 2. For all models (that of ZIGPR is not provided in the table), the predictors sex, age and illness are positively associated with total number of non-prescribed medications used. However, the predictor freerepa is negatively associated with the response variable. The dispersion parameter a in both ZIGPR and ZIEEGR are significantly different from 1. In the ZIEWGR model, the dispersion parameter c is significantly different from 1 but the parameter a is not significantly different from 1. The ZIEEGR is nested within the ZIEWGR model. Thus, we can compare ZIEWGR with ZIEEGR by testing if the parameter c = 1 under a null hypothesis. Since the null hypothesis is rejected, one should use the ZIEWGR to model the data. The log-likelihood statistics for ZIEEGR and ZIEWGR models in Table 2 support the assertion. The AIC, BIC and RPS for ZIGPR model are respectively 7863.1, 8040.1 and 5.7E-5. By using these statistics, we notice that ZIEWGR provided the best fit among all the three models.
The log-likelihood statistics for the GPR, EEGR and EWGR models are respectively -3930.16, -3929.87, and -3918.31. In comparing these values with the corresponding ones for the zero-inflated models, we observe that the zero-inflated models performed better than the non-inflated (ordinary) models. The ordinary models are all nested within the zero-inflated models. The likelihood ratio statistics for testing if all the parameters of the zero-inflation part are all zeros are rejected at 5% level for all models. The observed proportion of zeros in the data is 73.49%. After fitting the ZIGPR, ZIEEGR and ZIEWGR models, the predicted proportion of zeros are respectively given by 73.83%, 73.56% and 73.48%. The ZIEWGR provided the best predicted probability of zero. We also calculated the chi-square values by combining the last three classes in the frequency table. The chi-square values for ZIGPR, ZIEEGR and ZIEWGR are respectively given by 16.41, 16.68 and 2.20. Note that we have a total of 7 classes after the last three classes were combined. The goal for computing the chi-square values is not to check if these values are significant, but to see which of these models provides the closest expected frequencies. In this analysis, the ZIEWGR model provided the best fit by using the goodness of fit statistics.

Violence Data
The National Violence Against Women (NVAW) Survey of 1995-1996 was conducted to obtain a public-use data set. Interviews were completed from men and women, but the data used in this sub-section is a subset of the 8000 interviews completed by women who were at least 18 years old living in US households. Respondents were asked questions on various topics including physical assault they had experienced as adults by any type of perpetrator. The response variable used in the data analysis is physical assault or violence. This is the total number of twelve possible violent physical actions directed toward a woman by her current and/or past partners. A high score on this variable indicates a woman experienced severe violence.
In the analysis, seven predictor variables were used. The variables are age in years; level of education is one of the seven school levels (0 = no schooling to 6 = postgraduate); race (1 = white, 0 = others); number of children under 18 years of age (Nchid); respondent's income level is one of 10 levels (1 = below $5,000 to 10 = over $1,000,000); health level is one of 5 levels (0 = poor to 4 = excellent); and drug is a binary variable that indicates illicit drug use with 1 = yes and 0 = no. The variable drug indicates if a woman has used marijuana, cocaine, heroin, angel dust, etc. in the past month. After excluding the cases having missing information on any of the predictor variables and the response variable, we have 6110 observations.
The descriptive statistics for the response and predictor variables are given in Table 3. The response variable, violence, is positively skewed (skewness = 2.24, kurtosis = 4.70). Tjaden and Thoennes (1999) provided detailed description of the variables and the most recent publications on the data. The results of fitting the ZIEEGR and ZIEWGR models are presented in Table 4. For the ZIGPR (not included in Table 4) and ZIEEGR models, the variables education and health are significantly associated with the response variable violence. The higher the level of education (or the better the health condition), the lower the number of violence a respondent experienced. The other five predictor variables are not significantly related to violence. In the ZIEWGR model, the predictor variables education and health are negatively associated with the number of violence. In addition to these two predictor variables, drug is positively related to the number of violence under the ZIEWGR model. The respondents who used illicit drug in the past month of the survey tend to have higher number of violence. The chi-square values from the ZIGPR, ZIEEGR and ZIEWGR models are respectively given by 58.73, 46.84 and 31.23. The ZIEWGR model provided the closest expected frequencies. The observed proportion of zero for the response variable violence is 67.05%. The predicted proportion of zero from ZIGPR, ZIEEGR and ZIEWGR models are respectively 67.07%, 67.08% and 66.93%. The ZIGPR provided the best expected zero frequency.
The AIC, BIC and RPS for ZIGPR model are respectively 16224.0, 16338.0 and 6.221E-4. In comparing these values with the corresponding values for ZIEEGR and ZIEWGR models in Table 4, we observe that the ZIEWGR model provided the best fit followed by the ZIEEGR model. The log-likelihood statistics for the GPR, EEGR and EWGR models are respectively -8416.13, -8224.73 and -8151. 13. In comparing the ordinary regression models with their corresponding zero-inflated regression models, we observe that the zero-inflated models performed better. The results from the data analysis show that the ZIEWGR provided the best fit by using the goodness of fit statistics.

Summary and conclusions
The exponentiated Weibull-geometric distribution can be applied to fit count data with over-dispersion or under-dispersion and it has two shape parameters. The distribution has closed form PMF and a CDF. One limitation of the distribution is that its moments cannot be expressed in closed forms. However, the moments can easily be computed numerically. Quite often, the negative binomial distribution (NBD) and/or the generalized Poisson distribution (GPD) are used to fit count data. Both distributions have one shape parameter and only the GPD can be used to fit under-dispersed or over-dispersed data. The EWGD studied in this paper, with two shape parameters, is more flexible than the two distributions.
Consul (1989, page 1) pointed out that a natural event leading to the Poisson distribution follows the principle of complete randomness. When this principle does not hold, the generalized Poisson distribution is applied. The same can be said of the geometric distribution. When the principle of complete randomness fails, distributions like the negative binomial distribution, EEGD or EWGD is applied. Among these three, only EEGD and EWGD satisfy the property of under-dispersion or over-dispersion and EEGD is a sub-model of EWGD.
A count data regression, the exponentiated Weibull-geometric regression model, is defined. A modified version, the ZIEWGR model is defined and illustrated with two numerical data sets. The goodness-of-fit of ZIEWGR model is compared with ZIEEGR and ZIGPR by using the AIC and the ranked probability scores among other statistics. In the two numerical examples, the ZIEWGR performed better than the other two count data regression models.