Bayesian Inference for Spatial Count Data that May be Over-Dispersed or Under-Dispersed with Application to the 2016 US Presidential Election

We propose a method of spatial prediction using count data that can be reasonably modeled assuming the Conway-Maxwell Poisson distribution (COM-Poisson). The COM-Poisson model is a two parameter generalization of the Poisson distribution that allows for the ﬂexibility needed to model count data that are either over or under-dispersed. The computationally limiting factor of the COM-Poisson distribution is that the likelihood function contains multiple intractable normalizing constants and is not always feasible when using Markov Chain Monte Carlo (MCMC) techniques. Thus, we develop a prior distribution of the parameters associated with the COM-Poisson that avoids the intractable normalizing constant. Also, allowing for spatial random eﬀects induces additional variability that makes it unclear if a spatially correlated Conway-Maxwell Poisson random variable is over or under-dispersed. We propose a computationally eﬃcient hierarchical Bayesian model that addresses these issues. In particular, in our model, the parameters associated with the COM-Poisson do not include spatial random eﬀects (leading to additional variability that changes the dispersion properties of the data), and are then spatially smoothed in subsequent levels of the Bayesian hierarchical model. Furthermore, the spatially smoothed parameters have a simple regression interpretation that facilitates computation. We demonstrate the applicability of our approach using simulated examples, and a motivating application using 2016 US presidential election voting data in the state of Florida obtained from the Florida Division of Elections.


Introduction
In United States presidential elections there are many states that historically vote for the same political party. For example, from recent past electoral results, Republican candidates tend to win most of the mountain states and Great Plains, and most of the South. Similarly, Democratic candidates often win the Mid-Atlantic states along with New England and the West Coast states. There are also so-called swing states, which refers to any state that could reasonably be won by either the Democratic or Republican presidential candidates. For example, Texas is the key to outcome of the 1960 election, Florida and New Hampshire are key in deciding the 2000 election, and Ohio was important during the 2004 election (Duquette et al., 2017). Thus, there is reason to suggest under-dispersion for some regions and over-dispersion in others, where under (over) dispersion refers to when the mean is larger (smaller) than the variance of the data. That is, historically consistent voters suggest that variance within certain regions might be very small and vice versa.
In the context of modeling spatial count data, such as US voting data, the Poisson distribution has gained widespread popularity (Cressie, 1993;Banerjee et al., 2014). In general, spatial statistical models for spatial count data have become a common choice in a large number of scientific disciplines. For example, spatial count data have been modeled in areas as diverse as small-area samples from surveys, meteorological observations, epidemiological data, transportation data, and relative abundance of various species in ecological monitoring studies, among others (Guikema and Goffelt, 2008;Gupta et al., 2014;Sellers and Raim, 2016). Poisson regression is a standard framework for modeling covariate dependent count data through a log-linear link function. Nevertheless, the distributional assumption of having the mean equal to the variance over the entire spatial domain (i.e., equi-dispersion) is rarely satisfied for the types of processes that typically are observed in practice. Count data can exhibit over-dispersion and under-dispersion causing traditional count data regression models to violate this property of the data. Several approaches for addressing over-dispersion have been developed such as quasilikelihood methods, Poisson regression with random effects, and models based on the negative binomial distribution (Manton et al., 1981;Cameron and Trivedi, 2013;Ver Hoef and Boveng, 2007;Hilbe, 2011). Models based on the negative binomial distribution fit reasonably well for over-dispersed data, but often under perform in under-dispersion settings (Lindén and Mäntyniemi, 2011). Hence, methods that allow for both under-dispersion and over-dispersion are important contributions to the analysis of count-valued data.
There are several recent proposed flexible models that use the Conway-Maxwell Poisson distribution (COM-Poisson) (Conway and Maxwell, 1962), and have the potential to overcome the limitations of traditional count models. The COM-Poisson distribution was introduced into the statistics literature by (Shmueli et al., 2005) with several follow-up papers (e.g., see Daly and Gaunt, 2016;Chakraborty and Imoto, 2016;) Shmueli et al. (2005 demonstrate the probabilistic and statistical properties of the COM-Poisson distribution and describe several methods of parameter estimation. In particular, the COM-Poisson distribution is a member of the exponential family and can be seen as an extension to the Poisson distribution, with an extra parameter that flexibly controls the level of dispersion. Additionally, this distribution has the Poisson and geometric distributions as special cases and the Bernoulli distribution as a limiting case. The COM-Poisson provides a promising and flexible approach for performing count data regression. This distribution has become widely utilized in a variety of applications and has great practical interest. However, the statistical performance of this model has not yet been fully characterized. A comprehensive overview regarding the COM-Poisson model is provided by Sellers et al. (2012). The COM-Poisson distribution has since been used in more sophisticated Bayesian hierarchical models formulated to dynamically accommodate varying levels of spatial dispersion (Wu et al., 2013).
The application of hierarchical Bayesian models for spatial and spatio-temporal count data has become increasingly popular over the past decades. For example, Wikle and Hooten (2006) and Hooten et al. (2007) proposed spatio-temporal Poisson models. Bradley et al. (2018) also develops multivariate spatio-temporal models for high-dimensional count data. Moreover, there are many examples in the epidemiological field (see, e.g. Waller et al., 1997;Carlin and Banerjee, 2003, and beyond). Wu et al. (2013) developed a space-time COM-Poisson model. However, the overdispersion from the spatial random effects obfuscate the dispersion of the data. In this article, we propose a computationally efficient hierarchical Bayesian spatial COM-Poisson model that preserves the dispersion properties of the data and simultaneously include spatial random effects. To do this, we assume conditional independence between the data and the spatial process conditioned on a set of parameters. We refer to this model as Over or Underdispersed Regression for Spatial (OURS) count data.
We consider a modified version of the conjugate prior distribution for the COM-Poisson in Kadane et al. (2006). This modification completely avoids the need to compute the normalizing constant in the COM-Poisson. Furthermore, we show that our modified prior distribution leads to a well defined posterior distribution for the parameters in a COM-Poisson.
Another contribution of our method is that it is matrix free, which is a growing area in computational statistics (Dai et al., 2020;Yang and Bradley, 2021). To achieve a computationally efficient approach in the Bayesian setting we introduce latent random variables that allow one to model the dependency completely in the mean term. The first step of our procedure is using slice sampling (Neal, 2003) to sample from the aforementioned modified conjugate posterior from Kadane et al. (2006). The next step uses a posterior predictive distribution to a model with functional spatial dependencies defined in the mean. This two step procedure shows that we do not need to store a large matrix and nor do we need to compute the intractable normalizing constant in a COM-Poisson, which aids in computational efficiency.
The remaining sections of this article are organized as follows. We first review the COM-Poisson distribution then propose the methodology and our model framework in Section 2. In Section 3, we present simulated examples for over/under-dispersed counts. In Section 4, we implement our model using 2016 US presidential voting data in the state of Florida. Finally, Section 5 contains a discussion.

Review of Bayesian Analysis of COM-Poisson Distributed Data
In this section, we review the COM-Poisson distribution. The COM-Poisson distribution generalizes the Poisson distribution to model over-dispersion or under-dispersion. Define an ndimensional data vector Z = (z 1 . . . z n ) , where z i ∈{0, 1, 2, . . .} is the count-valued outcome associated with the i-th region for every i = 1 . . . n. For example, z i could represent the number of votes for a Republican candidate in county i. The probability mass function (p.m.f) is where Q(λ i , ν i ) serves as a normalization constant and ν i is called the dispersion parameter. The level of dispersion can be conveniently characterized with ν i = 1, ν i < 1, and ν i > 1 corresponding to equal-dispersion, over-dispersion, and under-dispersion, respectively. COM-Poisson is a member of the exponential family and it has the Poisson distribution (when ν i = 1) and geometric distribution (when ν i = 0 and λ i < 1) as special cases and the Bernoulli distribution (when ν i → ∞) as a limiting case. When ν i = 0 and λ i 1, Q(λ i , ν i ) does not converge, and the distribution is undefined. Kadane et al. (2006) used the exponential family structure of the COM-Poisson to establish a conjugate prior density of the form for λ i > 0 and ν i 0, where κ(a, b, c) is the integration constant. The posterior has the same form as (3), with a = a + z i , b = b + log(z i !) and c = c + 1. The values of a, b, and c are restricted so that κ −1 (a, b, c)< ∞. Using Jensen's inequality and the convexity of the log-gamma function, a necessary and sufficient condition for a finite where m denotes the floor of m (see, Kadane et al., 2006, for a complete proof). A spatial(-time) alternative to the Kadane et al. (2006)'s prior was introduced in Wu et al. (2013). Here, they assume the vector λ = (λ 1 , . . . , λ n ) has a basis function representation: log(λ) = μ+ α+ , where Gaussian priors are given to the n-dimensional vector μ, r-dimensional vector α, and n-dimensional vector . Here represents a n × r matrix of spatial basis functions. Specification of basis functions is a key tool used for spatial and spatio-temporal models (Wikle, 2010). Basis functions imply dependence, since cov (log(λ)) = cov(α) , which is not equal to a zero matrix.

A Modified Kadane et al. (2006) Prior Distribution
One can avoid the normalizing constant in the posterior distribution by setting c = −1. However, the implied posterior distribution is not proper (i.e., does not integrate to one). We gain flexibility in specifying c = −1 by truncating the support of λ i . Namely consider the following prior distribution for (λ i , ν i ): where I (·) is an indicator function, ν i 0, 0 < w < ∞. Then the posterior distribution for where ν i 0. The posterior means of λ i and ν i are given by ( w Z i +a+1 )Z i + a( w Z i +a+1 ) and 1/ b + log(Z i !) , respectively. Thus, the choice of w, a, and b are important from the perspective of point estimation. For example, if w ≈ Z i + a + 1 and a ≈ 0 we see that the posterior mean is roughly an unbiased estimator for λ i . When b < 1 and Z i is equal to zero or one, we obtain a poster mean of ν i greater than one suggesting under-dispersion. We note that in later stages in our Bayesian hierarchical models we effectively smooth values from this posterior predictive distribution (see Section 2.4), and hence, these smoothed parameter estimates have different properties.

Over or Underdispersed Regression for Spatial Data
In order to incorporate dependence in an efficient manner, we introduce latent random effects into a hierarchical model. The joint distribution is given by where is the prior distribution in Equation (5), and we specify f (β λ , β ν |λ, ν, θ ) to be a multivariate normal distribution. The parameter λ i and ν i are the location and dispersion parameters in a COM-Poisson distribution, and β λ and β ν will be defined as smoothed version of these parameters that take into accounts covariates and spatial dependence. We let so that θ = (σ 2 λ , σ 2 ν ) , where σ 2 λ > 0, σ 2 ν > 0, and the n-dimensional random vectors where the n-dimensional vectors λ = (λ 1 , . . . λ n ) and ν = (ν 1 , . . . ν n ) and "Normal(μ, )" is a shorthand for a multivariate normal distribution with mean μ and positive definite covariance matrix . Let x i be a known p-dimensional vector of covariates, X = {x 1 . . . x n } , ∈ R n × R r is defined to be a matrix of basis functions which is of dimension n × r (r n). We also assume σ 2 λ ∼ IG(1, 1) and σ 2 ν ∼ IG(1, 1) where "IG(α, κ)" is a shorthand for the inverse gamma distribution with shape α > 0 and scale κ > 0. Notice that the means of β λ and β ν are a series of projections of log(λ) and log(ν) onto the column space of X and , respectively. In Equation (9), we see that the mean of β λ is a smoothed (i.e., projection) version of the location parameter λ. Consequently, we interpret β λ as a location parameter that accounts for covariate (i.e., X) and spatial behavior (i.e., ). Similarly, one can interpret β ν as a dispersion parameter that incorporates covariates and spatial behavior. Consequently, we are interested in inference on β λ and β ν to adjust for covariate and spatial effects when learning about location and dispersion of the data. Sellers and Shmueli (2013) and Wu et al. (2013) introduced extensions of COM-Poisson regression that allows for different group and spatial levels of dispersion by modeling ν with additional spatial random effects. However the random effects in these models compete with the over-dispersion parameter ν. We remove this competition by assuming conditional independence between {λ, ν} and {β λ , β ν }, where β λ and β ν include spatial dependence.

Posterior Predictive Distribution
The posterior predictive distribution for {β λ , β ν } is given by Although this integral does not have a closed form, Equation (10) leads to straightforward implementation via a Gibbs sampler. See Algorithm 1 for an outline.
Thus, our implementation involves generating values from two easy to sample from conditional distributions. That is, first generate λ [b] and ν [b] from f (λ, ν|Z) and then generate the value from f (β λ , β ν |λ [b] , ν [b] ). It is important to emphasize that Step 3-8 in Algorithm 1 does not require computationally difficult covariance matrices to store and invert. However, this does not imply that we do not model spatial dependence. That is, let P x = X(X X) −1 X and P = ( ) −1 so that imply non-zero means and non-diagonal convariance matrices, where the expectations are taken with respect to (7). Thus, predictions from our incorporate spatial dependence, and are computed from Algorithm 1 withẐ where b 0 is a burn-in, and in Step 8 we have generated from posterior predictive distribution based β λ and β ν instead of λ and ν. The use of β λ and β ν is preferable because, again, these parameters incorporate spatial and covariate information into the prediction of the mean of Z.

Illustrations Using Simulated Examples
In this section, we present two different simulation scenarios. The first specifies ν i so that z i is under-dispersed and the other specifies z i to be over-dispersed. We simulate z i from a COM-Poisson distribution, and our main goal is to assess whether or not β λ can be used to accurately estimate the true location parameter. We fix the dispersion parameter ν i as constant (e.g., ν i = 2 for all i) in both simulation settings. Values of ν > 1 indicate under-dispersion relative to the Poisson, while ν < 1 indicates over-dispersion. Also, there are many possible choices for basis functions (see Section 2.1). In this paper, we use the thin plates basis function (Wahba, 1990). The thin plates basis function are defined as where s is a centroid of one of the n study regions (e.g., counties in Florida) and {c ij : i = 1, . . . r, j = 1, 2, 3} represent the knot locations defined over resolutions j = 1, 2, 3. Specifically, for j = 1, 2, 3 let {c ij : i = 1, . . . , ω j } be equally spaced centroids over the n regions. This implies r = ω 1 + ω 2 + ω 3 and α i > 0 is called a bandwidth so that = {ψ(c 11 ), . . . , ψ(c ω 3 3 )} . Specifying {c ij } in this way is referred to as a multi-resolutional choice of knots, and is a common choice in spatial statistics (e.g., see Cressie and Johannesson, 2008). Let · denote the usual Euclidean norm. Also, the choice of hyperparameters (i.e., a, b, w, and r) are set constant across simulated replicates. For each simulation, we implement with Algorithm 1 with B = 5000 and treated the first 2000 iterations as a burn-in. Consider generating data in the following way. Generate 3,000 observations over a onedimensional spatial domain (i.e., n = 3,000), such that z i is distributed according to a COM-Poisson with location λ i and dispersion parameter ν i . Define the true λ i = 2.1 sin(2πs i ) + 4.7 and where the centroid of region i is denoted as s i ∈ {s 1 . . . s 3000 } ⊂ [0, 1] and are equally spaced. We fix the true dispersion parameter as a constant. When generating under-dispersed data we set ν = (2, . . . , 2) , and when generating over-dispersed data we set ν = (0.95, . . . , 0.95) . Then to generate from a COM-Poisson the infinite sum in (2) is truncated, and we truncate at 100-th term. This is done using the R-package CompGLM. In (4), we set a = 2, b = 2, and w = 30. We implement the model in (7) with x i = (1, sin(2πs i )) and set ω 1 = 4, ω 2 = 5, and ω 3 = 6 so that we have a total of r = 15 thin plate splines over the three resolutions α 1 = 3.01, α 2 = 3.97, and α 3 = 5.02. In Figure 1, we plot the true log(λ) and 95% pointwise credible intervals associated with β λ using replicates from Algorithm 1. Here we see that the credible intervals display the pattern of the truth in both dispersion settings, and tend to contain log(λ). This pattern is consistent when simulating 50 replications. We provide the average performance of parameter  Table 1 by simulation set-up (i.e., over or under dispersion data). We use different metrics to evaluate the posterior performance. Those metrics including the mean (over 50 simulated data vectors) of mean (over elements of our data vector) absolute error (MMAE), the mean standard deviation (MSD), the mean of mean squared error (MMSE). The estimation performance is stable, in the sense that the metrics also small for both parameters. In general, we tend to perform better in the over-dispersion setting.

Data Description
As an illustration, we analyze voting data from the 2016 United States presidential election. We focus on data in the state of Florida at the county-level. In this application, we ignore the third parties and only focus on two main political parties, the Republican and Democratic parties. The candidate for the Republican party was Donald J. Trump and the candidate for the Democratic party was Hillary Clinton. In 2016, there were 67 counties in Florida. A transformation of the voting results can be seen of Figure 2. This dataset is made publically available on the Florida Division of Elections (https://results.elections.myflorida.com).

Analysis
We present an analysis of the election dataset using our method. In this analysis, we run 30,000 iterations and treated the first 20,000 iterations as a burn-in. We informally check trace plots for convergence, and no lack of convergence was detected. The main goals of our analysis is to compare the predictive performance of OURS relative to competing methods, and to infer values of over and under-dispersion. Recall, this is not immediately possible to do for our competitors. The response in this analysis is a transformation of the difference in the number of votes between Trump and Clinton. We shift this difference so the smallest value is zero, we rescale (12) by 1 30,000 for a numeric reason, and we take ceiling so that the response in integer-valued.
where V i,1 is the number of votes for Trump over each county, V i,2 is the number of votes for Clinton over each county, "min" is the minimum, and · represent the ceiling function. For illustration, we specify an intercept only model (i.e., x i = 1). In our model, we set r = 67 vector ψ(s) was chosen to consist of thin plates basis functions (see, Equation (11)) associated with the centroids of each county. The bandwidth is set to 0.75 which is chosen based on minimizing a criterion over a range of choices of the bandwidth.  We compare our model to a Poisson model with latent multivariate log gamma random effects (Bradley et al., 2018) and a Poisson model with latent Gaussian random effects (Hadfield et al., 2010). The multivariate log-gamma distribution is a type of conjugate multivariate (CM) distribution (Bradley et al., 2020). The Poisson model with latent Gaussian random effects is sometimes called a Latent Gaussian Process (LGP) (Gelfand and Schliep, 2016). The same covariates and basis functions were used in both of the competing models, and public-use code for both methods were used. We use the Deviance Information Criterion (DIC) (Spiegelhalter et al., 2002) and the logarithm pseudo marginal likelihood (LPML) (Chen et al., 2008) as an overall model fit measure. These two criteria involve a tradeoff between the goodness of fit In Figure 3, we plot predictions of Z using OUR (Algorithm 1), predictions from a Poisson CM model, and a Poisson LGP model. All predictions look similar, except the Poisson CM appears to overfit. This is verified by the value of DIC and LPML, see Table 3. Our model has the smallest DIC and largest LPML marginally. This gives motivation for our method because OURS not only gives competitive predictions, but also allows us to assess over-and-under dispersion more readily (i.e., through estimates of λ).
For this dataset, we find that eight counties appear to be under-dispersed (i.e., the pointwise 95% credible interval of β ν,i is greater than 1). The eight counties are Clay, Collier, Franklin, Holmes, Indian River, Lee, Putnam, and Wakulla counties. Considering the past four presidential elections from 2000 to 2012, these counties consistently favored the Republican party in each election. Four of these counties are rural areas and the remaining four are adjacent to rural areas.

Discussion
The COM-Poisson model has received a great deal of attention in recent years in many fields of application. This because COM-Poisson regression is a popular model for count data due to its ability to capture both under-dispersion and over-dispersion. In practice, modeling spatial count data using COM-Poisson is challenging, since spatial random effects compete with the dispersion parameter.
We have proposed a new "matrix free" Bayesian approach for modeling dependent count data. The first contribution is that we modify the prior distribution from Kadane et al. (2006) so that one can avoid computing/approximating the normalizing constant. Additionally, we impose a conditional independence assumption between the COM-Poisson parameters, and a spatially smoothed version of these parameters to avoid inflating variance of the data with spatial random effects changing the dispersion properties of the data. Thus, our approach can model both over-dispersed and under-dispersed count data (unlike negative binomial and many other count models). We refer to this model as Over or Under-dispersed Regression for Spatial (OURS) count data. Another contribution is that our approach is "matrix free" and computationally efficient. To achieve computationally efficient implementation of OURS, in the Bayesian setting, we model spatial dependency through latent random variables and two step procedure. We first sample from posterior distribution of the COM-Poisson parameters based on the modified prior distribution then use a posterior predictive distribution to model functional spatial dependence in the mean of latent processes. Consequently, there is no inversion or storage of a large matrix in our approach.
In Section 3, we present different simulation scenarios. It includes one-dimensional spatial locations with both under-dispersed and over-dispersed data and two-dimensional spatial locations for under-dispersed data. In each scenario, we find that we can accurately estimate the true location and dispersion parameters. We see that the credible interval displays the pattern of the truth in each scenario. In Section 4, we present a real data application of Florida voting data in the 2016 US presidential election. OURS produces better measures of out-of-sample error than Poisson CM model and Poisson LGP model. Furthermore, OURS allow us to assess over-and-under dispersion while the Poisson CM and Poisson LGP only allow for overdispersion.
There are several possibilities of interesting future work. For example, the COM-Poisson distribution's structure allows for a variety of generalizations such as zero-inflated data. Its appeal from a practical point of view is even stronger: it is easy to use, flexible for fitting overdispersed and under-dispersed data, and the second step of Algorithm 1 could easily be adapted to other setting such as time series or the spatio-temporal settings.