An Application of Bayesian Model Averaging Approach to Traﬃc Accidents Data Over Hierarchical Log-Linear Models

: In this article, a Bayesian model averaging approach for hierarchical log-linear models is considered. Posterior model probabilities are approximately calculated for hierarchical log-linear models. Dimension of interested model space is reduced by using Occam’s window and Occam’s razor approaches. 2002 road traﬃc accident data of Turkey is analyzed by using the considered approach.


Introduction
Many fields of scientific investigation include analysis of qualitative data.It is important to discover association structure of the interested categorical variables.On this point, log-linear models are widely used and very flexible.Bayesian methods give the opportunity of combining sample information with expert information.By this way, all available information is included in the conducted analysis.The association structure can be displayed by a Bayesian approach of log-linear modelling.The Bayesian approach is more advantageous than the classical setting because the inference is exact rather than asymptotic.Bayesian setting gives an entire posterior distribution for each element of the model; however the classical setting yields a point estimate and a precision estimated via an asymptotic method.In addition, the Bayesian approach would give better estimates of variability than the likelihood analysis (Gelfand and Mallick, 1995).All of these advantages are also valid for the Bayesian approaches to the log-linear modelling.
Estimation of the model parameters does not complete the analysis process.Model selection should be considered to obtain the most parsimonious model.When the Bayesian approach is used along with the log-linear modelling, Bayesian model selection procedures take place instead of the classical ones.General advantages of the Bayesian approach are also valid here.
In this sense, prior information can be induced on the log-linear parameters, expected cell counts or cell probabilities.In each case, induced prior information should be consistent with the other two cases.Posterior inferences are also drawn by using different algorithms for each case.In this article, we will put the prior information on log-linear parameters.
There is a huge literature on the model selection.Most of the methods can be used for large class of models, including log-linear models.Some of the methods are based on a series of significance tests and some include prior information, use Markov chain Monte Carlo (MCMC) methods and some are based on the Bayes factors.Almost each method has its own problems.Most important problem is on the model uncertainty.When a single model is selected and inferences are conditionally based on the selected model, model uncertainty is ignored (Raftery, 1996).This case is especially seen in the classical setting.
This difficulty can be overcome by including the information provided by all suitable models into the analysis process.The most common way of including the information, provided by different sources, is to use the average.From the Bayesian point of view, this averaging is applied such that posterior distribution of considered quantity is obtained over the set of suitable models, then they are weighted by their posterior model probabilities (Leamer, 1978;Raftery, 1996).Leamer (1978) extended the model averaging idea of Roberts (1965).However, computational difficulties were handicapped for progress of model averaging idea.Draper (1995) and Raftery (1995) reviewed the Bayesian model averaging (BMA) and the cost of ignoring model uncertainty.Madigan and Raftery (1994) are also considered the BMA that they give Occam's razor and Occam's window approaches to reduce the number of candidate models.Work of Hoeting et al. (1999) is a good tutorial for BMA.They discuss implementation of BMA for graphical, regression and generalized linear models, survival analysis, software for BMA, prior model probabilities and predictive performance of BMA, and give several examples.
In this article, we aim to present a BMA approach for model selection in hierarchical log-linear (HLL) models and analyze the road traffic accidents data by using the BMA approach.We use the normal likelihood, which is used by Leighty and Johnson (1990), and normal prior distribution, following to the approach given in Leighty and Johnson (1990) and Demirhan and Hamurkaroglu (2006).We calculate a required integral, which is mentioned in detail in Section 4, to develop a BMA approach for HLL models.Calculation of the integral is one of main difficulties of general BMA approach.
Our approach is useful and more advantageous than classical setting.In the classical setting, the model that gives the best fit to the data is determined by using a criterion or a model selection procedure, and classical estimates of model parameters are obtained conditionally on the predetermined model.In addition, estimation of another quantity such as quartiles of parameters is not possible in the classical setting.However, one can estimate other quantities and model parameters simultaneously, obtain entire posterior distributions of them in the Bayesian setting.Moreover, model uncertainty is also included in the resulting estimates at the same time.Because of this, standard error estimates of the parameters and the considered quantities will be smaller, and one can draw inferences on the distribution of model parameters.These are significant gains of the BMA approach in general.Specifically, for the road traffic accidents data, we are able to determine the best fitting model more confidentially in the Bayesian setting.We obtain estimates of model parameters with smaller standard errors.This is important, because our inferences are mainly based on these estimates.
In addition, we are able to see the posterior distributions of model parameters.
Details of the analysis of the data set are given in Section 5.
Section 2 mentions used log-linear model notations and hierarchy principle.BMA approach is outlined in Section 3. Section 4 is on the BMA for HLL models.In Section 5, 2002 road traffic data of Turkey is analyzed by using the given BMA approach.

Hierarchical Log-linear Models and Notation
Number of terms of a log-linear model increases by the increase in the number of categorical variables.Then standard notations become cumbersome.Instead, King and Brooks (2001a) give very flexible and practical notations, which are also used in this work.
Set of sources, where the data come from, is denoted by S.More detailed notations for the elements of design matrix, order of parameters and cells, and examples are given in King andBrooks (2001a, 2001b).
The family of hierarchical models is such that if any β c term is not included in the model, then all of its higher relatives must not be included in, and all of its lower order relatives must be in the model at the same time (Bishop, Fienberg and Holland, 1975, chap. 2).Hierarchy principle helps us to decrease the number of models of interest in model selection.

Bayesian Model Averaging
Underestimation due to the model uncertainty can lead very risky false decisions (Hodges, 1987).BMA provides a way to deal with model uncertainty.Because, all elements of the model space ℘(S) is considered for the estimation of a quantity of interest.Here, quantity of interest can be a group of parameters, effect size, odds ratio, etc.Let the quantity of interest be ∆, and D be data then , where P (m) is the prior model probability, and P (D|m) is the likelihood under model m and obtained as follows: Posterior mean and variance of ∆ are obtained by averaging as follows: where ∆m = E(∆|D, m) (Hoeting et al., 1999).
Although BMA is seen as a solution to the model uncertainty problem, it is not used as a standard analysis toolkit.Because it has some difficulties: 1.The integral in (3.2) is in general hard to compute.
2. Dimension of the model space can be enormous, which prevents considering whole model space.
3. Specification of P (m) is not clear, especially if there is prior information on some of the models of ℘(S).
To relax the effect of the difficulty mentioned in 1, dimension of the considered model space can be reduced.Madigan and Raftery (1994) proposed an approach that is working on a subset of the whole model space, namely (S) ⊆ ℘(S).
(S) includes models that do not predict data far less well than the best model (Madigan and Raftery, 1994).We should find a subset of ℘(S), over which to apply (3.1).Let (S) be the mentioned subset of ℘(S), and it is defined as follows: where m ∈ ℘(S), m is the model that has the maximum posterior model probability and C is an arbitrary constant.Models not belonging to (S) are excluded from ℘(S).This is the first principle of Madigan and Raftery (1994), called Occam's window.According to their second principle, which is called as Occam's Razor, complex models receiving less support from the data than their counterparts are excluded from (S) by the following equation:

BMA for HLL Models
One of the difficulties of the BMA, which is also valid for HLL models, is the computation of the integral (3.2).We calculated the required integral.
When (3.2) is rewritten as follows computation of the integral becomes simpler.In (4.1), P (D|β m , m) is the likelihood of data, and P (β m |m) is the prior distribution of log-linear parameter vector under the model m.Leighty and Johnson (1990) used to approximate the likelihood of data.Here b m is maximum likelihood estimate (MLE) of β m and V bm is the corresponding covariance matrix.Approach of Leighty and Johnson (1990) is followed for the specification of P (β m |m).(4.1) is rewritten as follows: where det(•) denotes the determinant of inner matrix, p is the dimension of β m , V −1 bm is the inverse of covariance matrix of MLEs, µ m and Σ −1 m are prior mean vector and the inverse of prior covariance matrix of β m , respectively.After some algebra, it is obtained that (4.3) Integral part of (4.3) is the integral of multivariate normally distributed random vector β m with mean vector z and covariance matrix A and equals to 1. Thus, P (D|m) is obtained as follows: The approach of Leighty and Johnson (1990) is used to represent degree of belief in the prior information.In the approach, a prior distribution for Σ m is specified in two stages.In the first stage, covariance matrix of the prior distribution is taken as, Σ m = αC m = αcI m , where I m is the identity matrix dimension of p, and c = p/tr(V −1 bm ) (Leighty and Johnson, 1990).Distribution of the general precision parameter α is given by the second stage prior.It is taken that τ = 1/(1 + α) and τ ∼ unif orm(0, 1) to make calculations easier.Values of τ represent the degree of our belief in prior.Leonard (1975) and Leighty and Johnson (1990) state that close to zero values of this precision parameter represent disbelief.When these definitions are applied to h, the following is obtained: If a noninformative prior distribution is defined, τ → 0. This implies that α → ∞, and hence lim α→∞ h = 0, then P (D|m, τ ) 0, ∀m ∈ (S).When equation (4.4) is used to find P (D|m, τ ), models that have greater dimensions are penalized by the term (2π) p/2 .As the result of this penalization, simpler models have greater model probability than complex models, even if they fit the data better.For these reasons, it is more appropriate using the result of (4.4) up to a proportionality constant such that After the integration problem, we deal with the huge number of HLL models.Number of log-linear models grows rapidly by the increase in the number of considered categorical variables.Although this is slower for HLL models, the problem still remains.
Occam's Window approach of Madigan and Raftery (1994) provides a solution to this problem.We adapted the approach for HLL models.Let Ω(S) and Υ(S) be subsets of ℘(S), set Ω(S) = ∅ and Υ(S) is the set of starting models.Then following down algorithm and up algorithm are applied successively.
where dim(•) denotes the dimension of the inner vector.are excluded (Madigan and Raftery, 1994).After all, Ω(S) contains acceptable models, and number of the possible models is noticeably reduced.Then the set of models, over which to average, is (S) = Ω(S).

Analysis of Road Traffic Accidents Data
Log-linear parameter vector of the saturated model is β m = ∅, (β c1 ) T , . . ., (β c 14 ) T T .After these definitions, BMA is applied over the given down and up algorithms.For these algorithms O R and O L are taken as 20 and 0.00001, respectively.The Newton-Raphson algorithm is used to obtain V bm and b m .P (m|D, τ ) is obtained by using (4.5) and P (m).
Dimension of model space is 168.Prior model probabilities were equal to 0.00479 for the models other than and it was 0.2 for m .The reason of this choice is the prior information that the result of an accident should be related with OT, AP and SP; however state of the person does not have an association with overlay type or place of accident.Prior distribution of the log-linear parameter vector is defined by using the approach of Leigthy and Johnson (1990), τ of which is taken as 10 −7 .This setting induces a diffuse prior on the log-linear parameters.Gibbs sampling is employed to obtain posterior estimate of each model parameter ( β c m ) for a given model.Total number of iterations were 30000, 10000 of which were discarded as burn-in.A record has been made at the end of each 200 cycles to reduce the autocorrelation of the Gibbs sequence.Full conditional distributions given in Demirhan and Hamurkaroglu (2006) are used in the mentioned Gibbs sampling scheme.
At the end of the down and up algorithms, elements of (S) and corresponding posterior model probabilities were obtained as given in Table 1.
Table 1: Elements of (S) and corresponding posterior model probabilities.
As seen in the Table 1, there are three models remained as possibly acceptable models in (S).Therefore it is not necessary to go on with the application of (3.5), and (S) = (S).
After the determination of (S), ∆ of (3.1) is taken as each element of β m i , i = 1, 2, 3; and β c m , c ∈ m i , m ∈ M c , are obtained.Also, it is taken as 5, 25, 50, 75 and 95th percentiles of the distribution of each parameter of each m i .Results, obtained over the Gibbs sampling, are combined to reach BMA estimates by using (3.1), and presented in Table 2. MLEs of parameters of m 2 ( β m 2 ) and their standard errors are also presented in the Table 2 to see what we gain over the classical setting by using the presented approach.As seen in the Table 1, the model m 2 , which is a conditional association model, has the greatest posterior model probability.Thus it is the most appropriate model for the data.According to m 2 , two-way interactions between overlay type and place of accident, overlay type and state of people, and place of accident and state of people are independent given the result of accident.In addition, associations between the levels of overlay type and place of accident, overlay type and state of people, and place of accident and state of people are homogeneous given the result of an accident.In the classical approach, m 2 appears as the best fitting model, but there are two more models (m 1 and the model including {S 1 , S 2 , S 3 }, {S 1 , S 3 , S 4 }, {S 2 , S 3 , S 4 } interactions) that fit well also.When our approach is compared to the classical setting, we can more confidentially conclude that m 2 gives the best fit and there is no other candidate model, because of the very high posterior model probability.
It is drawn from the Table 2 that all posterior distributions of the log-linear parameters, reached using BMA, are narrow, and approximately symmetric.This inference cannot be drawn from the classical setting.The gain of drawing it is that if posterior distribution of one of the model parameters is skewed then we may decide to use the median of the resulting Gibbs sequences as the posterior estimate instead of the mean.Although {S 1 , S 2 , S 3 } is not included in the model m 2 , associated parameters are given in the Table 2. Due to the low posterior model probabilities of the models m 1 and m 3 , estimates of the elements of β c 11 are close to zero.Contrary to the classical setting, standard error estimates of the model parameters are very smaller in the Bayesian setting.Because, inclusion of the model uncertainty decreases the uncertainty on the model parameters and hence the standard errors.Comparison of values of the parameter estimates of the Bayesian and classical settings is not appropriate.But similar inferences should be drawn in both settings when a diffuse prior distribution is used as in our case.Inferences drawn from the Bayesian and classical estimates are different from each other.The reason of this is also the inclusion of the model uncertainty in the posterior estimates.Consequently, BMA approach to the road accidents data set is more reliable and provide better estimates and inferences than its classical counterparts.
Following inferences are drawn from the Bayesian estimates of the model parameters.Weak negative association exists between asphalt roads and being killed in an accident for all places and all levels of SP.There is positive association between having an accident on inter-city roads and being killed in an accident for all overlay types and all levels of SP.Association of being driver and being killed in an accident for all overlay types and all places is negative.Positive association exists between asphalt roads, inter-city roads and being killed in an accident for all the levels of SP.Positive association exists between asphalt roads, being driver and being killed in an accident for both city and inter-city roads.There is weak negative association between inter-city roads, being driver and being killed in an accident for all of the overlay types.
All calculations, required for the application, were done on a computer program that is written by the authors in the Delphi 6 application development environment.

Disscussion
When HLL models are used to discover association structure of categorical variables, Bayesian model averaging provides a suitable way.General difficulties of it are also valid for HLL models.For HLL models, number of considered models are reduced by the algorithm given by Madigan and Raftery (1994), and the integral used to find posterior model probabilities is obtained up to a proportionality constant.Although, an approximation is used to find posterior model probabilities, reasonable results were obtained in the application.Approach of Leighty and Johnson (1990) is used to obtain likelihood function and to determine the prior distribution of the log-linear parameters.
The algorithm of Madigan and Raftery (1994) works well for the HLL models, for instance, number of possible models reduced to 3 from 168 in the road traffic data application.In addition, the whole procedure is applied for various τ values.It is seen that the results are not sensitive to choice of the τ .However, results are sensitive to choice of O R and O L .For the choice of them, we suggest running the whole procedure for several times and careful investigation of the B values of the "down" and "up" algorithms.Because smaller values of the O R causes the exclusion of appropriate complex models in the "down" algorithm, and greater values of the O L causes the exclusion parsimonious appropriate models in the "up" algorithm.Application of equation (4.6) is optional.If the number of possibly acceptable models is small enough, it may not be applied.
In conclusion, BMA is an effective way of including model uncertainty in the analysis.As seen in the application, some parameters are not included in the best model but they have positive effect on the estimation of expected cell counts, even if they are small.In addition, it is not only used for the estimation of log-linear parameters but also for the estimation of various percentiles of the posterior distributions of them.As a future work, results of this work can be extended to other likelihood-prior settings.
so the cells are indexed by k ∈K.Expected and observed cell counts are denoted by n k and y k for k ∈K, respectively.The set of subsets of S is defined by ℘(S) = {s : s ⊆ S}.Then m ⊆ ℘(S) is used to represent a log-linear model, where m lists the log-linear terms presented in the model.Each element of the model, m is included in a set c such that c ∈ m ⊆ ℘(S).Constant term of the log-linear model is represented by ∅ ∈ ℘(S).M c contains all possible combinations of the levels of sources included in c.In general, the highest level is not included by the elements of M c .Thus the set M c is {m c 1 , . . ., m c |M c | }.Then the log-linear model vector for each posterior distribution of the quantity of interest, under each model in ℘(S), is weighted by the posterior model probability of corresponding model.Posterior model probability of m is obtained as follows: P (m|D) = P (D|m)P (m) m ∈℘(S) P (D|m )P (m ) .5) If a model m has a simpler sub-model m and the posterior model probability of the sub-model is higher then the considered model is excluded from (S) by the equation (3.5).Finally, ℘(S)s seen in (3.1), (3.3) and (3.4) are replaced by (S) = (S) \ (S).
S) ← Υ(S) ∪ {m 0 }. 7. If there are more submodels of m then go to 3. 8.If Υ(S) = ∅ then go to 1.Up Algorithm 1. Select a model m from Υ(S). 2. Υ(S) ← Υ(S) \ {m} and Ω(S) ← Ω(S) ∪ {m}. 3. Select a hierarchical supermodel m 1 of m such that dim(β m 1 ) = dim(β m ) + 1. 4. Compute B = log[P (m|D)/P (m 1 |D)].5.If B < O L , then Ω(S) ← Ω(S)\{m}; if m 1 / ∈ Υ(S), then Υ(S) ← Υ(S)∪{m 1 }. 6.If O L ≤ B ≤ O R , then if m 1 /∈ Υ(S), Υ(S) ← Υ(S) ∪ {m 1 }. 7. If there are more supermodels of m then go to 3. 8.If Υ(S) = ∅ then go to 1.Here, choice of O L and O R is based on the considered data set.Raftery, Madigan and Volinsky (1994) suggest taking 1/20 and 20 for O L and O R , respectively.After the application of above algorithms, (3.5) is applied by replacing 1 with exp(O R ).In addition, models satisfying max{P (m |D)} P (m|D) > exp(−O L ), (4.6) Road traffic accidents of Turkey in 2002 are taken into consideration.Considered factors are overlay type of the road (OT), place of the accident (AP), state of the included people (SP), and the result of the accident (RA).Considered overlay types are concrete, asphalt, parquet, stabilized and dirt.Considered places of accident are city and inter-city roads.Included people are classified as driver, passenger and pedestrian; and results of an accident are taken as killed or injured.The data set is recorded by The Department of Traffic Training and Research of the General Directorate of Security Affairs of Turkey in 2002 and taken from Traffic Statistics Annual -2002 1 .

Number of ele- ments of a set is denoted by | • |, so each source is labelled such that S = {S ζ : ζ = 1, ..., |S|}. Set of levels for source S ζ is K ζ , for ζ = 1, ..., |S|. Cells of a contingency table can be represented by the set
Thus the log-linear parameter vector for the model m is β m = {(β c 1 ) T , (β c 2 ) T , . .., (β c |m| ) T }.Design matrix or model matrix corresponding to the model m ⊆ ℘(S) is denoted by X m .Using the design matrix and the parameter vector, the log-linear model is represented as follows:log n = X m β m .

Table 2 :
Bayesian and classical parameter estimates