Indirect Area Estimates of Disease Prevalence : Bayesian Evidence Synthesis with an Application to Coronary Heart Disease

Risks for many chronic diseases (coronary heart disease, cancer, mental illness, diabetes, asthma, etc) are strongly linked both to socioeconomic and ethnic group and so prevalence varies considerably between areas. Variations in prevalence are important in assessing health care needs and in comparing health care provision (e.g. of surgical intervention rates) to health need. This paper focuses on estimating prevalence of coronary heart disease and uses a Bayesian approach to synthesise information of different types to make indirect prevalence estimates for geographic units where prevalence data are not otherwise available. One source is information on prevalence risk gradients from national health survey data; such data typically provide only regional identifiers (for confidentiality reasons) and so gradients by age, sex, ethnicity, broad region, and socio-economic status may be obtained by regression methods. Often a series of health surveys is available and one may consider pooling strength over surveys by using information on prevalence gradients from earlier surveys (e.g. via a power prior approach). The second source of information is population totals by age, sex, ethnicity, etc from censuses or intercensal population estimates, to which survey based prevalence rates are applied. The other potential data source is information on area mortality, since for heart disease and some other major chronic diseases there is a positive correlation over areas between prevalence of disease and mortality from that disease. A case study considers the development of estimates of coronary heart disease prevalence in 354 English areas using (a) data from the Health Surveys for England for 2003 and 1999 (b) population data from the 2001 UK Census, and (c) area mortality data for 2003.


Introduction: Need for Spatially Disaggregated Prevalence Estimates
Often small area prevalence data for major diseases are not collected, or if collected are subject to measurement and administrative biases.However, many countries have regular national health surveys which provide an indication on national trends in prevalence.Such surveys typically provide only broad regional identifiers, whereas health planners require estimates at a much more spatially disaggregated scale, and for those strata (age, sex, ethnicity) by which area populations are recorded -in censuses or by intercensal population estimates.Additionally the estimates should take account of the impact of socioeconomic factors on chronic disease prevalence.In geographic applications, measures of the socioeconomic status of an area's residents include what are known as deprivation indices, where deprivation refers to hardship due to low income, poor housing, high rates of unemployment, etc.In the UK there have been significant developments in the methodology for measuring neighbourhood deprivation (e.g.Noble et al., 2000;Bailey et al., 2003), especially in small neighbourhoods of around 1500-2000 people, there being around 32500 such neighbourhoods in England (ONS, 2006).This paper describes a Bayesian methodology for obtaining prevalence estimates for chronic disease for 354 English areas, with a particular focus on heart disease.The first source of information is provided by national health surveys.In many countries, a series of health surveys (often annual) is available -among many examples are the Swedish National Public Health Survey, the Italian National Health Survey, and the Taiwan National Health Interview Survey -and one may consider pooling strength over surveys.The analysis here uses the 2003 Health Survey for England, with an earlier 1999 survey providing historical data under a power prior approach (Chen et al., 2000).Except for neighbourhood deprivation category (the quintile rank among 32500 neighbourhoods, with no further identifying information), the spatial scale in the two surveys used consists of nine government regions (the North East of England, the North West, Yorkshire & Humberside, the East Midlands, the West Midlands, Eastern England, London, South East England, and South West England) -see Table 1 for a summary of regional differences.A binomial regression is used to model survey evidence on gradients by age, sex, ethnicity, broad region, and neighbourhoood deprivation.
The second source of information is population totals by age, sex, ethnicity, etc from censuses or intercensal population estimates, to which survey based prevalence rates are to be applied.The populations used here are specific for age, sex and ethnic group, and are drawn from the UK 2001 Census -intercensal population estimates by age, sex and ethnicity are not made in the UK, though they are in other countries such as the US (Smith, 1998).
The third source of relevant information is mortality data which typically (unlike prevalence) are well recorded at a disaggregated spatial level.Evidence is presented of a positive correlation between heart disease prevalence and mortality, which points to the benefit of adjusting survey based estimates of area prevalence to take account of proxy information on prevalence provided by mortality data over the 354 areas.When mortality is only infrequently linked to a particular type of morbidity (e.g.asthma, psychiatric illness), other sources of area data can be used as proxies for morbidity -examples are hospital admissions or referrals to community care.
The methodology therefore provides an approach to indirect prevalence estimation, applying survey based gradients for heart disease over those stratifiers by which populations for areas are available (e.g.age, sex, ethnic group), while also taking account of neighbourhood deprivation, and of proxy information on prevalence (from mortality) at the required area level.The methodology adopts a fully Bayesian strategy, with prior densities on parameters updated via the likelihood of the observed data.Iterative Monte Carlo Markov Chain techniques (Gelfand and Smith, 1990) are used to estimate models, as implemented in the WINBUGS program (Spiegelhalter et al., 2003).
The following four sections outline the survey based component of the prevalence estimation procedure.They are followed by a section considering how area mortality and prevalence are jointly modelled so that prevalence estimates can incorporate information on spatial mortality patterns.The final section considers possible developments to the methodology.

Survey Model: Populations, Survey Variables and Choice of Binomial Link
To apply survey evidence on disease gradients to estimate prevalence in area populations requires equivalent variables to be available in both Census populations (or in intercensal population estimates) and for respondents in national health surveys.Many countries provide population data by age, sex, and ethnicity; for example, the UK 2001 Census includes a tabulation of populations by age, sex and ethnic group.
It is also typically necessary to take account of socioeconomic gradients in disease prevalence (e.g.gradients by individual occupational status or by the deprivation level of the neighbourhood in which individuals live).This suggests that ideally one would require populations by age, sex, ethnicity and occupation, or populations by age, sex, ethnicity and neighbourhood deprivation level.However, in many countries populations are not available to this level of detail.For example, the UK Census does not provide a four way disaggregation by age, sex, ethnicity and neighbourhood deprivation.In these circumstances, it is proposed here that prevalence estimates by age, sex and ethnicity are scaled by a survey based prevalence gradient over deprivation levels.
A binomial regression is applied to data from the 2003 and 1999 Health Surveys for England to provide model based rates of heart disease prevalence.Survey subjects are classed as having coronary disease if they reported (in the previous year) having angina or a heart attack, confirmed by a doctor.The survey categorisations relevant to estimating area prevalence are age (a = 1, . . ., 7, namely ages 0-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85+), sex (s = 1, 2; namely male, female), ethnicity (e = 1, . . ., 4; namely white, black, south Asian, all other ethnic groups), and regions r = 1, . . ., 9 as in the columns of Table 1.Additionally the 2003 survey includes neighbourhood deprivation quintile (d = 1, . . ., 5, with d = 5 for most deprived).Respondents are aggregated by risk category cells - Greenland (2001) refers to these as distinct covariate patterns.So the observations become numbers at risk, n aserd and diseased subjects y aserd , both taking account of survey weighting for differential non-response (JHSU, 2004).A log-binomial regression is applied, allowing inferences on prevalence proportion ratios rather than prevalence odds ratios (Skov et al., 1998;Zocchetti et al. 1997); this is also called the log-linear binomial (Greenland, 2004).For example, using this link permits estimation of the prevalence relative risk gradient over neighbourhood deprivation quintiles, whereas logit coefficients only provide relative risks under a rare disease assumption.To avoid probabilities above 1, an upper limit of 0.999 on cell probabilities was imposed.MCMC sampling produced this default value only in the first 100 to 200 iterations.
To assess possible interactions between risk factors, the prevalence model for the 2003 survey data includes main effects in all variables and second order interactions for which there is evidence in health outcome literature, not necessarily heart disease.The historical data model (for the 1999 survey) is the same except for excluding the main deprivation effect and any interactions involving deprivation.The second-order interactions included are for age-sex, sex-region, sexethnicity, sex-deprivation, age-deprivation, age-ethnicity, and ethnicity-deprivation.The sex-region interaction is suggested by Table 1, while different gender-age heart disease risk profiles have been reported as well as sex-ethnicity interactions (Primatesta and Brookes, 2000).While completeness in modelling terms might indicate including several interactions, some studies of cardiovascular outcomes that include area deprivation and area type report few interactions as significant (e.g.Martinez et al., 2003).Thus for parsimony, the age and deprivation variables when included in interactions are reframed as binary: ages up to 64 (a * = 1) are compared with ages 65 and above (a * = 2), and the top two deprivation quintiles (d * = 2) are contrasted with the lower three (d * = 1).Substantive justification for such a contraction exists: for example, the main impact of deprivation is on premature ill health and mortality (e.g.Barnett et al., 2001).

Survey Model Specification and Pooling over Surveys
A model including main effects and the above mentioned interactions is then where parameters treated as fixed effects {α 2a , α 3e , α 4d , β 1es , β 2a * s , β 3ea Priors on {α, β} in (3.1) and (3.2) are based on accumulated epidemiological evidence, such as that provided by UK studies of treated heart disease prevalence.Data from the Key Health Statistics from General Practice (ONS, 2000) give heart disease prevalence rates of 0.1 per 1000 at ages under 34 (for both males and females) ranging to 205/1000 (males) and 172/1000 (females) at ages over 85.Because of this wide range in risk, normal priors N (−9, 5) and N (0, 5) are adopted for α 1s and α 2a respectively.For the remaining risk factors (for ethnic and deprivation categories) and the interactions, accumulated evidence (e.g.Hoare, 2003) is that N (0, 1) priors will encompass likely ranges in relative risk.This corresponds to a prior belief that the associated relative risks will be between 0.14 and 7.1 with 95% certainty.It might well be possible to justify more informative elicited priors on relative risk and it is straightforward to include this when a log link is used in the binomial regression (e.g.Greenland, 2001).
The regional effects γ rs are treated as random and follow a bivariate spatial conditional autoregressive prior (see Appendix 1), with the multivariate CAR precision matrix Φ −1 γ assumed to follow a Wishart prior with 2 degrees of freedom and identity scale matrix.Reasons for expecting spatial correlation in regional relativities include the north-south contrast in prevalence (Table 1), as well as environmental factors, such as water hardness (Shaper et al., 1980;Catling et al., 2005).
Let θ = {α, β, γ, Φ γ } parameters, and 0 ≤ δ ≤ 1 be a precision parameter (with beta prior) that weights the historical data D h relative to the likelihood of the current study data D. Following Chen et al. (2000, p. 124) the power prior takes the form where P (D h |θ) is the binomial likelihood, and (a δ , b delta ) are pre-specified beta density hyperparameters.With δ an unknown the joint posterior density for (θ, δ) is then For the current analysis there is expected to be considerable continuity between the two surveys in prevalence differentials and the 1999 survey data includes relevant information on ethnic prevalence gradients; on the other hand, the model forms for 2003 and 1999 differ (because only the 2003 model includes neighbourhood deprivation) and so some downweighting is appropriate.Here three alternative beta priors for δ are considered, namely Be(250,1), Be(100,1) and Be(50,1).

Survey Model Results
Inferences are based on iterations 1000-5000 of two chain sampling runs starting from dispersed starting values, with convergence achieved by iteration 1000 using Gelman-Rubin criteria (Gelman et al., 1995).Comparisons of models use the deviance information criterion (DIC) of Spiegelhalter et al. (2002), namely the posterior mean deviance plus a complexity measure p e , derived as the difference between D and the deviance, Dev( Ψ), at the posterior mean of Ψ = (θ, δ).So the DIC can be obtained as Dev( Ψ) + 2p e .For model checking, new data (y new,aserd ) are sampled from the model and compatibility with actual data assessed by the extent to which 95% intervals for new data include the actual data (Gelfand, 1996).
Table 2 shows parameter estimates (log relative risks) under alternative δ priors.Average deviances are similar across the three options on δ, though DICs decrease slightly with larger values of δ because of lower p e .The posterior predictive checking procedure of Gelfand (1996) is satisfactory, with actual cases y aserd in all cells covered by 95% intervals for replicate data sampled from P (y new |y), regardless of the prior on δ.
Age and sex effects are significant under all options, while ethnic group effects for δ ∼ Be(250,1) show lower risk for blacks and higher risk for south Asians (cf Primatesta and Brookes, 2000).Under all δ priors, the deprivation effects (centred around their average) show the main contrast is between extremes of neighbourhood deprivation with a relatively flat intermediate effect.
The regional effects support a north-south contrast in prevalence within England; three coefficients are significant under δ ∼ Be(250,1), and the parameter contrasts γ 31 − γ 91 , γ 12 -γ 92 , and γ 32 -γ 92 have posterior means (95% intervals) of 0.37 (0.07,0.69), 0.61 (0.19,1.07) and 0.45 (0.08,0.85).The fact that regional effects exist after controlling for population composition and neighbourhood deprivation suggests genuine contextual variation in heart disease risk.Of the interactions β 122 is significant in terms of its 95% credible interval under the two more informative priors on δ, reflecting higher prevalence among black women as compared to men.Both β 322 and β 332 are significantly negative under δ ∼ Be(250,1), showing black and south Asian elders to have lower risk.

Relevant Survey Outputs for Prevalence-Mortality Model
The goal of the analysis is to estimate heart disease prevalence in 354 English areas, for which Census population counts N iase by age, sex and ethnic group are obtainable.Further population disaggregation by neighbourhood deprivation quintile is not available.However, to ensure the area prevalence estimates take account of neighbourhood deprivation, it is possible to obtain population totals N id , providing proportions w id = N id /N i of total area population living in each deprivation quintile.Let r i ∈ {1, . . ., 9} denote the region in which the i-th area is located, then the impact of neighbourhood deprivation in area i is based on averaging the probabilities ρ aser i d according to the population split w id .Then age-sex-ethnic specific prevalence rates R iase are estimated as a weighted average (5.1) and prevalent cases in each area and for age, sex and ethnic groups are estimated as P iase = R iase N iase .English area mortality data are not specific to ethnic group (see section 4).To generate a prevalence rate that can be modelled jointly with mortality, the P iase are aggregated over ethnic groups (at each MCMC iteration) to form areaage totals P ias that are in turn divided by area-age-sex specific populations N ias to give area-age-sex prevalence rates R ias .These are then applied to European standard populations S a (e.g.Hedman et al., 1999) to provide age standardised prevalence rates π is for each area by sex (and for ages over 35).To provide a suitable input to the joint prevalence-mortality analysis, the transforms x is = logit(π is ) are monitored, as these are more likely to be approximately normal than the prevalence rates themselves.Posterior means and variances of the x is are denoted X is (V is ).

Joint Prevalence-Mortality Model
As argued above, evidence on variation in area prevalence is provided indirectly by area heart disease mortality; it is likely that mortality will closely reflect prevalence, though other factors may be involved.Evidence justifying a joint analysis is obtained from heart disease registers associated with a new payment scheme for English general practitioners (Strong et al., 2006).These data may be subject to under or over-registration (especially at lower spatial scales) and so do not necessarily provide a "gold standard" prevalence estimate.However, Table 3 shows a clear correlation between prevalence and the selected mortality index at regional level; the Pearson correlation is 0.9.
Let D i1 and D i2 be area deaths for males and females (ages over 35), and E D i1 and E D i2 be expected deaths (using England age-specific rates for 2003).Assuming Poisson sampling, one has , where µ is are relative mortality risks for sex s (1=M, 2=F) and area i.The goal is to adjust survey based logit prevalence rate estimates x is to reflect spatial patterning of these mortality relative risks.Let z be the underlying true (logits of) prevalence, measured with error.The bivariate model can be seen as following a form P (µ, z|x) = P (µ|z)P (z|x)P (x), ( namely a marginal prevalence model and a model including an impact of prevalence on mortality.Similar models are mentioned by X ia and Carlin (1998) and Bernardinelli et al. (1997), and both of these studies incorporate spatial correlation in the underlying rate.
In the application here, spatially correlated effects pool information over areas and genders.There are many reasons to expect unmeasured risk factors to be spatially correlated, for example between adjacent urban as against rural areas, or between adjacent areas in northern as against southern regions of England.Thus, urban air pollution is a risk for heart disease (Chen et al., 2005), while regional differences in smoking and physical activity are reported by Morris et al. (2003).Regional differences in drinking-water hardness have also been linked to cardiovascular disease variations (Monarca et al., 2004;Catling et al., 2005).So for sexes s = 1, 2 the following joint model is postulated

Local Authority Areas
Prevalence (%) where the spatial correlated errors (u i1 , u i2 ) follows a bivariate conditional autoregressive prior (see Appendix 1), with the effects u i1 and u i2 centred around respective means at each iteration.From the form of the model it is apparent that z will be affected by µ as well as vice versa, by virtue of the reverse regression implicit in many measurement error models (Maddala, 2001, Ch 11).The precision matrix Φ −1 u is assumed to follow a Wishart prior with 2 degrees of freedom and identity scale matrix; the intercepts η s and slopes ω s are assigned N (0, 100) priors.The last 15000 of a two chain run of 25000 iterations show the coefficients ω 1 a nd ω 2 in model (6.2) as clearly significant, with means (95% intervals) of 0.42 (0.36,0.48) and 0.23 (0.17,0.31).The correlation between (u i1 , u i2 ) is estimated at 0.71 (0.54,0.84).
Compared to the mean prevalences π is from the survey model, the adjusted prevalences ζ is = exp(z is )/(1 + exp(z is )) from the joint model show greater inequality (i.e. higher coefficients of variation), possibly as they reflect local variations in mortality relative risks.Figures 1 and 2 contain quintile maps for the posterior mean prevalence rates ζ i1 and ζ i2 , with the North-South contrast again visible.The fact that this contrast remains after controlling for neighbourhood deprivation and ethno-demographic structure suggests that health behaviours (e.g.diet) and environmental factors are also relevant to prevalence differences.
As an application with policy relevance, the prevalence rates ζ is are compared to revascularisation rates for the 354 areas in 2002.There are recognized to be variations in provision of revascularisation, namely coronary artery bypass grafts and percutaneous transluminal coronary angioplasty that other studies suggest are not explained by morbidity (Hippisley-Cox and Pringle, 2000;Payne and Saul, 1997).In fact, the correlation between provision and prevalence rates is −0.02 for males and 0.08 for females, indicating possible variations in access to surgical care not matched to need for such care (as reflected in prevalence).

Discussion
While mortality and hospitalisation data are often used as proxies for prevalence (morbidity) and hence health need (Ebrahim et al., 2002), there is value in using available survey evidence to provide direct estimates of prevalence and morbidity.The present study has outlined a methodology to combine spatially aggregated survey evidence with information on spatially disaggregated patterns in heart disease mortality, which reflect geographic variations in prevalence (e.g.see Table 3).
This methodology can be seen as a form of meta-analysis over different forms of evidence that can be applied to other types of morbidity.The pooling of information over surveys (here the 1999 and 2003 Health Surveys for England) can be performed using the power prior method.An alternative analysis to the one adopted in the paper could arguably input more informative priors to the power prior likelihood, further developing on the theme of evidence pooling.There is accumulated epidemiological evidence on heart disease risk factors that could justify more informative priors, especially on main effects.For example, south Asian ethnicity is often reported as associated with higher relative heart disease risk in the UK.So one could for instance, following Greenland (2001, p. 665), assume a prior relative risk between 1 and 3 for this group, translating into a N (0.55, 0.08) prior.On the other hand, there may be relatively little prior evidence on certain interactions (especially in a particular geographical setting such as England with its distinct health care system) and adopting an informative prior approach may also imply the need to run a sensitivity analysis over different informative priors.
There are other options for modelling that might be considered.One option is to introduce information on prevalence from hospital admission data.These are sometimes suspect as indicators of morbidity because they reflect supply of care, but for events where hospitalisation is usually unavoidable (e.g.myocardial infarction) they may improve the estimation of morbidity.One might also seek to jointly model, and so make indirect area estimates for, more than one type of prevalence (e.g.smoking, diabetes or obesity prevalence) in conjunction with modelling heart disease prevalence.In the UK prevalence of these behaviours or conditions is also monitored by the Health Survey for England and they are known risk factors for heart disease.Similar potentialities exist for using national health survey data of other countries to indirectly estimate area prevalence in conjunction with other relevant and locally disaggregated information (e.g. on mortality, hospital admissions).

Appendix. Univariate and Multivariate CAR priors
To explain the form of the CAR prior, first consider a univariate conditional autoregressive prior.Let (e 1 , . . ., e n ) be a vector of effects associated with areas 1,..,n such as relative mortality risks.Then a univariate conditional autoregressive prior or CAR prior (Rue and Held, 2005), involves specifying the n full conditionals  (1974) and Jin et al. (2005).The multivariate normal CAR is the multivariate generalisation of the prior in (A.1).If there are K outcomes, then φ e is replaced by a K × K covariance matrix Φ e .A common practice is to define c ij = 1 if areas i and j are adjacent, and c ij = 0 otherwise, in which case c i+ is the number of areas adjacent to area i.

Figure 2 :
Figure 2: Female prevalence of heart disease e i |e [i] ∼ N ( j =i c ij e j /c i+ , φ e /c i+ ) ( A . 1 )where e [i] = (e 1 , e 2 , . . ., e i−1 , e i+1 , . . ., e n ) is the collection of effects excluding area i, C = [c ij ] is an n n matrix of spatial interactions c ij , often known but sometimes involving unknown parameters, the sums c i+ = j c ij total over rows in this matrix, and φ e is a conditional variance.The conditional density uniquely determines the joint density of the effects (e 1 , . . ., e n ), a feature noted by Besag

Table 1 :
Age standardised chd prevalence (ages 35+) with 95% confidence intervals by sex & government office region 2003 Health Survey for England.

Table 3 :
Associations between prevalence and mortality, regional level