Estimating Small Area Diabetes Prevalence in the US Using the Behavioral Risk Factor Surveillance System

Information regarding small area prevalence of chronic disease is important for public health strategy and resourcing equity. This paper develops a prevalence model taking account of survey and census data to derive small area prevalence estimates for diabetes. The application involves 32000 small area subdivisions (zip code census tracts) of the US, with the prevalence estimates taking account of information from the US-wide Behavioral Risk Factor Surveillance System (BRFSS) survey on population prevalence differentials by age, gender, ethnic group and education. The effects of such aspects of population composition on prevalence are widely recognized. However, the model also incorporates spatial or contextual influences via spatially structured effects for each US state; such contextual effects are allowed to differ between ethnic groups and other demographic categories using a multivariate spatial prior. A Bayesian estimation approach is used and analysis demonstrates the considerably improved fit of a fully specified compositional-contextual model as compared to simpler ‘standard’ approaches which are typically limited to age and area effects.


Introduction
Information regarding area prevalence of diabetes is important for ensuring that resources for diabetes care match need and for effective targetting of diabetes-prevention services.In the US there is evidence of a growth in diabetes levels over time (Mokdad et al, 2001), of wide geographic contrasts in prevalence, and of considerable differences in relative risk between the main ethnic groups (Davidson, 2001;Harris, 1998).Thus in 1999-2000, the age-adjusted US wide prevalence of previously diagnosed diabetes among adults was estimated as 11.7% among blacks, 9.6% among Hispanics, and 4.8% among non-Hispanic whites (CDC, 2003).This paper develops a binary regression model taking account of 2005 survey data, and 2000 US census data, to derive small area prevalence estimates for previously diagnosed diabetes in 32000 small area subdivisions of the US.
These estimates take account of information from the US-wide Behavioral Risk Factor Surveillance System (BRFSS) surveys on prevalence differentials by age, gender, ethnic group and education (e.g.Mukhtar et al, 2003).These surveys are random-digit-dialed telephone survey to determine the prevalence among adults (ages 18 and over) of major illnesses and health behaviors which are related to the leading causes of death in the US.To determine diabetes status, respondents were asked "Have you ever been told by a doctor that you have diabetes?",encompassing both types of diabetes.
The estimates described in this paper are based on around 360,000 survey responses to the 2005 BRFSS, and on a binary regression model expressing the impact on diabetes of major individual level risk factors measured by the survey.However, since the ultimate goal of the analysis is small area prevalence estimation, inclusion of risk factors (and interactions between them) in the model is subject to the constraint that included risks are available also as tabulations for small area populations.The regression model adjusts for US state level relationships between diabetes and the levels of rurality and poverty, and for unmeasured state level influences.The latter are modelled using a multivariate random effects approach that allows state level contextual effects to be differentiated by ethnic group.The areas for which prevalence is estimated are 32000 ZIP Code Tabulation Areas (ZCTAs) for which selected Census 2000 statistics have been provided by the US Census Bureau (cf Grubesic & Matisziw, 2006).

Individual Level Risk Factors: Compatibility between Survey and Small Area Variable Frames
The survey regression model for diabetes prevalence includes major individual level risk factors (age, gender, ethnicity, education level) that are known to be significant sources of varying diabetes prevalence.A pronounced gradient in diabetes prevalence by age is reported by CDC (2003) and Mokdad et al (2001), while Maty et al (2005) report that socioeconomic disadvantage, especially low educational attainment, is a significant predictor of incident Type 2 diabetes.Prevalence variations by education level are also reported by CDC (2004).However, since the ultimate goal of the analysis is small area prevalence estimation, inclusion of risk factors (and interactions between them) in the survey model is subject to the constraint that included risks are available both in the BRFSS and as tabulations for ZCTA populations; any assumed interaction between risk factors requires a matching cross-tabulation in the ZCTA population.
Demographic risk categories, namely age group, gender and ethnic group (white non-hispanic, black, hispanic, other) are available both as BRFSS variables and in a ZCTA level tabulation which cross-tabulates adult populations by ethnicity, quinquennial age and gender.For comparably defined demographic risk groups (e.g.age-ethnic-gender subgroups), parameters from the survey model (e.g.relative risk for hispanic males aged 45-49) can then be transferred to the ZCTA sub-population.As mentioned in the description of the model below, age gradients may vary both by gender and between ethnic groups, and it is important to model such variation while also taking account of correlation between the shapes of age profiles for different groups.
For other individual level risk variables (e.g.education), either primary ZCTA tabulations are available from the 2000 census, or a limited cross tabulation (e.g.male adults by education, and female adults by education), but not tabulations involving cross-hatching against all other risk factors.For example, there is not a ZCTA level census table that cross-tabulates the adult population simultaneously by education, quinquennial age, ethnicity and gender.A small area prevalence adjustment can then be applied only for the main effect of such variables, or for a partial interaction.For example, the survey regression models show genderspecific education gradients in relative risk of diabetes prevalence, and these gradients can be applied to ZCTA male and female adult populations subdivided by education level.

Geographic Influences
As is now well known, individual risk factors and contextual factors (including the impact of geographic location) interact in their impact on many chronic diseases.Although prevalence is to be estimated at ZCTA level, the ZCTA of residence is not available for BRFSS respondents for confidentiality reasons, so it is not possible to take account of the impact of (say) poverty rates for ZCTAs on small area diabetes prevalence.
However, one may model the impact of broad scale geographic influences on diabetes prevalence operating at the level of US states, since state of residence (s = 1, .., 53, including the District of Columbia, Virgin Islands and Puerto Rico), is available for all respondents.Some directly measured state level predictors may have a significant influence on diabetes prevalence; those used here are the percent of population in poverty and the percentage of rural population.Rural location in the US is in fact a positive risk factor for diabetes prevalence and an adverse influence on access to diabetes care (Mainous et al, 2004;AHRQ, 2005).
Many geographic influences are likely to be unobserved and these are proxied in the regression model by state level random effects.These influences may reflect environmental factors such as climate (Franz & Bailey, 2004), or the aggregate effect of variables representing health behaviours.Such effects are taken to be a sum of two effects, one of which is spatially correlated to reflect smoothly varying risk factors in space that straddle arbitrary state boundaries (Richardson & Monfort, 2000), while the other is unstructured in the sense of not incorporating spatial structure.In the disease mapping literature this approach, due to Besag et al (1991), is known as a convolution model.Both types of random state effects are differentiated by ethnic group (i.e. are multivariate), since contrasts in diabetes prevalence between ethnic groups are likely to differ by state.Thus CDC (2004) report that "Hispanics continued to have a higher prevalence of diabetes than non-Hispanic whites and that disparities in diabetes between these two populations varied by area of residence".For spatial isolates such as Alaska and Puerto Rico, the impact on prevalence of state of residence is confined to the unstructured random effect.

Survey Model Specification
The analysis is based on the 2005 BRFSS survey, with 136 thousand male and 217 thousand female respondents.As well as including relevant risk variables, the model should incorporate survey weights w i for respondents i to account for differential response between demographic categories, including a lower response rate for males as against females, and for hispanics and blacks as against whites.Because of the large number of respondents, seperate binary regressions are carried out for males and females, and exclude cases with diabetes status not reported or refused -missing status applies to under 0.1% of subjects (CDC, 2008).Separate analysis by gender is also supported by evidence from other studies of gender effect modification over a wide range of risk factors (Cabrera et al, 2003).
Let y i = 1 if a subject reports doctor diagnosed diabetes, with y i = 0 otherwise (i = 1, .., N ), and define π i = P r(y i = 1) as the probability that a respondent reports diagnosed diabetes.The analysis here then follows studies such as Graubard et al (1997) in using a weighted likelihood, namely To facilitate straightforward application of survey model parameters across to ZCTA populations a relative risk interpretation was sought for parameters, which is achieved using a log link (Robbins et al, 2002).
The regression model for each gender then involves the following features with associated parameters in brackets: a) an overall intercept (α), b) differential risks for black, hispanic and other ethnic groups as against whites as reference (unknowns β g , g = 2, 3, 4,with β 1 = 0 as reference) c) differential education risks, according to education level e, namely 1=never attended, elementary only, or some high school; 2=high school graduate; 3=some college or technical school; 4=college graduate (unknowns η e , e = 2, .., 4, with η 1 = 0 as reference) d) effects of state level predictors, namely poverty rate P ov s and percent rural Rur s , where s = 1, .., 53 denotes the BRFSS respondent's state of residence (δ 1 , δ 2 ).These predictors are centred so that their average over all states is zero.
Effects under (e) and (f) are modelled via multivariate normal conditional autoregressive priors (of dimension G=4), respectively a multivariate first order random walk and a multivariate spatial scheme (Fahrmeir & Lang, 2001).A constraint is applied during estimation that ensures these effects to sum to zero within ethnic groups, so that The area effects u sg under heading (g) are multivariate normal (with means of zero over all states) of dimension G, allowing for correlated effects across ethnic groups, but without any form of autocorrelation over areas.The differentiation of area effects by ethnicity reflects evidence such as that from (CDC, 2004) that disparities in diabetes between ethnic sub-groups in populations vary by area of residence.
Let S i denote the state of residence for respondent i.Also let {x i , g i , e i } denote the age, ethnicity and education level of respondent i.Then one may write the survey prevalence model as where the c sg terms are not included for Alaska, Hawaii, Puerto Rico and the Virgin Islands.This model is run separately for males and females.For simplicity of presentation, gender r = 1, 2 (1=males, 2=females) is omitted from (4.1), but the complete parameterisation has the form log(π for i = 1, N r where N 1 = 135038 and N 2 = 217280. The parameters in (1) operate on the log relative risk scale.In particular, smoothed state level relative risks by ethnic group ρ sg may be obtained by exponentiating the total area effect, namely ρ sg = exp(c sg + u sg ).
Excess risk can be defined in different ways, but one is that the 95% credible intervals for ρ sg are confined to values above 1.The smoothing of state risks under this model follows the general principle of other hierarchical shrinkage methods that the smoothed estimate for each area "borrow strength" (precision) from data in other areas, with shrinkage greater for areas with low event counts.Except in the spatially isolated states, two forms of smoothing are invoked: local smoothing towards the average of neighbouring states, and global smoothing of all prevalence risks toward the same US wide mean (Clayton & Kaldor, 1987).The smoothing is multivariate and so also incorporates a within state correlation between prevalence rates of different ethnic groups.Such smoothed prevalence estimates are more precise and more robust against false-positive inferences (e.g.regarding excess risk) than are unpooled prevalence rate estimators.
Some concerns have been raised that Bayesian risk estimates may tend to oversmooth variations in disease or mortality risks, particularly when data are sparse or there are discontinuities in the spatial pattern of risk (Green and Richardson, 2002).For such reasons, the area units for the survey model have been chosen as US states rather than US counties (of which there are circa three thousand across the US) to avoid data sparseness.As for distortion due to discontinuities (states with prevalence unlike that of their neighbours), these are reduced by including unstructured effects u sg as well as spatially structured effects c sg .Assuming local smoothing via spatially configured c sg as the sole relevant principle guiding smoothed prevalence estimation is inappropriate when there are discontinuities.It is possible that more elaborate "adaptive" priors (e.g.Congdon, 2007) could be applied to account for any discontinuities.However, it is important to use information on geographic adjacencies, since some spatial pooling of strength is likely to be relevant.Accumulated evidence indicates a clear spatial patterning in US diabetes prevalence, and in mortality from diabetes and related conditions, with elevated diabetes prevalence in the south eastern US, and lower prevalence in the mountain and northern states (see for example Ahluwalia et al, 2003, Table 20).Such evidence supports the inclusion of a mechanism for spatial pooling of strength in the survey model.

Model with Age and State Effects Only
To provide a benchmark against more conventional prevalence rate estimation approaches and assess the gain in fit (if any) from using the detailed model in (1), we also consider a simple approach (though still a model) with age and state effects only.This provides estimates of relative diabetes risk for different states that adjust only for differences in population age structure between states.
Under this simplified model, the model for respondents i (again within each gender) is log where the age parameters γ x are fixed effects with γ 1 = 0 for identification, and the log relative risks {u s , s = 1, .., 53} for states are unstructured normal random effects with zero mean.Age adjusted state prevalence rates for each gender are obtained from this model as ρ s = exp(α + u s ).
Thus a model based approach to estimating geographic relativities is retained under this simpler option, but this model is similar to conventional estimation techniques for obtaining age-adjusted prevalence rates for states.Note that in the conventional demographic approach, state rates are in effect treated as 'fixed effects' parameters, though the implicit statistical assumptions are typically not stated.

Small Area Prevalence Estimates
To translate the survey model parameters into small area estimates requires disaggregated populations that match the risk categorisations used in that model.Thus let S j denote the state in which ZCTA j is located, with j = 1, ..., m s and with ∑ s m s = 31986, the total number of ZCTAs across the US.From the estimates of the full survey prevalence model parameters, one may extract ZCTA level estimated prevalence probabilities (here called rates for simplicity) specific for age group, ethnicity and gender r as S j g ), (6.1) and these may be applied to gender-specific populations P jrxg for ZCTA areas to obtain estimated prevalence totals.Summary ethnic specific rates may be obtained by weighting the age bands according the 2000 US Standard Population (National Cancer Institute, 2008).Thus with weights {w x , x = 1, .., X} for the X = 12 adult age bands in the diabetes prevalence model, and subject to ∑ x w x = 1, overall diabetes prevalence rates for the four ethnic groups in ZCTA j are One may adjust the estimated rates (6.1) and (6.2) to take account of the impact on diabetes prevalence of the education attainment mix in each ZCTA.The education mix in a small area is one measure of the impact of socioeconomic structure on health outcomes (cf.Catelan et al, 2008).Thus, let h e ) be the survey model estimate of the relative diabetes risk at education level e after controlling for age, ethnicity and gender.Then a measure of relative risk associated with the educational mix in the j th ZCTA is H

Model Results
Fitting of the models (4.1-4.2) and (5.1) and assessment of their goodness of fit follows a Bayesian approach, under which existing evidence on parameters is expressed via prior densities on such parameters, with posterior evidence provided by combining the prior evidence with the observed data.A Bayesian strategy is advantageous for estimating models with several sets of random effects, including random effects which are spatially clustered.Goodness of fit (see Appendix 1 for details) is assessed by the DIC (Spiegelhalter et al, 2002) and an approximate marginal likelihood (Ibrahim et al, 2001), while ability of the model to reproduce the data is assessed via a posterior predictive check involving the deviance D = −2L (e.g.Lynch & Western, 2004).A model will be preferred if it both (a) successfully reproduces the data and (b) has best fit among those models compatible with the data.Estimation uses iterative Monte Carlo Markov Chain (MCMC) sampling methods (Gelfand and Smith, 1990), as provided in the WINBUGS program (Lunn et al, 2000).Prior specifications are considered in Appendix 2. Posterior summaries of parameters are based on the 2 nd half of runs of 5000 iterations, using two chains starting from dispersed starting values.Convergence was achieved in all models using Brooks-Gelman-Rubin criteria (Brooks & Gelman, 1998).
Table 1 shows gender-specific estimates of the fixed effect parameters {α, β g , η e , δ k } from the full survey model ( 1).It can be seen that there is a steeper educational gradient for females, for whom the relative risk for college graduates of exp(η 4 ) = 0.42 is under a half that of the first education category, those with limited education (elementary education only or did not graduate from high school).There are also clearly significant ethnic effects for both genders, with elevated relative risk for black and hispanic persons.Age profiles for the four ethnic groups in a typical state (one with average poverty and rurality levels), with the rates also specific for education level e, are obtainable as For example, Figure 1, left panel and right panel show estimated age prevalence profiles differentiated by ethnic group, with the rates specific for high school graduates, obtained via where η 2 is the parameter for high school graduates.Peak rates for nonwhite groups occur for slightly younger age bands than the oldest age band in the model, namely the over 75s.This may reflect cohort effects (Gilliland et 1997), linked to the sharp rise in diabetes prevalence since the 1950s.The overall age adjusted prevalence for ethnic groups g at education level e is obtainable (for a state with average poverty and rurality) as Table 2 contains posterior summaries (expressed as percents) for the p ge over the four ethnic groups and four education levels.The widest contrast is among women, exemplified by the rates for white, college-educated women (posterior mean prevalence of 0.037), as opposed to black women with limited education (posterior mean percent prevalence of 0.176).State relative risks ρ sg for diabetes among males and females may be obtained by exponentiating the total area effects c sg +u sg by ethnic groups g.These amount to residual effects after controlling for the age and educational composition of state populations, and also for state levels of poverty and rurality.Despite this there are consistent patterns, such as multiple elevated area impacts (two or more ρ sg significantly above 1, and none significantly below 1) in Maine and Georgia, and multiple diminished area impacts (two or more ρ sg significantly below 1, and none above 1) in Colorado, Iowa, Louisiana, Nevada, North Carolina, Utah, Wisconsin and Wyoming.Table 3 shows states with the lowest and highest posterior mean ρ sg for groups formed according to sex and ethnicity; it is apparent that low relative risks tend to be concentrated in the mountain states, and high risks in the south and east, and also that risk contrasts are greater for blacks and hispanics than for white non-hispanics.
For estimates at ZCTA level, one important feature is measures of variation across areas and demographic groups.Thus ranges under model (1) in posterior mean ethnic group prevalences p jg (i.e.adjusted for education mix) are lowest for whites.The minima and maxima posterior mean p jg are {0.048,0.121}for white males and {0.027,0.153}for white females.By contrast, for black males and black females the extrema are {0.056,0.263}and {0.055,0.240}.
A summary expression of state level geographic differentials applicable across all ethnic groups is obtainable from the additive age and area effects model (5.1).The simplicity of this model is appealing, and it is sufficient to reproduce the data according to the posterior predictive check based on the deviance (see Table 4).However, there is a clear deterioration in fit compared to model (1), both in terms of a lower marginal likelihood and higher DIC.Despite its worse fit, it is of interest to consider the state level relative prevalence risks ρ s = exp(u s ) obtained from model (5.1), which are adjusted for age, but not adjusted for population differences in ethnic composition and education levels, or for state poverty or rurality measures; see Table 5 for a summary of highest and lowest state level relative risks according to sex.High relative risks, namely those significantly exceeding 1 (in the sense that the 95% credible interval is confined to values over 1), occur in several southern states (Alabama, Georgia, Louisiana, Mississippi, North and South Carolina) as well as in Puerto Rico and Oklahoma.Low relative risks, those significantly under 1, occur in west central and northern states such as Colorado, Montana, North Dakota, Wisconsin, Alaska, Rhode Island and Massachusetts.A pattern with some similarities (albeit for crude rates, not adjusted for age) is reported by the CDC at http://apps.nccd.cdc.gov/gisbrfss/default.aspx.

Conclusion
Variations in prevalence of chronic diseases between geographic areas will reflect variations in the attributes of area populations, sometimes termed 'compositional' effects due to the demographic and social structure of area populations (Duncan et al, 1998).However, prevalence variations are also likely to show spatial structure, reflecting what are sometimes termed 'contextual' effects (Sacker et al, 2006), or unobserved risk factors that vary smoothly over space (Richardson & Monfort, 2000).Such contextual effects are likely to be differentiated between ethnic groups and other demographic categories.
This paper has presented a binary regression model that takes account of individual level risk factors and the spatial context for a particular chronic disease, diabetes.Contextual effects are represented by spatially structured and unstructured area random effects, as well as by known state level influences such as poverty levels.Area random effects are differentiated by ethnic group, reflecting evidence from other sources that ethnic relativities are not constant spatially.Age effects are also differentiated by ethnic group using a multivariate autoregressive prior.
Elaborations to the model presented in (1) are possible, such as state as well as ethnic group differentiation in age gradients, or state differentiation in education gradients.One might also consider spatially varying priors for the impacts of the known state level predictors, such as state poverty rate (Gamerman et al, 2003).Varying impacts of such predictors by ethnic group or age are also possible, if for instance, poverty has a greater influence on middle age prevalence contrasts.However, model variations are constrained to some extent in that the ultimate goal of the analysis is small area prevalence estimation, so that inclusion of risk factor interactions is subject to the constraint that any assumed interaction between risk factors requires a matching cross-tabulation in the small area population.
The greatly improved fit for a model that includes both major individual risk factors, and a full specification for contextual factors whether known or unobserved, has been demonstrated.Results for the full model (1) show significant spatial effects (Table 3) even after adjusting for age, education, ethnicity and known state predictors.This may reflect climatic influences (Franz and Bailey, 2004), unmeasured behavioral influences or the effectiveness of health care systems.

Appendix 1: Assessing goodness of fit
Comparisons of model fit are based on the Deviance Information Criterion (DIC) of Spiegelhalter et al (2002), and an approximate marginal likelihood, denoted the pseudo marginal likelihood.The DIC criterion is obtained as the posterior mean deviance (minus twice the log likelihood) plus a measure of complexity d e .The latter is in turn derived as the difference between the mean deviance D over MCMC iterations and the deviance Dev( The pseudo marginal likelihood is based on Monte Carlo estimates of the conditional predictive ordinate or CPO, p(y i |y [i] ), where y [i] denotes the dataset with the i th subject excluded (Dey et al, 1997).The conditional predictive ordinate amounts to a cross validation measure for each case, with the remainder of the data forming the 'test data'.Totalling the logs of the CP Os over all cases provides the logged pseudo marginal likelihood, and models with higher log pseudo marginal likelihoods provide better fits (Ibrahim et al, 2001).
The ability of models to reproduce the data is assessed via a posterior predictive check involving the deviance D = −2L (e.g.Lynch & Western, 2004).Let y new,i be replicates (predictions) sampled from the posterior predictive density p(y new,i |y).Then at each MCMC iteration t = 1, .., T the deviances {D generally regarded as casting doubt on the model (Meng, 1994) relative proportions at education level e in each gender's adult population in ZCTA j with ∑

Figure 1 :
Figure 1: Left panel: Female Prevalence Rates by Age and Ethnic Group (High School Graduates); right panel: Male Prevalence Rates by Age and Ethnic Group (High School Graduates)

−θ)
at the posterior mean of the parameter set θ. Lower values of the DIC indicate better fitting models.
obs ) where I(A) = 1 when A is true andI(A) = 0 when A is false.Posterior predictive p-values ∑ t C (t)/T exceeding 0.9 or under 0.1 are

Table 1 :
Fixed effect coefficients, full survey model

Table 2 :
Age adjusted diabetes prevalence (percent) 2005, by ethnic group and education level * Never attended, elementary only, or some high school.

Table 3 :
Highest and lowest state relative risks by sex-ethnic category

Table 4 :
Model fit, full and simple survey models

Table 5 :
Area effects (relative risks) from simple age-area model, posterior summary