Modeling County-Level Rare Disease Prevalence Using Bayesian Hierarchical Sampling Weighted Zero-Inflated Regression

Xie, Hui; Rolka, Deborah B.; Barker, Lawrence E.

doi:10.6339/22-JDS1049

Journal of Data Science

Modeling County-Level Rare Disease Prevalence Using Bayesian Hierarchical Sampling Weighted Zero-Inflated Regression^✩

Volume 21, Issue 1 (2023), pp. 145–157

Hui Xie

Deborah B. Rolka Lawrence E. Barker

https://doi.org/10.6339/22-JDS1049

Pub. online: 22 June 2022 Type: Statistical Data Science

Open Access

^✩ The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Received
14 December 2021

Accepted
26 April 2022

Published
22 June 2022

Abstract

Estimates of county-level disease prevalence have a variety of applications. Such estimation is often done via model-based small-area estimation using survey data. However, for conditions with low prevalence (i.e., rare diseases or newly diagnosed diseases), counties with a high fraction of zero counts in surveys are common. They are often more common than the model used would lead one to expect; such zeros are called ‘excess zeros’. The excess zeros can be structural (there are no cases to find) or sampling (there are cases, but none were selected for sampling). These issues are often addressed by combining multiple years of data. However, this approach can obscure trends in annual estimates and prevent estimates from being timely. Using single-year survey data, we proposed a Bayesian weighted Binomial Zero-inflated (BBZ) model to estimate county-level rare diseases prevalence. The BBZ model accounts for excess zero counts, the sampling weights and uses a power prior. We evaluated BBZ with American Community Survey results and simulated data. We showed that BBZ yielded less bias and smaller variance than estimates based on the binomial distribution, a common approach to this problem. Since BBZ uses only a single year of survey data, BBZ produces more timely county-level incidence estimates. These timely estimates help pinpoint the special areas of county-level needs and help medical researchers and public health practitioners promptly evaluate rare diseases trends and associations with other health conditions.

Supplementary material

Supplementary Material

Figure 4: Agreement between BRFSS model-based estimates and ACS 1-year reports of county-level DDRS based on 225 selected counties in 2015. The reference line denotes if model-based estimates and standard references (e.g., ACS 1-year report) were identical. Among the four models (BHBI, BZBI, BPLW and BBZ), estimates of BHBI and BZBI present both large variances and bias; Most counties have a positive estimated bias. Estimates of BBZ tend to stay closer to the reference line with least bias and variance. These results are matched with those in 2019. Figure 5: Agreement between BRFSS model-based estimates and ACS 1-year reports of county-level DDRS based on 225 selected counties in 2016. The reference line denotes if model-based estimates and standard references (e.g., ACS 1-year report) were identical. Among the four models (BHBI, BZBI, BPLW and BBZ), estimates of BHBI and BZBI present both large variances and bias; Most counties have a positive estimated bias. Estimates of BBZ tend to stay closer to the reference line with least bias and variance. These results are matched with those in 2019.

References

Sugasawa S, Kubokawa T (2020). Small area estimation with mixed models: a review. Japanese Journal of Statistics and Data Science. https://doi.org/10.1007/s42081-020-00076-x.

Ghosh M, Rao JNK (1994). Small area estimation: an appraisal. Statistical Science, 9(1): 90–93.

Trevisani M, Torelli N (2017). A comparison of hierarchical Bayesian models for small area estimation of counts. Open Journal of Statistics, 7: 521–550.

Best N, Richardson S, Clarke P, et al. (2019). A comparison of model-based methods for small area estimation. BIAS project report. http://www.bias-project.org.uk/papers/ComparisonSAE.pdf (Accessed August 2019).

Auvin S, Irwin J, Abi-Aad P, et al. (2018). The problem of rarity: estimation of prevalence in rare disease. Value Health, 21: 501–507.

Liu J, Luan J, Zhou X, et al. (2017). Epidemiology, diagnosis, and treatment of Wilson’s disease. Intractable And Rare Diseases Research, 6: 249–255.

Bendewald MJ, Wetter DA, Li X, et al. (2010). Incidence of dermatomyositis and clinically amyopathic dermatomyositis: a population-based study in olmsted county, Minnesota. Archives of Dermatology, 146: 26–30.

Thompson JA, Carozza SE, Zhu L (2007). An evaluation of spatial and multivariate covariance among childhood cancer histotypes in Texas (United States). Cancer Causes Control, 18: 105–113.

Erciulescu AL, Cruze NB, Nandram B (2019). Model-based county-level crop estimates incorporating auxiliary sources of information. Journal of the Royal Statistical Society, Series A, 182: 283–303.

Alexander M, Zagheni E, Barbieri M (2017). A flexible Bayesian model for estimating subnational mortality. Demography, 54: 2025–2041.

Khana D, Rossen LM, Hedegaard H, et al. (2018). A Bayesian spatial and temporal modeling approach to mapping geographic variation in mortality rates for subnational areas with R-Inla. Journal of Data Science, 16: 147–182.

Ayubi E, Barati M, Dabbagh Moghaddam A, et al. (2018). Spatial modeling of cutaneous leishmaniasis in Iranian army units during 2014–2017 using a hierarchical Bayesian method and the spatial scan statistic. Epidemiology and Health, 40: e2018032.

Chen Q, Gelman A, Tracy M, et al. (2015). Incorporating the sampling design in weighting adjustments for panel attrition. Statistics in Medicine, 34: 3637–3647.

Vandendijck Y, Faes C, Kirby RS, et al. (2016). Model-based inference for small area estimation with sampling weights. Spatial Statistics, 18: 455–473.

Millar RB (2009). Comparison of hierarchical Bayesian models for overdispersed count data using DIC and Bayes’ factors. Biometrics, 65: 962–969.

Lee JH, Han G, Fulp WJ, et al. (2012). Analysis of overdispersed count data: application to the human papillomavirus infection in men (HIM) study. Epidemiology and Infection, 140: 1087–1094.

Dai L, Sweat MD, Gebregziabher M (2018). Modeling excess zeros and heterogeneity in count data from a complex survey design with application to the demographic health survey in sub-Saharan Africa. Statistical Methods in Medical Research, 27: 208–220.

Hu T, Gallins P, Zhou YH (2018). A zero-inflated beta-binomial model for microbiome data analysis. Stat (international Statistical Institute), 7(1).

Pourhoseingholi A, Baghestani AR, Ghasemi E, et al. (2018). Bayesian zero- inflated Poisson model for prognosis of demographic factors associated with using crystal meth in Tehran population. Medical Journal of The Islamic Republic of Iran, 32: 24.

Xie H, Barker LE, Rolka DB (2020). Incorporating design weights and historical data into model-based small area estimation. Journal of Data Science, 18(1): 115–131.

Gettens J, Lei PP, Henry AD (2015). Using American community survey disability data to improve the behavioral risk factor surveillance system accuracy. Mathematica Policy Research, DRC Brief, 2015-05.

Barker LE, Thompson TJ, Kirtland KA, et al. (2013). Bayesian small area estimates of diabetes incidence by United States county, 2009. Journal of Data Science, 11: 269–280.

Pfeffermann D (2013). New important developments in small area estimation. Statistical Science, 28: 40–68.

Kish L, Frankel MR (1974). Inference from complex samples. Journal of the Royal Statistical Society, Series B, 36: 1–37.

Hansen M, Madow W, Tepping B (1983). An evaluation of model-dependent and probability sampling inferences in sample surveys. Journal of the American Statistical Association, 78: 776–793.

Chen MH, Ibrahim JG, Shao QM (2000). Power prior distributions for generalized linear models. Journal of Statistical Planning and Inference, 84: 121–137.

Spiegelhalter DJ, Best NG, Carlin BP, et al. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, Series B, 64(4): 583–639.

Shriner D, Yi N (2009). Deviance information criterion (DIC) in Bayesian multiple QTL mapping. Computational Statistics and Data Analysis, 53: 1850–1860.

Bland JM, Altman DG (1999). Measuring agreement in method comparison studies. Statistical Methods in Medical Research, 8: 135–160.

Weaver CG, Ravani P, Oliver MJ, et al. (2015). Analyzing hospitalization data: potential limitations of Poisson regression. Nephrology Dialysis Transplantation, 30: 1244–1249.

Rose CE, Martin SW, Wannemuehler KA, et al. (2006). On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data. Journal of Biopharmaceutical Statistics, 16(4): 463–481.

Khan D, Rossen L, Hedegaard H, et al. (2018). A Bayesian spatial and temporal modeling approach to mapping geographic variation in mortality rates for subnational areas with R-INLA. Journal of data science, 16(1): 147–182.

Gibbs Z, Groendyke C, Hartman B, et al. (2020). Modeling county-level spatio-temporal mortality rates using dynamic linear models. Risks, 8(4): 117.

Oleson J, Smith B, Kim H (2008). Joint spatio-temporal modeling of low incidence cancers sharing common risk factors. Journal of Data Science, 6: 105–123.

Vahedi B, Karimzadeh M, Zoraghein H (2021). Spatiotemporal prediction of COVID-19 cases using inter- and intra-county proxies of human interactions. Nature Communications, 12: 6440.

Bhattacharya A, Clarke BS, Datta G (2008). A Bayesian test for excess zeros in a zero-inflated power series distribution. IMS collections, 1: 89–104.

Tang W, Lu N, Chen T, et al. (2015). On performance of parametric and distribution-free models for zero-inflated and over-dispersed count responses. Statistics in Medicine, 34: 3235–3245.

Porter AP, Wikle CK, Holan SH (2015). Small area estimation via multivariate fay-herriot models with latent spatial dependence. Australian & New Zealand Journal of Statistics, 57: 15–29.

Rao JNK, Molina I (2015). Small area estimation. 2nd edn, John Wiley & Sons, Inc, Hoboken.

Centers for Disease Control and Prevention. National Center for chronic disease prevention and health promotion. National Diabetes Statistics Report, 2017: Estimates of Diabetes and Its Burden in the United States. www.cdc.gov/diabetes/pdfs/data/statistics/national-diabetes-statistics-report.pdf (Accessed December 2017).

Rossen LM, Hedegaard H, Khan D, et al. (2018). County-level trends in suicide rates in the U.S., 2005–2015. American Journal of Preventive Medicine, 55: 72–79.

Cadwell BL, Thompson TJ, Boyle JP, et al. (2010). Bayesian small area estimates of diabetes prevalence by U.S. county, 2005. Journal of Data Science, 8: 173–188.

2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

excess zeros incidence PLOW power prior small area estimate

Metrics

since February 2021

940

Article info
views

604

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file