Estimating Disease Prevalence from Preferentially Sampled, Pooled Data
Pub. online: 11 June 2025
Type: Statistical Data Science
Open Access
Received
31 October 2024
31 October 2024
Accepted
30 May 2025
30 May 2025
Published
11 June 2025
11 June 2025
Abstract
After the onset of the COVID-19 pandemic, scientific interest in coronaviruses endemic in animal populations has increased dramatically. However, investigating the prevalence of disease in animal populations across the landscape, which requires finding and capturing animals can be difficult. Spatial random sampling over a grid could be extremely inefficient because animals can be hard to locate, and the total number of samples may be small. Alternatively, preferential sampling, using existing knowledge to inform sample location, can guarantee larger numbers of samples, but estimates derived from this sampling scheme may exhibit bias if there is a relationship between higher probability sampling locations and the disease prevalence. Sample specimens are commonly grouped and tested in pools which can also be an added challenge when combined with preferential sampling. Here we present a Bayesian method for estimating disease prevalence with preferential sampling in pooled presence-absence data motivated by estimating factors related to coronavirus infection among Mexican free-tailed bats (Tadarida brasiliensis) in California. We demonstrate the efficacy of our approach in a simulation study, where a naive model, not accounting for preferential sampling, returns biased estimates of parameter values; however, our model returns unbiased results regardless of the degree of preferential sampling. Our model framework is then applied to data from California to estimate factors related to coronavirus prevalence. After accounting for preferential sampling impacts, our model suggests small prevalence differences between male and female bats.
References
Albert JH, Chib S (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422): 669–679. https://doi.org/10.1080/01621459.1993.10476321
Banerjee S (2017). High-dimensional Bayesian geostatistics. Bayesian Analysis, 12(2): 583. https://doi.org/10.1214/17-BA1056R
Bezanson J, Edelman A, Karpinski S, Shah VB (2017). Julia: A fresh approach to numerical computing. SIAM Review, 59(1): 65–98. https://doi.org/10.1137/141000671
Bilder CR, Tebbs JM, Chen P (2010). Informative retesting. Journal of the American Statistical Association, 105(491): 942–955. https://doi.org/10.1198/jasa.2010.ap09231
Cheng J, Schloerke B, Karambelkar B, Xie Y (2025). Leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library. https://rstudio.github.io/leaflet/.
Conroy B, Waller LA, Buller ID, Hacker GM, Tucker JR, Novak MG (2023). A shared latent process model to correct for preferential sampling in disease surveillance systems. Journal of Agricultural, Biological, and Environmental Statistics, 28(3): 483–501. https://doi.org/10.1007/s13253-023-00535-4
Diggle PJ, Menezes R, Su T-l (2010). Geostatistical inference under preferential sampling. Journal of the Royal Statistical Society. Series C. Applied Statistics, 59(2): 191–232. https://doi.org/10.1111/j.1467-9876.2009.00701.x
Gorelick N, Hancher M, Dixon M, Ilyushchenko S, Thau D, Moore R (2017). Google Earth engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 202: 18–27. https://doi.org/10.1016/j.rse.2017.06.031
Haydon DT, Cleaveland S, Taylor LH, Laurenson MK (2002). Identifying reservoirs of infection: A conceptual and practical challenge. Emerging Infectious Diseases, 8(12): 1468–1473. https://doi.org/10.3201/eid0812.010317
Hoegh A, Peel AJ, Madden W, Ruiz Aravena M, Morris A, Washburne A, et al. (2021). Estimating viral prevalence with data fusion for adaptive two-phase pooled sampling. Ecology and Evolution, 11(20): 14012–14023. https://doi.org/10.1002/ece3.8107
Jimenez F, Katzfuss M (2023). Scalable Bayesian optimization using Vecchia approximations of Gaussian processes. In: Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (F Ruiz, J Dy, J-W van de Meent, eds.), volume 206 of Proceedings of Machine Learning Research. PMLR, 1492–1512. https://proceedings.mlr.press/v206/jimenez23a.html.
Johnson NG, Williams MR, Riordan EC (2021). Generalized nonlinear models can solve the prediction problem for data from species-stratified use-availability designs. Diversity and Distributions, 27(11): 2077–2092. https://doi.org/10.1111/ddi.13384
Katzfuss M, Guinness J (2021). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, 36(1): 124–141. https://doi.org/10.1214/19-STS755
Mackenzie JS, Childs JE, Field HE, Wang L-F, Breed AC (2016). The role of bats as reservoir hosts of emerging neuroviruses. In: Neurotropic Viral Infections (CS Reiss, ed.), 403–454. https://doi.org/10.1007/978-3-319-33189-8_12.
Mallapaty S, et al. (2020). The mathematical strategy that could transform coronavirus testing. Nature, 583(7817): 504–505. https://doi.org/10.1038/d41586-020-02053-6
Meyer M, Melville DW, Baldwin HJ, Wilhelm K, Nkrumah EE, Badu EK, et al. (2024). Bat species assemblage predicts coronavirus prevalence. Nature Communications, 15(1): 2887. https://doi.org/10.1038/s41467-024-46979-1
Møller J, Syversveen AR, Waagepetersen RP (1998). Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3): 451–482. https://doi.org/10.1111/1467-9469.00115
Moreira GA, Menezes R, Wise L (2024). Presence-only for marked point process under preferential sampling. Journal of Agricultural, Biological, and Environmental Statistics, 29(1): 92–109. https://doi.org/10.1007/s13253-023-00558-x
OpenStreetMap contributors. (2017). Planet dump retrieved from https://planet.osm.org. https://www.openstreetmap.org.
Oram J, Wray AK, Davis HT, de Wit LA, Frick WF, Hoegh A, et al. (2025). Predicting Bat Roosts in Bridges Using Bayesian Additive Regression Trees. Global Ecology and Conservation. 60(e03551). https://doi.org/10.1016/j.gecco.2025.e03551
Pati D, Reich BJ, Dunson DB (2011). Bayesian geostatistical modelling with informative sampling locations. Biometrika, 98(1): 35–48. https://doi.org/10.1093/biomet/asq067
Plowright RK, Parrish CR, McCallum H, Hudson PJ, Ko AI, Graham AL, et al. (2017). Pathways to zoonotic spillover. Nature Reviews. Microbiology, 15(8): 502–510. https://doi.org/10.1038/nrmicro.2017.45
Rahman MT, Sobur MA, Islam MS, Ievy S, Hossain MJ, et al. (2020). Zoonotic diseases: Etiology, impact, and control. Microorganisms, 8(9): 1405. https://doi.org/10.3390/microorganisms8091405
Ruiz-Aravena M, McKee C, Gamble A, Lunn T, Morris A, Snedden CE, et al. (2022). Ecology, evolution and spillover of coronaviruses from bats. Nature Reviews. Microbiology, 20(5): 299–314. https://doi.org/10.1038/s41579-021-00652-2
Stevens Jr DL, Olsen AR (2004). Spatially balanced sampling of natural resources. Journal of the American Statistical Association, 99(465): 262–278. https://doi.org/10.1198/016214504000000250
Talbert C, Reichert BE (2018). North American Bat Monitoring Program (NABat) Master Sample and Grid-Based Sampling Frame. U.S. Geological Survey. https://doi.org/10.5066/P9O75YDV.
Vecchia AV (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 50(2): 297–312. https://doi.org/10.1111/j.2517-6161.1988.tb01729.x
Vedensky D, Parker PA, Holan SH (2023). A look into the problem of preferential sampling through the lens of survey statistics. American Statistician, 77(3): 313–322. https://doi.org/10.1080/00031305.2022.2143898
Warasi MS, Hungerford LL, Lahmers K (2022). Optimizing pooled testing for estimating the prevalence of multiple diseases. Journal of Agricultural, Biological, and Environmental Statistics, 27(4): 713–727. https://doi.org/10.1007/s13253-022-00511-4
Wong S, Lau S, Woo P, Yuen K-Y (2007). Bats as a continuing source of emerging infections in humans. Reviews in Medical Virology, 17(2): 67–91. https://doi.org/10.1002/rmv.520
Wray A, de Wit L, Banner K, Foster J, Frick W, Gibson A, et al. (2025). OneHealth: U.S. geological survey data release. North American Bat Monitoring Program (NABat). https://doi.org/10.5066/P14HVQHW.
Yang L, Jin S, Danielson P, Homer C, Gass L, Bender SM, et al. (2018). A new generation of the United States national land cover database: Requirements, research priorities, design, and implementation strategies. ISPRS Journal of Photogrammetry and Remote Sensing, 146: 108–123. https://doi.org/10.1016/j.isprsjprs.2018.09.006