Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. Estimating Disease Prevalence from Prefe ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Estimating Disease Prevalence from Preferentially Sampled, Pooled Data
Clinton P. Pollock ORCID icon link to view author Clinton P. Pollock details   Andrew Hoegh ORCID icon link to view author Andrew Hoegh details   Kathryn M. Irvine ORCID icon link to view author Kathryn M. Irvine details     All authors (5)

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1191
Pub. online: 11 June 2025      Type: Statistical Data Science      Open accessOpen Access

Received
31 October 2024
Accepted
30 May 2025
Published
11 June 2025

Abstract

After the onset of the COVID-19 pandemic, scientific interest in coronaviruses endemic in animal populations has increased dramatically. However, investigating the prevalence of disease in animal populations across the landscape, which requires finding and capturing animals can be difficult. Spatial random sampling over a grid could be extremely inefficient because animals can be hard to locate, and the total number of samples may be small. Alternatively, preferential sampling, using existing knowledge to inform sample location, can guarantee larger numbers of samples, but estimates derived from this sampling scheme may exhibit bias if there is a relationship between higher probability sampling locations and the disease prevalence. Sample specimens are commonly grouped and tested in pools which can also be an added challenge when combined with preferential sampling. Here we present a Bayesian method for estimating disease prevalence with preferential sampling in pooled presence-absence data motivated by estimating factors related to coronavirus infection among Mexican free-tailed bats (Tadarida brasiliensis) in California. We demonstrate the efficacy of our approach in a simulation study, where a naive model, not accounting for preferential sampling, returns biased estimates of parameter values; however, our model returns unbiased results regardless of the degree of preferential sampling. Our model framework is then applied to data from California to estimate factors related to coronavirus prevalence. After accounting for preferential sampling impacts, our model suggests small prevalence differences between male and female bats.

References

 
Albert JH, Chib S (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422): 669–679. https://doi.org/10.1080/01621459.1993.10476321
 
Banerjee S (2017). High-dimensional Bayesian geostatistics. Bayesian Analysis, 12(2): 583. https://doi.org/10.1214/17-BA1056R
 
Bezanson J, Edelman A, Karpinski S, Shah VB (2017). Julia: A fresh approach to numerical computing. SIAM Review, 59(1): 65–98. https://doi.org/10.1137/141000671
 
Bilder CR, Tebbs JM, Chen P (2010). Informative retesting. Journal of the American Statistical Association, 105(491): 942–955. https://doi.org/10.1198/jasa.2010.ap09231
 
Cheng J, Schloerke B, Karambelkar B, Xie Y (2025). Leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library. https://rstudio.github.io/leaflet/.
 
Conroy B, Waller LA, Buller ID, Hacker GM, Tucker JR, Novak MG (2023). A shared latent process model to correct for preferential sampling in disease surveillance systems. Journal of Agricultural, Biological, and Environmental Statistics, 28(3): 483–501. https://doi.org/10.1007/s13253-023-00535-4
 
Diggle PJ, Menezes R, Su T-l (2010). Geostatistical inference under preferential sampling. Journal of the Royal Statistical Society. Series C. Applied Statistics, 59(2): 191–232. https://doi.org/10.1111/j.1467-9876.2009.00701.x
 
Du D-Z, Hwang FK-m (1999). Combinatorial Group Testing and Its Applications, volume 12. World Scientific.
 
Gorelick N, Hancher M, Dixon M, Ilyushchenko S, Thau D, Moore R (2017). Google Earth engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 202: 18–27. https://doi.org/10.1016/j.rse.2017.06.031
 
Hall JS, Hofmeister E, Ip HS, Nashold SW, Leon AE, Malave CM, et al. (2023). Experimental infection of Mexican free-tailed bats (tadarida brasiliensis) with SARS-CoV-2. Msphere, 8(1): e00263–22.
 
Haydon DT, Cleaveland S, Taylor LH, Laurenson MK (2002). Identifying reservoirs of infection: A conceptual and practical challenge. Emerging Infectious Diseases, 8(12): 1468–1473. https://doi.org/10.3201/eid0812.010317
 
Hoegh A, Peel AJ, Madden W, Ruiz Aravena M, Morris A, Washburne A, et al. (2021). Estimating viral prevalence with data fusion for adaptive two-phase pooled sampling. Ecology and Evolution, 11(20): 14012–14023. https://doi.org/10.1002/ece3.8107
 
Jimenez F, Katzfuss M (2023). Scalable Bayesian optimization using Vecchia approximations of Gaussian processes. In: Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (F Ruiz, J Dy, J-W van de Meent, eds.), volume 206 of Proceedings of Machine Learning Research. PMLR, 1492–1512. https://proceedings.mlr.press/v206/jimenez23a.html.
 
Johnson NG, Williams MR, Riordan EC (2021). Generalized nonlinear models can solve the prediction problem for data from species-stratified use-availability designs. Diversity and Distributions, 27(11): 2077–2092. https://doi.org/10.1111/ddi.13384
 
Katzfuss M, Guinness J (2021). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, 36(1): 124–141. https://doi.org/10.1214/19-STS755
 
Mackenzie JS, Childs JE, Field HE, Wang L-F, Breed AC (2016). The role of bats as reservoir hosts of emerging neuroviruses. In: Neurotropic Viral Infections (CS Reiss, ed.), 403–454. https://doi.org/10.1007/978-3-319-33189-8_12.
 
Mallapaty S, et al. (2020). The mathematical strategy that could transform coronavirus testing. Nature, 583(7817): 504–505. https://doi.org/10.1038/d41586-020-02053-6
 
Meyer M, Melville DW, Baldwin HJ, Wilhelm K, Nkrumah EE, Badu EK, et al. (2024). Bat species assemblage predicts coronavirus prevalence. Nature Communications, 15(1): 2887. https://doi.org/10.1038/s41467-024-46979-1
 
Møller J, Syversveen AR, Waagepetersen RP (1998). Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3): 451–482. https://doi.org/10.1111/1467-9469.00115
 
Moreira GA, Menezes R, Wise L (2024). Presence-only for marked point process under preferential sampling. Journal of Agricultural, Biological, and Environmental Statistics, 29(1): 92–109. https://doi.org/10.1007/s13253-023-00558-x
 
OpenStreetMap contributors. (2017). Planet dump retrieved from https://planet.osm.org. https://www.openstreetmap.org.
 
Oram J, Wray AK, Davis HT, de Wit LA, Frick WF, Hoegh A, et al. (2025). Predicting Bat Roosts in Bridges Using Bayesian Additive Regression Trees. Global Ecology and Conservation. 60(e03551). https://doi.org/10.1016/j.gecco.2025.e03551
 
Pati D, Reich BJ, Dunson DB (2011). Bayesian geostatistical modelling with informative sampling locations. Biometrika, 98(1): 35–48. https://doi.org/10.1093/biomet/asq067
 
Peel AJ, Ruiz-Aravena M, Kim K, (2025). Synchronized seasonal excretion of multiple coronaviruses coincides with high rates of coinfection in immature bats. Accepted, Nature Communications.
 
Plowright RK, Parrish CR, McCallum H, Hudson PJ, Ko AI, Graham AL, et al. (2017). Pathways to zoonotic spillover. Nature Reviews. Microbiology, 15(8): 502–510. https://doi.org/10.1038/nrmicro.2017.45
 
Rahman MT, Sobur MA, Islam MS, Ievy S, Hossain MJ, et al. (2020). Zoonotic diseases: Etiology, impact, and control. Microorganisms, 8(9): 1405. https://doi.org/10.3390/microorganisms8091405
 
Ruiz-Aravena M, McKee C, Gamble A, Lunn T, Morris A, Snedden CE, et al. (2022). Ecology, evolution and spillover of coronaviruses from bats. Nature Reviews. Microbiology, 20(5): 299–314. https://doi.org/10.1038/s41579-021-00652-2
 
Savitsky TD, Williams MR, Gershunskaya J, Beresovsky V, Johnson NG (2023). Methods for combining probability and nonprobability samples under unknown overlaps. Statistics in Transition, 24(4): 1–34.
 
Stevens Jr DL, Olsen AR (2004). Spatially balanced sampling of natural resources. Journal of the American Statistical Association, 99(465): 262–278. https://doi.org/10.1198/016214504000000250
 
Talbert C, Reichert BE (2018). North American Bat Monitoring Program (NABat) Master Sample and Grid-Based Sampling Frame. U.S. Geological Survey. https://doi.org/10.5066/P9O75YDV.
 
Vecchia AV (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 50(2): 297–312. https://doi.org/10.1111/j.2517-6161.1988.tb01729.x
 
Vedensky D, Parker PA, Holan SH (2023). A look into the problem of preferential sampling through the lens of survey statistics. American Statistician, 77(3): 313–322. https://doi.org/10.1080/00031305.2022.2143898
 
Warasi MS, Hungerford LL, Lahmers K (2022). Optimizing pooled testing for estimating the prevalence of multiple diseases. Journal of Agricultural, Biological, and Environmental Statistics, 27(4): 713–727. https://doi.org/10.1007/s13253-022-00511-4
 
Wong S, Lau S, Woo P, Yuen K-Y (2007). Bats as a continuing source of emerging infections in humans. Reviews in Medical Virology, 17(2): 67–91. https://doi.org/10.1002/rmv.520
 
Wray A, de Wit L, Banner K, Foster J, Frick W, Gibson A, et al. (2025). OneHealth: U.S. geological survey data release. North American Bat Monitoring Program (NABat). https://doi.org/10.5066/P14HVQHW.
 
Yang L, Jin S, Danielson P, Homer C, Gass L, Bender SM, et al. (2018). A new generation of the United States national land cover database: Requirements, research priorities, design, and implementation strategies. ISPRS Journal of Photogrammetry and Remote Sensing, 146: 108–123. https://doi.org/10.1016/j.isprsjprs.2018.09.006

Related articles PDF XML
Related articles PDF XML

Copyright
2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
Bayesian modeling pooled testing spatial sampling

Metrics
since February 2021
46

Article info
views

7

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy