An Estimation Framework for Combining Probability and Non-probability Samples
Pub. online: 8 June 2026
Type: Computing In Data Science
Open Access
Received
8 September 2025
8 September 2025
Accepted
29 May 2026
29 May 2026
Published
8 June 2026
8 June 2026
Abstract
Survey researchers are increasingly adopting hybrid sampling designs to address the limitations of traditional probability sampling, especially when studying rare or hard-to-reach populations. Challenges such as high screening costs, low statistical efficiency, and operational constraints make purely probability-based approaches impractical in many contexts. This article uses public data from the National Health and Nutrition Examination Survey to demonstrate how one can make population estimates from a hybrid sampling strategy that combines data from a stratified, multistage probability sample with data from a non-probability sample within the same primary sampling units as the probability sample. We outline a framework and discuss methods for analyzing data from a hybrid sample such as this, where covariates and survey outcomes are observed in both the probability and non-probability samples. We present a case study to illustrate the framework. We provide the case study R code in the supplementary material.
Supplementary material
Supplementary MaterialThe online supplementary material contains annotated R syntax and results to illustrate estimation from non-probability samples and combining estimates from probability and non-probability samples.
References
Baker J, Brick J, Bates N, Battaglia M, Couper M, ..., Tourangeau R (2013). Summary report of the aapor task force on non-probability sampling. Journal of Survey Statistics and Methodology, 1(2): 90–143. https://doi.org/10.1093/jssam/smt008
Beaumont J, Dhushenthen J (2024). nppr: Inference on non-probability sample data via integrating probability sample data. https://github.com/StatCan/nppR.
Buelens B, Burger J, van den Brakel J (2018). Comparing inference methods for non-probability samples. International Statistical Review, 86(2): 322–343. https://doi.org/10.1111/insr.12253
Castro-Martin L (2024). Inps: Inference from non-probability samples. python package version 1.0. https://github.com/luiscastro193/inps.
Castro-Martin L, Ferri-Garcia R, Rueda M (2020a). Estimation in nonprobability sampling: Package ‘nonprobest’. (version 0.2.4.) https://CRAN.R-project.org/package=NonProbEst.
Castro-Martin L, Rueda M, Ferri-Garcia R (2020b). Inference from non-probability surveys with statistical matching and propensity score adjustment using modern prediction techniques. Mathematics, 8(6), 879. https://www.mdpi.com/2227-7390/8/6/879. https://doi.org/10.3390/math8060879
Castro-Martin L, Rueda M, Ferri-Garcia R (2022). Combining statistical matching and propensity score adjustment for inference from non-probability surveys. Journal of Computational and Applied Mathematics, 404, 113414. https://www.sciencedirect.com/science/article/pii/S0377042721000339. https://doi.org/10.1016/j.cam.2021.113414
Chen J, Valliant R, Elliott M (2019). Calibrating non-probability surveys to estimated control totals using lasso, with an application to political polling. Journal of the Royal Statistical Society. Series C. Applied Statistics, 68(3): 657–681. https://doi.org/10.1111/rssc.12327
Chen S, Haziza D (2022). General purpose multiply robust data integration procedures for handling nonprobability samples. Scandinavian Journal of Statistics, 50(2): 697–724. https://doi.org/10.1111/sjos.12605
Chen S, Woodruff A, Campbell J, Vesely S, Xu Z, Snider C (2023). Combining probability and nonprobability samples by using multivariate mass imputation approaches with application to biomedical research. Stats, 6(2): 617–625. https://doi.org/10.3390/stats6020039
Chen S, Yang S, Kim J (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1): 1–24. https://doi.org/10.1093/jssam/smaa036
Chen Y, Li P, Wu C (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(532): 2011–2021. https://doi.org/10.1080/01621459.2019.1677241
Chrostowski L, Beręsewicz M (2024). nonprobsvy: modern inference methods for non-probability samples in r (version 0.1.0). https://cran.r-project.org/package=nonprobsvy.
Cobo B, Ferri-García R, Rueda-Sánchez J, Rueda M (2024). Software review for inference with non-probability surveys. The Survey Statistician, 90: 40–47. https://isi-iass.org/home/wp-content/uploads/Survey_Statistician_2024_July_N90_06.pdf.
Cornesse C, Blom A, Dutwin D, Krosnick J, De Leeuw E, ..., Wenz A (2020). A review of conceptual approaches and empirical evidence on probability and nonprobability sample survey research. Journal of Survey Statistics and Methodology, 8(1): 4–36. https://doi.org/10.1093/jssam/smz041
Dever J (2018). Combining probability and nonprobability samples to form efficient hybrid estimates: An evaluation of the common support assumption. In: Proceedings of the 2018 Federal Committee on Statistical Methodology Research Conference. https://nces.ed.gov/FCSM/pdf/A4_Dever_2018FCSM.pdf.
Deville JC, Särndal CE (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418): 376–382. https://doi.org/10.1080/01621459.1992.10475217
DiSogra C, Cobb C, Chan E, Dennis J (2011). Calibrating non-probability internet samples with probability samples using early adopter characteristics. In: Proceedings of the Survey Research Methods Section of the American Statistical Association. http://www.asasrms.org/Proceedings/y2011/Files/302704_68925.pdf.
Elliott M, Valliant R (2017). Inference for nonprobability samples. Statistical Science, 32(2): 249–264. https://doi.org/10.1214/16-STS598
Epanechnikov V (1969). Non-parametric estimation of a multivariate probability density. Theory of Probability and Its Applications, 14(1): 153–158. https://doi.org/10.1137/1114019
Ferri-García R, Rueda M (2020). Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys. PLoS ONE, 15(4): e0231500. https://doi.org/10.1371/journal.pone.0231500
Kang J, Schafer J (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4): 523–539. https://doi.org/10.1214/07-STS227
Kern C, Li Y, Wang L (2021). Boosted kernel weighting–using statistical learning to improve inference from nonprobability samples. Journal of Survey Statistics and Methodology, 9(5): 1088–1113. https://doi.org/10.1093/jssam/smaa028
Kim J (2022). A gentle introduction to data integration in survey sampling. The Survey Statistician, 85: 19–29. https://isi-iass.org/home/wp-content/uploads/Survey_Statistician_2022_January_N85_03.pdf.
Kim J, Park S, Chen Y, Wu C (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society. Series A. Statistics in Society, 184(3): 941–963. https://doi.org/10.1111/rssa.12696
Kim J, Rao J (2012). Combining data from two independent surveys: A model-assisted approach. Biometrika, 99(1): 85–100. https://doi.org/10.1093/biomet/asr063
Kim J, Wang Z (2019). Sampling techniques for big data analysis. International Statistical Review, 87(S1): S177–S191. https://doi.org/10.1111/insr.12290
Kott P (2016). Calibration weighting in survey sampling. WIREs: Computational Statistics, 8(1): 39–53. https://doi.org/10.1002/wics.1374
Lee S, Valliant R (2009). Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociological Methods & Research, 37(3): 319–343. https://doi.org/10.1177/0049124108329643
Liu Z, Valliant R (2023). Investigating an alternative for estimation from a nonprobability sample: Matching plus calibration. Journal of Official Statistics, 39(1): 45–78. https://doi.org/10.2478/jos-2023-0003
Meng XL (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9(4): 538–558. https://doi.org/10.1214/ss/1177010269
Mulrow E, Ganesh N, Pineau V, Yang M (2007). Using statistical matching to account for coverage bias when combining probability and nonprobability samples. In: Proceedings of the Survey Research Methods Section of the American Statistical Association. http://www.asasrms.org/Proceedings/y2020/files/1505359.pdf.
Nandram B, Choi J, Liu Y (2021). Integration of nonprobability and probability samples via survey weights. International Journal of Statistics and Probability, 10(6): 5–21. https://doi.org/10.5539/ijsp.v10n6p5
Nandram B, Rao J (2023). Bayesian predictive inference when integrating a non-probability sample and a probability sample. arXiv preprint: https://arxiv.org/abs/2305.08997.
Rafei A (2021). Robust and efficient bayesian inference for large-scale non-probability samples, Ph.D. thesis, University of Michigan. https://deepblue.lib.umich.edu/handle/2027.42/169715.
Rafei A, Elliott M, Flannagan C (2022). Robust and efficient bayesian inference for non-probability samples. arXiv preprint: https://arxiv.org/abs/2203.14355.
Rao J (2021). On making valid inferences by integrating data from surveys and other sources. Sankhya B: The Indian Journal of Statistics, 83: 242–272. https://doi.org/10.1007/s13571-020-00227-w
Rivers D (2007). Sampling for web surveys. In: Proceedings of the Survey Research Methods Section of the American Statistical Association. http://www.websm.org/uploadi/editor/1368187629Rivers_2007_Sampling_for_web_surveys.pdf.
Robbins M, Ghosh-Dastidar B, Ramchand R (2021). Blending probability and nonprobability samples with applications to a survey of military caregivers. Journal of Survey Statistics and Methodology, 9(5): 1114–1145. https://doi.org/10.1093/jssam/smaa037
Rosenbaum P, Rubin D (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1): 41–55. https://doi.org/10.1093/biomet/70.1.41
Rueda M, Ferri-Garcia R, Castro-Martin L (2020). The R package nonprobest for estimation in non-probability surveys. The R Journal, 12(1): 405–417. https://doi.org/10.32614/RJ-2020-015
Rueda M, Pasadas-del Amo S, Rodriguez B, Castro-Martin L, Ferri-Garcia R (2023). Enhancing estimation methods for integrating probability and nonprobability survey samples with machine-learning techniques: An application to a survey on the impact of the COVID-19 pandemic in Spain. Biometrical Journal, 65(2): 2200035. https://doi.org/10.1002/bimj.202200035
Sakshaug J, Wisniowski A, Perez Ruis D, Blom A (2019). Supplementing small probability samples with nonprobability samples: A Bayesian approach. Journal of Official Statistics, 35(3): 653–681. https://doi.org/10.2478/jos-2019-0027
Salvatore C, Biffignandi S, Sakshaug J, Wisniowski A, Struminskaya B (2024). Bayesian integration of probability and nonprobability samples for logistic regression. Journal of Survey Statistics and Methodology, 12(2): 458–492. https://doi.org/10.1093/jssam/smad041
Savitsky T, Williams M, Gershunskaya J, Beresovsky V (2023). Methods for combining probability and nonprobability samples under unknown overlaps. Statistics in Transition, 24(5): 1–34. https://doi.org/10.59170/stattrans-2023-061
Savitsky TD, Williams MR, Beresovsky V, Gershunskaya J (2025). Thresholding nonprobability units in combined data for efficient domain estimation. Statistics in Transition, 26(2): 1–19. https://doi.org/10.59139/stattrans-2025-013
Schonlau M, Couper M (2017). Options for conducting web surveys. Statistical Science, 32(2): 279–292. https://doi.org/10.1214/16-STS597
Sekhon J (2011). Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Software, 42(7): 1–52. https://doi.org/10.18637/jss.v042.i07
Skinner C, Rao J (1996). Estimation in dual frame surveys with complex designs. Journal of the American Statistical Association, 91(433): 349–356. https://doi.org/10.1080/01621459.1996.10476695
Valliant R (2020). Comparing alternatives for estimation from nonprobability samples. Journal of Survey Statistics and Methodology, 8(2): 231–263. https://doi.org/10.1093/jssam/smz003
Valliant R, Dever J (2011). Estimating propensity adjustments for volunteer web surveys. Sociological Methods & Research, 40(1): 105–137. https://doi.org/10.1177/0049124110392533
Wang L, Graubard BI, Katki HA, Li Y (2020). Improving external validity of epidemiologic cohort analyses: A kernel weighting approach. Journal of the Royal Statistical Society. Series A. Statistics in Society, 183(3): 1293–1311. https://doi.org/10.1111/rssa.12564
Wang L, Kern C (2023). Kwml: Boosted kernel weighting. r package version 1.0.1. https://github.com/chkern/KWML/.
Wang L, Valliant R, Li Y (2021). Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts. Statistics in Medicine, 40(24): 5237–5250. https://doi.org/10.1002/sim.9122
Wiśniowski A, Sakshaug JW, Perez Ruiz DA, Blom AG (2020). Integrating probability and nonprobability samples for survey inference. Journal of Survey Statistics and Methodology, 8(1): 120–147. https://doi.org/10.1093/jssam/smz051
Yang S, Kim J (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3: 625–650. https://doi.org/10.1007/s42081-020-00093-w
Yang S, Kim J, Song R (2020). Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 82(2): 445–465. https://doi.org/10.1111/rssb.12354