Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. An Estimation Framework for Combining Pr ...

Journal of Data Science

Submit your article Information
  • Article info
  • More
    Article info

An Estimation Framework for Combining Probability and Non-probability Samples
Mahmoud Elkasabi ORCID icon link to view author Mahmoud Elkasabi details   Taylor Lewis ORCID icon link to view author Taylor Lewis details   Matthew Williams ORCID icon link to view author Matthew Williams details  

Authors

 
Placeholder
https://doi.org/10.6339/26-JDS1234
Pub. online: 8 June 2026      Type: Computing In Data Science      Open accessOpen Access

Received
8 September 2025
Accepted
29 May 2026
Published
8 June 2026

Abstract

Survey researchers are increasingly adopting hybrid sampling designs to address the limitations of traditional probability sampling, especially when studying rare or hard-to-reach populations. Challenges such as high screening costs, low statistical efficiency, and operational constraints make purely probability-based approaches impractical in many contexts. This article uses public data from the National Health and Nutrition Examination Survey to demonstrate how one can make population estimates from a hybrid sampling strategy that combines data from a stratified, multistage probability sample with data from a non-probability sample within the same primary sampling units as the probability sample. We outline a framework and discuss methods for analyzing data from a hybrid sample such as this, where covariates and survey outcomes are observed in both the probability and non-probability samples. We present a case study to illustrate the framework. We provide the case study R code in the supplementary material.

Supplementary material

 Supplementary Material
The online supplementary material contains annotated R syntax and results to illustrate estimation from non-probability samples and combining estimates from probability and non-probability samples.

References

 
Baker J, Brick J, Bates N, Battaglia M, Couper M, ..., Tourangeau R (2013). Summary report of the aapor task force on non-probability sampling. Journal of Survey Statistics and Methodology, 1(2): 90–143. https://doi.org/10.1093/jssam/smt008
 
Beaumont J, Dhushenthen J (2024). nppr: Inference on non-probability sample data via integrating probability sample data. https://github.com/StatCan/nppR.
 
Beresovsky V, Gershunskaya J, Savitsky TD (2025). Review of quasi-randomization approaches for estimation from non-probability samples. Statistical Science. Forthcoming.
 
Buelens B, Burger J, van den Brakel J (2018). Comparing inference methods for non-probability samples. International Statistical Review, 86(2): 322–343. https://doi.org/10.1111/insr.12253
 
Castro-Martin L (2024). Inps: Inference from non-probability samples. python package version 1.0. https://github.com/luiscastro193/inps.
 
Castro-Martin L, Ferri-Garcia R, Rueda M (2020a). Estimation in nonprobability sampling: Package ‘nonprobest’. (version 0.2.4.) https://CRAN.R-project.org/package=NonProbEst.
 
Castro-Martin L, Rueda M, Ferri-Garcia R (2020b). Inference from non-probability surveys with statistical matching and propensity score adjustment using modern prediction techniques. Mathematics, 8(6), 879. https://www.mdpi.com/2227-7390/8/6/879. https://doi.org/10.3390/math8060879
 
Castro-Martin L, Rueda M, Ferri-Garcia R (2022). Combining statistical matching and propensity score adjustment for inference from non-probability surveys. Journal of Computational and Applied Mathematics, 404, 113414. https://www.sciencedirect.com/science/article/pii/S0377042721000339. https://doi.org/10.1016/j.cam.2021.113414
 
Chen J, Valliant R, Elliott M (2019). Calibrating non-probability surveys to estimated control totals using lasso, with an application to political polling. Journal of the Royal Statistical Society. Series C. Applied Statistics, 68(3): 657–681. https://doi.org/10.1111/rssc.12327
 
Chen S, Haziza D (2022). General purpose multiply robust data integration procedures for handling nonprobability samples. Scandinavian Journal of Statistics, 50(2): 697–724. https://doi.org/10.1111/sjos.12605
 
Chen S, Woodruff A, Campbell J, Vesely S, Xu Z, Snider C (2023). Combining probability and nonprobability samples by using multivariate mass imputation approaches with application to biomedical research. Stats, 6(2): 617–625. https://doi.org/10.3390/stats6020039
 
Chen S, Yang S, Kim J (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1): 1–24. https://doi.org/10.1093/jssam/smaa036
 
Chen Y, Li P, Wu C (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(532): 2011–2021. https://doi.org/10.1080/01621459.2019.1677241
 
Chen Y, Li P, Wu C (2023). Dealing with undercoverage for non-probability survey samples. Survey Methodology, 49(2): 497–515.
 
Chrostowski L, Beręsewicz M (2024). nonprobsvy: modern inference methods for non-probability samples in r (version 0.1.0). https://cran.r-project.org/package=nonprobsvy.
 
Cobo B, Ferri-García R, Rueda-Sánchez J, Rueda M (2024). Software review for inference with non-probability surveys. The Survey Statistician, 90: 40–47. https://isi-iass.org/home/wp-content/uploads/Survey_Statistician_2024_July_N90_06.pdf.
 
Cornesse C, Blom A, Dutwin D, Krosnick J, De Leeuw E, ..., Wenz A (2020). A review of conceptual approaches and empirical evidence on probability and nonprobability sample survey research. Journal of Survey Statistics and Methodology, 8(1): 4–36. https://doi.org/10.1093/jssam/smz041
 
Dever J (2018). Combining probability and nonprobability samples to form efficient hybrid estimates: An evaluation of the common support assumption. In: Proceedings of the 2018 Federal Committee on Statistical Methodology Research Conference. https://nces.ed.gov/FCSM/pdf/A4_Dever_2018FCSM.pdf.
 
Deville JC, Särndal CE (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418): 376–382. https://doi.org/10.1080/01621459.1992.10475217
 
DiSogra C, Cobb C, Chan E, Dennis J (2011). Calibrating non-probability internet samples with probability samples using early adopter characteristics. In: Proceedings of the Survey Research Methods Section of the American Statistical Association. http://www.asasrms.org/Proceedings/y2011/Files/302704_68925.pdf.
 
Elliot MR (2009). Combining data from probability and non-probability samples using pseudo-weights. Survey Practice, 2(6).
 
Elliott M, Haviland A (2007). Use of a web-based convenience sample to supplement a probability sample. Survey Methodology, 33(2): 211–215.
 
Elliott M, Valliant R (2017). Inference for nonprobability samples. Statistical Science, 32(2): 249–264. https://doi.org/10.1214/16-STS598
 
Epanechnikov V (1969). Non-parametric estimation of a multivariate probability density. Theory of Probability and Its Applications, 14(1): 153–158. https://doi.org/10.1137/1114019
 
Ferri-García R, Rueda M (2020). Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys. PLoS ONE, 15(4): e0231500. https://doi.org/10.1371/journal.pone.0231500
 
Ferri-García R, MdM R (2018). Efficiency of propensity score adjustment and calibration on the estimation from non-probabilistic online surveys. SORT-Statistics and Operations Research Transactions, 42(2): 159–162.
 
Gershunskaya J, Beresovsky V, Savitsky Mason L TD (2025). Estimation from combined probability and non-probability samples under uncertain sampling overlap. In: The Joint Statistical Meetings, Nashville, TN, USA. Conference presentation.
 
Kang J, Schafer J (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4): 523–539. https://doi.org/10.1214/07-STS227
 
Kern C, Li Y, Wang L (2021). Boosted kernel weighting–using statistical learning to improve inference from nonprobability samples. Journal of Survey Statistics and Methodology, 9(5): 1088–1113. https://doi.org/10.1093/jssam/smaa028
 
Kim J (2022). A gentle introduction to data integration in survey sampling. The Survey Statistician, 85: 19–29. https://isi-iass.org/home/wp-content/uploads/Survey_Statistician_2022_January_N85_03.pdf.
 
Kim J, Haziza D (2014). Doubly robust inference with missing data in survey sampling. Statistica Sinica, 24(1): 375–394.
 
Kim J, Park S, Chen Y, Wu C (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society. Series A. Statistics in Society, 184(3): 941–963. https://doi.org/10.1111/rssa.12696
 
Kim J, Rao J (2012). Combining data from two independent surveys: A model-assisted approach. Biometrika, 99(1): 85–100. https://doi.org/10.1093/biomet/asr063
 
Kim J, Wang Z (2019). Sampling techniques for big data analysis. International Statistical Review, 87(S1): S177–S191. https://doi.org/10.1111/insr.12290
 
Kott P (2016). Calibration weighting in survey sampling. WIREs: Computational Statistics, 8(1): 39–53. https://doi.org/10.1002/wics.1374
 
Lee S (2006). Propensity score adjustment as a weighting scheme for volunteer panel web surveys. Journal of Official Statistics, 22(2): 329–349.
 
Lee S, Valliant R (2009). Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociological Methods & Research, 37(3): 319–343. https://doi.org/10.1177/0049124108329643
 
Little R, Rubin D (2019). Statistical Analysis with Missing Data. Wiley, Hoboken, NJ, 3 edition.
 
Liu Z, Valliant R (2023). Investigating an alternative for estimation from a nonprobability sample: Matching plus calibration. Journal of Official Statistics, 39(1): 45–78. https://doi.org/10.2478/jos-2023-0003
 
Lohr S (2011). Alternative survey sample designs: Sampling with multiple overlapping frames. Survey Methodology, 37(2): 197–213.
 
Meng XL (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9(4): 538–558. https://doi.org/10.1214/ss/1177010269
 
Mulrow E, Ganesh N, Pineau V, Yang M (2007). Using statistical matching to account for coverage bias when combining probability and nonprobability samples. In: Proceedings of the Survey Research Methods Section of the American Statistical Association. http://www.asasrms.org/Proceedings/y2020/files/1505359.pdf.
 
Nandram B, Choi J, Liu Y (2021). Integration of nonprobability and probability samples via survey weights. International Journal of Statistics and Probability, 10(6): 5–21. https://doi.org/10.5539/ijsp.v10n6p5
 
Nandram B, Rao J (2023). Bayesian predictive inference when integrating a non-probability sample and a probability sample. arXiv preprint: https://arxiv.org/abs/2305.08997.
 
Rafei A (2021). Robust and efficient bayesian inference for large-scale non-probability samples, Ph.D. thesis, University of Michigan. https://deepblue.lib.umich.edu/handle/2027.42/169715.
 
Rafei A, Elliott M, Flannagan C (2022). Robust and efficient bayesian inference for non-probability samples. arXiv preprint: https://arxiv.org/abs/2203.14355.
 
Rao J (2021). On making valid inferences by integrating data from surveys and other sources. Sankhya B: The Indian Journal of Statistics, 83: 242–272. https://doi.org/10.1007/s13571-020-00227-w
 
Rivers D (2007). Sampling for web surveys. In: Proceedings of the Survey Research Methods Section of the American Statistical Association. http://www.websm.org/uploadi/editor/1368187629Rivers_2007_Sampling_for_web_surveys.pdf.
 
Robbins M, Ghosh-Dastidar B, Ramchand R (2021). Blending probability and nonprobability samples with applications to a survey of military caregivers. Journal of Survey Statistics and Methodology, 9(5): 1114–1145. https://doi.org/10.1093/jssam/smaa037
 
Rosenbaum P, Rubin D (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1): 41–55. https://doi.org/10.1093/biomet/70.1.41
 
Rueda M, Ferri-Garcia R, Castro-Martin L (2020). The R package nonprobest for estimation in non-probability surveys. The R Journal, 12(1): 405–417. https://doi.org/10.32614/RJ-2020-015
 
Rueda M, Pasadas-del Amo S, Rodriguez B, Castro-Martin L, Ferri-Garcia R (2023). Enhancing estimation methods for integrating probability and nonprobability survey samples with machine-learning techniques: An application to a survey on the impact of the COVID-19 pandemic in Spain. Biometrical Journal, 65(2): 2200035. https://doi.org/10.1002/bimj.202200035
 
Sakshaug J, Wisniowski A, Perez Ruis D, Blom A (2019). Supplementing small probability samples with nonprobability samples: A Bayesian approach. Journal of Official Statistics, 35(3): 653–681. https://doi.org/10.2478/jos-2019-0027
 
Salvatore C, Biffignandi S, Sakshaug J, Wisniowski A, Struminskaya B (2024). Bayesian integration of probability and nonprobability samples for logistic regression. Journal of Survey Statistics and Methodology, 12(2): 458–492. https://doi.org/10.1093/jssam/smad041
 
Savitsky T, Williams M, Gershunskaya J, Beresovsky V (2023). Methods for combining probability and nonprobability samples under unknown overlaps. Statistics in Transition, 24(5): 1–34. https://doi.org/10.59170/stattrans-2023-061
 
Savitsky TD, Williams MR, Beresovsky V, Gershunskaya J (2025). Thresholding nonprobability units in combined data for efficient domain estimation. Statistics in Transition, 26(2): 1–19. https://doi.org/10.59139/stattrans-2025-013
 
Schonlau M, Couper M (2017). Options for conducting web surveys. Statistical Science, 32(2): 279–292. https://doi.org/10.1214/16-STS597
 
Sekhon J (2011). Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Software, 42(7): 1–52. https://doi.org/10.18637/jss.v042.i07
 
Skinner C, Rao J (1996). Estimation in dual frame surveys with complex designs. Journal of the American Statistical Association, 91(433): 349–356. https://doi.org/10.1080/01621459.1996.10476695
 
Tourangeau R, Edwards B, Johnson T, Wolter K, Bates N (2014). Hard-to-Survey Populations. Cambridge University Press, Cambridge, UK.
 
Valliant R (2020). Comparing alternatives for estimation from nonprobability samples. Journal of Survey Statistics and Methodology, 8(2): 231–263. https://doi.org/10.1093/jssam/smz003
 
Valliant R, Dever J (2011). Estimating propensity adjustments for volunteer web surveys. Sociological Methods & Research, 40(1): 105–137. https://doi.org/10.1177/0049124110392533
 
Wang L, Graubard BI, Katki HA, Li Y (2020). Improving external validity of epidemiologic cohort analyses: A kernel weighting approach. Journal of the Royal Statistical Society. Series A. Statistics in Society, 183(3): 1293–1311. https://doi.org/10.1111/rssa.12564
 
Wang L, Kern C (2023). Kwml: Boosted kernel weighting. r package version 1.0.1. https://github.com/chkern/KWML/.
 
Wang L, Valliant R, Li Y (2021). Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts. Statistics in Medicine, 40(24): 5237–5250. https://doi.org/10.1002/sim.9122
 
Wiśniowski A, Sakshaug JW, Perez Ruiz DA, Blom AG (2020). Integrating probability and nonprobability samples for survey inference. Journal of Survey Statistics and Methodology, 8(1): 120–147. https://doi.org/10.1093/jssam/smz051
 
Wu C (2022). Statistical inference with non-probability survey samples. Survey Methodology, 48(2): 283–311.
 
Yang S, Kim J (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3: 625–650. https://doi.org/10.1007/s42081-020-00093-w
 
Yang S, Kim J, Song R (2020). Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 82(2): 445–465. https://doi.org/10.1111/rssb.12354

PDF XML
PDF XML

Copyright
2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
hard-to-reach populations non-probability sample rare populations

Metrics
since February 2021
29

Article info
views

14

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal
  • Renmin University of China homepage
  • Academic Journal Management
    and Development Center homepage

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • Contact person: Jing Zhou
  • Phone: +86-10-62511318
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy