Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. Matched Mass Imputation for Survey Data ...

Journal of Data Science

Submit your article Information
  • Article info
  • More
    Article info

Matched Mass Imputation for Survey Data Integration
Jeremy Flood   Sayed A. Mostafa ORCID icon link to view author Sayed A. Mostafa details  

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1179
Pub. online: 17 April 2025      Type: Statistical Data Science      Open accessOpen Access

Received
16 August 2024
Accepted
20 March 2025
Published
17 April 2025

Abstract

Analysis of nonprobability survey samples has gained much attention in recent years due to their wide availability and the declining response rates within their costly probabilistic counterparts. Still, valid population inference cannot be deduced from nonprobability samples without additional information, which typically takes the form of a smaller survey sample with a shared set of covariates. In this paper, we propose the matched mass imputation (MMI) approach as a means for integrating data from probability and nonprobability samples when common covariates are present in both samples but the variable of interest is available only in the nonprobability sample. The proposed approach borrows strength from the ideas of statistical matching and mass imputation to provide robustness against potential nonignorable bias in the nonprobability sample. Specifically, MMI is a two-step approach: first, a novel application of statistical matching identifies a subset of the nonprobability sample that closely resembles the probability sample; second, mass imputation is performed using these matched units. Our empirical results, from simulations and a real data application, demonstrate the effectiveness of the MMI estimator under nearest-neighbor matching, which almost always outperformed other imputation estimators in the presence of nonignorable bias. We also explore the effectiveness of a bootstrap variance estimation procedure for the proposed MMI estimator.

Supplementary material

 Supplementary Material
The supplementary material includes the following: (1) additional simulation results showing the RMSER and ABR results in tabular format to complement the visualizations in Figures 1 and 2, (2) R code, and (3) README: a brief explanation of how to run the code.

References

 
Beaumont JF, Rao J (2021). Pitfalls of making inferences from non-probability samples: Can data integration through probability samples provide remedies? The Survey Statistician, 83: 11–22.
 
Bethlehem J (2016). Solving the nonresponse problem with sample matching? Social Science Computer Review, 34(1): 59–77. https://doi.org/10.1177/0894439315573926
 
Centers for Disease Control and Prevention (CDC) (2015–2020). NHANES - National Health and Nutrition Examination Survey. https://www.cdc.gov/nchs/nhanes/index.htm (visited: 2023-10-11).
 
Chen S, Yang S, Kim JK (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1): 1–24. https://doi.org/10.1093/jssam/smaa036
 
Chen Y, Li P, Wu C (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(532): 2011–2021. https://doi.org/10.1080/01621459.2019.1677241
 
Dever J (2018). Combining probability and nonprobability samples to form efficient hybrid estimates: An evaluation of the common support assumption. In: Proceedings of the 2018 Federal Committee on Statistical Methodology (FCSM) Research Conference, 1–15.
 
Hájek J (1964). Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35(4): 1491–1523.
 
Horvitz DG, Thompson DJ (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260): 663–685. https://doi.org/10.1080/01621459.1952.10483446
 
James G, Witten D, Hastie T, Tibshirani R, et al. (2013). An Introduction to Statistical Learning, volume 112. Springer.
 
Kalay AF (2021). Double Robust Mass-Imputation with Matching Estimators. arXiv preprint: https://arxiv.org/abs/2110.09275.
 
Kern C, Li Y, Wang L (2021). Boosted kernel weighting–using statistical learning to improve inference from nonprobability samples. Journal of Survey Statistics and Methodology, 9(5): 1088–1113. https://doi.org/10.1093/jssam/smaa028
 
Kim JK, Park S, Chen Y, Wu C (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society. Series A. Statistics in Society, 184(3): 941–963. https://doi.org/10.1111/rssa.12696
 
Lee BK, Lessler J, Stuart EA (2011). Weight trimming and propensity score weighting. PLoS ONE, 6(3): e18174. https://doi.org/10.1371/journal.pone.0018174
 
Li Y, Fay M, Hunsberger S, Graubard BI (2023). Variable inclusion strategies for effective quota sampling and propensity modeling: An application to sars-cov-2 infection prevalence estimation. Journal of Survey Statistics and Methodology, 11(5): 1204–1228. https://doi.org/10.1093/jssam/smad026
 
Little RJ (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404): 1198–1202.
 
Lohr SL (2021). Sampling: Design and Analysis. Chapman and Hall/CRC.
 
Maia M, Azevedo AR, Ara A (2021). Predictive comparison between random machines and random forests. Journal of Data Science, 19(4): 593–614. https://doi.org/10.6339/21-JDS1025
 
National Academies of Sciences, Engineering, and Medicine (2018). Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. National Academies Press.
 
Rivers D (2007). Sampling for web surveys. American Statistical Association, Alexandria, VA, 1–26.
 
Rubin DB (1976). Inference and missing data. Biometrika, 63(3): 581–592. Publisher: Oxford University Press. https://doi.org/10.1093/biomet/63.3.581
 
Särndal CE, Swensson B, Wretman J (2003). Model Assisted Survey Sampling. Springer Science & Business Media.
 
Scott DW (2009). Sturges’ rule. Wiley Interdisciplinary Reviews. Computational Statistics, 1(3): 303–306. https://doi.org/10.1002/wics.35
 
Stuart EA (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1): 1. https://doi.org/10.1214/09-STS313
 
Stuart EA, King G, Imai K, Ho D (2011). MatchIt: Nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, 42(8): 1–28. https://doi.org/10.18637/jss.v042.i08
 
Sturges HA (1926). The choice of a class interval. Journal of the American Statistical Association, 21(153): 65–66. https://doi.org/10.1080/01621459.1926.10502161
 
Wang L, Graubard BI, Katki HA, Li Y (2020). Improving external validity of epidemiologic cohort analyses: A kernel weighting approach. Journal of the Royal Statistical Society. Series A. Statistics in Society, 183(3): 1293–1311.
 
Wang YH (1993). On the number of successes in independent trials. Statistica Sinica, 3(2): 295–312.
 
Wiśniowski A, Sakshaug JW, Perez Ruiz DA, Blom AG (2020). Integrating probability and nonprobability samples for survey inference. Journal of Survey Statistics and Methodology, 8(1): 120–147. https://doi.org/10.1093/jssam/smz051
 
Wood SN (2017). Generalized Additive Models: An Introduction with R. CRC Press.
 
Yang S, Kim JK (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3: 625–650. https://doi.org/10.1007/s42081-020-00093-w
 
Yang S, Kim JK, Hwang Y (2021). Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, 47(1): 29–58.
 
Yang S, Kim JK, Song R (2020). Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 82(2): 445–465. https://doi.org/10.1111/rssb.12354

PDF XML
PDF XML

Copyright
2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
data integration mass imputation nonignorable missingness nonprobability samples statistical matching

Funding
This work of Jeremy Flood was funded by the North Carolina A&T State University Chancellor’s Distinguished Fellowship, a Title III HBGI grant from the U.S. Department of Education.

Metrics
since February 2021
138

Article info
views

32

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy