Matched Mass Imputation for Survey Data Integration
Pub. online: 17 April 2025
Type: Statistical Data Science
Open Access
Received
16 August 2024
16 August 2024
Accepted
20 March 2025
20 March 2025
Published
17 April 2025
17 April 2025
Abstract
Analysis of nonprobability survey samples has gained much attention in recent years due to their wide availability and the declining response rates within their costly probabilistic counterparts. Still, valid population inference cannot be deduced from nonprobability samples without additional information, which typically takes the form of a smaller survey sample with a shared set of covariates. In this paper, we propose the matched mass imputation (MMI) approach as a means for integrating data from probability and nonprobability samples when common covariates are present in both samples but the variable of interest is available only in the nonprobability sample. The proposed approach borrows strength from the ideas of statistical matching and mass imputation to provide robustness against potential nonignorable bias in the nonprobability sample. Specifically, MMI is a two-step approach: first, a novel application of statistical matching identifies a subset of the nonprobability sample that closely resembles the probability sample; second, mass imputation is performed using these matched units. Our empirical results, from simulations and a real data application, demonstrate the effectiveness of the MMI estimator under nearest-neighbor matching, which almost always outperformed other imputation estimators in the presence of nonignorable bias. We also explore the effectiveness of a bootstrap variance estimation procedure for the proposed MMI estimator.
Supplementary material
Supplementary MaterialThe supplementary material includes the following: (1) additional simulation results showing the RMSER and ABR results in tabular format to complement the visualizations in Figures 1 and 2, (2) R code, and (3) README: a brief explanation of how to run the code.
References
Bethlehem J (2016). Solving the nonresponse problem with sample matching? Social Science Computer Review, 34(1): 59–77. https://doi.org/10.1177/0894439315573926
Centers for Disease Control and Prevention (CDC) (2015–2020). NHANES - National Health and Nutrition Examination Survey. https://www.cdc.gov/nchs/nhanes/index.htm (visited: 2023-10-11).
Chen S, Yang S, Kim JK (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1): 1–24. https://doi.org/10.1093/jssam/smaa036
Chen Y, Li P, Wu C (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(532): 2011–2021. https://doi.org/10.1080/01621459.2019.1677241
Horvitz DG, Thompson DJ (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260): 663–685. https://doi.org/10.1080/01621459.1952.10483446
Kalay AF (2021). Double Robust Mass-Imputation with Matching Estimators. arXiv preprint: https://arxiv.org/abs/2110.09275.
Kern C, Li Y, Wang L (2021). Boosted kernel weighting–using statistical learning to improve inference from nonprobability samples. Journal of Survey Statistics and Methodology, 9(5): 1088–1113. https://doi.org/10.1093/jssam/smaa028
Kim JK, Park S, Chen Y, Wu C (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society. Series A. Statistics in Society, 184(3): 941–963. https://doi.org/10.1111/rssa.12696
Lee BK, Lessler J, Stuart EA (2011). Weight trimming and propensity score weighting. PLoS ONE, 6(3): e18174. https://doi.org/10.1371/journal.pone.0018174
Li Y, Fay M, Hunsberger S, Graubard BI (2023). Variable inclusion strategies for effective quota sampling and propensity modeling: An application to sars-cov-2 infection prevalence estimation. Journal of Survey Statistics and Methodology, 11(5): 1204–1228. https://doi.org/10.1093/jssam/smad026
Maia M, Azevedo AR, Ara A (2021). Predictive comparison between random machines and random forests. Journal of Data Science, 19(4): 593–614. https://doi.org/10.6339/21-JDS1025
Rubin DB (1976). Inference and missing data. Biometrika, 63(3): 581–592. Publisher: Oxford University Press. https://doi.org/10.1093/biomet/63.3.581
Scott DW (2009). Sturges’ rule. Wiley Interdisciplinary Reviews. Computational Statistics, 1(3): 303–306. https://doi.org/10.1002/wics.35
Stuart EA (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1): 1. https://doi.org/10.1214/09-STS313
Stuart EA, King G, Imai K, Ho D (2011). MatchIt: Nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, 42(8): 1–28. https://doi.org/10.18637/jss.v042.i08
Sturges HA (1926). The choice of a class interval. Journal of the American Statistical Association, 21(153): 65–66. https://doi.org/10.1080/01621459.1926.10502161
Wiśniowski A, Sakshaug JW, Perez Ruiz DA, Blom AG (2020). Integrating probability and nonprobability samples for survey inference. Journal of Survey Statistics and Methodology, 8(1): 120–147. https://doi.org/10.1093/jssam/smz051
Yang S, Kim JK (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3: 625–650. https://doi.org/10.1007/s42081-020-00093-w
Yang S, Kim JK, Song R (2020). Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 82(2): 445–465. https://doi.org/10.1111/rssb.12354