Predictive Mean Matching Imputation Procedure Based on Machine Learning Models for Complex Survey Data
Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 456–468
Pub. online: 10 July 2024
Type: Statistical Data Science
Open Access
Received
22 November 2023
22 November 2023
Accepted
16 April 2024
16 April 2024
Published
10 July 2024
10 July 2024
Abstract
Missing data is a common occurrence in various fields, spanning social science, education, economics, and biomedical research. Disregarding missing data in statistical analyses can introduce bias to study outcomes. To mitigate this issue, imputation methods have proven effective in reducing nonresponse bias and generating complete datasets for subsequent analysis of secondary data. The efficacy of imputation methods hinges on the assumptions of the underlying imputation model. While machine learning techniques such as regression trees, random forest, XGBoost, and deep learning have demonstrated robustness against model misspecification, their optimal performance may necessitate fine-tuning under specific conditions. Moreover, imputed values generated by these methods can sometimes deviate unnaturally, falling outside the normal range. To address these challenges, we propose a novel Predictive Mean Matching imputation (PMM) procedure that leverages popular machine learning-based methods. PMM strikes a balance between robustness and the generation of appropriate imputed values. In this paper, we present our innovative PMM approach and conduct a comparative performance analysis through Monte Carlo simulation studies, assessing its effectiveness against other established methods.
References
Andridge RR, Little RJ (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78(1): 40–64. https://doi.org/10.1111/j.1751-5823.2010.00103.x
Aydilek IB, Arslan A (2013). A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Information Sciences, 233: 25–35. https://doi.org/10.1016/j.ins.2013.01.021
Bottou L (2010). Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds.), Proceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, August 22–27, 2010 Keynote, Invited and Contributed Papers, 177–186. Springer.
Breiman L (2001). Random forests. Machine Learning, 45: 5–32. https://doi.org/10.1023/A:1010933404324
Burgette LF, Reiter JP (2010). Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology, 172(9): 1070–1076. https://doi.org/10.1093/aje/kwq260
Chen S, Xu C (2023). Handling high-dimensional data with missing values by modern machine learning techniques. Journal of Applied Statistics, 50(3): 786–804. https://doi.org/10.1080/02664763.2022.2068514
Chen S, Yang S, Kim JK (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1): 1–24. https://doi.org/10.1093/jssam/smaa036
Cheng PE (1994). Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association, 89(425): 81–87. https://doi.org/10.1080/01621459.1994.10476448
Chou PA (1991). Optimal partitioning for classification and regression trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(04): 340–354. https://doi.org/10.1109/34.88569
Das D, Avancha S, Mudigere D, Vaidynathan K, Sridharan S, Kalamkar D, et al. (2016). Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint: https://arxiv.org/abs/1602.06709.
Deng Y, Lumley T (2023). Multiple imputation through XGBoost. Journal of Computational and Graphical Statistics, 33(2): 352–363. https://doi.org/10.1080/10618600.2023.2252501
Farrell MH, Liang T, Misra S (2021). Deep neural networks for estimation and inference. Econometrica, 89(1): 181–213. https://doi.org/10.3982/ECTA16901
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998). Support vector machines. IEEE Intelligent Systems & Their Applications, 13(4): 18–28. https://doi.org/10.1109/5254.708428
Kim JK (2011). Parametric fractional imputation for missing data analysis. Biometrika, 98(1): 119–132. https://doi.org/10.1093/biomet/asq073
Kim JK, Fuller W (2004). Fractional hot deck imputation. Biometrika, 91(3): 559–578. https://doi.org/10.1093/biomet/91.3.559
Kingma DP, Ba J (2014). Adam: A method for stochastic optimization. arXiv preprint: https://arxiv.org/abs/1412.6980.
Lin J, Li N, Alam MA, Ma Y (2020). Data-driven missing data imputation in cluster monitoring system based on deep neural network. Applied Intelligence, 50: 860–877. https://doi.org/10.1007/s10489-019-01560-y
Little RJ (1988). Missing-data adjustments in large surveys. Journal of Business & Economic Statistics, 6(3): 287–296. https://doi.org/10.1080/07350015.1988.10509663
Noble WS (2006). What is a support vector machine? Nature Biotechnology, 24(12): 1565–1567. https://doi.org/10.1038/nbt1206-1565
Peterson LE (2009). K-nearest neighbor. Scholarpedia, 4(2): 1883. https://doi.org/10.4249/scholarpedia.1883
Rao JN, Shao J (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika, 79(4): 811–822. https://doi.org/10.1093/biomet/79.4.811
Rubin DB (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434): 473–489. https://doi.org/10.1080/01621459.1996.10476908
Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H (2014). Comparison of random forest and parametric imputation models for imputing missing data using mice: A caliber study. American Journal of Epidemiology, 179(6): 764–774. https://doi.org/10.1093/aje/kwt312
Toth D, Eltinge JL (2011). Building consistent regression trees from complex sample data. Journal of the American Statistical Association, 106(496): 1626–1636. https://doi.org/10.1198/jasa.2011.tm10383
Wager S, Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523): 1228–1242. https://doi.org/10.1080/01621459.2017.1319839
Yang S, Kim JK (2020a). Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework. Scandinavian Journal of Statistics, 47(3): 839–861. https://doi.org/10.1111/sjos.12429
Yang S, Kim JK (2020b). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3: 625–650. https://doi.org/10.1007/s42081-020-00093-w
Zhang Z (2016). Missing data imputation: Focusing on single imputation. Annals of Translational Medicine, 4(1): 9. https://doi.org/10.21037/atm-20-3623