Predictive Mean Matching Imputation Procedure Based on Machine Learning Models for Complex Survey Data

Chen, Sixia; Xu, Chao

doi:10.6339/24-JDS1135

Journal of Data Science

Predictive Mean Matching Imputation Procedure Based on Machine Learning Models for Complex Survey Data

Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 456–468

Sixia Chen Chao Xu

https://doi.org/10.6339/24-JDS1135

Pub. online: 10 July 2024 Type: Statistical Data Science

Open Access

Received
22 November 2023

Accepted
16 April 2024

Published
10 July 2024

Abstract

Missing data is a common occurrence in various fields, spanning social science, education, economics, and biomedical research. Disregarding missing data in statistical analyses can introduce bias to study outcomes. To mitigate this issue, imputation methods have proven effective in reducing nonresponse bias and generating complete datasets for subsequent analysis of secondary data. The efficacy of imputation methods hinges on the assumptions of the underlying imputation model. While machine learning techniques such as regression trees, random forest, XGBoost, and deep learning have demonstrated robustness against model misspecification, their optimal performance may necessitate fine-tuning under specific conditions. Moreover, imputed values generated by these methods can sometimes deviate unnaturally, falling outside the normal range. To address these challenges, we propose a novel Predictive Mean Matching imputation (PMM) procedure that leverages popular machine learning-based methods. PMM strikes a balance between robustness and the generation of appropriate imputed values. In this paper, we present our innovative PMM approach and conduct a comparative performance analysis through Monte Carlo simulation studies, assessing its effectiveness against other established methods.

References

Akinbami LJ, Chen TC, Davy O, Ogden CL, Fink S, Clark J, et al. (2022). National health and nutrition examination survey, 2017–March 2020 prepandemic file: Sample design, estimation, and analytic guidelines.

Andridge RR, Little RJ (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78(1): 40–64. https://doi.org/10.1111/j.1751-5823.2010.00103.x

Aydilek IB, Arslan A (2013). A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Information Sciences, 233: 25–35. https://doi.org/10.1016/j.ins.2013.01.021

Bottou L (2010). Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds.), Proceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, August 22–27, 2010 Keynote, Invited and Contributed Papers, 177–186. Springer.

Breiman L (2001). Random forests. Machine Learning, 45: 5–32. https://doi.org/10.1023/A:1010933404324

Burgette LF, Reiter JP (2010). Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology, 172(9): 1070–1076. https://doi.org/10.1093/aje/kwq260

Chen J, Shao J (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics, 16(2): 113.

Chen S, Haziza D, Stubblefield A (2021). A note on multiply robust predictive mean matching imputation with complex survey data. Survey Methodology, 47(1): 215–223.

Chen S, Xu C (2023). Handling high-dimensional data with missing values by modern machine learning techniques. Journal of Applied Statistics, 50(3): 786–804. https://doi.org/10.1080/02664763.2022.2068514

Chen S, Yang S, Kim JK (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1): 1–24. https://doi.org/10.1093/jssam/smaa036

Chen T, Guestrin C (2016). Xgboost: A scalable tree boosting system. In: Krishnapuram B et al. (eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.

Cheng PE (1994). Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association, 89(425): 81–87. https://doi.org/10.1080/01621459.1994.10476448

Chou PA (1991). Optimal partitioning for classification and regression trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(04): 340–354. https://doi.org/10.1109/34.88569

Das D, Avancha S, Mudigere D, Vaidynathan K, Sridharan S, Kalamkar D, et al. (2016). Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint: https://arxiv.org/abs/1602.06709.

Deng Y, Lumley T (2023). Multiple imputation through XGBoost. Journal of Computational and Graphical Statistics, 33(2): 352–363. https://doi.org/10.1080/10618600.2023.2252501

Farrell MH, Liang T, Misra S (2021). Deep neural networks for estimation and inference. Econometrica, 89(1): 181–213. https://doi.org/10.3982/ECTA16901

Fuller WA (2009). Measurement Error Models. John Wiley & Sons.

Goodfellow I, Bengio Y, Courville A (2016). Deep Learning. MIT Press.

Hastie TJ (2017). Generalized additive models. In: Chambers JM, Hastie TJ (eds.), Statistical Models in S, 249–307. Routledge.

Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998). Support vector machines. IEEE Intelligent Systems & Their Applications, 13(4): 18–28. https://doi.org/10.1109/5254.708428

Heitjan DF, Little RJ (1991). Multiple imputation for the fatal accident reporting system. Journal of the Royal Statistical Society. Series C. Applied Statistics, 40(1): 13–29.

Hinton G, Srivastava N, Swersky K (2012). Neural networks for machine learning. Lecture 6a. Overview of mini-batch gradient descent. Cited on, 14(8): 2.

Imbens GW, Rubin DB (2015). Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.

Kim JK (2011). Parametric fractional imputation for missing data analysis. Biometrika, 98(1): 119–132. https://doi.org/10.1093/biomet/asq073

Kim JK, Fuller W (2004). Fractional hot deck imputation. Biometrika, 91(3): 559–578. https://doi.org/10.1093/biomet/91.3.559

Kim JK, Shao J (2021). Statistical Methods for Handling Incomplete Data. CRC Press.

Kingma DP, Ba J (2014). Adam: A method for stochastic optimization. arXiv preprint: https://arxiv.org/abs/1412.6980.

Lin J, Li N, Alam MA, Ma Y (2020). Data-driven missing data imputation in cluster monitoring system based on deep neural network. Applied Intelligence, 50: 860–877. https://doi.org/10.1007/s10489-019-01560-y

Little RJ (1988). Missing-data adjustments in large surveys. Journal of Business & Economic Statistics, 6(3): 287–296. https://doi.org/10.1080/07350015.1988.10509663

Little RJ, Rubin DB (2019). Statistical Analysis with Missing Data, volume 793. John Wiley & Sons.

Loehlin JC (2004). Latent Variable Models: An Introduction to Factor, Path, and Structural Equation Analysis. Psychology Press.

Mallinson H, Gammerman A (2003). Imputation using support vector machines. University of London Egham, UK: Department of Computer Science Royal Holloway.

Noble WS (2006). What is a support vector machine? Nature Biotechnology, 24(12): 1565–1567. https://doi.org/10.1038/nbt1206-1565

Peterson LE (2009). K-nearest neighbor. Scholarpedia, 4(2): 1883. https://doi.org/10.4249/scholarpedia.1883

Polley EC, Van der Laan MJ (2010). Super learner in prediction.

Qiao L, Ran R, Wu H, Zhou Q, Liu S, Liu Y (2018). Imputation method of missing values for dissolved gas analysis data based on iterative KNN and XGBoost. In: Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence, 1–7.

Rahman MG, Islam MZ (2011). A decision tree-based missing value imputation technique for data pre-processing. In: Vamplew P, Stranieri A, Ong K-L, Christen P, Kennedy PJ (eds.), The 9th Australasian Data Mining Conference: AusDM 2011, 41–50. Australian Computer Society Inc.

Rao JN, Shao J (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika, 79(4): 811–822. https://doi.org/10.1093/biomet/79.4.811

Rubin DB (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434): 473–489. https://doi.org/10.1080/01621459.1996.10476908

Rubin DB (2018). Multiple imputation. In: van Buuren S (ed.), Flexible Imputation of Missing Data, Second Edition, 29–62. Chapman and Hall/CRC.

Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H (2014). Comparison of random forest and parametric imputation models for imputing missing data using mice: A caliber study. American Journal of Epidemiology, 179(6): 764–774. https://doi.org/10.1093/aje/kwt312

Steinberg D, Colla P (2009). Cart: classification and regression trees. In: Wu X, Kumar V (eds.), The Top Ten Algorithms in Data Mining, volume 9, 179.

Tang F, Ishwaran H (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(6): 363–377.

Toth D, Eltinge JL (2011). Building consistent regression trees from complex sample data. Journal of the American Statistical Association, 106(496): 1626–1636. https://doi.org/10.1198/jasa.2011.tm10383

Van der Laan MJ, Polley EC, Hubbard AE (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1): 25.

Wager S, Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523): 1228–1242. https://doi.org/10.1080/01621459.2017.1319839

Yang S, Kim JK (2019). Nearest neighbor imputation for general parameter estimation in survey sampling. In: Huynh KP, Jacho-Chávez DT, Tripathi G (eds.), The Econometrics of Complex Survey Data, volume 39, 209–234. Emerald Publishing Limited.

Yang S, Kim JK (2020a). Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework. Scandinavian Journal of Statistics, 47(3): 839–861. https://doi.org/10.1111/sjos.12429

Yang S, Kim JK (2020b). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3: 625–650. https://doi.org/10.1007/s42081-020-00093-w

Zhang Z (2016). Missing data imputation: Focusing on single imputation. Annals of Translational Medicine, 4(1): 9. https://doi.org/10.21037/atm-20-3623

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

imputation missing data nonresponse bias

Funding

Dr. Sixia Chen was partially supported by the Oklahoma Shared Clinical and Translational Resources (U54GM104938) with an Institutional Development Award (IDeA) from NIGMS. The content is solely the responsibility of the authors and does not necessarily represent official views of the National Institutes of Health or the Indian Health Service. Part of the computing for this project was performed at the OU Supercomputing Center for Education & Research (OSCER) at the University of Oklahoma (OU).

Metrics

since February 2021

1025

Article info
views

504

PDF
downloads

RSS

Authors

Abstract

References

Export citation

Copy and paste formatted citation

Download citation in file