Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 21, Issue 3 (2023): Special Issue: Advances in Network Data Science
  4. Comparison of Methods for Imputing Socia ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Comparison of Methods for Imputing Social Network Data
Volume 21, Issue 3 (2023): Special Issue: Advances in Network Data Science, pp. 599–618
Ziqian Xu   Jiarui Hai   Yutong Yang     All authors (4)

Authors

 
Placeholder
https://doi.org/10.6339/22-JDS1045
Pub. online: 20 April 2022      Type: Data Science Reviews      Open accessOpen Access

Received
21 December 2021
Accepted
30 March 2022
Published
20 April 2022

Abstract

Social network data often contain missing values because of the sensitive nature of the information collected and the dependency among the network actors. As a response, network imputation methods including simple ones constructed from network structural characteristics and more complicated model-based ones have been developed. Although past studies have explored the influence of missing data on social networks and the effectiveness of imputation procedures in many missing data conditions, the current study aims to evaluate a more extensive set of eight network imputation techniques (i.e., null-tie, Reconstruction, Preferential Attachment, Constrained Random Dot Product Graph, Multiple Imputation by Bayesian Exponential Random Graph Models or BERGMs, k-Nearest Neighbors, Random Forest, and Multiple Imputation by Chained Equations) under more practical conditions through comprehensive simulation. A factorial design for missing data conditions is adopted with factors including missing data types, missing data mechanisms, and missing data proportions, which are applied to generated social networks with varying numbers of actors based on 4 different sets of coefficients in ERGMs. Results show that the effectiveness of imputation methods differs by missing data types, missing data mechanisms, the evaluation criteria used, and the complexity of the social networks. More complex methods such as the BERGMs have consistently good performances in recovering missing edges that should have been present. While simpler methods like Reconstruction work better in recovering network statistics when the missing proportion of present edges is low, the BERGMs work better when more present edges are missing. The BERGMs also work well in recovering ERGM coefficients when the networks are complex and the missing data type is actor non-response. In conclusion, researchers analyzing social networks with incomplete data should identify the network structures of interest and the potential missing data types before selecting appropriate imputation methods.

Supplementary material

 Supplementary Material
• supplement.pdf: Supplementary analyses, tables, and figures mentioned in the paper. • code: Code used in this study. This folder contains a README.txt file which explains how the code can be used.

References

 
Akhtar N, Javed H, Sengar G (2013). Analysis of facebook social network. In: 2013 5th International Conference and Computational Intelligence and Communication Networks, 451–454. IEEE.
 
Barabási AL, Albert R (1999). Emergence of scaling in random networks. Science, 286(5439): 509–512.
 
Bernard S, Heutte L, Adam S (2009). Influence of hyperparameters on random forest accuracy. In: International Workshop on Multiple Classifier Systems, volume 5519, 171–180. Springer.
 
Butts CT (2020). sna: Tools for social network analysis. In: R package version 2.6. https://cran.r-project.org/web/packages/sna.
 
Caimo A, Bouranis L, Krause R, Friel N (2021a). Bergm: Bayesian exponential random graph models. R package version 5.0.3. https://cran.r-project.org/web/packages/Bergm/.
 
Caimo A, Bouranis L, Krause R, Friel N (2021b). Statistical network analysis with bergm. arXiv preprint https://arxiv.org/abs/2104.02444.
 
Caimo A, Friel N (2011). Bayesian inference for exponential random graph models. Social Networks, 33(1): 41–55.
 
Chang C, Deng Y, Jiang X, Long Q (2020). Multiple imputation for analysis of incomplete data in distributed health data networks. Nature Communications, 11(1): 1–11.
 
Csardi G, Nepusz T (2022). igraph: Network analysis and visualization. R package version 1.2.11. https://cran.r-project.org/web/packages/igraph/.
 
de la Haye K, Embree J, Punkay M, Espelage DL, Tucker JS, Green Jr HD (2017). Analytic strategies for longitudinal networks with missing data. Social Networks, 50: 17–25.
 
Epskamp S, Borsboom D, Fried EI (2018). Estimating psychological networks and their accuracy: A tutorial paper. Behavior Research Methods, 50(1): 195–212.
 
Fix E, Hodges JL (1989). Discriminatory analysis. nonparametric discrimination: Consistency properties. International Statistical Review/Revue Internationale de Statistique, 57(3): 238–247.
 
Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, Morris M (2021). ergm: Fit, simulate and diagnose exponential-family models for networks. https://cran.r-project.org/web/packages/ergm/.
 
Himelboim I, Smith MA, Rainie L, Shneiderman B, Espina C (2017). Classifying twitter topic-networks using social network analysis. Social Media + Society, 3(1): 1–13.
 
Ho TK (1995). Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 1, 278–282. IEEE.
 
Hoff PD, Raftery AE, Handcock MS (2002). Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460): 1090–1098.
 
Huisman M (2009). Imputation of missing network data: Some simple procedures. Journal of Social Structure, 10(1): 1–29.
 
Huisman M, Steglich C (2008). Treatment of non-response in longitudinal network studies. Social Networks, 30(4): 297–308.
 
Jadhav A, Pramod D, Ramanathan K (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10): 913–933.
 
Kc B, Morais DB, Smith JW, Peterson M, Seekamp E (2019). Using social network analysis to understand trust, reciprocity, and togetherness in wildlife tourism microentrepreneurship. Journal of Hospitality & Tourism Research, 43(8): 1176–1198.
 
Kokla M, Virtanen J, Kolehmainen M, Paananen J, Hanhineva K (2019). Random forest-based imputation outperforms other methods for imputing lc-ms metabolomics data: a comparative study. BMC Bioinformatics, 20(1): 1–11.
 
Koskinen JH, Robins GL, Pattison PE (2010). Analysing exponential random graph (p-star) models with missing data using bayesian data augmentation. Statistical Methodology, 7(3): 366–384.
 
Kossinets G (2006). Effects of missing data in social networks. Social Networks, 28(3): 247–268.
 
Krause RW, Huisman M, Steglich C, Snijders T (2020). Missing data in cross-sectional networks–an extensive comparison of missing data treatment methods. Social Networks, 62: 99–112.
 
Liaw A, Wiener M (2022). randomforest: Breiman and cutler’s random forests for classification and regression. R package version 4.7-1. https://cran.r-project.org/web/packages/randomForest/.
 
Little RJ, Rubin DB (1987). Statistical Analysis With Missing Data. John Wiley & Sons.
 
Liu H, Jin IH, Zhang Z, Yuan Y (2021). Social network mediation analysis: A latent space approach. Psychometrika, 86(1): 272–298.
 
Marchette DJ, Priebe CE (2008). Predicting unobserved links in incompletely observed networks. Computational Statistics & Data Analysis, 52(3): 1373–1386.
 
Massing-Schaffer M, Nesi J, Telzer EH, Lindquist KA, Prinstein MJ (2020). Adolescent peer experiences and prospective suicidal ideation: the protective role of online-only friendships. Journal of Clinical Child & Adolescent Psychology, 51(1): 1–12.
 
Otte E, Rousseau R (2002). Social network analysis: A powerful strategy, also for the information sciences. Journal of Information Science, 28(6): 441–453.
 
Ouzienko V, Obradovic Z (2014). Imputation of missing links and attributes in longitudinal social surveys. Machine Learning, 95(3): 329–356.
 
Pantanowitz A, Marwala T (2009). Missing data imputation through the use of the random forest algorithm. In: Advances in Computational Intelligence, volume 116, 53–62. Springer.
 
Paradis E, Blomberg S, Bolker B, Brown J, Claude J, Cuong HS, et al. (2022). ape: Analyses of phylogenetics and evolution. R package version 5.6-2. https://cran.r-project.org/web/packages/ape.
 
R Core Team (2022). R: A language and environment for statistical computing. https://www.R-project.org/.
 
Ripley B, Venables W (2022). class: Functions for classification. R package version 7.3-20. https://cran.r-project.org/web/packages/class/.
 
Rubin DB (1976). Inference and missing data. Biometrika, 63(3): 581–592.
 
Smith JA, Moody J (2013). Structural effects of network sampling coverage I: Nodes missing at random. Social Networks, 35(4): 652–668.
 
Smith JA, Moody J, Morgan JH (2017). Network sampling coverage II: The effect of non-random missing data on network measurement. Social Networks, 48: 78–99.
 
Smith JA, Morgan JH, Moody J (2022). Network sampling coverage III: Imputation of missing network data under different network and missing data conditions. Social Networks, 68: 148–178.
 
Snijders TA (1996). Stochastic actor-oriented models for network change. Journal of Mathematical Sociology, 21(1–2): 149–172.
 
Stork D, Richards WD (1992). Nonrespondents in communication network studies: Problems and possibilities. Group & Organization Management, 17(2): 193–209.
 
Van Buuren S, Groothuis-Oudshoorn K (2011). mice: Multivariate imputation by chained equations in R. Journal of statistical software, 45(3): 1–67.
 
van Buuren S, Groothuis-Oudshoorn K, Robitzsch A, Vink G, Doove L, Jolani S, et al. (2021). mice: Multivariate imputation by chained equations. R package version 3.14.0. https://cran.r-project.org/web/packages/mice/.
 
Wang C, Butts CT, Hipp JR, Jose R, Lakon CM (2016). Multiple imputation for missing edge data: A predictive evaluation method with application to add health. Social Networks, 45: 89–98.
 
Yang CL, Yuan CW, Wang HC (2019). When knowledge network is social network: Understanding collaborative knowledge transfer in workplace. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW): 1–23.
 
Žnidaršič A, Doreian P, Ferligoj A (2012). Absent ties in social networks, their treatments, and blockmodeling outcomes. Advances in Methodology and Statistics, 9(2): 119–138.
 
Žnidaršič A, Ferligoj A, Doreian P (2017). Actor non-response in valued social networks: The impact of different non-response treatments on the stability of blockmodels. Social Networks, 48: 46–56.

Related articles PDF XML
Related articles PDF XML

Copyright
2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
Bayesian ERGM ERGM missing data multiple imputation

Funding
The research was supported by a grant from the US Department of Education (R305D210023).

Metrics
since February 2021
1268

Article info
views

614

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy