Scalable Community Extraction of Text Networks for Automated Grouping in Medical Databases

Komolafe, Tomilayo; Fong, Allan; Sengupta, Srijan

doi:10.6339/22-JDS1038

Journal of Data Science

Scalable Community Extraction of Text Networks for Automated Grouping in Medical Databases

Volume 21, Issue 3 (2023): Special Issue: Advances in Network Data Science, pp. 470–489

Tomilayo Komolafe Allan Fong Srijan Sengupta

https://doi.org/10.6339/22-JDS1038

Pub. online: 19 April 2022 Type: Statistical Data Science

Open Access

Received
4 November 2021

Accepted
13 February 2022

Published
19 April 2022

Abstract

Networks are ubiquitous in today’s world. Community structure is a well-known feature of many empirical networks, and a lot of statistical methods have been developed for community detection. In this paper, we consider the problem of community extraction in text networks, which is greatly relevant in medical errors and patient safety databases. We adapt a well-known community extraction method to develop a scalable algorithm for extracting groups of similar documents in large text databases. The application of our method on a real-world patient safety report system demonstrates that the groups generated from community extraction are much more accurate than manual tagging by frontline workers.

Supplementary material

Supplementary Material

Supplementary material online include R code for implementing the proposed method.

References

Aizawa A (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1): 45–65.

Amini AA, Chen A, Bickel PJ, Levina E (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist., 41(4): 2097–2122.

Aspden P, Corrigan JM, Wolcott J, Erickson SM, et al. (2004). Patient safety reporting systems and applications. In: Patient Safety: Achieving a New Standard for Care. National Academies Press (US).

Beasley JE (1998). Heuristic algorithms for the unconstrained binary quadratic programming problem, Technical report, Citeseer.

Bickel PJ, Chen A (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proceedings of the National Academy of Sciences, 106: 21068–21073.

Chang A, Schyve PM, Croteau RJ, O’Leary DS, Loeb JM (2005). The JCAHO patient safety event taxonomy: a standardized terminology and classification schema for near misses and adverse events. International Journal for Quality in Health Care, 17(2): 95–105.

Clarke JR (2006). How a system for reporting medical errors can and cannot improve patient safety. The American Surgeon, 72(11): 1088–1091.

Dong R, Yang J, Chen Y (2020). Overlapping community detection in weighted temporal text networks. IEEE Access, 8: 58118–58129.

Dovey S, Meyers D, Phillips R, Green L, Fryer G, Galliher J, et al. (2002). A preliminary taxonomy of medical errors in family practice. BMJ Quality & Safety, 11(3): 233–238.

Dumais ST (2004). Latent semantic analysis. Annual review of information science and technology, 38(1): 188–230.

Fortunato S (2010). Community detection in graphs. Physics Reports, 486(3): 75–174.

Glover F, Laguna M (1998). Tabu search. In: Handbook of Combinatorial Optimization, 2093–2229. Springer.

Gong Y, Song HY, Wu X, Hua L (2015). Identifying barriers and benefits of patient safety event reporting toward user-centered design. Safety in Health, 1(1): 1–9.

Griffey RT, Schneider RM, Todorov AA, Yaeger L, Sharp BR, Vrablik MC, et al. (2019). Critical review, development, and testing of a taxonomy for adverse events and near misses in the emergency department. Academic Emergency Medicine, 26(6): 670–679.

Günther F, Dudschig C, Kaup B (2015). Lsafun: An r package for computations based on latent semantic analysis. Behavior Research Methods, 47(4): 930–944.

Guo Z, Cho JH, Chen R, Sengupta S, Hong M, Mitra T (2020). Online social deception and its countermeasures: A survey. IEEE Access, 9: 1770–1806.

Harrington MM (2005). Revisiting medical error: Five years after the iom report, have reporting systems made a measurable difference. Health Matrix, 15: 329.

Hofmann T (1999). Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50–57.

Huberman BA, Adamic LA (1999). Internet: growth dynamics of the World-Wide Web. Nature, 401: 131.

Jin J (2015). Fast community detection by SCORE. The Annals of Statistics, 43(1): 57–89.

Johnson C (2003). How will we get the data and what will we do with it then? issues in the reporting of adverse healthcare events. BMJ Quality & Safety, 12(suppl): ii64–ii67.

Jonsson PF, Cavanna T, Zicha D, Bates PA (2006). Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinformatics, 7(1): 2.

Kohn LT, Corrigan J, Donaldson MS (2000). To err is Human: Building a Safer Health System. National Academy Press.

Landauer TK, Foltz PW, Laham D (1998). An introduction to latent semantic analysis. Discourse processes, 25(2–3): 259–284.

Leitch J, Alexander KA, Sengupta S (2019). Toward epidemic thresholds on temporal networks: a review and open questions. Applied Network Science, 4(1): 105.

Lynall ME, Bassett DS, Kerwin R, McKenna PJ, Kitzbichler M, Muller U, et al. (2010). Functional connectivity and brain networks in schizophrenia. Journal of Neuroscience, 30(28): 9477–9487.

Makary MA Daniel M (2016). Medical error—the third leading cause of death in the US. Bmj, 353.

Mikolov T, Chen K, Corrado G, Dean J (2013). Efficient estimation of word representations in vector space. arXiv preprint: https://arxiv.org/abs/1301.3781.

Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2017). Advances in pre-training distributed word representations. arXiv preprint: https://arxiv.org/abs/1712.09405

Newman MEJ, Girvan M (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2): 026113.

Pagani GA, Aiello M (2013). The power grid as a complex network: A survey. Physica A: Statistical Mechanics and its Applications, 392(11): 2688–2700.

Papadimitriou CH, Raghavan P, Tamaki H, Vempala S (2000). Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2): 217–235.

Pronovost PJ, Morlock LL, Sexton JB, Miller MR, Holzmueller CG, Thompson DA, et al. (2008). Improving the value of patient safety reporting systems Advances in Patient Safety: New Directions and Alternative Approaches (Vol. 1: Assessment).

Puthumana JS, Fong A, Blumenthal J, Ratwani RM (2021). Making patient safety event data actionable: understanding patient safety analyst needs. Journal of Patient Safety, 17(6): e509–e514.

Ramos J, et al. (2003). Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, volume 242, 29–48. Citeseer.

Rohe K, Chatterjee S, Yu B (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4): 1878–1915.

Rosenthal J, Booth M (2005). Maximizing the use of State Adverse Event Data to Improve Patient Safety. National Academy for State Health Policy, Portland, ME.

Sengupta S, Chen Y (2015). Spectral clustering in heterogeneous networks. Statistica Sinica, 25: 1081–1106.

Sengupta S, Chen Y (2018). A block model for node popularity in networks with community structure. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(2): 365–386.

The White House (2020). Clinton-gore administration announces new actions to improve patient safety and assure health care quality. https://clintonwhitehouse4.archives.gov/textonly/WH/New/html/20000222_1.html.

Turney PD (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: European Conference on Machine Learning, 491–502. Springer.

Vijayarani S, Ilamathi MJ, Nithya M, et al. (2015). Preprocessing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks, 5(1): 7–16.

Wild F, Stahl C, Stermsek G, Neumann G (2005). Parameters driving effectiveness of automated essay scoring with lsa.

Yan S, Jia Y Wang X, (2021). Overlapping community detection in temporal text networks.

Zhao Y, Levina E, Zhu J (2011). Community extraction for social networks. Proceedings of the National Academy of Sciences, 108(18): 7321–7326.

2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

community structure community extraction natural language processing patient safety

Funding

We acknowledge the support from the NIH R01 grant 1R01LM013309 from the National Library of Medicine.

Metrics

since February 2021

659

Article info
views

414

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file