Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 21, Issue 3 (2023): Special Issue: Advances in Network Data Science
  4. Scalable Community Extraction of Text Ne ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Scalable Community Extraction of Text Networks for Automated Grouping in Medical Databases
Volume 21, Issue 3 (2023): Special Issue: Advances in Network Data Science, pp. 470–489
Tomilayo Komolafe   Allan Fong   Srijan Sengupta ORCID icon link to view author Srijan Sengupta details  

Authors

 
Placeholder
https://doi.org/10.6339/22-JDS1038
Pub. online: 19 April 2022      Type: Statistical Data Science      Open accessOpen Access

Received
4 November 2021
Accepted
13 February 2022
Published
19 April 2022

Abstract

Networks are ubiquitous in today’s world. Community structure is a well-known feature of many empirical networks, and a lot of statistical methods have been developed for community detection. In this paper, we consider the problem of community extraction in text networks, which is greatly relevant in medical errors and patient safety databases. We adapt a well-known community extraction method to develop a scalable algorithm for extracting groups of similar documents in large text databases. The application of our method on a real-world patient safety report system demonstrates that the groups generated from community extraction are much more accurate than manual tagging by frontline workers.

Supplementary material

 Supplementary Material
Supplementary material online include R code for implementing the proposed method.

References

 
Aizawa A (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1): 45–65.
 
Amini AA, Chen A, Bickel PJ, Levina E (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist., 41(4): 2097–2122.
 
Aspden P, Corrigan JM, Wolcott J, Erickson SM, et al. (2004). Patient safety reporting systems and applications. In: Patient Safety: Achieving a New Standard for Care. National Academies Press (US).
 
Beasley JE (1998). Heuristic algorithms for the unconstrained binary quadratic programming problem, Technical report, Citeseer.
 
Bickel PJ, Chen A (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proceedings of the National Academy of Sciences, 106: 21068–21073.
 
Chang A, Schyve PM, Croteau RJ, O’Leary DS, Loeb JM (2005). The JCAHO patient safety event taxonomy: a standardized terminology and classification schema for near misses and adverse events. International Journal for Quality in Health Care, 17(2): 95–105.
 
Clarke JR (2006). How a system for reporting medical errors can and cannot improve patient safety. The American Surgeon, 72(11): 1088–1091.
 
Dong R, Yang J, Chen Y (2020). Overlapping community detection in weighted temporal text networks. IEEE Access, 8: 58118–58129.
 
Dovey S, Meyers D, Phillips R, Green L, Fryer G, Galliher J, et al. (2002). A preliminary taxonomy of medical errors in family practice. BMJ Quality & Safety, 11(3): 233–238.
 
Dumais ST (2004). Latent semantic analysis. Annual review of information science and technology, 38(1): 188–230.
 
Fortunato S (2010). Community detection in graphs. Physics Reports, 486(3): 75–174.
 
Glover F, Laguna M (1998). Tabu search. In: Handbook of Combinatorial Optimization, 2093–2229. Springer.
 
Gong Y, Song HY, Wu X, Hua L (2015). Identifying barriers and benefits of patient safety event reporting toward user-centered design. Safety in Health, 1(1): 1–9.
 
Griffey RT, Schneider RM, Todorov AA, Yaeger L, Sharp BR, Vrablik MC, et al. (2019). Critical review, development, and testing of a taxonomy for adverse events and near misses in the emergency department. Academic Emergency Medicine, 26(6): 670–679.
 
Günther F, Dudschig C, Kaup B (2015). Lsafun: An r package for computations based on latent semantic analysis. Behavior Research Methods, 47(4): 930–944.
 
Guo Z, Cho JH, Chen R, Sengupta S, Hong M, Mitra T (2020). Online social deception and its countermeasures: A survey. IEEE Access, 9: 1770–1806.
 
Harrington MM (2005). Revisiting medical error: Five years after the iom report, have reporting systems made a measurable difference. Health Matrix, 15: 329.
 
Hofmann T (1999). Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50–57.
 
Huberman BA, Adamic LA (1999). Internet: growth dynamics of the World-Wide Web. Nature, 401: 131.
 
Jin J (2015). Fast community detection by SCORE. The Annals of Statistics, 43(1): 57–89.
 
Johnson C (2003). How will we get the data and what will we do with it then? issues in the reporting of adverse healthcare events. BMJ Quality & Safety, 12(suppl): ii64–ii67.
 
Jonsson PF, Cavanna T, Zicha D, Bates PA (2006). Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinformatics, 7(1): 2.
 
Kohn LT, Corrigan J, Donaldson MS (2000). To err is Human: Building a Safer Health System. National Academy Press.
 
Landauer TK, Foltz PW, Laham D (1998). An introduction to latent semantic analysis. Discourse processes, 25(2–3): 259–284.
 
Leitch J, Alexander KA, Sengupta S (2019). Toward epidemic thresholds on temporal networks: a review and open questions. Applied Network Science, 4(1): 105.
 
Lynall ME, Bassett DS, Kerwin R, McKenna PJ, Kitzbichler M, Muller U, et al. (2010). Functional connectivity and brain networks in schizophrenia. Journal of Neuroscience, 30(28): 9477–9487.
 
Makary MA Daniel M (2016). Medical error—the third leading cause of death in the US. Bmj, 353.
 
Mikolov T, Chen K, Corrado G, Dean J (2013). Efficient estimation of word representations in vector space. arXiv preprint: https://arxiv.org/abs/1301.3781.
 
Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2017). Advances in pre-training distributed word representations. arXiv preprint: https://arxiv.org/abs/1712.09405
 
Newman MEJ, Girvan M (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2): 026113.
 
Pagani GA, Aiello M (2013). The power grid as a complex network: A survey. Physica A: Statistical Mechanics and its Applications, 392(11): 2688–2700.
 
Papadimitriou CH, Raghavan P, Tamaki H, Vempala S (2000). Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2): 217–235.
 
Pronovost PJ, Morlock LL, Sexton JB, Miller MR, Holzmueller CG, Thompson DA, et al. (2008). Improving the value of patient safety reporting systems Advances in Patient Safety: New Directions and Alternative Approaches (Vol. 1: Assessment).
 
Puthumana JS, Fong A, Blumenthal J, Ratwani RM (2021). Making patient safety event data actionable: understanding patient safety analyst needs. Journal of Patient Safety, 17(6): e509–e514.
 
Ramos J, et al. (2003). Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, volume 242, 29–48. Citeseer.
 
Rohe K, Chatterjee S, Yu B (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4): 1878–1915.
 
Rosenthal J, Booth M (2005). Maximizing the use of State Adverse Event Data to Improve Patient Safety. National Academy for State Health Policy, Portland, ME.
 
Sengupta S, Chen Y (2015). Spectral clustering in heterogeneous networks. Statistica Sinica, 25: 1081–1106.
 
Sengupta S, Chen Y (2018). A block model for node popularity in networks with community structure. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(2): 365–386.
 
The White House (2020). Clinton-gore administration announces new actions to improve patient safety and assure health care quality. https://clintonwhitehouse4.archives.gov/textonly/WH/New/html/20000222_1.html.
 
Turney PD (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: European Conference on Machine Learning, 491–502. Springer.
 
Vijayarani S, Ilamathi MJ, Nithya M, et al. (2015). Preprocessing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks, 5(1): 7–16.
 
Wild F, Stahl C, Stermsek G, Neumann G (2005). Parameters driving effectiveness of automated essay scoring with lsa.
 
Yan S, Jia Y Wang X, (2021). Overlapping community detection in temporal text networks.
 
Zhao Y, Levina E, Zhu J (2011). Community extraction for social networks. Proceedings of the National Academy of Sciences, 108(18): 7321–7326.

Related articles PDF XML
Related articles PDF XML

Copyright
2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
community structure community extraction natural language processing patient safety

Funding
We acknowledge the support from the NIH R01 grant 1R01LM013309 from the National Library of Medicine.

Metrics
since February 2021
635

Article info
views

388

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy