Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference
  4. Evaluation of Text Cluster Naming with G ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Evaluation of Text Cluster Naming with Generative Large Language Models
Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 376–392
Alexander J. Preiss ORCID icon link to view author Alexander J. Preiss details   Caren A. Arbeit ORCID icon link to view author Caren A. Arbeit details   Anthony Berghammer ORCID icon link to view author Anthony Berghammer details     All authors (9)

Authors

 
Placeholder
https://doi.org/10.6339/24-JDS1149
Pub. online: 26 August 2024      Type: Data Science In Action      Open accessOpen Access

Received
30 November 2023
Accepted
20 July 2024
Published
26 August 2024

Abstract

Text clustering can streamline many labor-intensive tasks, but it creates a new challenge: efficiently labeling and interpreting the clusters. Generative large language models (LLMs) are a promising option to automate the process of naming text clusters, which could significantly streamline workflows, especially in domains with large datasets and esoteric language. In this study, we assessed the ability of GPT-3.5-turbo to generate names for clusters of texts and compared these to human-generated text cluster names. We clustered two benchmark datasets, each from a specialized domain: research abstracts and clinical patient notes. We generated names for each cluster using four prompting strategies (different ways of including information about the cluster in the prompt used to get LLM responses). For both datasets, the best prompting strategy beat the manual approach across all quality domains. However, name quality varied by prompting strategy and dataset. We conclude that practitioners should consider trying automated cluster naming to avoid bottlenecks or when the scale of the effort is enough to take advantage of the cost savings offered by automation, as detailed in our supplemental blueprint for using LLM cluster naming. However, to get the best performance, it is vital to test a variety of prompting strategies and perform a small test to identify which one performs best on each project’s unique data.

Supplementary material

 Supplementary Material
Appendices A–D.

References

 
BERTopic (2023a). The algorithm. Accessed 2023.
 
BERTopic (2023b). c-tf-idf. Accessed 2023.
 
Bowman SR, Dahl GE (2021). What will it take to fix benchmarking in natural language understanding? arXiv preprint: https://arxiv.org/abs/2104.02145.
 
Carbonell J, Goldstein J (1998). The use of mmr, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 335–336.
 
Dang HT (2005). Overview of DUC 2005. Technical report, National Institute of Standards and Technology (NIST).
 
Fabbri AR, Kryściński W, McCann B, Xiong C, Socher R, Radev D (2021). Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9: 391–409. https://doi.org/10.1162/tacl_a_00373
 
Giray L (2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 51: 2629–2633. https://doi.org/10.1007/s10439-023-03272-4
 
Hdbscan (2016). The hdbscan clustering library. Accessed 2023.
 
Hosna A, Merry E, Gyalmo J, Alom Z, Aung Z, Abdul M (2022). Transfer learning: A friendly introduction. Journal of Big Data, 9: 102. https://doi.org/10.1186/s40537-022-00652-w
 
Kamalloo E, Dziri N, Clarke CLA, Rafiei D (2023). Evaluating open-domain question answering in the era of large language models. arXiv preprint: https://arxiv.org/abs/2305.06984.
 
Kaur J, Buttar PK (2018). A systematic review on stopword removal algorithms. International Journal of Future Revolution in Computer Science & Communication Engineering, 4(4): 207–210.
 
KeyBERT (2022). About the project. Accessed 2023.
 
Kryściński W, McCann B, Xiong C, Socher R (2020). Evaluating the factual consistency of abstractive text summarization. arXiv preprint: https://arxiv.org/abs/1910.12840.
 
Ma C, Zhang WE, Guo M, Wang H, Sheng QZ (2021). Multi-document summarization via deep learning techniques: A survey. arXiv preprint: https://arxiv.org/abs/2011.04843.
 
Ramos J (2003). Using TF-IDF to determine word relevance in document queries. Technical report.
 
Reimers N, Gurevych I (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint: https://arxiv.org/abs/1908.10084.
 
Rose S, Engel D, Cramer N, Cowley W (2010). Automatic keyword extraction from individual documents. In: Text Mining: Applications and Theory (MW Berry, J Kogan, eds.). John Wiley & Sons, Ltd.
 
UMAP (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. Accessed 2023.
 
Xiao W, Beltagy I, Carenini G, Cohan A (2022). Primera: Pyramid-based masked sentence pre-training for multi-document summarization. arXiv preprint: https://arxiv.org/abs/2110.08499.
 
Zhang T, Ladhak F, Durmus E, Liang P, McKeown K, Hashimoto TB (2023a). Benchmarking large language models for news summarization. arXiv preprint: https://arxiv.org/abs/2301.13848.
 
Zhang Y, Li Y, Cui L, Cai D, Liu L, Fu T, et al. (2023b). Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint: https://arxiv.org/abs/2309.01219.
 
Zhao Z, Jin Q, Chen F, Peng T, Yu S (2023). PMC-patients: A large-scale dataset of patient summaries and relations for benchmarking retrieval-based clinical decision support systems. arXiv preprint: https://arxiv.org/abs/2202.13876.

Related articles PDF XML
Related articles PDF XML

Copyright
2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
cluster profiling large language model natural language processing text clustering topic modeling unsupervised learning

Funding
This work was funded internally by an RTI International research and development funding mechanism.

Metrics
since February 2021
1020

Article info
views

255

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy