Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1149

10.6339/24-JDS1149

Data Science in Action

Evaluation of Text Cluster Naming with Generative Large Language Models

https://orcid.org/0000-0002-7966-527X

Preiss

Alexander J.

apreiss@rti.org1∗

https://orcid.org/0000-0001-6819-7637

Arbeit

Caren A.

https://orcid.org/0000-0002-0011-066X

Berghammer

Anthony

https://orcid.org/0000-0003-4209-7595

Bollenbacher

John

https://orcid.org/0000-0003-0765-7884

McCarthy

John V.

https://orcid.org/0009-0009-1955-5309

Brom

Madeline G.

https://orcid.org/0000-0002-9530-9814

Enger

Mike

https://orcid.org/0009-0006-4191-2673

Rios Villacorta

Nicholas

https://orcid.org/0000-0001-5881-8102

Straughn

Shaquavia

1 1RTI International, Durham, NC, 27709, United States

∗Corresponding author. Email: apreiss@rti.org.

2024

2682024

223376392

Supplementary Material

Appendices A–D.

301120232072024

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2024

Open access article under the CC BY license.

Text clustering can streamline many labor-intensive tasks, but it creates a new challenge: efficiently labeling and interpreting the clusters. Generative large language models (LLMs) are a promising option to automate the process of naming text clusters, which could significantly streamline workflows, especially in domains with large datasets and esoteric language. In this study, we assessed the ability of GPT-3.5-turbo to generate names for clusters of texts and compared these to human-generated text cluster names. We clustered two benchmark datasets, each from a specialized domain: research abstracts and clinical patient notes. We generated names for each cluster using four prompting strategies (different ways of including information about the cluster in the prompt used to get LLM responses). For both datasets, the best prompting strategy beat the manual approach across all quality domains. However, name quality varied by prompting strategy and dataset. We conclude that practitioners should consider trying automated cluster naming to avoid bottlenecks or when the scale of the effort is enough to take advantage of the cost savings offered by automation, as detailed in our supplemental blueprint for using LLM cluster naming. However, to get the best performance, it is vital to test a variety of prompting strategies and perform a small test to identify which one performs best on each project’s unique data.

Keywords cluster profiling large language model natural language processing text clustering topic modeling unsupervised learning

This work was funded internally by an RTI International research and development funding mechanism.

References

BERTopic (2023a). The algorithm. Accessed 2023.

BERTopic (2023b). c-tf-idf. Accessed 2023.

Bowman

, Dahl

(2021). What will it take to fix benchmarking in natural language understanding? arXiv preprint: https://arxiv.org/abs/2104.02145.

Carbonell

, Goldstein

(1998). The use of mmr, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 335–336.

Dang

(2005). Overview of DUC 2005. Technical report, National Institute of Standards and Technology (NIST).

Fabbri

, Kryściński

, McCann

, Xiong

, Socher

, Radev

(2021). Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9: 391–409. https://doi.org/10.1162/tacl_a_00373

Giray

(2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 51: 2629–2633. https://doi.org/10.1007/s10439-023-03272-4

Hdbscan (2016). The hdbscan clustering library. Accessed 2023.

Hosna

, Merry

, Gyalmo

, Alom

, Aung

, Abdul

(2022). Transfer learning: A friendly introduction. Journal of Big Data, 9: 102. https://doi.org/10.1186/s40537-022-00652-w

Kamalloo

, Dziri

, Clarke

CLA

, Rafiei

(2023). Evaluating open-domain question answering in the era of large language models. arXiv preprint: https://arxiv.org/abs/2305.06984.

Kaur

, Buttar

(2018). A systematic review on stopword removal algorithms. International Journal of Future Revolution in Computer Science & Communication Engineering, 4(4): 207–210.

KeyBERT (2022). About the project. Accessed 2023.

Kryściński

, McCann

, Xiong

, Socher

(2020). Evaluating the factual consistency of abstractive text summarization. arXiv preprint: https://arxiv.org/abs/1910.12840.

, Zhang

, Guo

, Wang

, Sheng

(2021). Multi-document summarization via deep learning techniques: A survey. arXiv preprint: https://arxiv.org/abs/2011.04843.

Ramos

(2003). Using TF-IDF to determine word relevance in document queries. Technical report.

Reimers

, Gurevych

(2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint: https://arxiv.org/abs/1908.10084.

Rose

, Engel

, Cramer

, Cowley

(2010). Automatic keyword extraction from individual documents. In: Text Mining: Applications and Theory (

Berry,

Kogan, eds.). John Wiley & Sons, Ltd.

UMAP (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. Accessed 2023.

Xiao

, Beltagy

, Carenini

, Cohan

(2022). Primera: Pyramid-based masked sentence pre-training for multi-document summarization. arXiv preprint: https://arxiv.org/abs/2110.08499.

Zhang

, Ladhak

, Durmus

, Liang

, McKeown

, Hashimoto

(2023a). Benchmarking large language models for news summarization. arXiv preprint: https://arxiv.org/abs/2301.13848.

Zhang

, Li

, Cui

, Cai

, Liu

, Fu

, et al. (2023b). Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint: https://arxiv.org/abs/2309.01219.

Zhao

, Jin

, Chen

, Peng

, Yu

(2023). PMC-patients: A large-scale dataset of patient summaries and relations for benchmarking retrieval-based clinical decision support systems. arXiv preprint: https://arxiv.org/abs/2202.13876.