<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1149</article-id>
<article-id pub-id-type="doi">10.6339/24-JDS1149</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Data Science in Action</subject></subj-group></article-categories>
<title-group>
<article-title>Evaluation of Text Cluster Naming with Generative Large Language Models</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-7966-527X</contrib-id>
<name><surname>Preiss</surname><given-names>Alexander J.</given-names></name><email xlink:href="mailto:apreiss@rti.org">apreiss@rti.org</email><xref ref-type="aff" rid="j_jds1149_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-6819-7637</contrib-id>
<name><surname>Arbeit</surname><given-names>Caren A.</given-names></name><xref ref-type="aff" rid="j_jds1149_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-0011-066X</contrib-id>
<name><surname>Berghammer</surname><given-names>Anthony</given-names></name><xref ref-type="aff" rid="j_jds1149_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-4209-7595</contrib-id>
<name><surname>Bollenbacher</surname><given-names>John</given-names></name><xref ref-type="aff" rid="j_jds1149_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-0765-7884</contrib-id>
<name><surname>McCarthy</surname><given-names>John V.</given-names></name><xref ref-type="aff" rid="j_jds1149_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0009-0009-1955-5309</contrib-id>
<name><surname>Brom</surname><given-names>Madeline G.</given-names></name><xref ref-type="aff" rid="j_jds1149_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-9530-9814</contrib-id>
<name><surname>Enger</surname><given-names>Mike</given-names></name><xref ref-type="aff" rid="j_jds1149_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0009-0006-4191-2673</contrib-id>
<name><surname>Rios Villacorta</surname><given-names>Nicholas</given-names></name><xref ref-type="aff" rid="j_jds1149_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-5881-8102</contrib-id>
<name><surname>Straughn</surname><given-names>Shaquavia</given-names></name><xref ref-type="aff" rid="j_jds1149_aff_001">1</xref>
</contrib>
<aff id="j_jds1149_aff_001"><label>1</label><institution>RTI International</institution>, Durham, NC, 27709, <country>United States</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:apreiss@rti.org">apreiss@rti.org</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2024</year></pub-date><pub-date pub-type="epub"><day>26</day><month>8</month><year>2024</year></pub-date><volume>22</volume><issue>3</issue><fpage>376</fpage><lpage>392</lpage><supplementary-material id="S1" content-type="archive" xlink:href="jds1149_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>Appendices A–D.</p>
</caption>
</supplementary-material><history><date date-type="received"><day>30</day><month>11</month><year>2023</year></date><date date-type="accepted"><day>20</day><month>7</month><year>2024</year></date></history>
<permissions><copyright-statement>2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2024</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Text clustering can streamline many labor-intensive tasks, but it creates a new challenge: efficiently labeling and interpreting the clusters. Generative large language models (LLMs) are a promising option to automate the process of naming text clusters, which could significantly streamline workflows, especially in domains with large datasets and esoteric language. In this study, we assessed the ability of GPT-3.5-turbo to generate names for clusters of texts and compared these to human-generated text cluster names. We clustered two benchmark datasets, each from a specialized domain: research abstracts and clinical patient notes. We generated names for each cluster using four prompting strategies (different ways of including information about the cluster in the prompt used to get LLM responses). For both datasets, the best prompting strategy beat the manual approach across all quality domains. However, name quality varied by prompting strategy and dataset. We conclude that practitioners should consider trying automated cluster naming to avoid bottlenecks or when the scale of the effort is enough to take advantage of the cost savings offered by automation, as detailed in our supplemental blueprint for using LLM cluster naming. However, to get the best performance, it is vital to test a variety of prompting strategies and perform a small test to identify which one performs best on each project’s unique data.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>cluster profiling</kwd>
<kwd>large language model</kwd>
<kwd>natural language processing</kwd>
<kwd>text clustering</kwd>
<kwd>topic modeling</kwd>
<kwd>unsupervised learning</kwd>
</kwd-group>
<funding-group><funding-statement>This work was funded internally by an RTI International research and development funding mechanism.</funding-statement></funding-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1149_reflist_001">
<title>References</title>
<ref id="j_jds1149_ref_001">
<mixed-citation publication-type="other"> <string-name><surname>BERTopic</surname></string-name> (<year>2023</year>a). The algorithm. Accessed 2023.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_002">
<mixed-citation publication-type="other"> <string-name><surname>BERTopic</surname></string-name> (<year>2023</year>b). c-tf-idf. Accessed 2023.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_003">
<mixed-citation publication-type="other"> <string-name><surname>Bowman</surname> <given-names>SR</given-names></string-name>, <string-name><surname>Dahl</surname> <given-names>GE</given-names></string-name> (<year>2021</year>). What will it take to fix benchmarking in natural language understanding? arXiv preprint: <uri>https://arxiv.org/abs/2104.02145</uri>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_004">
<mixed-citation publication-type="chapter"> <string-name><surname>Carbonell</surname> <given-names>J</given-names></string-name>, <string-name><surname>Goldstein</surname> <given-names>J</given-names></string-name> (<year>1998</year>). <chapter-title>The use of mmr, diversity-based reranking for reordering documents and producing summaries</chapter-title>. In: <source><italic>Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</italic></source>, <fpage>335</fpage>–<lpage>336</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_005">
<mixed-citation publication-type="other"> <string-name><surname>Dang</surname> <given-names>HT</given-names></string-name> (<year>2005</year>). Overview of DUC 2005. <italic>Technical report</italic>, National Institute of Standards and Technology (NIST).</mixed-citation>
</ref>
<ref id="j_jds1149_ref_006">
<mixed-citation publication-type="journal"> <string-name><surname>Fabbri</surname> <given-names>AR</given-names></string-name>, <string-name><surname>Kryściński</surname> <given-names>W</given-names></string-name>, <string-name><surname>McCann</surname> <given-names>B</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>C</given-names></string-name>, <string-name><surname>Socher</surname> <given-names>R</given-names></string-name>, <string-name><surname>Radev</surname> <given-names>D</given-names></string-name> (<year>2021</year>). <article-title>Summeval: Re-evaluating summarization evaluation</article-title>. <source><italic>Transactions of the Association for Computational Linguistics</italic></source>, <volume>9</volume>: <fpage>391</fpage>–<lpage>409</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1162/tacl_a_00373" xlink:type="simple">https://doi.org/10.1162/tacl_a_00373</ext-link></mixed-citation>
</ref>
<ref id="j_jds1149_ref_007">
<mixed-citation publication-type="journal"> <string-name><surname>Giray</surname> <given-names>L</given-names></string-name> (<year>2023</year>). <article-title>Prompt engineering with ChatGPT: A guide for academic writers</article-title>. <source><italic>Annals of Biomedical Engineering</italic></source>, <volume>51</volume>: <fpage>2629</fpage>–<lpage>2633</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s10439-023-03272-4" xlink:type="simple">https://doi.org/10.1007/s10439-023-03272-4</ext-link></mixed-citation>
</ref>
<ref id="j_jds1149_ref_008">
<mixed-citation publication-type="other"> <string-name><surname>Hdbscan</surname></string-name> (<year>2016</year>). The hdbscan clustering library. Accessed 2023.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_009">
<mixed-citation publication-type="journal"> <string-name><surname>Hosna</surname> <given-names>A</given-names></string-name>, <string-name><surname>Merry</surname> <given-names>E</given-names></string-name>, <string-name><surname>Gyalmo</surname> <given-names>J</given-names></string-name>, <string-name><surname>Alom</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Aung</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Abdul</surname> <given-names>M</given-names></string-name> (<year>2022</year>). <article-title>Transfer learning: A friendly introduction</article-title>. <source><italic>Journal of Big Data</italic></source>, <volume>9</volume>: <fpage>102</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1186/s40537-022-00652-w" xlink:type="simple">https://doi.org/10.1186/s40537-022-00652-w</ext-link></mixed-citation>
</ref>
<ref id="j_jds1149_ref_010">
<mixed-citation publication-type="other"> <string-name><surname>Kamalloo</surname> <given-names>E</given-names></string-name>, <string-name><surname>Dziri</surname> <given-names>N</given-names></string-name>, <string-name><surname>Clarke</surname> <given-names>CLA</given-names></string-name>, <string-name><surname>Rafiei</surname> <given-names>D</given-names></string-name> (<year>2023</year>). Evaluating open-domain question answering in the era of large language models. arXiv preprint: <uri>https://arxiv.org/abs/2305.06984</uri>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_011">
<mixed-citation publication-type="journal"> <string-name><surname>Kaur</surname> <given-names>J</given-names></string-name>, <string-name><surname>Buttar</surname> <given-names>PK</given-names></string-name> (<year>2018</year>). <article-title>A systematic review on stopword removal algorithms</article-title>. <source><italic>International Journal of Future Revolution in Computer Science &amp; Communication Engineering</italic></source>, <volume>4</volume>(<issue>4</issue>): <fpage>207</fpage>–<lpage>210</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_012">
<mixed-citation publication-type="other"> <string-name><surname>KeyBERT</surname></string-name> (<year>2022</year>). About the project. Accessed 2023.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_013">
<mixed-citation publication-type="other"> <string-name><surname>Kryściński</surname> <given-names>W</given-names></string-name>, <string-name><surname>McCann</surname> <given-names>B</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>C</given-names></string-name>, <string-name><surname>Socher</surname> <given-names>R</given-names></string-name> (<year>2020</year>). Evaluating the factual consistency of abstractive text summarization. arXiv preprint: <uri>https://arxiv.org/abs/1910.12840</uri>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_014">
<mixed-citation publication-type="other"> <string-name><surname>Ma</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>WE</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>M</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Sheng</surname> <given-names>QZ</given-names></string-name> (<year>2021</year>). Multi-document summarization via deep learning techniques: A survey. arXiv preprint: <uri>https://arxiv.org/abs/2011.04843</uri>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_015">
<mixed-citation publication-type="other"> <string-name><surname>Ramos</surname> <given-names>J</given-names></string-name> (<year>2003</year>). Using TF-IDF to determine word relevance in document queries. <italic>Technical report</italic>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_016">
<mixed-citation publication-type="other"> <string-name><surname>Reimers</surname> <given-names>N</given-names></string-name>, <string-name><surname>Gurevych</surname> <given-names>I</given-names></string-name> (<year>2019</year>). Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint: <uri>https://arxiv.org/abs/1908.10084</uri>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_017">
<mixed-citation publication-type="chapter"> <string-name><surname>Rose</surname> <given-names>S</given-names></string-name>, <string-name><surname>Engel</surname> <given-names>D</given-names></string-name>, <string-name><surname>Cramer</surname> <given-names>N</given-names></string-name>, <string-name><surname>Cowley</surname> <given-names>W</given-names></string-name> (<year>2010</year>). <chapter-title>Automatic keyword extraction from individual documents</chapter-title>. In: <source><italic>Text Mining: Applications and Theory</italic></source> (<string-name><given-names>MW</given-names> <surname>Berry</surname></string-name>, <string-name><given-names>J</given-names> <surname>Kogan</surname></string-name>, eds.). <publisher-name>John Wiley &amp; Sons, Ltd.</publisher-name></mixed-citation>
</ref>
<ref id="j_jds1149_ref_018">
<mixed-citation publication-type="other"> <string-name><surname>UMAP</surname></string-name> (<year>2018</year>). UMAP: Uniform manifold approximation and projection for dimension reduction. Accessed 2023.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_019">
<mixed-citation publication-type="other"> <string-name><surname>Xiao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Beltagy</surname> <given-names>I</given-names></string-name>, <string-name><surname>Carenini</surname> <given-names>G</given-names></string-name>, <string-name><surname>Cohan</surname> <given-names>A</given-names></string-name> (<year>2022</year>). Primera: Pyramid-based masked sentence pre-training for multi-document summarization. arXiv preprint: <uri>https://arxiv.org/abs/2110.08499</uri>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_020">
<mixed-citation publication-type="other"> <string-name><surname>Zhang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Ladhak</surname> <given-names>F</given-names></string-name>, <string-name><surname>Durmus</surname> <given-names>E</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>P</given-names></string-name>, <string-name><surname>McKeown</surname> <given-names>K</given-names></string-name>, <string-name><surname>Hashimoto</surname> <given-names>TB</given-names></string-name> (<year>2023</year>a). Benchmarking large language models for news summarization. arXiv preprint: <uri>https://arxiv.org/abs/2301.13848</uri>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_021">
<mixed-citation publication-type="other"> <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>L</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>D</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>T</given-names></string-name>, et al. (<year>2023</year>b). Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint: <uri>https://arxiv.org/abs/2309.01219</uri>.</mixed-citation>
</ref>
<ref id="j_jds1149_ref_022">
<mixed-citation publication-type="other"> <string-name><surname>Zhao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Jin</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>F</given-names></string-name>, <string-name><surname>Peng</surname> <given-names>T</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>S</given-names></string-name> (<year>2023</year>). PMC-patients: A large-scale dataset of patient summaries and relations for benchmarking retrieval-based clinical decision support systems. arXiv preprint: <uri>https://arxiv.org/abs/2202.13876</uri>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
