Pub. online:26 Aug 2024Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 376–392
Abstract
Text clustering can streamline many labor-intensive tasks, but it creates a new challenge: efficiently labeling and interpreting the clusters. Generative large language models (LLMs) are a promising option to automate the process of naming text clusters, which could significantly streamline workflows, especially in domains with large datasets and esoteric language. In this study, we assessed the ability of GPT-3.5-turbo to generate names for clusters of texts and compared these to human-generated text cluster names. We clustered two benchmark datasets, each from a specialized domain: research abstracts and clinical patient notes. We generated names for each cluster using four prompting strategies (different ways of including information about the cluster in the prompt used to get LLM responses). For both datasets, the best prompting strategy beat the manual approach across all quality domains. However, name quality varied by prompting strategy and dataset. We conclude that practitioners should consider trying automated cluster naming to avoid bottlenecks or when the scale of the effort is enough to take advantage of the cost savings offered by automation, as detailed in our supplemental blueprint for using LLM cluster naming. However, to get the best performance, it is vital to test a variety of prompting strategies and perform a small test to identify which one performs best on each project’s unique data.
In this paper, we consider functional varying coefficient model in present of a time invariant covariate for sparse longitudinal data contaminated with some measurement errors. We propose a regularization method to estimate the slope function based on a reproducing kernel Hilbert space approach. As we will see, our procedure is easy to implement. Our simulation results show that the procedure performs well, especially when either sampling frequency or sample size increases. Applications of our method are illustrated in an analysis of a longitudinal CD4+ count dataset from an HIV study.
Abstract: Change point problem has been studied extensively since 1950s due to its broad applications in many fields such as finance, biology and so on. As a special case of the multiple change point problem, the epidemic change point problem has received a lot of attention especially in medical studies. In this paper, a nonparametric method based on the empirical likelihood is proposed to detect the epidemic changes of the mean after unknown change points. Under some mild conditions, the asymptotic null distribution of the empirical likelihood ratio test statistic is proved to be the extreme distribution. The consistency of the test is also proved. Simulations indicate that the test behaves comparable to the other available tests while it enjoys less constraint on the data distribution. The method is applied to the Standford heart transplant data and detects the change points successfully.