Pub. online:7 Aug 2024Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 409–422
Abstract
The North American Product Classification System (NAPCS) was first introduced in the 2017 Economic Census and provides greater detail on the range of products and services offered by businesses than what was previously available with just an industry code. In the 2022 Economic Census, NAPCS consisted of 7,234 codes and respondents often found that they were unable to identify correct NAPCS codes for their business, leaving instead written descriptions of their products and services. Over one million of these needed to be reviewed by Census analysts in the 2017 Economic Census. The Smart Instrument NAPCS Classification Tool (SINCT) offers respondents a low latency search engine to find appropriate NAPCS codes based on a written description of their products and services. SINCT uses a neural network document embedding model (doc2vec) to embed respondent searches in a numerical space and then identifies NAPCS codes that are close to the search text. This paper shows one way in which machine learning can improve the survey respondent experience and reduce the amount of expensive manual processing that is necessary after data collection. We also show how relatively simple tools can achieve an estimated 72% top-ten accuracy with thousands of possible classes, limited training data, and strict latency requirements.
Pub. online:23 Jul 2024Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 393–408
Abstract
The coronavirus disease 2019 (COVID-19) pandemic presented unique challenges to the U.S. healthcare system, particularly for nonprofit U.S. hospitals that are obligated to provide community benefits in exchange for federal tax exemptions. We sought to examine how hospitals initiated, modified, or disbanded community benefits programming in response to the COVID-19 pandemic. We used the free-response text in Part IV of Internal Revenue Service (IRS) Form 990 Schedule H (F990H) to assess health equity and disparities. We combined traditional key term frequency and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) clustering approaches with a novel Generative Pre-trained Transformer (GPT) 3.5 summarization approach. Our research reveals shifts in community benefits programming. We observed an increase in COVID-related terms starting in the 2019 tax year, indicating a pivot in community focus and efforts toward pandemic-related activities such as telehealth services and COVID-19 testing and prevention. The clustering analysis identified themes related to COVID-19 and community benefits. Generative Artificial Intelligence (GenAI) summarization with GPT3.5 contextualized these changes, revealing examples of healthcare system adaptations and program cancellations. However, GPT3.5 also encountered some accuracy and validation challenges. This multifaceted text analysis underscores the adaptability of hospitals in maintaining community health support during crises and suggests the potential of advanced AI tools in evaluating large-scale qualitative data for policy and public health research.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 875–888
Abstract
In the wake of the COVID-19 outbreak, the public resorted to Sina Weibo as a major platform for the trend of the pandemic. Research on public sentiment and topic mining of major public sentiment events based on Sina Weibo’s comment data is important for understanding the trend of public opinions during major epidemic outbreaks. Based on classification of the Chinese language into emotion categories in psychology, we use open source tools to build naive Bayesian models to classify Weibo comments. Visualization of comment topics is achieved with word co-occurrence network methods. Commented topics are mined with the help of the latent Dirichlet distribution model. The results show that the psychological sentiment classification combined with the naive Bayesian model can reflect the evolvement of public sentiment during the epidemic, and that the latent Dirichlet distribution model and word co-occurrence network can effectively mine the topics of public concerns.
Abstract: Many nations’ defence departments use capabilitybased planning to guide their investment and divestment decisions. This planning process involves a variety of data that in its raw form is difficult for decisionmakers to use. In this paper we describe how dimensionality reduction and partition clustering are used in the Canadian Armed Forces to create visualizations that convey how important military capabilities are in planning scenarios and how much capacity the planned force structure has to provide the capabilities. Together, these visualizations give decisionmakers an overview of which capabilities may require investment or may be candidates for divestment.
Pub. online:19 Apr 2022Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 3 (2023): Special Issue: Advances in Network Data Science, pp. 470–489
Abstract
Networks are ubiquitous in today’s world. Community structure is a well-known feature of many empirical networks, and a lot of statistical methods have been developed for community detection. In this paper, we consider the problem of community extraction in text networks, which is greatly relevant in medical errors and patient safety databases. We adapt a well-known community extraction method to develop a scalable algorithm for extracting groups of similar documents in large text databases. The application of our method on a real-world patient safety report system demonstrates that the groups generated from community extraction are much more accurate than manual tagging by frontline workers.