Pub. online:23 Jul 2024Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 393–408
Abstract
The coronavirus disease 2019 (COVID-19) pandemic presented unique challenges to the U.S. healthcare system, particularly for nonprofit U.S. hospitals that are obligated to provide community benefits in exchange for federal tax exemptions. We sought to examine how hospitals initiated, modified, or disbanded community benefits programming in response to the COVID-19 pandemic. We used the free-response text in Part IV of Internal Revenue Service (IRS) Form 990 Schedule H (F990H) to assess health equity and disparities. We combined traditional key term frequency and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) clustering approaches with a novel Generative Pre-trained Transformer (GPT) 3.5 summarization approach. Our research reveals shifts in community benefits programming. We observed an increase in COVID-related terms starting in the 2019 tax year, indicating a pivot in community focus and efforts toward pandemic-related activities such as telehealth services and COVID-19 testing and prevention. The clustering analysis identified themes related to COVID-19 and community benefits. Generative Artificial Intelligence (GenAI) summarization with GPT3.5 contextualized these changes, revealing examples of healthcare system adaptations and program cancellations. However, GPT3.5 also encountered some accuracy and validation challenges. This multifaceted text analysis underscores the adaptability of hospitals in maintaining community health support during crises and suggests the potential of advanced AI tools in evaluating large-scale qualitative data for policy and public health research.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 3 (2020): Special issue: Data Science in Action in Response to the Outbreak of COVID-19, pp. 526–535
Abstract
COVID-19 is a disease caused by the severe acute respiratory syndrome coronavirus 2 (SARSCoV-2) that was reported to spread in people in December 2019. Understanding epidemiological
features of COVID-19 is important for the ongoing global efforts to contain the virus. As a
complement to the available work, in this article we analyze the Kaggle novel coronavirus dataset
of 3397 patients dated from January 22, 2020 to March 29, 2020. We employ semiparametric
and nonparametric survival models as well as text mining and data visualization techniques to
examine the clinical manifestations and epidemiological features of COVID-19. Our analysis
shows that: (i) the median incubation time is about 5 days and older people tend to have a
longer incubation period; (ii) the median time for infected people to recover is about 20 days,
and the recovery time is significantly associated with age but not gender; (iii) the fatality rate
is higher for older infected patients than for younger patients
Law and legal studies has been an exciting new field for data science applications whereas the technological advancement also has profound implications for legal practice. For example, the legal industry has accumulated a rich body of high quality texts, images and other digitised formats, which are ready to be further processed and analysed by data scientists. On the other hand, the increasing popularity of data science has been a genuine challenge to legal practitioners, regulators and even general public and has motivated a long-lasting debate in the academia focusing on issues such as privacy protection and algorithmic discrimination. This paper collects 1236 journal articles involving both law and data science from the platform Web of Science to understand the patterns and trends of this interdisciplinary research field in terms of English journal publications. We find a clear trend of increasing publication volume over time and a strong presence of high-impact law and political science journals. We then use the Latent Dirichlet Allocation (LDA) as a topic modelling method to classify the abstracts into four topics based on the coherence measure. The four topics identified confirm that both challenges and opportunities have been investigated in this interdisciplinary field and help offer directions for future research.
Pub. online:22 Feb 2021Type:Data Science In Action
Journal:Journal of Data Science
Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic, pp. 334–347
Abstract
Coronavirus and the COVID-19 pandemic have substantially altered the ways in which people learn, interact, and discover information. In the absence of everyday in-person interaction, how do people self-educate while living in isolation during such times? More specifically, do communities emerge in Google search trends related to coronavirus? Using a suite of network and community detection algorithms, we scrape and mine all Google search trends in America related to an initial search for “coronavirus,” starting with the first Google search on the term (January 16, 2020) to recently (August 11, 2020). Results indicate a near-constant shift in the structure of how people educate themselves on coronavirus. Queries in the earliest days focusing on “Wuhan” and “China”, then shift to “stimulus checks” at the height of the virus in the U.S., and finally shift to queries related to local surges of new cases in later days. A few communities emerge surrounding terms more overtly related to coronavirus (e.g., “cases”, “symptoms”, etc.). Yet, given the shift in related Google queries and the broader information environment, clear community structure for the full search space does not emerge.