Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 892

Order by:

Select: All None Download:

Comparative Analysis of VADER and TextBlob on Financial News Headlines

Keshab Raj Dahal

Ankrit Gupta Nirajan Budhathoki

https://doi.org/10.6339/25-JDS1195

Pub. online: 10 Jul 2025 Type: Data Science In Action

Open Access

Journal: Journal of Data Science

Abstract

Financial news headlines serve as a rich source of information on financial activities, offering a wealth of text that can provide insights into human behavior. One key analysis that can be conducted on this text is sentiment analysis. Despite extensive research over the years, sentiment analysis still faces challenges, particularly in handling internet slang, abbreviations, and emoticons commonly found on many websites that cover financial news headlines, including Bloomberg, Yahoo Finance, and Financial Times. This paper compares the performance of two sentiment analyzers—VADER and TextBlob—on financial news headlines from two countries: the USA (a well-developed economic nation) and Nepal (an underdeveloped economic nation). The collected headlines were manually classified into three categories (positive, negative, and neutral) from a financial perspective. The headlines were then cleaned and processed through the sentiment analyzers to compare their performance. The models’ performance is evaluated based on accuracy, sensitivity, specificity, and neutral specificity. Experimental results reveal that VADER performs better than TextBlob on both datasets. Additionally, both models perform better on financial news headlines from the USA than Nepal. These findings are further validated through statistical tests.

HIMA: An

R

Package for High-Dimensional Mediation Analysis

Haixiang Zhang Yinan Zheng Lifang Hou All authors (4)

https://doi.org/10.6339/25-JDS1192

Pub. online: 9 Jul 2025 Type: Computing In Data Science

Open Access

Journal: Journal of Data Science

Abstract

Mediation analysis plays an important role in many research fields, yet it is very challenging to perform estimation and hypothesis testing for high-dimensional mediation effects. We develop a user-friendly $\mathsf{R}$ package HIMA for high-dimensional mediation analysis with varying mediator and outcome specifications. The HIMA package is a comprehensive tool that accommodates various types of high-dimensional mediation models. This paper offers an overview of the functions within HIMA and demonstrates the practical utility of HIMA through simulated datasets. The HIMA package is publicly available from the Comprehensive $\mathsf{R}$ Archive Network at https://CRAN.R-project.org/package=HIMA.

A Bayesian Negative Binomial-Bernoulli Model with Tensor Decomposition: Application to Jointly Analyzing Shot Attempts and Shot Successes in Basketball Games

Kwok-Wah Ho

https://doi.org/10.6339/25-JDS1196

Pub. online: 9 Jul 2025 Type: Data Science In Action

Open Access

Journal: Journal of Data Science

Abstract

We propose a Bayesian Negative Binomial-Bernoulli model to jointly analyze the patterns behind field goal attempts and the factors influencing shot success. We apply nonnegative CANDECOMP/PARAFAC tensor decomposition to study shot patterns and use logistic regression to predict successful shots. To maintain the conditional conjugacy of the model, we employ a double Pólya-Gamma data augmentation scheme and devise an efficient variational inference algorithm for estimation. The model is applied to shot chart data from the National Basketball Association, focusing on the regular seasons from 2015–16 to 2022–23. We consistently identify three latent features in shot patterns across all seasons and verify a popular claim from recent years about the increasing importance of three-point shots. Additionally, we find that the home court advantage in field goal accuracy disappears in the 2020–21 regular season, which was the only full season under strict COVID-19 crowd control, aside from the short bubble period in 2019–20. This finding contributes to the literature on the influence of crowd effects on home advantage in basketball games.

Editorial: 2024 WNAR/IMS/Graybill Annual Meeting

Tianjian Zhou Brian Wiens Tianying Wang

https://doi.org/10.6339/25-JDS233EDI

Pub. online: 25 Jun 2025 Type: Editorial

Open Access

Journal: Journal of Data Science Volume 23, Issue 3 (2025): Special Issue: 2024 WNAR/IMS/Graybill Annual Meeting, pp. 451–453

Money Laundering Detection with Multi-Aggregation Custom Edge GIN

Filip Wójcik

https://doi.org/10.6339/25-JDS1190

Pub. online: 12 Jun 2025 Type: Data Science In Action

Open Access

Journal: Journal of Data Science

Abstract

Detecting illicit transactions in Anti-Money Laundering (AML) systems remains a significant challenge due to class imbalances and the complexity of financial networks. This study introduces the Multiple Aggregations for Graph Isomorphism Network with Custom Edges (MAGIC) convolution, an enhancement of the Graph Isomorphism Network (GIN) designed to improve the detection of illicit transactions in AML systems. MAGIC integrates edge convolution (GINE Conv) and multiple learnable aggregations, allowing for varied embedding sizes and increased generalization capabilities. Experiments were conducted using synthetic datasets, which simulate real-world transactions, following the experimental setup of previous studies to ensure comparability. MAGIC, when combined with XGBoost as a link predictor, outperformed existing models in 16 out of 24 metrics, with notable improvements in F1 scores and precision. In the most imbalanced dataset, MAGIC achieved an F1 score of 82.6% and a precision of 90.4% for the illicit class. While MAGIC demonstrated high precision, its recall was lower or comparable to the other models, indicating potential areas for future enhancement. Overall, MAGIC presents a robust approach to AML detection, particularly in scenarios where precision and overall quality are critical. Future research should focus on optimizing the model’s recall, potentially by incorporating additional regularization techniques or advanced sampling methods. Additionally, exploring the integration of foundation models like GraphAny could further enhance the model’s applicability in diverse AML environments.

Quantifying the Alignment of a Data Analysis Between Analyst and Audience

Lucy D’Agostino McGowan Roger D. Peng Stephanie C. Hicks

https://doi.org/10.6339/25-JDS1189

Pub. online: 12 Jun 2025 Type: Education In Data Science

Open Access

Journal: Journal of Data Science

Abstract

A challenge that data scientists face is building an analytic product that is useful and trustworthy for a given audience. Previously, a set of principles for describing data analyses were defined that can be used to create a data analysis and to characterize the variation between analyses. Here, we introduce a concept called the alignment of a data analysis, which is between the data analyst and an audience. We define an aligned data analysis as the matching of principles between the analyst and the audience for whom the analysis is developed. In this paper, we propose a model for evaluating the alignment of a data analysis and describe some of its properties. We argue that more generally, this framework provides a language for characterizing alignment and can be used as a guide for practicing data scientists to building better data products.

Estimating Disease Prevalence from Preferentially Sampled, Pooled Data

Clinton P. Pollock

Andrew Hoegh

Kathryn M. Irvine

All authors (5)

https://doi.org/10.6339/25-JDS1191

Pub. online: 11 Jun 2025 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 23, Issue 3 (2025): Special Issue: 2024 WNAR/IMS/Graybill Annual Meeting, pp. 542–559

Abstract

After the onset of the COVID-19 pandemic, scientific interest in coronaviruses endemic in animal populations has increased dramatically. However, investigating the prevalence of disease in animal populations across the landscape, which requires finding and capturing animals can be difficult. Spatial random sampling over a grid could be extremely inefficient because animals can be hard to locate, and the total number of samples may be small. Alternatively, preferential sampling, using existing knowledge to inform sample location, can guarantee larger numbers of samples, but estimates derived from this sampling scheme may exhibit bias if there is a relationship between higher probability sampling locations and the disease prevalence. Sample specimens are commonly grouped and tested in pools which can also be an added challenge when combined with preferential sampling. Here we present a Bayesian method for estimating disease prevalence with preferential sampling in pooled presence-absence data motivated by estimating factors related to coronavirus infection among Mexican free-tailed bats (Tadarida brasiliensis) in California. We demonstrate the efficacy of our approach in a simulation study, where a naive model, not accounting for preferential sampling, returns biased estimates of parameter values; however, our model returns unbiased results regardless of the degree of preferential sampling. Our model framework is then applied to data from California to estimate factors related to coronavirus prevalence. After accounting for preferential sampling impacts, our model suggests small prevalence differences between male and female bats.

Variable Selection with FDR Control for Noisy Data – An Application to Screening Metabolites that Are Associated with Breast Cancer and Colorectal Cancer

Runqiu Wang Ran Dai

Ying Huang All authors (8)

https://doi.org/10.6339/25-JDS1166

Pub. online: 11 Jun 2025 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 23, Issue 3 (2025): Special Issue: 2024 WNAR/IMS/Graybill Annual Meeting, pp. 499–520

Abstract

The rapidly expanding field of metabolomics presents an invaluable resource for understanding the associations between metabolites and various diseases. However, the high dimensionality, presence of missing values, and measurement errors associated with metabolomics data can present challenges in developing reliable and reproducible approaches for disease association studies. Therefore, there is a compelling need for robust statistical analyses that can navigate these complexities to achieve reliable and reproducible disease association studies. In this paper, we construct algorithms to perform variable selection for noisy data and control the False Discovery Rate when selecting mutual metabolomic predictors for multiple disease outcomes. We illustrate the versatility and performance of this procedure in a variety of scenarios, dealing with missing data and measurement errors. As a specific application of this novel methodology, we target two of the most prevalent cancers among US women: breast cancer and colorectal cancer. By applying our method to the Women’s Health Initiative data, we successfully identify metabolites that are associated with either or both of these cancers, demonstrating the practical utility and potential of our method in identifying consistent risk factors and understanding shared mechanisms between diseases.

Exploring Massive Risk Factors of Categorical Outcomes via Supervised Dimension Reduction

Yan Li Kangni Alemdjrodo Yanzhu Lin All authors (5)

https://doi.org/10.6339/25-JDS1188

Pub. online: 27 May 2025 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science

Abstract

We propose to explore high-dimensional data with categorical outcomes by generalizing the penalized orthogonal-components regression method (POCRE), a supervised dimension reduction method initially proposed for high-dimensional linear regression. This generalized POCRE, i.e., gPOCRE, sequentially builds up orthogonal components by selecting predictors which maximally explain the variation of the response variables. Therefore, gPOCRE simultaneously selects significant predictors and reduces dimensions by constructing linear components of these selected predictors for a high-dimensional generalized linear model. For multiple categorical outcomes, gPOCRE can also construct common components shared by all outcomes to improve the power of selecting variables shared by multiple outcomes. Both simulation studies and real data analysis are carried out to illustrate the performance of gPOCRE.

Editorial: Beyond Big Data: Bridging the Gap Between Theory and Practice—Symposium on Data Science and Statistics 2024

Kristine Gierz Owais Gilani Julia Schedler

https://doi.org/10.6339/25-JDS232EDI

Pub. online: 12 May 2025 Type: Editorial

Open Access

Journal: Journal of Data Science Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 265–268

Detailed search

Search results 892

Export citation

Copy and paste formatted citation

Download citation in file

Authors