Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 892

Order by:

Select: All None Download:

Rethinking Attention Weights as Bidirectional Coefficients

Yuxiang Huang Hanfang Yang Xingrui Wang

https://doi.org/10.6339/24-JDS1134

Pub. online: 14 Nov 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science

Abstract

Attention mechanism has become an almost ubiquitous model architecture in deep learning. One of its distinctive features is to compute non-negative probabilistic distribution to re-weight input representations. This work reconsiders attention weights as bidirectional coefficients instead of probabilistic measures for potential benefits in interpretability and representational capacity. After analyzing the iteration process of attention scores through backwards gradient propagation, we proposed a novel activation function, TanhMax, which possesses several favorable properties to satisfy the requirements of bidirectional attention. We conduct a battery of experiments to validate our analyses and advantages of proposed method on both text and image datasets. The results show that bidirectional attention is effective in revealing input unit’s semantics, presenting more interpretable explanations and increasing the expressive power of attention-based model.

Bibliographical Connections for Semiparametric Analysis in Case-Control Studies on Gene-Environment Interactions

Tianying Wang Jianxuan Liu Aijing Wu

https://doi.org/10.6339/24-JDS1155

Pub. online: 16 Oct 2024 Type: Data Science Reviews

Open Access

Journal: Journal of Data Science Volume 23, Issue 3 (2025): Special Issue: 2024 WNAR/IMS/Graybill Annual Meeting, pp. 454–469

Abstract

Analyzing the gene-environment interaction (GEI) is crucial for understanding the etiology of many complex traits. Among various types of study designs, case-control studies are popular for analyzing gene-environment interactions due to their efficiency in collecting covariate information. Extensive literature explores efficient estimation under various assumptions made about the relationship between genetic and environmental variables. In this paper, we comprehensively review the methods based on or related to the retrospective likelihood, including the methods based on the hypothetical population concept, which has been largely overlooked in GEI research in the past decade. Furthermore, we establish the methodological connection between these two groups of methods by deriving a new estimator from both the retrospective likelihood and the hypothetical population perspectives. The validity of the derivation is demonstrated through numerical studies.

Is Augmentation Effective in Improving Prediction in Imbalanced Datasets?

Gabriel O. Assunção Rafael Izbicki

Marcos O. Prates

https://doi.org/10.6339/24-JDS1154

Pub. online: 15 Oct 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science

Abstract

Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.

Efficient UCB-Based Assignment Algorithm Under Unknown Utility with Application in Mentor-Mentee Matching

Yuyang Shi Yajun Mei

https://doi.org/10.6339/24-JDS1151

Pub. online: 24 Sep 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science

Abstract

The assignment problem, crucial in various real-world applications, involves optimizing the allocation of agents to tasks for maximum utility. While it has been well-studied in the optimization literature when the underlying utilities between all agent-task pairs are known, research is sparse when the utilities are unknown and need to be learned from data on the fly. This paper addresses this gap, as motivated by mentor-mentee matching programs at many U.S. universities. We develop an efficient sequential assignment algorithm, with the aim of nearly maximizing the overall utility simultaneously over different time periods. Our proposed algorithm is to use stochastic bandit feedback to adaptively estimate the unknown utilities through linear regression models, integrating the Upper Confidence Bound (UCB) algorithm in the multi-armed bandit problem with the Hungarian algorithm in the assignment problem. We provide theoretical bounds of our algorithm for both the estimation error and the total regret. Additionally, numerical studies are also conducted to demonstrate the practical effectiveness of our algorithm.

Variable Importance Measures for Multivariate Random Forests

Sharmistha Sikdar

Giles Hooker Vrinda Kadiyali

https://doi.org/10.6339/24-JDS1152

Pub. online: 18 Sep 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 23, Issue 1 (2025), pp. 243–263

Abstract

Multivariate random forests (or MVRFs) are an extension of tree-based ensembles to examine multivariate responses. MVRF can be particularly helpful where some of the responses exhibit sparse (e.g., zero-inflated) distributions, making borrowing strength from correlated features attractive. Tree-based algorithms select features using variable importance measures (VIMs) that score each covariate based on the strength of dependence of the model on that variable. In this paper, we develop and propose new VIMs for MVRFs. Specifically, we focus on the variable’s ability to achieve split improvement, i.e., the difference in the responses between the left and right nodes obtained after splitting the parent node, for a multivariate response. Our proposed VIMs are an improvement over the default naïve VIM in existing software and allow us to investigate the strength of dependence both globally and on a per-response basis. Our simulation studies show that our proposed VIM recovers the true predictors better than naïve measures. We demonstrate usage of the VIMs for variable selection in two empirical applications; the first is on Amazon Marketplace data to predict Buy Box prices of multiple brands in a category, and the second is on ecology data to predict co-occurrence of multiple, rare bird species. A feature of both data sets is that some outcomes are sparse — exhibiting a substantial proportion of zeros or fixed values. In both cases, the proposed VIMs when used for variable screening give superior predictive accuracy over naïve measures.

Data Science Principles for Interpretable and Explainable AI

Kris Sankaran

https://doi.org/10.6339/24-JDS1150

Pub. online: 18 Sep 2024 Type: Data Science Reviews

Open Access

Journal: Journal of Data Science

Abstract

Society’s capacity for algorithmic problem-solving has never been greater. Artificial Intelligence is now applied across more domains than ever, a consequence of powerful abstractions, abundant data, and accessible software. As capabilities have expanded, so have risks, with models often deployed without fully understanding their potential impacts. Interpretable and interactive machine learning aims to make complex models more transparent and controllable, enhancing user agency. This review synthesizes key principles from the growing literature in this field. We first introduce precise vocabulary for discussing interpretability, like the distinction between glass box and explainable models. We then explore connections to classical statistical and design principles, like parsimony and the gulfs of interaction. Basic explainability techniques – including learned embeddings, integrated gradients, and concept bottlenecks – are illustrated with a simple case study. We also review criteria for objectively evaluating interpretability approaches. Throughout, we underscore the importance of considering audience goals when designing interactive data-driven systems. Finally, we outline open challenges and discuss the potential role of data science in addressing them. Code to reproduce all examples can be found at https://go.wisc.edu/3k1ewe.

Evaluation of Text Cluster Naming with Generative Large Language Models

Alexander J. Preiss

Caren A. Arbeit

Anthony Berghammer

All authors (9)

https://doi.org/10.6339/24-JDS1149

Pub. online: 26 Aug 2024 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 376–392

Abstract

Text clustering can streamline many labor-intensive tasks, but it creates a new challenge: efficiently labeling and interpreting the clusters. Generative large language models (LLMs) are a promising option to automate the process of naming text clusters, which could significantly streamline workflows, especially in domains with large datasets and esoteric language. In this study, we assessed the ability of GPT-3.5-turbo to generate names for clusters of texts and compared these to human-generated text cluster names. We clustered two benchmark datasets, each from a specialized domain: research abstracts and clinical patient notes. We generated names for each cluster using four prompting strategies (different ways of including information about the cluster in the prompt used to get LLM responses). For both datasets, the best prompting strategy beat the manual approach across all quality domains. However, name quality varied by prompting strategy and dataset. We conclude that practitioners should consider trying automated cluster naming to avoid bottlenecks or when the scale of the effort is enough to take advantage of the cost savings offered by automation, as detailed in our supplemental blueprint for using LLM cluster naming. However, to get the best performance, it is vital to test a variety of prompting strategies and perform a small test to identify which one performs best on each project’s unique data.

Introduction to the GASP Special Issue

Lisa M. Frehill Peter B. Meyer

https://doi.org/10.6339/24-JDS223EDI

Pub. online: 26 Aug 2024 Type: Editorial

Open Access

Journal: Journal of Data Science Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 353–355

Evaluating a Method for Georeferencing Agricultural Fields

Robert L. Emmet Kevin Hunt Rachael Jennings All authors (5)

https://doi.org/10.6339/24-JDS1146

Pub. online: 9 Aug 2024 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 423–435

Abstract

The US Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) has begun a modernization effort to supplement survey data with non-survey data to improve estimation of agricultural quantities. As part of this effort, NASS has begun georeferencing farms on its list frame by linking geospatial data on agricultural fields with farm records on the list frame. Although many farms can be linked to geospatial data acquired by the Farm Service Agency (FSA), this linkage is not possible for farmers who do not participate in FSA programs, which may include members of some underrepresented groups in US agriculture. Thus, NASS has developed a georeferencing process for non-FSA farms, combining automatic and manual field identification, county assessor parcel data, record linkage, and classification surveys. This process serves the dual purpose of linking farms already on the list frame to geospatial data sources and identifying new farms to add to NASS’s list frame. This report evaluates the output of the non-FSA georeferencing process for 11 states, with a focus on farms added to the list frame via georeferencing. Substantial percentages (>25% for each category) of the new farms added via georeferencing were urban or suburban farms, were small, had livestock, or were in counties with Amish settlements. The georeferencing process shows promise adding farms from these groups that have historically been less well covered in NASS surveys.

A Generalized Class of Exponentiated Modified Weibull Distribution With Applications

Shusen Pu Broderick O. Oluyede Yuqi Qiu All authors (4)

https://doi.org/10.6339/JDS.201610_14(4).0002

Pub. online: 8 Aug 2024 Type: Research Article

Journal: Journal of Data Science Volume 14, Issue 4 (2016), pp. 585–614

Abstract

Abstract: In this paper, a new class of five parameter gamma-exponentiated or generalized modified Weibull (GEMW) distribution which includes exponential, Rayleigh, Weibull, modified Weibull, exponentiated Weibull, exponentiated exponential, exponentiated modified Weibull, exponentiated modified exponential, gamma-exponentiated exponential, gamma exponentiated Rayleigh, gamma-modified Weibull, gamma-modified exponential, gamma-Weibull, gamma-Rayleigh and gamma-exponential distributions as special cases is proposed and studied. Mathematical properties of this new class of distributions including moments, mean deviations, Bonferroni and Lorenz curves, distribution of order statistics and Renyi entropy are presented. Maximum likelihood estimation technique is used to estimate the model parameters and applications to real data sets presented in order to illustrate the usefulness of this new class of distributions and its sub-models.

4 5 6 7 8

Items per page

Export citation

Copy and paste formatted citation

Formatted citation

Placeholder

Citation style

Download citation in file

Export format

Authors

Placeholder

RSS

Journal of data science

Online ISSN: 1683-8602
Print ISSN: 1680-743X

About

About journal

For contributors

Submit
OA Policy
Become a Peer-reviewer

Contact us

JDS@ruc.edu.cn
No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China