Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 889

Order by:

Select: All None Download:

Producing Fast and Convenient Machine Learning Benchmarks in R with the stressor Package

Sam Haycock Brennan Bean Emily Burchfield

https://doi.org/10.6339/24-JDS1123

Pub. online: 4 Jun 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 239–258

Abstract

The programming overhead required to implement machine learning workflows creates a barrier for many discipline-specific researchers with limited programming experience. The stressor package provides an R interface to Python’s PyCaret package, which automatically tunes and trains 14-18 machine learning (ML) models for use in accuracy comparisons. In addition to providing an R interface to PyCaret, stressor also contains functions that facilitate synthetic data generation and variants of cross-validation that allow for easy benchmarking of the ability of machine-learning models to extrapolate or compete with simpler models on simpler data forms. We show the utility of stressor on two agricultural datasets, one using classification models to predict crop suitability and another using regression models to predict crop yields. Full ML benchmarking workflows can be completed in only a few lines of code with relatively small computational cost. The results, and more importantly the workflow, provide a template for how applied researchers can quickly generate accuracy comparisons of many machine learning models with very little programming.

Editorial: Inquire, Investigate, Implement, Innovate – Symposium on Data Science and Statistics 2023

Emily Dodwell Amanda A. Koepke

https://doi.org/10.6339/24-JDS222EDI

Pub. online: 4 Jun 2024 Type: Editorial

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 173–175

Spatial-Temporal Extreme Modeling for Point-to-Area Random Effects (PARE)

Carlynn Fagnant Julia C. Schedler

Katherine B. Ensor

https://doi.org/10.6339/24-JDS1133

Pub. online: 24 May 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 221–238

Abstract

One measurement modality for rainfall is a fixed location rain gauge. However, extreme rainfall, flooding, and other climate extremes often occur at larger spatial scales and affect more than one location in a community. For example, in 2017 Hurricane Harvey impacted all of Houston and the surrounding region causing widespread flooding. Flood risk modeling requires understanding of rainfall for hydrologic regions, which may contain one or more rain gauges. Further, policy changes to address the risks and damages of natural hazards such as severe flooding are usually made at the community/neighborhood level or higher geo-spatial scale. Therefore, spatial-temporal methods which convert results from one spatial scale to another are especially useful in applications for evolving environmental extremes. We develop a point-to-area random effects (PARE) modeling strategy for understanding spatial-temporal extreme values at the areal level, when the core information are time series at point locations distributed over the region.

A Platform for Large Scale Statistical Modelling in R

Jason Cairns Simon Urbanek Paul Murrell

https://doi.org/10.6339/24-JDS1132

Pub. online: 24 May 2024 Type: Computing In Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 208–220

Abstract

With the growing scale of big datasets, fitting novel statistical models on larger-than-memory datasets becomes correspondingly challenging. This document outlines the development and use of an API for large scale modelling, with a demonstration given by the proof of concept platform largescaler, developed specifically for the development of statistical models for big datasets.

Evaluating Perceptual Judgements on 3D Printed Bar Charts

Tyler Wiederich Susan VanderPlas

https://doi.org/10.6339/24-JDS1131

Pub. online: 24 May 2024 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 176–190

Abstract

Graphical design principles typically recommend minimizing the dimensionality of a visualization - for instance, using only 2 dimensions for bar charts rather than providing a 3D rendering, because this extra complexity may result in a decrease in accuracy. This advice has been oft repeated, but the underlying experimental evidence is focused on fixed 2D projections of 3D charts. In this paper, we describe an experiment which attempts to establish whether the decrease in accuracy extends to 3D virtual renderings and 3D printed charts. We replicate the grouped bar chart comparisons in the 1984 Cleveland & McGill study, assessing the accuracy of numerical estimates using different types of 3D and 2D renderings.

Interaction Selection and Prediction Performance in High-Dimensional Data: A Comparative Study of Statistical and Tree-Based Methods

Chinedu J. Nzekwe Seongtae Kim Sayed A. Mostafa

https://doi.org/10.6339/24-JDS1127

Pub. online: 22 May 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 259–279

Abstract

Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.

Demonstrative Evidence and the Use of Algorithms in Jury Trials

Rachel Rogers

Susan VanderPlas

https://doi.org/10.6339/24-JDS1130

Pub. online: 2 May 2024 Type: Education In Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 314–332

Abstract

We investigate how the use of bullet comparison algorithms and demonstrative evidence may affect juror perceptions of reliability, credibility, and understanding of expert witnesses and presented evidence. The use of statistical methods in forensic science is motivated by a lack of scientific validity and error rate issues present in many forensic analysis methods. We explore what our study says about how this type of forensic evidence is perceived in the courtroom – where individuals unfamiliar with advanced statistical methods are asked to evaluate results in order to assess guilt. In the course of our initial study, we found that individuals overwhelmingly provided high Likert scale ratings in reliability, credibility, and scientificity regardless of experimental condition. This discovery of scale compression - where responses are limited to a few values on a larger scale, despite experimental manipulations - limits statistical modeling but provides opportunities for new experimental manipulations which may improve future studies in this area.

A Multi-Model Framework to Explore ADHD Diagnosis from Neuroimaging Data

Yagmur Yavuz Ozdemir Naga Chandra Padmini Nukala Roberto Molinari All authors (4)

https://doi.org/10.6339/24-JDS1128

Pub. online: 2 May 2024 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 191–207

Abstract

Attention Deficit Hyperactivity Disorder (ADHD) is a frequent neurodevelopmental disorder in children that is commonly diagnosed subjectively. The objective detection of ADHD based on neuroimaging data has been a complex problem with low ranges of accuracy, possibly due to (among others) complex diagnostic processes, the high number of features considered and imperfect measurements in data collection. Hence, reliable neuroimaging biomarkers for detecting ADHD have been elusive. To address this problem we consider a recently proposed multi-model selection method called Sparse Wrapper AlGorithm (SWAG), which is a greedy algorithm that combines screening and wrapper approaches to create a set of low-dimensional models with good predictive power. While preserving the previous levels of accuracy, SWAG provides a measure of importance of brain regions for identifying ADHD. Our approach also provides a set of equally-performing and simple models which highlight the main feature combinations to be analyzed and the interactions between them. Taking advantage of the network of models resulting from this approach, we confirm the relevance of the frontal and temporal lobes as well as highlight how the different regions interact to detect the presence of ADHD. In particular, these results are fairly consistent across different learning mechanisms employed within the SWAG (i.e. logistic regression, linear and radial-kernel support vector machines) thereby providing population-level insights, as well as delivering feature combinations that are smaller and often perform better than those that would be used if employing their original versions directly.

Interdisciplinary Approaches to Teaching Communication and Ethics in Data Science: A Case Study

Scott Thatcher K. Scott Alberts Tetyana Beregovska

https://doi.org/10.6339/24-JDS1129

Pub. online: 29 Apr 2024 Type: Education In Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 333–351

Abstract

By its nature, data science uses ideas and methodologies from computer science and statistics, along with field-specific knowledge, to describe, learn and predict. Recently, storytelling has been highlighted as an important extension of more traditional data science skills such as coding and modeling. Three courses in our new Master in Data Science and Analytic Storytelling program were designed to include interdisciplinary modules, mainly taught by faculty in storytelling-related disciplines, such as Communication and Art & Design. These courses were PDAT 622: Narrative, Argument, and Persuasion in Data Science; PDAT 624: Principles of Design in Data Visualization; and PDAT 625: Big Data Ethics and Security.

Our first cohort serves as a natural case study, allowing us to reflectively analyze our materials and an informal student survey to explore the effects of interdisciplinarity in these novel courses. Results of the student survey show that students generally found value in these interdisciplinary course components, especially in course “signature assignments,” which allow students to actively engage with course content while reinforcing technical skills from previous courses. Examples of these signature assignments are presented in this paper’s supplementary materials.

BIE: Binary Image Encoding for the Classification of Tabular Data

James Halladay

Drake Cullen

Nathan Briner

All authors (9)

https://doi.org/10.6339/24-JDS1122

Pub. online: 19 Apr 2024 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 23, Issue 1 (2025), pp. 109–129

Abstract

There has been remarkable progress in the field of deep learning, particularly in areas such as image classification, object detection, speech recognition, and natural language processing. Convolutional Neural Networks (CNNs) have emerged as a dominant model of computation in this domain, delivering exceptional accuracy in image recognition tasks. Inspired by their success, researchers have explored the application of CNNs to tabular data. However, CNNs trained on structured tabular data often yield subpar results. Hence, there has been a demonstrated gap between the performance of deep learning models and shallow models on tabular data. To that end, Tabular-to-Image (T2I) algorithms have been introduced to convert tabular data into an unstructured image format. T2I algorithms enable the encoding of spatial information into the image, which CNN models can effectively utilize for classification. In this work, we propose two novel T2I algorithms, Binary Image Encoding (BIE) and correlated Binary Image Encoding (cBIE), which preserve complex relationships in the generated image by leveraging the native binary representation of the data. Additionally, cBIE captures more spatial information by reordering columns based on their correlation to a feature. To evaluate the performance of our algorithms, we conducted experiments using four benchmark datasets, employing ResNet-50 as the deep learning model. Our results show that the ResNet-50 models trained with images generated using BIE and cBIE consistently outperformed or matched models trained on images created using the previous State of the Art method, Image Generator for Tabular Data (IGTD).

6 7 8 9 10

Items per page

Export citation

Copy and paste formatted citation

Formatted citation

Placeholder

Citation style

Download citation in file

Export format

Authors

Placeholder

RSS

Journal of data science

Online ISSN: 1683-8602
Print ISSN: 1680-743X

About

About journal

For contributors

Submit
OA Policy
Become a Peer-reviewer

Contact us

JDS@ruc.edu.cn
No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China