Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 892

Order by:

Select: All None Download:

Physician Effects in Critical Care: A Causal Inference Approach Through Propensity Weighting with Parametric and Super Learning Methods

Yuan Bian

Yu Shi Hui Guo All authors (5)

https://doi.org/10.6339/24-JDS1143

Pub. online: 2 Jul 2024 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 23, Issue 1 (2025), pp. 130–148

Abstract

Physician performance is critical to caring for patients admitted to the intensive care unit (ICU), who are in life-threatening situations and require high level medical care and interventions. Evaluating physicians is crucial for ensuring a high standard of medical care and fostering continuous performance improvement. The non-randomized nature of ICU data often results in imbalance in patient covariates across physician groups, making direct comparisons of the patients’ survival probabilities for each physician misleading. In this article, we utilize the propensity weighting method to address confounding, achieve covariates balance, and assess physician effects. Due to possible model misspecification, we compare the performance of the propensity weighting methods using both parametric models and super learning methods. When the generalized propensity or the quality function is not correctly specified within the parametric propensity weighting framework, super learning-based propensity weighting methods yield more efficient estimators. We demonstrate that utilizing propensity weighting offers an effective way to assess physician performance, a topic of considerable interest to hospital administrators.

Image Registration for Zooming Using Similarity Matching

Sujay Das Partha Sarathi Mukherjee

https://doi.org/10.6339/24-JDS1139

Pub. online: 28 Jun 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 4 (2024), pp. 558–574

Abstract

Image registration techniques are used for mapping two images of the same scene or image objects to one another. There are several image registration techniques available in the literature for registering rigid body as well as non-rigid body transformations. A very important image transformation is zooming in or out which also called scaling. Very few research articles address this particular problem except a number of feature-based approaches. This paper proposes a method to register two images of the same image object where one is a zoomed-in version of the other. In the proposed intensity-based method, we consider a circular neighborhood around an image pixel of the zoomed-in image, and search for the pixel in the reference image whose circular neighborhood is most similar to that of the neighborhood in the zoomed-in image with respect to various similarity measures. We perform this procedure for all pixels in the zoomed-in image. On images where the features are small in number, our proposed method works better than the state-of-the-art feature-based methods. We provide several numerical examples as well as a mathematical justification in this paper which support our statement that this method performs reasonably well in many situations.

Unified Robust Boosting

Zhu Wang

https://doi.org/10.6339/24-JDS1138

Pub. online: 28 Jun 2024 Type: Computing In Data Science

Open Access

Journal: Journal of Data Science Volume 23, Issue 1 (2025), pp. 90–108

Abstract

Boosting is a popular algorithm in supervised machine learning with wide applications in regression and classification problems. It combines weak learners, such as regression trees, to obtain accurate predictions. However, in the presence of outliers, traditional boosting may yield inferior results since the algorithm optimizes a convex loss function. Recent literature has proposed boosting algorithms that optimize robust nonconvex loss functions. Nevertheless, there is a lack of weighted estimation to indicate the outlier status of observations. This article introduces the iteratively reweighted boosting (IRBoost) algorithm, which combines robust loss optimization and weighted estimation. It can be conveniently constructed with existing software. The output includes weights as valuable diagnostics for the outlier status of observations. For practitioners interested in the boosting algorithm, the new method can be interpreted as a way to tune robust observation weights. IRBoost is implemented in the R package irboost and is demonstrated using publicly available data in generalized linear models, classification, and survival data analysis.

Producing Fast and Convenient Machine Learning Benchmarks in R with the stressor Package

Sam Haycock Brennan Bean Emily Burchfield

https://doi.org/10.6339/24-JDS1123

Pub. online: 4 Jun 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 239–258

Abstract

The programming overhead required to implement machine learning workflows creates a barrier for many discipline-specific researchers with limited programming experience. The stressor package provides an R interface to Python’s PyCaret package, which automatically tunes and trains 14-18 machine learning (ML) models for use in accuracy comparisons. In addition to providing an R interface to PyCaret, stressor also contains functions that facilitate synthetic data generation and variants of cross-validation that allow for easy benchmarking of the ability of machine-learning models to extrapolate or compete with simpler models on simpler data forms. We show the utility of stressor on two agricultural datasets, one using classification models to predict crop suitability and another using regression models to predict crop yields. Full ML benchmarking workflows can be completed in only a few lines of code with relatively small computational cost. The results, and more importantly the workflow, provide a template for how applied researchers can quickly generate accuracy comparisons of many machine learning models with very little programming.

Editorial: Inquire, Investigate, Implement, Innovate – Symposium on Data Science and Statistics 2023

Emily Dodwell Amanda A. Koepke

https://doi.org/10.6339/24-JDS222EDI

Pub. online: 4 Jun 2024 Type: Editorial

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 173–175

Spatial-Temporal Extreme Modeling for Point-to-Area Random Effects (PARE)

Carlynn Fagnant Julia C. Schedler

Katherine B. Ensor

https://doi.org/10.6339/24-JDS1133

Pub. online: 24 May 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 221–238

Abstract

One measurement modality for rainfall is a fixed location rain gauge. However, extreme rainfall, flooding, and other climate extremes often occur at larger spatial scales and affect more than one location in a community. For example, in 2017 Hurricane Harvey impacted all of Houston and the surrounding region causing widespread flooding. Flood risk modeling requires understanding of rainfall for hydrologic regions, which may contain one or more rain gauges. Further, policy changes to address the risks and damages of natural hazards such as severe flooding are usually made at the community/neighborhood level or higher geo-spatial scale. Therefore, spatial-temporal methods which convert results from one spatial scale to another are especially useful in applications for evolving environmental extremes. We develop a point-to-area random effects (PARE) modeling strategy for understanding spatial-temporal extreme values at the areal level, when the core information are time series at point locations distributed over the region.

A Platform for Large Scale Statistical Modelling in R

Jason Cairns Simon Urbanek Paul Murrell

https://doi.org/10.6339/24-JDS1132

Pub. online: 24 May 2024 Type: Computing In Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 208–220

Abstract

With the growing scale of big datasets, fitting novel statistical models on larger-than-memory datasets becomes correspondingly challenging. This document outlines the development and use of an API for large scale modelling, with a demonstration given by the proof of concept platform largescaler, developed specifically for the development of statistical models for big datasets.

Evaluating Perceptual Judgements on 3D Printed Bar Charts

Tyler Wiederich Susan VanderPlas

https://doi.org/10.6339/24-JDS1131

Pub. online: 24 May 2024 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 176–190

Abstract

Graphical design principles typically recommend minimizing the dimensionality of a visualization - for instance, using only 2 dimensions for bar charts rather than providing a 3D rendering, because this extra complexity may result in a decrease in accuracy. This advice has been oft repeated, but the underlying experimental evidence is focused on fixed 2D projections of 3D charts. In this paper, we describe an experiment which attempts to establish whether the decrease in accuracy extends to 3D virtual renderings and 3D printed charts. We replicate the grouped bar chart comparisons in the 1984 Cleveland & McGill study, assessing the accuracy of numerical estimates using different types of 3D and 2D renderings.

Interaction Selection and Prediction Performance in High-Dimensional Data: A Comparative Study of Statistical and Tree-Based Methods

Chinedu J. Nzekwe Seongtae Kim Sayed A. Mostafa

https://doi.org/10.6339/24-JDS1127

Pub. online: 22 May 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 259–279

Abstract

Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.

Demonstrative Evidence and the Use of Algorithms in Jury Trials

Rachel Rogers

Susan VanderPlas

https://doi.org/10.6339/24-JDS1130

Pub. online: 2 May 2024 Type: Education In Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 314–332

Abstract

We investigate how the use of bullet comparison algorithms and demonstrative evidence may affect juror perceptions of reliability, credibility, and understanding of expert witnesses and presented evidence. The use of statistical methods in forensic science is motivated by a lack of scientific validity and error rate issues present in many forensic analysis methods. We explore what our study says about how this type of forensic evidence is perceived in the courtroom – where individuals unfamiliar with advanced statistical methods are asked to evaluate results in order to assess guilt. In the course of our initial study, we found that individuals overwhelmingly provided high Likert scale ratings in reliability, credibility, and scientificity regardless of experimental condition. This discovery of scale compression - where responses are limited to a few values on a larger scale, despite experimental manipulations - limits statistical modeling but provides opportunities for new experimental manipulations which may improve future studies in this area.

6 7 8 9 10

Items per page

Export citation

Copy and paste formatted citation

Formatted citation

Placeholder

Citation style

Download citation in file

Export format

Authors

Placeholder

RSS

Journal of data science

Online ISSN: 1683-8602
Print ISSN: 1680-743X

About

About journal

For contributors

Submit
OA Policy
Become a Peer-reviewer

Contact us

JDS@ruc.edu.cn
No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China