Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Keywords: machine learning

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 13

Order by:

Select: All None Download:

Impact of Data Perturbation for Statistical Disclosure Control on the Predictive Performance of Machine Learning Techniques

Thomas Johnson III Sayed A. Mostafa

https://doi.org/10.6339/25-JDS1186

Pub. online: 23 Apr 2025 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 312–331

Abstract

The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical disclosure control (SDC) aims to maximize data utility while minimizing the disclosure risk, i.e., the risk of individual identification. A key SDC method is data perturbation, with General Additive Data Perturbation (GADP) and Copula General Additive Data Perturbation (CGADP) being two prominent approaches. Both leverage multivariate normal distributions to generate synthetic data while preserving statistical properties of the original dataset. Given the increasing use of machine learning for data modeling, this study compares the performance of various machine learning models on GADP- and CGADP-perturbed data. Using Monte Carlo simulations with three data-generating models and a real dataset, we evaluate the predictive performance and robustness of ten machine learning techniques under data perturbation. Our findings provide insights into the machine learning techniques that perform robustly on GADP- and CGADP-perturbed datasets, extending previous research that primarily focused on simple statistics such as means, variances, and correlations.

BEACON: A Tool for Industry Self-Classification in the Economic Census

Brian Dumbacher

Daniel Whitehead Jiseok Jeong All authors (4)

https://doi.org/10.6339/25-JDS1180

Pub. online: 17 Apr 2025 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 429–448

Abstract

Business Establishment Automated Classification of NAICS (BEACON) is a text classification tool that helps respondents to the U.S. Census Bureau’s economic surveys self-classify their business activity in real time. The tool is based on rich training data, natural language processing, machine learning, and information retrieval. It is implemented using Python and an application programming interface. This paper describes BEACON’s methodology and successful application to the 2022 Economic Census, during which the tool was used over half a million times. BEACON has demonstrated that it recognizes a large vocabulary, quickly returns relevant results to respondents, and reduces clerical work associated with industry code assignment.

Physician Effects in Critical Care: A Causal Inference Approach Through Propensity Weighting with Parametric and Super Learning Methods

Yuan Bian

Yu Shi Hui Guo All authors (5)

https://doi.org/10.6339/24-JDS1143

Pub. online: 2 Jul 2024 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 23, Issue 1 (2025), pp. 130–148

Abstract

Physician performance is critical to caring for patients admitted to the intensive care unit (ICU), who are in life-threatening situations and require high level medical care and interventions. Evaluating physicians is crucial for ensuring a high standard of medical care and fostering continuous performance improvement. The non-randomized nature of ICU data often results in imbalance in patient covariates across physician groups, making direct comparisons of the patients’ survival probabilities for each physician misleading. In this article, we utilize the propensity weighting method to address confounding, achieve covariates balance, and assess physician effects. Due to possible model misspecification, we compare the performance of the propensity weighting methods using both parametric models and super learning methods. When the generalized propensity or the quality function is not correctly specified within the parametric propensity weighting framework, super learning-based propensity weighting methods yield more efficient estimators. We demonstrate that utilizing propensity weighting offers an effective way to assess physician performance, a topic of considerable interest to hospital administrators.

Unified Robust Boosting

Zhu Wang

https://doi.org/10.6339/24-JDS1138

Pub. online: 28 Jun 2024 Type: Computing In Data Science

Open Access

Journal: Journal of Data Science Volume 23, Issue 1 (2025), pp. 90–108

Abstract

Boosting is a popular algorithm in supervised machine learning with wide applications in regression and classification problems. It combines weak learners, such as regression trees, to obtain accurate predictions. However, in the presence of outliers, traditional boosting may yield inferior results since the algorithm optimizes a convex loss function. Recent literature has proposed boosting algorithms that optimize robust nonconvex loss functions. Nevertheless, there is a lack of weighted estimation to indicate the outlier status of observations. This article introduces the iteratively reweighted boosting (IRBoost) algorithm, which combines robust loss optimization and weighted estimation. It can be conveniently constructed with existing software. The output includes weights as valuable diagnostics for the outlier status of observations. For practitioners interested in the boosting algorithm, the new method can be interpreted as a way to tune robust observation weights. IRBoost is implemented in the R package irboost and is demonstrated using publicly available data in generalized linear models, classification, and survival data analysis.

Producing Fast and Convenient Machine Learning Benchmarks in R with the stressor Package

Sam Haycock Brennan Bean Emily Burchfield

https://doi.org/10.6339/24-JDS1123

Pub. online: 4 Jun 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 239–258

Abstract

The programming overhead required to implement machine learning workflows creates a barrier for many discipline-specific researchers with limited programming experience. The stressor package provides an R interface to Python’s PyCaret package, which automatically tunes and trains 14-18 machine learning (ML) models for use in accuracy comparisons. In addition to providing an R interface to PyCaret, stressor also contains functions that facilitate synthetic data generation and variants of cross-validation that allow for easy benchmarking of the ability of machine-learning models to extrapolate or compete with simpler models on simpler data forms. We show the utility of stressor on two agricultural datasets, one using classification models to predict crop suitability and another using regression models to predict crop yields. Full ML benchmarking workflows can be completed in only a few lines of code with relatively small computational cost. The results, and more importantly the workflow, provide a template for how applied researchers can quickly generate accuracy comparisons of many machine learning models with very little programming.

Tuning Support Vector Machines and Boosted Trees Using Optimization Algorithms

Jill F. Lundell

https://doi.org/10.6339/23-JDS1106

Pub. online: 5 Jul 2023 Type: Computing In Data Science

Open Access

Journal: Journal of Data Science Volume 22, Issue 4 (2024), pp. 575–590

Abstract

Statistical learning methods have been growing in popularity in recent years. Many of these procedures have parameters that must be tuned for models to perform well. Research has been extensive in neural networks, but not for many other learning methods. We looked at the behavior of tuning parameters for support vector machines, gradient boosting machines, and adaboost in both a classification and regression setting. We used grid search to identify ranges of tuning parameters where good models can be found across many different datasets. We then explored different optimization algorithms to select a model across the tuning parameter space. Models selected by the optimization algorithm were compared to the best models obtained through grid search to select well performing algorithms. This information was used to create an R package, EZtune, that automatically tunes support vector machines and boosted trees.

Binary Classification of Malignant Mesothelioma: A Comparative Study

Ted Si Yuan Cheng Xiyue Liao

https://doi.org/10.6339/23-JDS1090

Pub. online: 14 Feb 2023 Type: Data Science In Action

Open Access

Journal: Journal of Data Science Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 205–224

Abstract

Malignant mesotheliomas are aggressive cancers that occur in the thin layer of tissue that covers most commonly the linings of the chest or abdomen. Though the cancer itself is rare and deadly, early diagnosis will help with treatment and improve outcomes. Mesothelioma is usually diagnosed in the later stages. Symptoms are similar to other, more common conditions. As such, predicting and diagnosing mesothelioma early is essential to starting early treatment for a cancer that is often diagnosed too late. The goal of this comprehensive empirical comparison is to determine the best-performing model based on recall (sensitivity). We particularly wish to avoid false negatives, as it is costly to diagnose a patient as healthy when they actually have cancer. Model training will be conducted based on k-fold cross validation. Random forest is chosen as the optimal model. According to this model, age and duration of asbestos exposure are ranked as the most important features affecting diagnosis of mesothelioma.

Random Forest of Interaction Trees for Estimating Individualized Treatment Regimes with Ordered Treatment Levels in Observational Studies

Justin Thorp Richard A. Levine

Luo Li All authors (4)

https://doi.org/10.6339/23-JDS1084

Pub. online: 2 Feb 2023 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 391–411

Abstract

Traditional methods for evaluating a potential treatment have focused on the average treatment effect. However, there exist situations where individuals can experience significantly heterogeneous responses to a treatment. In these situations, one needs to account for the differences among individuals when estimating the treatment effect. Li et al. (2022) proposed a method based on random forest of interaction trees (RFIT) for a binary or categorical treatment variable, while incorporating the propensity score in the construction of random forest. Motivated by the need to evaluate the effect of tutoring sessions at a Math and Stat Learning Center (MSLC), we extend their approach to an ordinal treatment variable. Our approach improves upon RFIT for multiple treatments by incorporating the ordered structure of the treatment variable into the tree growing process. To illustrate the effectiveness of our proposed method, we conduct simulation studies where the results show that our proposed method has a lower mean squared error and higher optimal treatment classification, and is able to identify the most important variables that impact the treatment effect. We then apply the proposed method to estimate how the number of visits to the MSLC impacts an individual student’s probability of passing an introductory statistics course. Our results show that every student is recommended to go to the MSLC at least once and some can drastically improve their chance of passing the course by going the optimal number of times suggested by our analysis.

Efficacy of Data Fusion Using Convolved Multi-Output Gaussian Processes

Shrihari Vasudevan Arman Melkumyan Steven Scheding

https://doi.org/10.6339/JDS.201504_13(2).0007

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 13, Issue 2 (2015), pp. 341–368

Abstract

Abstract: This paper evaluates the efficacy of a machine learning approach to data fusion using convolved multi-output Gaussian processes in the context of geological resource modeling. It empirically demonstrates that information integration across multiple information sources leads to superior estimates of all the quantities being modeled, compared to modeling them individually. Convolved multi-output Gaussian processes provide a powerful approach for simultaneous modeling of multiple quantities of interest while taking correlations between these quantities into consideration. Experiments are performed on large scale data taken from a mining context.

Machine Learning Algorithms To Predict The Childhood Anemia In Bangladesh

Jahidur Rahman Khan Srizan Chowdhury Humayera Islam All authors (4)

https://doi.org/10.6339/JDS.201901_17(1).0009

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 17, Issue 1 (2019), pp. 195–218

Abstract

Anemia, especially among children, is a serious public health problem in Bangladesh. Apart from understanding the factors associated with anemia, it may be of interest to know the likelihood of anemia given the factors. Prediction of disease status is a key to community and health service policy making as well as forecasting for resource planning. We considered machine learning (ML) algorithms to predict the anemia status among children (under five years) using common risk factors as features. Data were extracted from a nationally representative cross-sectional survey- Bangladesh Demographic and Health Survey (BDHS) conducted in 2011. In this study, a sample of 2013 children were selected for whom data on all selected variables was available. We used several ML algorithms such as linear discriminant analysis (LDA), classification and regression trees (CART), k-nearest neighbors (k-NN), support vector machines (SVM), random forest (RF) and logistic regression (LR) to predict the childhood anemia status. A systematic evaluation of the algorithms was performed in terms of accuracy, sensitivity, specificity, and area under the curve (AUC). We found that the RF algorithm achieved the best classification accuracy of 68.53% with a sensitivity of 70.73%, specificity of 66.41% and AUC of 0.6857. On the other hand, the classical LR algorithm reached a classification accuracy of 62.75% with a sensitivity of 63.41%, specificity of 62.11% and AUC of 0.6276. Among all considered algorithms, the k-NN gave the least accuracy. We conclude that ML methods can be considered in addition to the classical regression techniques when the prediction of anemia is the primary focus.

Detailed search

Search results 13

Export citation

Copy and paste formatted citation

Download citation in file

Authors