Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Keywords: classification

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 11

Order by:

Select: All None Download:

VAROC: Value Added Receiver Operating Characteristics Curve

Danielle Brister Yunro Chung

https://doi.org/10.6339/26-JDS1218

Pub. online: 30 Jan 2026 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science

Abstract

The receiver operating characteristics (ROC) curve has been widely used to evaluate the discrimination performance of biomarkers, but it has been criticized for overlooking their underlying distributions. In this paper, we propose a continuous version of the ROC curve that can assess not only the discrimination performance of biomarkers but also their continuity performance. Our method summarizes the biomarker values as conditional tail expectations at varying thresholds and compare them with true positive and false positive rates. The proposed method is particularly useful for an early phase of biomarker study that enrolls heterogeneous disease populations. Analysis of data from an ovarian cancer biomarker study illustrates the practical utility of the proposed method over the standard ROC curve analysis. The proposed methods are implemented in the R package varoc.

Predicting Stunted Growth in Two Year Old Bangladeshi Children via the Super Learner

Heather L. Cook Jennie Z. Ma Daniel M. Keenan All authors (8)

https://doi.org/10.6339/25-JDS1197

Pub. online: 4 Aug 2025 Type: Data Science In Action

Open Access

Journal: Journal of Data Science

Abstract

Stunted growth in children is a worldwide issue which may cause long term problems for individuals stunted as early as two years of age. However, predicting stunted growth with accuracy is quite complex, but machine learning poses a distinct advantage in this regard. While several techniques are available for predictive modeling, the Super Learner stands out as an ensemble method that integrates multiple algorithms into a single predictive model with enhanced performance. In this study, the Super Learner model, comprising generalized linear model, bagged trees, random forests, conditional random forest, stochastic gradient boosting, Bayesian additive regression trees, neural networks, and model averaged neural networks, achieved high performance with high area under the receiver operating characteristic curve, Brier Score, and the minimum of precision and recall values. However, after analyzing the results from cross validation, the final model selected was the Bayesian additive regression trees. Within the final model, the height-for-age z-score at one year, income, expenditure, anti-lipopolysaccharide antibody at week 6 and at week 18, plasma retinol binding protein at week 6, plasma soluble cluster designation 14 at week 18, fecal Reg 1B at week 12, vitamin D at week 18, mother’s weight and height at enrollment, fecal calprotectin at week 12, fecal myeloperoxidase at week 12, number of days of diarrhea through the first year of life, and the number of days of exclusive breastfeeding through the first year of life emerged as the top important variables for predicting stunted growth at two years of age.

A Statistician’s Selective Review of Neural Network Modeling: Algorithms and Applications

Chunming Zhang

Zhengjun Zhang Xinrui Zhong All authors (5)

https://doi.org/10.6339/25-JDS1167

Pub. online: 20 Jan 2025 Type: Data Science Reviews

Open Access

Journal: Journal of Data Science Volume 23, Issue 4 (2025): Special Issue: Statistical Frontiers of Data Science, pp. 676–694

Abstract

Deep neural networks have a wide range of applications in data science. This paper reviews neural network modeling algorithms and their applications in both supervised and unsupervised learning. Key examples include: (i) binary classification and (ii) nonparametric regression function estimation, both implemented with feedforward neural networks ($\mathrm{FNN}$); (iii) sequential data prediction using long short-term memory ($\mathrm{LSTM}$) networks; and (iv) image classification using convolutional neural networks ($\mathrm{CNN}$). All implementations are provided in $\mathrm{MATLAB}$, making these methods accessible to statisticians and data scientists to support learning and practical application.

A Two-Stage Classification for Dealing with Unseen Clusters in the Testing Data

Jung Wun Lee

Ofer Harel

https://doi.org/10.6339/24-JDS1140

Pub. online: 2 Jul 2024 Type: Statistical Data Science

Open Access

Journal: Journal of Data Science Volume 23, Issue 1 (2025), pp. 188–207

Abstract

Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.

A Copula-Based Supervised Learning Classification for Continuous and Discrete Data

Yuhui Chen

https://doi.org/10.6339/JDS.201610_14(4).0010

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 14, Issue 4 (2016), pp. 769–790

Abstract

Abstract: Despite the unreasonable feature independence assumption, the naive Bayes classifier provides a simple way but competes well with more sophisticated classifiers under zero-one loss function for assigning an observation to a class given the features observed. However, it has been proved that the naive Bayes works poorly in estimation and in classification for some cases when the features are correlated. To extend, researchers had developed many approaches to free of this primary but rarely satisfied assumption in the real world for the naive Bayes. In this paper, we propose a new classifier which is also free of the independence assumption by evaluating the dependence of features through pair copulas constructed via a graphical model called D-Vine tree. This tree structure helps to decompose the multivariate dependence into many bivariate dependencies and thus makes it possible to easily and efficiently evaluate the dependence of features even for data with high dimension and large sample size. We further extend the proposed method for features with discrete-valued entries. Experimental studies show that the proposed method performs well for both continuous and discrete cases.

Application of Skew-normal in Classification of Satellite Image

Mohammad Reza Zadkarami Mahdi Rowhani

https://doi.org/10.6339/JDS.2010.08(4).624

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 8, Issue 4 (2010), pp. 597–606

Abstract

Abstract: The aim of this paper is to investigate the flexibility of the skewnormal distribution to classify the pixels of a remotely sensed satellite image. In the most of remote sensing packages, for example ENVI and ERDAS, it is assumed that populations are distributed as a multivariate normal. Then linear discriminant function (LDF) or quadratic discriminant function (QDF) is used to classify the pixels, when the covariance matrix of populations are assumed equal or unequal, respectively. However, the data was obtained from the satellite or airplane images suffer from non-normality. In this case, skew-normal discriminant function (SDF) is one of techniques to obtain more accurate image. In this study, we compare the SDF with LDF and QDF using simulation for different scenarios. The results show that ignoring the skewness of the data increases the misclassification probability and consequently we get wrong image. An application is provided to identify the effect of wrong assumptions on the image accuracy.

Topic Model Kernel Classification with Probabilistically Reduced Features

Vu Nguyen Dinh Phung Svetha Venkatesh

https://doi.org/10.6339/JDS.201504_13(2).0006

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 13, Issue 2 (2015), pp. 323–340

Abstract

Probabilistic topic models have become a standard in modern machine learning to deal with a wide range of applications. Representing data by dimensional reduction of mixture proportion extracted from topic models is not only richer in semantics interpretation, but could also be informative for classification tasks. In this paper, we describe the Topic Model Kernel (TMK), a topicbased kernel for Support Vector Machine classification on data being processed by probabilistic topic models. The applicability of our proposed kernel is demonstrated in several classification tasks with real world datasets. TMK outperforms existing kernels on the distributional features and give comparative results on nonprobabilistic data types.

On Classifying At Risk Latent Zeros Using Zero Inflated Models

Dwivedi Dwivedi MB Rao Sada Nand Dwivedi All authors (5)

https://doi.org/10.6339/JDS.201404_12(2).0006

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 12, Issue 2 (2014), pp. 307–323

Abstract

Abstract: Count data often have excess zeros in many clinical studies. These zeros usually represent “disease-free state”. Although disease (event) free at the time, some of them might be at a high risk of having the putative outcome while others may be at low or no such risk. We postulate these zeros as a one of the two types, either as ‘low risk’ or as ‘high risk’ zeros for the disease process in question. Low risk zeros can arise due to the absence of risk factors for disease initiation/progression and/or due to very early stage of the disease. High risk zeros can arise due to the presence of significant risk factors for disease initiation/ progression or could be, in rare situations, due to misclassification, more specific diagnostic tests, or below the level of detection. We use zero inflated models which allows us to assume that zeros arise from one of the two separate latent processes-one giving low-risk zeros and the other high-risk zeros and subsequently propose a strategy to identify and classify them as such. To illustrate, we use data on the number of involved nodes in breast cancer patients. Of the 1152 patients studied, 38.8% were node- negative (zeros). The model predicted that about a third (11.4%) of negative nodes are “high risk” and the remaining (27.4%) are at “low risk” of nodal positivity. Posterior probability based classification was more appropriate compared to other methods. Our approach indicates that some node negative patients may be re-assessed for their diagnosis about nodal positivity and/or for future clinical management of their disease. The approach developed here is applicable to any scenario where the disease or outcome can be characterized by count-data.

Determination of the Effective Economic and/or Demographic Indicators in Classification of European Union Member and Candidate Countries Using Partial Least Squares Discriminant Analysis

Esra Polat

https://doi.org/10.6339/JDS.201801_16(1).0005

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 16, Issue 1 (2018), pp. 79–92

Abstract

Partial Least Squares Discriminant Analysis (PLSDA) is a statistical method for classification and consists of a classical Partial Least Squares Regression in which the dependent variable is a categorical one expressing the class membership of each observation. The aim of this study is both analyzing the performance of PLSDA method in classifying 28 European Union (EU) member countries and 7 candidate countries (Albania, Montenegro, Serbia, Macedonia FYR, Turkey moreover including potential candidates Bosnia and Herzegovina and Kosova) correctly to their pre-defined classes (candidate or member) and determining the economic and/or demographic indicators, which are effective in classifying, by using the data set obtained from database of the World Bank.

Stability and Structure of CART and SPAN Search Generated Data Partitions for the Analysis of Low Birth Weight

Roger J. Marshall Panagiota Kitsantas

https://doi.org/10.6339/JDS.2012.10(1).1014

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 10, Issue 1 (2012), pp. 61–73

Abstract

Abstract: Searching for data structure and decision rules using classification and regression tree (CART) methodology is now well established. An alternative procedure, search partition analysis (SPAN), is less well known. Both provide classifiers based on Boolean structures; in CART these are generated by a hierarchical series of local sub-searches and in SPAN by a global search. One issue with CART is its perceived instability, another the awkward nature of the Boolean structures generated by a hierarchical tree. Instability arises because the final tree structure is sensitive to early splits. SPAN, as a global search, seems more likely to render stable partitions. To examine these issues in the context of identifying mothers at risk of giving birth to low birth weight babies, we have taken a very large sample, divided it at random into ten non-overlapping sub-samples and performed SPAN and CART analyses on each sub-sample. The stability of the SPAN and CART models is described and, in addition, the structure of the Boolean representation of classifiers is examined. It is found that SPAN partitions have more intrinsic stability and less prone to Boolean structural irregularities.

Detailed search

Search results 11

Export citation

Copy and paste formatted citation

Download citation in file

Authors