Pub. online:23 Apr 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 353–369
Abstract
Studying migration patterns driven by extreme environmental events is crucial for building a sustainable society and stable economy. Motivated by a real dataset about human migrations, this paper develops a transformed varying coefficient model for origin and destination (OD) regression to elucidate the complex associations of migration patterns with spatio-temporal dependencies and socioeconomic factors. Existing studies often overlook the dynamic effects of these factors in OD regression. Furthermore, with the increasing ease of collecting OD data, the scale of current OD regression data is typically large, necessitating the development of methods for efficiently fitting large-scale migration data. We address the challenge by proposing a new Bayesian interpretation for the proposed OD models, leveraging sufficient statistics for efficient big data computation. Our method, inspired by migration studies, promises broad applicability across various fields, contributing to refined statistical analysis techniques. Extensive numerical studies are provided, and insights from real data analysis are shared.
Abstract: Despite the unreasonable feature independence assumption, the naive Bayes classifier provides a simple way but competes well with more sophisticated classifiers under zero-one loss function for assigning an observation to a class given the features observed. However, it has been proved that the naive Bayes works poorly in estimation and in classification for some cases when the features are correlated. To extend, researchers had developed many approaches to free of this primary but rarely satisfied assumption in the real world for the naive Bayes. In this paper, we propose a new classifier which is also free of the independence assumption by evaluating the dependence of features through pair copulas constructed via a graphical model called D-Vine tree. This tree structure helps to decompose the multivariate dependence into many bivariate dependencies and thus makes it possible to easily and efficiently evaluate the dependence of features even for data with high dimension and large sample size. We further extend the proposed method for features with discrete-valued entries. Experimental studies show that the proposed method performs well for both continuous and discrete cases.
Abstract: The aim of this paper is to investigate the flexibility of the skewnormal distribution to classify the pixels of a remotely sensed satellite image. In the most of remote sensing packages, for example ENVI and ERDAS, it is assumed that populations are distributed as a multivariate normal. Then linear discriminant function (LDF) or quadratic discriminant function (QDF) is used to classify the pixels, when the covariance matrix of populations are assumed equal or unequal, respectively. However, the data was obtained from the satellite or airplane images suffer from non-normality. In this case, skew-normal discriminant function (SDF) is one of techniques to obtain more accurate image. In this study, we compare the SDF with LDF and QDF using simulation for different scenarios. The results show that ignoring the skewness of the data increases the misclassification probability and consequently we get wrong image. An application is provided to identify the effect of wrong assumptions on the image accuracy.
Abstract: Count data often have excess zeros in many clinical studies. These zeros usually represent “disease-free state”. Although disease (event) free at the time, some of them might be at a high risk of having the putative outcome while others may be at low or no such risk. We postulate these zeros as a one of the two types, either as ‘low risk’ or as ‘high risk’ zeros for the disease process in question. Low risk zeros can arise due to the absence of risk factors for disease initiation/progression and/or due to very early stage of the disease. High risk zeros can arise due to the presence of significant risk factors for disease initiation/ progression or could be, in rare situations, due to misclassification, more specific diagnostic tests, or below the level of detection. We use zero inflated models which allows us to assume that zeros arise from one of the two separate latent processes-one giving low-risk zeros and the other high-risk zeros and subsequently propose a strategy to identify and classify them as such. To illustrate, we use data on the number of involved nodes in breast cancer patients. Of the 1152 patients studied, 38.8% were node- negative (zeros). The model predicted that about a third (11.4%) of negative nodes are “high risk” and the remaining (27.4%) are at “low risk” of nodal positivity. Posterior probability based classification was more appropriate compared to other methods. Our approach indicates that some node negative patients may be re-assessed for their diagnosis about nodal positivity and/or for future clinical management of their disease. The approach developed here is applicable to any scenario where the disease or outcome can be characterized by count-data.
Partial Least Squares Discriminant Analysis (PLSDA) is a statistical method for classification and consists of a classical Partial Least Squares Regression in which the dependent variable is a categorical one expressing the class membership of each observation. The aim of this study is both analyzing the performance of PLSDA method in classifying 28 European Union (EU) member countries and 7 candidate countries (Albania, Montenegro, Serbia, Macedonia FYR, Turkey moreover including potential candidates Bosnia and Herzegovina and Kosova) correctly to their pre-defined classes (candidate or member) and determining the economic and/or demographic indicators, which are effective in classifying, by using the data set obtained from database of the World Bank.