Abstract: Despite the unreasonable feature independence assumption, the naive Bayes classifier provides a simple way but competes well with more sophisticated classifiers under zero-one loss function for assigning an observation to a class given the features observed. However, it has been proved that the naive Bayes works poorly in estimation and in classification for some cases when the features are correlated. To extend, researchers had developed many approaches to free of this primary but rarely satisfied assumption in the real world for the naive Bayes. In this paper, we propose a new classifier which is also free of the independence assumption by evaluating the dependence of features through pair copulas constructed via a graphical model called D-Vine tree. This tree structure helps to decompose the multivariate dependence into many bivariate dependencies and thus makes it possible to easily and efficiently evaluate the dependence of features even for data with high dimension and large sample size. We further extend the proposed method for features with discrete-valued entries. Experimental studies show that the proposed method performs well for both continuous and discrete cases.
Abstract: The aim of this paper is to investigate the flexibility of the skewnormal distribution to classify the pixels of a remotely sensed satellite image. In the most of remote sensing packages, for example ENVI and ERDAS, it is assumed that populations are distributed as a multivariate normal. Then linear discriminant function (LDF) or quadratic discriminant function (QDF) is used to classify the pixels, when the covariance matrix of populations are assumed equal or unequal, respectively. However, the data was obtained from the satellite or airplane images suffer from non-normality. In this case, skew-normal discriminant function (SDF) is one of techniques to obtain more accurate image. In this study, we compare the SDF with LDF and QDF using simulation for different scenarios. The results show that ignoring the skewness of the data increases the misclassification probability and consequently we get wrong image. An application is provided to identify the effect of wrong assumptions on the image accuracy.
Abstract: In this paper, a tree-structured method is proposed to extend Classification and Regression Trees (CART) algorithm to multivariate survival data, assuming a proportional hazard structure in the whole tree. The method works on the marginal survivor distributions and uses a sandwich estimator of variance to account for the association between survival times. The Wald-test statistics is defined as the splitting rule and the survival trees are developed by maximizing between-node separation. The proposed method intends to classify patients into subgroups with distinctively different prognosis. However, unlike the conventional tree-growing algorithms which work on a subset of data at every partition, the proposed method deals with the whole data set and searches the global optimal split at each partition. The method is applied to a prostate cancer data and its performance is also evaluated by several simulation studies.
Abstract: Count data often have excess zeros in many clinical studies. These zeros usually represent “disease-free state”. Although disease (event) free at the time, some of them might be at a high risk of having the putative outcome while others may be at low or no such risk. We postulate these zeros as a one of the two types, either as ‘low risk’ or as ‘high risk’ zeros for the disease process in question. Low risk zeros can arise due to the absence of risk factors for disease initiation/progression and/or due to very early stage of the disease. High risk zeros can arise due to the presence of significant risk factors for disease initiation/ progression or could be, in rare situations, due to misclassification, more specific diagnostic tests, or below the level of detection. We use zero inflated models which allows us to assume that zeros arise from one of the two separate latent processes-one giving low-risk zeros and the other high-risk zeros and subsequently propose a strategy to identify and classify them as such. To illustrate, we use data on the number of involved nodes in breast cancer patients. Of the 1152 patients studied, 38.8% were node- negative (zeros). The model predicted that about a third (11.4%) of negative nodes are “high risk” and the remaining (27.4%) are at “low risk” of nodal positivity. Posterior probability based classification was more appropriate compared to other methods. Our approach indicates that some node negative patients may be re-assessed for their diagnosis about nodal positivity and/or for future clinical management of their disease. The approach developed here is applicable to any scenario where the disease or outcome can be characterized by count-data.
Partial Least Squares Discriminant Analysis (PLSDA) is a statistical method for classification and consists of a classical Partial Least Squares Regression in which the dependent variable is a categorical one expressing the class membership of each observation. The aim of this study is both analyzing the performance of PLSDA method in classifying 28 European Union (EU) member countries and 7 candidate countries (Albania, Montenegro, Serbia, Macedonia FYR, Turkey moreover including potential candidates Bosnia and Herzegovina and Kosova) correctly to their pre-defined classes (candidate or member) and determining the economic and/or demographic indicators, which are effective in classifying, by using the data set obtained from database of the World Bank.