Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.
Abstract: Despite the unreasonable feature independence assumption, the naive Bayes classifier provides a simple way but competes well with more sophisticated classifiers under zero-one loss function for assigning an observation to a class given the features observed. However, it has been proved that the naive Bayes works poorly in estimation and in classification for some cases when the features are correlated. To extend, researchers had developed many approaches to free of this primary but rarely satisfied assumption in the real world for the naive Bayes. In this paper, we propose a new classifier which is also free of the independence assumption by evaluating the dependence of features through pair copulas constructed via a graphical model called D-Vine tree. This tree structure helps to decompose the multivariate dependence into many bivariate dependencies and thus makes it possible to easily and efficiently evaluate the dependence of features even for data with high dimension and large sample size. We further extend the proposed method for features with discrete-valued entries. Experimental studies show that the proposed method performs well for both continuous and discrete cases.
Abstract: Count data often have excess zeros in many clinical studies. These zeros usually represent “disease-free state”. Although disease (event) free at the time, some of them might be at a high risk of having the putative outcome while others may be at low or no such risk. We postulate these zeros as a one of the two types, either as ‘low risk’ or as ‘high risk’ zeros for the disease process in question. Low risk zeros can arise due to the absence of risk factors for disease initiation/progression and/or due to very early stage of the disease. High risk zeros can arise due to the presence of significant risk factors for disease initiation/ progression or could be, in rare situations, due to misclassification, more specific diagnostic tests, or below the level of detection. We use zero inflated models which allows us to assume that zeros arise from one of the two separate latent processes-one giving low-risk zeros and the other high-risk zeros and subsequently propose a strategy to identify and classify them as such. To illustrate, we use data on the number of involved nodes in breast cancer patients. Of the 1152 patients studied, 38.8% were node- negative (zeros). The model predicted that about a third (11.4%) of negative nodes are “high risk” and the remaining (27.4%) are at “low risk” of nodal positivity. Posterior probability based classification was more appropriate compared to other methods. Our approach indicates that some node negative patients may be re-assessed for their diagnosis about nodal positivity and/or for future clinical management of their disease. The approach developed here is applicable to any scenario where the disease or outcome can be characterized by count-data.
Partial Least Squares Discriminant Analysis (PLSDA) is a statistical method for classification and consists of a classical Partial Least Squares Regression in which the dependent variable is a categorical one expressing the class membership of each observation. The aim of this study is both analyzing the performance of PLSDA method in classifying 28 European Union (EU) member countries and 7 candidate countries (Albania, Montenegro, Serbia, Macedonia FYR, Turkey moreover including potential candidates Bosnia and Herzegovina and Kosova) correctly to their pre-defined classes (candidate or member) and determining the economic and/or demographic indicators, which are effective in classifying, by using the data set obtained from database of the World Bank.
Abstract: Searching for data structure and decision rules using classification and regression tree (CART) methodology is now well established. An alternative procedure, search partition analysis (SPAN), is less well known. Both provide classifiers based on Boolean structures; in CART these are generated by a hierarchical series of local sub-searches and in SPAN by a global search. One issue with CART is its perceived instability, another the awkward nature of the Boolean structures generated by a hierarchical tree. Instability arises because the final tree structure is sensitive to early splits. SPAN, as a global search, seems more likely to render stable partitions. To examine these issues in the context of identifying mothers at risk of giving birth to low birth weight babies, we have taken a very large sample, divided it at random into ten non-overlapping sub-samples and performed SPAN and CART analyses on each sub-sample. The stability of the SPAN and CART models is described and, in addition, the structure of the Boolean representation of classifiers is examined. It is found that SPAN partitions have more intrinsic stability and less prone to Boolean structural irregularities.