Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.
Abstract: The application of linear mixed models or generalized linear mixed models to large databases in which the level 2 units (hospitals) have a wide variety of characteristics is a problem frequently encountered in studies of medical quality. Accurate estimation of model parameters and standard errors requires accounting for the grouping of outcomes within hospitals. Including the hospitals as random effect in the model is a common method of doing so. However in a large, diverse population, the required assump tions are not satisfied, which can lead to inconsistent and biased parameter estimates. One solution is to use cluster analysis with clustering variables distinct from the model covariates to group the hospitals into smaller, more homogeneous groups. The analysis can then be carried out within these groups. We illustrate this analysis using an example of a study of hemoglobin A1c control among diabetic patients in a national database of United States Department of Veterans’ Affairs (VA) hospitals.
Abstract: A new set of methods are developed to perform cluster analysis of functions, motivated by a data set consisting of hydraulic gradients at several locations distributed across a wetland complex. The methods build on previous work on clustering of functions, such as Tarpey and Kinateder (2003) and Hitchcock et al. (2007), but explore functions generated from an additive model decomposition (Wood, 2006) of the original time series. Our decomposition targets two aspects of the series, using an adaptive smoother for the trend and circular spline for the diurnal variation in the series. Different measures for comparing locations are discussed, including a method for efficiently clustering time series that are of different lengths using a functional data approach. The complicated nature of these wetlands are highlighted by the shifting group memberships depending on which scale of variation and year of the study are considered.