Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.
Bayesian methods provide direct uncertainty quantification in functional data analysis applications without reliance on bootstrap techniques. A major tool in functional data applications is the functional principal component analysis which decomposes the data around a common mean function and identifies leading directions of variation. Bayesian functional principal components analysis (BFPCA) provides uncertainty quantification on the estimated functional model components via the posterior samples obtained. We propose central posterior envelopes (CPEs) for BFPCA based on functional depth as a descriptive visualization tool to summarize variation in the posterior samples of the estimated functional model components, contributing to uncertainty quantification in BFPCA. The proposed BFPCA relies on a latent factor model and targets model parameters within a hierarchical modeling framework using modified multiplicative gamma process shrinkage priors on the variance components. Functional depth provides a center-outward order to a sample of functions. We utilize modified band depth and modified volume depth for ordering of a sample of functions and surfaces, respectively, to derive at CPEs of the mean and eigenfunctions within the BFPCA framework. The proposed CPEs are showcased in extensive simulations. Finally, the proposed CPEs are applied to the analysis of a sample of power spectral densities from resting state electroencephalography where they lead to novel insights on diagnostic group differences among children diagnosed with autism spectrum disorder and their typically developing peers across age.
Abstract: The application of linear mixed models or generalized linear mixed models to large databases in which the level 2 units (hospitals) have a wide variety of characteristics is a problem frequently encountered in studies of medical quality. Accurate estimation of model parameters and standard errors requires accounting for the grouping of outcomes within hospitals. Including the hospitals as random effect in the model is a common method of doing so. However in a large, diverse population, the required assump tions are not satisfied, which can lead to inconsistent and biased parameter estimates. One solution is to use cluster analysis with clustering variables distinct from the model covariates to group the hospitals into smaller, more homogeneous groups. The analysis can then be carried out within these groups. We illustrate this analysis using an example of a study of hemoglobin A1c control among diabetic patients in a national database of United States Department of Veterans’ Affairs (VA) hospitals.
Abstract: This paper describes and compares three clustering techniques: traditional clustering methods, Kohonen maps and latent class models. The paper also proposes some novel measures of the quality of a clustering. To the best of our knowledge, this is the first contribution in the literature to compare these three techniques in a context where the classes are not known in advance.
Abstract: State lotteries employ sales projections to determine appropri ate advertised jackpot levels for some of their games. This paper focuses on prediction of sales for the Lotto Texas game of the Texas Lottery. A novel prediction method is developed in this setting that utilizes functional data analysis concepts in conjunction with a Bayesian paradigm to produce predictions and associated precision assessments.