A Two-Stage Classification for Dealing with Unseen Clusters in the Testing Data
Pub. online: 2 July 2024
Type: Statistical Data Science
Open Access
Received
21 January 2024
21 January 2024
Accepted
28 April 2024
28 April 2024
Published
2 July 2024
2 July 2024
Abstract
Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.
Supplementary material
Supplementary Material
•
Supplementary document: The supplementary document provides the proofs of the Theorems 1, 2, and 3, and additional numerical study results.
•
Software: R codes for the proposed methods and algorithms.
References
Bethlehem J (2010). Selection bias in web surveys. International Statistical Review, 78(2): 161–188. https://doi.org/10.1111/j.1751-5823.2010.00112.x
Bouveyron C (2014). Adaptive mixture discriminant analysis for supervised learning with unobserved classes. Journal of Classification, 31: 49–84. https://doi.org/10.1007/s00357-014-9147-x
Cappozzo A, Greselin F, Murphy TB (2020). Anomaly and novelty detection for robust semi-supervised learning. Statistics and Computing, 30(5): 1545–1571. https://doi.org/10.1007/s11222-020-09959-1
Clifton DA, Hugueny S, Tarassenko L (2011). Novelty detection with multivariate extreme value statistics. Journal of Signal Processing Systems, 65(3): 371–389. https://doi.org/10.1007/s11265-010-0513-6
Dempster AP, Laird NM, Rubin DB (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, Methodological, 39(1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
Denti F, Cappozzo A, Greselin F (2021). A two-stage Bayesian semiparametric model for novelty detection with robust prior information. Statistics and Computing, 31(4): 42. https://doi.org/10.1007/s11222-021-10017-7
Feinman R, Curtin RR, Shintre S, Gardner AB (2017). Detecting adversarial samples from artifacts. arXiv preprint: https://arxiv.org/abs/1703.00410.
Geng C, Huang Sj, Chen S (2020). Recent advances in open set recognition: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10): 3614–3631. https://doi.org/10.1109/TPAMI.2020.2981604
Grosse K, Manoharan P, Papernot N, Backes M, McDaniel P (2017). On the (statistical) detection of adversarial examples. arXiv preprint: https://arxiv.org/abs/1702.06280.
He Z, Xu X, Deng S (2003). Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9–10): 1641–1650. https://doi.org/10.1016/S0167-8655(03)00003-5
Hodge V, Austin J (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2): 85–126. https://doi.org/10.1023/B:AIRE.0000045502.10941.a9
Koklu M, Ozkan IA (2020). Multiclass classification of dry beans using computer vision and machine learning techniques. Computers and Electronics in Agriculture, 174: 105507. https://doi.org/10.1016/j.compag.2020.105507
Liang S, Li Y, Srikant R (2017). Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint: https://arxiv.org/abs/1706.02690.
Lonij V, Rawat A, Nicolae MI (2017). Open-world visual recognition using knowledge graphs. arXiv preprint: https://arxiv.org/abs/1708.08310.
Ma X, Li B, Wang Y, Erfani SM, Wijewickrema S, Schoenebeck G, et al. (2018). Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint: https://arxiv.org/abs/1801.02613.
Papernot N, McDaniel P (2018). Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. arXiv preprint: https://arxiv.org/abs/1803.04765.
Pimentel MA, Clifton DA, Clifton L, Tarassenko L (2014). A review of novelty detection. Signal Processing, 99: 215–249. https://doi.org/10.1016/j.sigpro.2013.12.026
Redner RA, Walker HF (1984). Mixture densities, maximum likelihood and the em algorithm. SIAM Review, 26(2): 195–239. https://doi.org/10.1137/1026034
Rousseeuw PJ (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20: 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Scrucca L, Fop M, Murphy TB, Raftery AE (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1): 289. https://doi.org/10.32614/RJ-2016-021
Sun Z, Wang T, Deng K, Wang XF, Lafyatis R, Ding Y, et al. (2018). Dimm-sc: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data. Bioinformatics, 34(1): 139–146. https://doi.org/10.1093/bioinformatics/btx490
Tibshirani R, Walther G, Hastie T (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 63(2): 411–423. https://doi.org/10.1111/1467-9868.00293
Wankhade KK, Jondhale KC, Thool VR (2018). A hybrid approach for classification of rare class data. Knowledge and Information Systems, 56(1): 197–221. https://doi.org/10.1007/s10115-017-1114-5
Xu S, Qiao X, Zhu L, Zhang Y, Xue C, Li L (2016). Reviews on determining the number of clusters. Applied Mathematics & Information Sciences, 10(4): 1493–1512. https://doi.org/10.18576/amis/100428
Yong SP, Deng JD, Purvis MK (2012). Novelty detection in wildlife scenes through semantic context modelling. Pattern Recognition, 45(9): 3439–3450. https://doi.org/10.1016/j.patcog.2012.02.036