A Two-Stage Classification for Dealing with Unseen Clusters in the Testing Data

Lee, Jung Wun; Harel, Ofer

doi:10.6339/24-JDS1140

Journal of Data Science

A Two-Stage Classification for Dealing with Unseen Clusters in the Testing Data

Volume 23, Issue 1 (2025), pp. 188–207

Jung Wun Lee

Ofer Harel

https://doi.org/10.6339/24-JDS1140

Pub. online: 2 July 2024 Type: Statistical Data Science

Open Access

Received
21 January 2024

Accepted
28 April 2024

Published
2 July 2024

Abstract

Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.

Supplementary material

Supplementary Material

• Supplementary document: The supplementary document provides the proofs of the Theorems 1, 2, and 3, and additional numerical study results. • Software: R codes for the proposed methods and algorithms.

References

Bartlett PL, Wegkamp MH (2008). Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8): 1823–1840.

Bendale A, Boult T (2015). Towards open world recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1893–1902.

Bethlehem J (2010). Selection bias in web surveys. International Statistical Review, 78(2): 161–188. https://doi.org/10.1111/j.1751-5823.2010.00112.x

Bouveyron C (2014). Adaptive mixture discriminant analysis for supervised learning with unobserved classes. Journal of Classification, 31: 49–84. https://doi.org/10.1007/s00357-014-9147-x

Cappozzo A, Greselin F, Murphy TB (2020). Anomaly and novelty detection for robust semi-supervised learning. Statistics and Computing, 30(5): 1545–1571. https://doi.org/10.1007/s11222-020-09959-1

Clifton DA, Hugueny S, Tarassenko L (2011). Novelty detection with multivariate extreme value statistics. Journal of Signal Processing Systems, 65(3): 371–389. https://doi.org/10.1007/s11265-010-0513-6

Dempster AP, Laird NM, Rubin DB (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, Methodological, 39(1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

Denti F, Cappozzo A, Greselin F (2021). A two-stage Bayesian semiparametric model for novelty detection with robust prior information. Statistics and Computing, 31(4): 42. https://doi.org/10.1007/s11222-021-10017-7

Doan T, Kalita J (2017). Overcoming the challenge for text classification in the open world. In: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), 1–7. IEEE.

Feinman R, Curtin RR, Shintre S, Gardner AB (2017). Detecting adversarial samples from artifacts. arXiv preprint: https://arxiv.org/abs/1703.00410.

Geng C, Huang Sj, Chen S (2020). Recent advances in open set recognition: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10): 3614–3631. https://doi.org/10.1109/TPAMI.2020.2981604

Grosse K, Manoharan P, Papernot N, Backes M, McDaniel P (2017). On the (statistical) detection of adversarial examples. arXiv preprint: https://arxiv.org/abs/1702.06280.

He Z, Xu X, Deng S (2003). Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9–10): 1641–1650. https://doi.org/10.1016/S0167-8655(03)00003-5

Hodge V, Austin J (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2): 85–126. https://doi.org/10.1023/B:AIRE.0000045502.10941.a9

Klawonn F, Höppner F, Jayaram B (2012). What are clusters in high dimensions and are they difficult to find? In: Clustering High-Dimensional Data, 14–33. Springer.

Koklu M, Ozkan IA (2020). Multiclass classification of dry beans using computer vision and machine learning techniques. Computers and Electronics in Agriculture, 174: 105507. https://doi.org/10.1016/j.compag.2020.105507

Lee K, Lee K, Lee H, Shin J (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: Advances in Neural Information Processing Systems, volume 31 (S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett, eds.).

Liang S, Li Y, Srikant R (2017). Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint: https://arxiv.org/abs/1706.02690.

Lo AY (1984). On a class of Bayesian nonparametric estimates: I. density estimates. The Annals of Statistics, 12(1): 351–357.

Lonij V, Rawat A, Nicolae MI (2017). Open-world visual recognition using knowledge graphs. arXiv preprint: https://arxiv.org/abs/1708.08310.

Ma X, Li B, Wang Y, Erfani SM, Wijewickrema S, Schoenebeck G, et al. (2018). Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint: https://arxiv.org/abs/1801.02613.

Miller DJ, Browning J (2003). A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets. In: 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No. 03TH8718), 489–498. IEEE.

Papernot N, McDaniel P (2018). Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. arXiv preprint: https://arxiv.org/abs/1803.04765.

Pimentel MA, Clifton DA, Clifton L, Tarassenko L (2014). A review of novelty detection. Signal Processing, 99: 215–249. https://doi.org/10.1016/j.sigpro.2013.12.026

Redner RA, Walker HF (1984). Mixture densities, maximum likelihood and the em algorithm. SIAM Review, 26(2): 195–239. https://doi.org/10.1137/1026034

Rousseeuw PJ (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20: 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

Schölkopf B, Williamson RC, Smola A, Shawe-Taylor J, Platt J (1999). Support vector method for novelty detection. In: Advances in Neural Information Processing Systems, volume 12 (S Solla, T Leen, K Müller, eds.).

Schwarz G (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2): 461–464.

Scrucca L, Fop M, Murphy TB, Raftery AE (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1): 289. https://doi.org/10.32614/RJ-2016-021

Shewhart WA, Deming WE (1986). Statistical Method from the Viewpoint of Quality Control. Courier Corporation.

Sun Z, Wang T, Deng K, Wang XF, Lafyatis R, Ding Y, et al. (2018). Dimm-sc: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data. Bioinformatics, 34(1): 139–146. https://doi.org/10.1093/bioinformatics/btx490

Tibshirani R, Walther G, Hastie T (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 63(2): 411–423. https://doi.org/10.1111/1467-9868.00293

Wankhade KK, Jondhale KC, Thool VR (2018). A hybrid approach for classification of rare class data. Knowledge and Information Systems, 56(1): 197–221. https://doi.org/10.1007/s10115-017-1114-5

Wu CJ (1983). On the convergence properties of the em algorithm. The Annals of Statistics, 11(1): 95–103.

Xu S, Qiao X, Zhu L, Zhang Y, Xue C, Li L (2016). Reviews on determining the number of clusters. Applied Mathematics & Information Sciences, 10(4): 1493–1512. https://doi.org/10.18576/amis/100428

Yong SP, Deng JD, Purvis MK (2012). Novelty detection in wildlife scenes through semantic context modelling. Pattern Recognition, 45(9): 3439–3450. https://doi.org/10.1016/j.patcog.2012.02.036

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

classification cluster analysis open set recognition outlier detection

Funding

This work was partially supported by the National Science Foundation under grant DMS-2015320.

Metrics

since February 2021

329

Article info
views

173

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file