Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 23, Issue 1 (2025)
  4. A Two-Stage Classification for Dealing w ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

A Two-Stage Classification for Dealing with Unseen Clusters in the Testing Data
Volume 23, Issue 1 (2025), pp. 188–207
Jung Wun Lee ORCID icon link to view author Jung Wun Lee details   Ofer Harel ORCID icon link to view author Ofer Harel details  

Authors

 
Placeholder
https://doi.org/10.6339/24-JDS1140
Pub. online: 2 July 2024      Type: Statistical Data Science      Open accessOpen Access

Received
21 January 2024
Accepted
28 April 2024
Published
2 July 2024

Abstract

Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.

Supplementary material

 Supplementary Material
• Supplementary document: The supplementary document provides the proofs of the Theorems 1, 2, and 3, and additional numerical study results. • Software: R codes for the proposed methods and algorithms.

References

 
Bartlett PL, Wegkamp MH (2008). Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8): 1823–1840.
 
Bendale A, Boult T (2015). Towards open world recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1893–1902.
 
Bethlehem J (2010). Selection bias in web surveys. International Statistical Review, 78(2): 161–188. https://doi.org/10.1111/j.1751-5823.2010.00112.x
 
Bouveyron C (2014). Adaptive mixture discriminant analysis for supervised learning with unobserved classes. Journal of Classification, 31: 49–84. https://doi.org/10.1007/s00357-014-9147-x
 
Cappozzo A, Greselin F, Murphy TB (2020). Anomaly and novelty detection for robust semi-supervised learning. Statistics and Computing, 30(5): 1545–1571. https://doi.org/10.1007/s11222-020-09959-1
 
Clifton DA, Hugueny S, Tarassenko L (2011). Novelty detection with multivariate extreme value statistics. Journal of Signal Processing Systems, 65(3): 371–389. https://doi.org/10.1007/s11265-010-0513-6
 
Dempster AP, Laird NM, Rubin DB (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, Methodological, 39(1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
 
Denti F, Cappozzo A, Greselin F (2021). A two-stage Bayesian semiparametric model for novelty detection with robust prior information. Statistics and Computing, 31(4): 42. https://doi.org/10.1007/s11222-021-10017-7
 
Doan T, Kalita J (2017). Overcoming the challenge for text classification in the open world. In: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), 1–7. IEEE.
 
Feinman R, Curtin RR, Shintre S, Gardner AB (2017). Detecting adversarial samples from artifacts. arXiv preprint: https://arxiv.org/abs/1703.00410.
 
Geng C, Huang Sj, Chen S (2020). Recent advances in open set recognition: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10): 3614–3631. https://doi.org/10.1109/TPAMI.2020.2981604
 
Grosse K, Manoharan P, Papernot N, Backes M, McDaniel P (2017). On the (statistical) detection of adversarial examples. arXiv preprint: https://arxiv.org/abs/1702.06280.
 
He Z, Xu X, Deng S (2003). Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9–10): 1641–1650. https://doi.org/10.1016/S0167-8655(03)00003-5
 
Hodge V, Austin J (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2): 85–126. https://doi.org/10.1023/B:AIRE.0000045502.10941.a9
 
Klawonn F, Höppner F, Jayaram B (2012). What are clusters in high dimensions and are they difficult to find? In: Clustering High-Dimensional Data, 14–33. Springer.
 
Koklu M, Ozkan IA (2020). Multiclass classification of dry beans using computer vision and machine learning techniques. Computers and Electronics in Agriculture, 174: 105507. https://doi.org/10.1016/j.compag.2020.105507
 
Lee K, Lee K, Lee H, Shin J (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: Advances in Neural Information Processing Systems, volume 31 (S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett, eds.).
 
Liang S, Li Y, Srikant R (2017). Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint: https://arxiv.org/abs/1706.02690.
 
Lo AY (1984). On a class of Bayesian nonparametric estimates: I. density estimates. The Annals of Statistics, 12(1): 351–357.
 
Lonij V, Rawat A, Nicolae MI (2017). Open-world visual recognition using knowledge graphs. arXiv preprint: https://arxiv.org/abs/1708.08310.
 
Ma X, Li B, Wang Y, Erfani SM, Wijewickrema S, Schoenebeck G, et al. (2018). Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint: https://arxiv.org/abs/1801.02613.
 
Miller DJ, Browning J (2003). A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets. In: 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No. 03TH8718), 489–498. IEEE.
 
Papernot N, McDaniel P (2018). Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. arXiv preprint: https://arxiv.org/abs/1803.04765.
 
Pimentel MA, Clifton DA, Clifton L, Tarassenko L (2014). A review of novelty detection. Signal Processing, 99: 215–249. https://doi.org/10.1016/j.sigpro.2013.12.026
 
Redner RA, Walker HF (1984). Mixture densities, maximum likelihood and the em algorithm. SIAM Review, 26(2): 195–239. https://doi.org/10.1137/1026034
 
Rousseeuw PJ (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20: 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
 
Schölkopf B, Williamson RC, Smola A, Shawe-Taylor J, Platt J (1999). Support vector method for novelty detection. In: Advances in Neural Information Processing Systems, volume 12 (S Solla, T Leen, K Müller, eds.).
 
Schwarz G (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2): 461–464.
 
Scrucca L, Fop M, Murphy TB, Raftery AE (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1): 289. https://doi.org/10.32614/RJ-2016-021
 
Shewhart WA, Deming WE (1986). Statistical Method from the Viewpoint of Quality Control. Courier Corporation.
 
Sun Z, Wang T, Deng K, Wang XF, Lafyatis R, Ding Y, et al. (2018). Dimm-sc: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data. Bioinformatics, 34(1): 139–146. https://doi.org/10.1093/bioinformatics/btx490
 
Tibshirani R, Walther G, Hastie T (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 63(2): 411–423. https://doi.org/10.1111/1467-9868.00293
 
Wankhade KK, Jondhale KC, Thool VR (2018). A hybrid approach for classification of rare class data. Knowledge and Information Systems, 56(1): 197–221. https://doi.org/10.1007/s10115-017-1114-5
 
Wu CJ (1983). On the convergence properties of the em algorithm. The Annals of Statistics, 11(1): 95–103.
 
Xu S, Qiao X, Zhu L, Zhang Y, Xue C, Li L (2016). Reviews on determining the number of clusters. Applied Mathematics & Information Sciences, 10(4): 1493–1512. https://doi.org/10.18576/amis/100428
 
Yong SP, Deng JD, Purvis MK (2012). Novelty detection in wildlife scenes through semantic context modelling. Pattern Recognition, 45(9): 3439–3450. https://doi.org/10.1016/j.patcog.2012.02.036

Related articles PDF XML
Related articles PDF XML

Copyright
2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
classification cluster analysis open set recognition outlier detection

Funding
This work was partially supported by the National Science Foundation under grant DMS-2015320.

Metrics
since February 2021
273

Article info
views

140

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy