Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1140

10.6339/24-JDS1140

Statistical Data Science

A Two-Stage Classification for Dealing with Unseen Clusters in the Testing Data

https://orcid.org/0000-0002-6683-1822

Lee

Jung Wun

jwlee@hsph.harvard.edu1

https://orcid.org/0000-0002-1054-3055

Harel

Ofer

ofer.harel@uconn.edu2∗ 1Department of Biostatistics, Harvard University, Boston, MA, 02115, USA 2Department of Statistics, University of Connecticut, Storrs, CT, 06269, USA

∗Corresponding author. Email: jwlee@hsph.harvard.edu or ofer.harel@uconn.edu.

2025

272024

231188207

Supplementary Material

•

Supplementary document: The supplementary document provides the proofs of the Theorems 1, 2, and 3, and additional numerical study results.

•

Software: R codes for the proposed methods and algorithms.

21120242842024

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2025

Open access article under the CC BY license.

Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.

Keywords classification cluster analysis open set recognition outlier detection

This work was partially supported by the National Science Foundation under grant DMS-2015320.

References

Bartlett

, Wegkamp

(2008). Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8): 1823–1840.

Bendale

, Boult

(2015). Towards open world recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1893–1902.

Bethlehem

(2010). Selection bias in web surveys. International Statistical Review, 78(2): 161–188. https://doi.org/10.1111/j.1751-5823.2010.00112.x

Bouveyron

(2014). Adaptive mixture discriminant analysis for supervised learning with unobserved classes. Journal of Classification, 31: 49–84. https://doi.org/10.1007/s00357-014-9147-x

Cappozzo

, Greselin

, Murphy

(2020). Anomaly and novelty detection for robust semi-supervised learning. Statistics and Computing, 30(5): 1545–1571. https://doi.org/10.1007/s11222-020-09959-1

Clifton

, Hugueny

, Tarassenko

(2011). Novelty detection with multivariate extreme value statistics. Journal of Signal Processing Systems, 65(3): 371–389. https://doi.org/10.1007/s11265-010-0513-6

Dempster

, Laird

, Rubin

(1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, Methodological, 39(1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

Denti

, Cappozzo

, Greselin

(2021). A two-stage Bayesian semiparametric model for novelty detection with robust prior information. Statistics and Computing, 31(4): 42. https://doi.org/10.1007/s11222-021-10017-7

Doan

, Kalita

(2017). Overcoming the challenge for text classification in the open world. In: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), 1–7. IEEE.

Feinman

, Curtin

, Shintre

, Gardner

(2017). Detecting adversarial samples from artifacts. arXiv preprint: https://arxiv.org/abs/1703.00410.

Geng

, Huang

, Chen

(2020). Recent advances in open set recognition: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10): 3614–3631. https://doi.org/10.1109/TPAMI.2020.2981604

Grosse

, Manoharan

, Papernot

, Backes

, McDaniel

(2017). On the (statistical) detection of adversarial examples. arXiv preprint: https://arxiv.org/abs/1702.06280.

, Xu

, Deng

(2003). Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9–10): 1641–1650. https://doi.org/10.1016/S0167-8655(03)00003-5

Hodge

, Austin

(2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2): 85–126. https://doi.org/10.1023/B:AIRE.0000045502.10941.a9

Klawonn

, Höppner

, Jayaram

(2012). What are clusters in high dimensions and are they difficult to find? In: Clustering High-Dimensional Data, 14–33. Springer.

Koklu

, Ozkan

(2020). Multiclass classification of dry beans using computer vision and machine learning techniques. Computers and Electronics in Agriculture, 174: 105507. https://doi.org/10.1016/j.compag.2020.105507

Lee

, Lee

, Shin

(2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: Advances in Neural Information Processing Systems, volume 31 (

Bengio,

Wallach,

Larochelle,

Grauman,

Cesa-Bianchi,

Garnett, eds.).

Liang

, Li

, Srikant

(2017). Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint: https://arxiv.org/abs/1706.02690.

(1984). On a class of Bayesian nonparametric estimates: I. density estimates. The Annals of Statistics, 12(1): 351–357.

Lonij

, Rawat

, Nicolae

(2017). Open-world visual recognition using knowledge graphs. arXiv preprint: https://arxiv.org/abs/1708.08310.

, Li

, Wang

, Erfani

, Wijewickrema

, Schoenebeck

, et al. (2018). Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint: https://arxiv.org/abs/1801.02613.

Miller

, Browning

(2003). A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets. In: 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No. 03TH8718), 489–498. IEEE.

Papernot

, McDaniel

(2018). Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. arXiv preprint: https://arxiv.org/abs/1803.04765.

Pimentel

, Clifton

, Tarassenko

(2014). A review of novelty detection. Signal Processing, 99: 215–249. https://doi.org/10.1016/j.sigpro.2013.12.026

Redner

, Walker

(1984). Mixture densities, maximum likelihood and the em algorithm. SIAM Review, 26(2): 195–239. https://doi.org/10.1137/1026034

Rousseeuw

(1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20: 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

Schölkopf

, Williamson

, Smola

, Shawe-Taylor

, Platt

(1999). Support vector method for novelty detection. In: Advances in Neural Information Processing Systems, volume 12 (

Solla,

Leen,

Müller, eds.).

Schwarz

(1978). Estimating the dimension of a model. The Annals of Statistics, 6(2): 461–464.

Scrucca

, Fop

, Murphy

, Raftery

(2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1): 289. https://doi.org/10.32614/RJ-2016-021

Shewhart

, Deming

(1986). Statistical Method from the Viewpoint of Quality Control. Courier Corporation.

Sun

, Wang

, Deng

, Wang

, Lafyatis

, Ding

, et al. (2018). Dimm-sc: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data. Bioinformatics, 34(1): 139–146. https://doi.org/10.1093/bioinformatics/btx490

Tibshirani

, Walther

, Hastie

(2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 63(2): 411–423. https://doi.org/10.1111/1467-9868.00293

Wankhade

, Jondhale

, Thool

(2018). A hybrid approach for classification of rare class data. Knowledge and Information Systems, 56(1): 197–221. https://doi.org/10.1007/s10115-017-1114-5

(1983). On the convergence properties of the em algorithm. The Annals of Statistics, 11(1): 95–103.

, Qiao

, Zhu

, Zhang

, Xue

, Li

(2016). Reviews on determining the number of clusters. Applied Mathematics & Information Sciences, 10(4): 1493–1512. https://doi.org/10.18576/amis/100428

Yong

, Deng

, Purvis

(2012). Novelty detection in wildlife scenes through semantic context modelling. Pattern Recognition, 45(9): 3439–3450. https://doi.org/10.1016/j.patcog.2012.02.036