Hybrid Density- and Partition-Based Clustering Algorithm for Data With Mixed-Type Variables

Wang, Shu; Yabes, Jonathan G.; Chang, Chung-Chou H.

doi:10.6339/21-JDS996

Journal of Data Science

Hybrid Density- and Partition-Based Clustering Algorithm for Data With Mixed-Type Variables

Volume 19, Issue 1 (2021), pp. 15–36

Shu Wang Jonathan G. Yabes Chung-Chou H. Chang

https://doi.org/10.6339/21-JDS996

Pub. online: 28 January 2021 Type: Statistical Data Science

Open Access

Received
1 September 2020

Accepted
1 October 2020

Published
28 January 2021

Abstract

Clustering is an essential technique for discovering patterns in data. Many clustering algorithms have been developed to tackle the ever increasing quantity and complexity of data, yet algorithms that can cluster data with mixed variables (continuous and categorical) remain limited despite the abundance of mixed-type data. Of the existing clustering methods for mixed data types, some posit unverifiable distributional assumptions or rest on unbalanced contributions of different variable types. To address these issues, we propose a two-step hybrid density- and partition-based (HyDaP) algorithm to detect clusters after variable selection. The first step involves both density-based and partition-based algorithms to identify the data structure formed by continuous variables and determine important variables (both continuous and categorical) for clustering. The second step involves a partition-based algorithm together with our proposed novel dissimilarity measure to obtain clustering results. Simulations across various scenarios were conducted to compare the HyDaP algorithm with other commonly used methods. Our HyDaP algorithm was applied to identify sepsis phenotypes and yielded important results.

Supplementary material

Supplementary Material

The R codes and a brief tutorial of implementing the HyDaP are available at GitHub: https://github.com/gmailw1264648156/HyDaP.

References

Angus DC, Van der Poll T (2013). Severe sepsis and septic shock. New England Journal of Medicine, 369: 840–851.

Ankerst M, Breunig MM, Kriegel HP, Sander J (1999). OPTICS: Ordering points to identify the clustering structure. In: ACM Sigmod Record, volume 28, 49–60. ACM.

Ester M, Kriegel HP, Sander J, Xu X, et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, volume 96, 226–231.

Gower JC (1971). A general coefficient of similarity and some of its properties. Biometrics, 857–871.

Han J, Pei J, Kamber M (2011). Data Mining: Concepts and Techniques. Elsevier.

Haripriya H, Amrutha S, Veena R, Nedungadi P (2015). Integrating apriori with paired K-Means for cluster fixed mixed data. In: Proceedings of the Third International Symposium on Women in Computing and Informatics, 10–16. ACM.

Hennig C, Liao TF (2013). How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3): 309–369.

Huang Z (1998). Extensions to the K-Means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3): 283–304.

Hubert L, Arabie P (1985). Comparing partitions. Journal of Classification, 2(1): 193–218.

Jensen PB, Jensen LJ, Brunak S (2012). Mining electronic health records: Towards better research applications and clinical care. Nature Reviews Genetics, 13(6): 395–405.

Kaufman L, Rousseeuw PJ (2009). Finding Groups in Data: An Introduction to Cluster Analysis, volume 344. John Wiley & Sons.

Liu V, Escobar GJ, Greene JD, Soule J, Whippy A, Angus DC, et al. (2014). Hospital deaths in patients with sepsis from 2 independent cohorts. Journal of the American Medical Association, 312(1): 90–92.

MacQueen J, et al. (1967). Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, 281–297. Oakland, CA, USA.

McCutcheon AL (1987). Latent Class Analysis. 64. Sage.

Monti S, Tamayo P, Mesirov J, Golub T (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52(1–2): 91–118.

Moustaki I (1996). A latent trait and a latent class model for mixed observed variables. British Journal of Mathematical and Statistical Psychology, 49(2): 313–334.

Pagès J (2014). Multiple Factor Analysis by Example Using R. CRC Press.

Rand WM (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336): 846–850.

Reddy MJ, Kavitha B (2012). Clustering the mixed numerical and categorical dataset using similarity weight and filter method. International Journal of Database Theory and Application, 5(1): 121–134.

Scicluna BP, Van Vught LA, Zwinderman AH, Wiewel MA, Davenport EE, Burnham KL, et al. (2017). Classification of patients with sepsis according to blood genomic endotype: A prospective cohort study. The Lancet Respiratory Medicine, 5(10): 816–826.

Seymour CW, Kennedy JN, Wang S, Chang CCH, Elliott CF, Xu Z, et al. (2019). Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis. Journal of the American Medical Association, 321(20): 2003–2017.

Seymour CW, Liu VX, Iwashyna TJ, Brunkhorst FM, Rea TD, Scherag A, et al. (2016). Assessment of clinical criteria for sepsis: For the third international consensus definitions for sepsis and septic shock (sepsis-3). Journal of the American Medical Association, 315(8): 762–774.

Shirkhorshidi AS, Aghabozorgi S, Wah TY (2015). A comparison study on similarity and dissimilarity measures in clustering continuous data. PloS One, 10(12): e0144059.

Ward Jr JH (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301): 236–244.

Wilkerson MD, Hayes DN (2010). ConsensusClusterPlus: A class discovery tool with confidence assessments and item tracking. Bioinformatics, 26(12): 1572–1573.

Witten DM, Tibshirani R (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490): 713–726.

Xu R, Wunsch D (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3): 645–678.

2021 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

mixed data variable selection

Metrics

since February 2021

2634

Article info
views

1343

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file