Supplementary Material

JDS

Journal of Data Science

1683-8602 1680-743X

1680-743X

School of Statistics, Renmin University of China

JDS996

10.6339/21-JDS996

Statistical Data Science

Hybrid Density- and Partition-Based Clustering Algorithm for Data With Mixed-Type Variables

Wang

Shu

12 Yabes

Jonathan G.

34 Chang

Chung-Chou H.

changj@pitt.edu34∗ 1Department of Biostatistics, College of Public Health and Health Professions, University of Florida 2University of Florida Health Cancer Center 3Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh 4Department of Medicine, School of Medicine, University of Pittsburgh

∗Corresponding author. Email: changj@pitt.edu.

2021

2812021

1911536

Supplementary Material

The R codes and a brief tutorial of implementing the HyDaP are available at GitHub: https://github.com/gmailw1264648156/HyDaP.

92020 102020

2021 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2021

Open access article under the CC BY license.

Clustering is an essential technique for discovering patterns in data. Many clustering algorithms have been developed to tackle the ever increasing quantity and complexity of data, yet algorithms that can cluster data with mixed variables (continuous and categorical) remain limited despite the abundance of mixed-type data. Of the existing clustering methods for mixed data types, some posit unverifiable distributional assumptions or rest on unbalanced contributions of different variable types. To address these issues, we propose a two-step hybrid density- and partition-based (HyDaP) algorithm to detect clusters after variable selection. The first step involves both density-based and partition-based algorithms to identify the data structure formed by continuous variables and determine important variables (both continuous and categorical) for clustering. The second step involves a partition-based algorithm together with our proposed novel dissimilarity measure to obtain clustering results. Simulations across various scenarios were conducted to compare the HyDaP algorithm with other commonly used methods. Our HyDaP algorithm was applied to identify sepsis phenotypes and yielded important results.

Keywords mixed data variable selection

References

Angus

, Van der Poll

(2013). Severe sepsis and septic shock. New England Journal of Medicine, 369: 840–851.

Ankerst

, Breunig

, Kriegel

, Sander

(1999). OPTICS: Ordering points to identify the clustering structure. In: ACM Sigmod Record, volume 28, 49–60. ACM.

Ester

, Kriegel

, Sander

, Xu

, et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, volume 96, 226–231.

Gower

(1971). A general coefficient of similarity and some of its properties. Biometrics, 857–871.

Han

, Pei

, Kamber

(2011). Data Mining: Concepts and Techniques. Elsevier.

Haripriya

, Amrutha

, Veena

, Nedungadi

(2015). Integrating apriori with paired K-Means for cluster fixed mixed data. In: Proceedings of the Third International Symposium on Women in Computing and Informatics, 10–16. ACM.

Hennig

, Liao

(2013). How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3): 309–369.

Huang

(1998). Extensions to the K-Means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3): 283–304.

Hubert

, Arabie

(1985). Comparing partitions. Journal of Classification, 2(1): 193–218.

Jensen

, Jensen

, Brunak

(2012). Mining electronic health records: Towards better research applications and clinical care. Nature Reviews Genetics, 13(6): 395–405.

Kaufman

, Rousseeuw

(2009). Finding Groups in Data: An Introduction to Cluster Analysis, volume 344. John Wiley & Sons.

Liu

, Escobar

, Greene

, Soule

, Whippy

, Angus

, et al. (2014). Hospital deaths in patients with sepsis from 2 independent cohorts. Journal of the American Medical Association, 312(1): 90–92.

MacQueen

, et al. (1967). Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, 281–297. Oakland, CA, USA.

McCutcheon

(1987). Latent Class Analysis. 64. Sage.

Monti

, Tamayo

, Mesirov

, Golub

(2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52(1–2): 91–118.

Moustaki

(1996). A latent trait and a latent class model for mixed observed variables. British Journal of Mathematical and Statistical Psychology, 49(2): 313–334.

Pagès

(2014). Multiple Factor Analysis by Example Using R. CRC Press.

Rand

(1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336): 846–850.

Reddy

, Kavitha

(2012). Clustering the mixed numerical and categorical dataset using similarity weight and filter method. International Journal of Database Theory and Application, 5(1): 121–134.

Scicluna

, Van Vught

, Zwinderman

, Wiewel

, Davenport

, Burnham

, et al. (2017). Classification of patients with sepsis according to blood genomic endotype: A prospective cohort study. The Lancet Respiratory Medicine, 5(10): 816–826.

Seymour

, Kennedy

, Wang

, Chang

CCH

, Elliott

, Xu

, et al. (2019). Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis. Journal of the American Medical Association, 321(20): 2003–2017.

Seymour

, Liu

, Iwashyna

, Brunkhorst

, Rea

, Scherag

, et al. (2016). Assessment of clinical criteria for sepsis: For the third international consensus definitions for sepsis and septic shock (sepsis-3). Journal of the American Medical Association, 315(8): 762–774.

Shirkhorshidi

, Aghabozorgi

, Wah

(2015). A comparison study on similarity and dissimilarity measures in clustering continuous data. PloS One, 10(12): e0144059.

Ward Jr

(1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301): 236–244.

Wilkerson

, Hayes

(2010). ConsensusClusterPlus: A class discovery tool with confidence assessments and item tracking. Bioinformatics, 26(12): 1572–1573.

Witten

, Tibshirani

(2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490): 713–726.

, Wunsch

(2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3): 645–678.