Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 19, Issue 1 (2021)
  4. Hybrid Density- and Partition-Based Clus ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Hybrid Density- and Partition-Based Clustering Algorithm for Data With Mixed-Type Variables
Volume 19, Issue 1 (2021), pp. 15–36
Shu Wang   Jonathan G. Yabes   Chung-Chou H. Chang  

Authors

 
Placeholder
https://doi.org/10.6339/21-JDS996
Pub. online: 28 January 2021      Type: Statistical Data Science     

Received
1 September 2020
Accepted
1 October 2020
Published
28 January 2021

Abstract

Clustering is an essential technique for discovering patterns in data. Many clustering algorithms have been developed to tackle the ever increasing quantity and complexity of data, yet algorithms that can cluster data with mixed variables (continuous and categorical) remain limited despite the abundance of mixed-type data. Of the existing clustering methods for mixed data types, some posit unverifiable distributional assumptions or rest on unbalanced contributions of different variable types. To address these issues, we propose a two-step hybrid density- and partition-based (HyDaP) algorithm to detect clusters after variable selection. The first step involves both density-based and partition-based algorithms to identify the data structure formed by continuous variables and determine important variables (both continuous and categorical) for clustering. The second step involves a partition-based algorithm together with our proposed novel dissimilarity measure to obtain clustering results. Simulations across various scenarios were conducted to compare the HyDaP algorithm with other commonly used methods. Our HyDaP algorithm was applied to identify sepsis phenotypes and yielded important results.

Supplementary material

 Supplementary Material
The R codes and a brief tutorial of implementing the HyDaP are available at GitHub: https://github.com/gmailw1264648156/HyDaP.

References

 
Angus DC, Van der Poll T (2013). Severe sepsis and septic shock. New England Journal of Medicine, 369: 840–851.
 
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999). OPTICS: Ordering points to identify the clustering structure. In: ACM Sigmod Record, volume 28, 49–60. ACM.
 
Ester M, Kriegel HP, Sander J, Xu X, et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, volume 96, 226–231.
 
Gower JC (1971). A general coefficient of similarity and some of its properties. Biometrics, 857–871.
 
Han J, Pei J, Kamber M (2011). Data Mining: Concepts and Techniques. Elsevier.
 
Haripriya H, Amrutha S, Veena R, Nedungadi P (2015). Integrating apriori with paired K-Means for cluster fixed mixed data. In: Proceedings of the Third International Symposium on Women in Computing and Informatics, 10–16. ACM.
 
Hennig C, Liao TF (2013). How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3): 309–369.
 
Huang Z (1998). Extensions to the K-Means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3): 283–304.
 
Hubert L, Arabie P (1985). Comparing partitions. Journal of Classification, 2(1): 193–218.
 
Jensen PB, Jensen LJ, Brunak S (2012). Mining electronic health records: Towards better research applications and clinical care. Nature Reviews Genetics, 13(6): 395–405.
 
Kaufman L, Rousseeuw PJ (2009). Finding Groups in Data: An Introduction to Cluster Analysis, volume 344. John Wiley & Sons.
 
Liu V, Escobar GJ, Greene JD, Soule J, Whippy A, Angus DC, et al. (2014). Hospital deaths in patients with sepsis from 2 independent cohorts. Journal of the American Medical Association, 312(1): 90–92.
 
MacQueen J, et al. (1967). Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, 281–297. Oakland, CA, USA.
 
McCutcheon AL (1987). Latent Class Analysis. 64. Sage.
 
Monti S, Tamayo P, Mesirov J, Golub T (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52(1–2): 91–118.
 
Moustaki I (1996). A latent trait and a latent class model for mixed observed variables. British Journal of Mathematical and Statistical Psychology, 49(2): 313–334.
 
Pagès J (2014). Multiple Factor Analysis by Example Using R. CRC Press.
 
Rand WM (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336): 846–850.
 
Reddy MJ, Kavitha B (2012). Clustering the mixed numerical and categorical dataset using similarity weight and filter method. International Journal of Database Theory and Application, 5(1): 121–134.
 
Scicluna BP, Van Vught LA, Zwinderman AH, Wiewel MA, Davenport EE, Burnham KL, et al. (2017). Classification of patients with sepsis according to blood genomic endotype: A prospective cohort study. The Lancet Respiratory Medicine, 5(10): 816–826.
 
Seymour CW, Kennedy JN, Wang S, Chang CCH, Elliott CF, Xu Z, et al. (2019). Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis. Journal of the American Medical Association, 321(20): 2003–2017.
 
Seymour CW, Liu VX, Iwashyna TJ, Brunkhorst FM, Rea TD, Scherag A, et al. (2016). Assessment of clinical criteria for sepsis: For the third international consensus definitions for sepsis and septic shock (sepsis-3). Journal of the American Medical Association, 315(8): 762–774.
 
Shirkhorshidi AS, Aghabozorgi S, Wah TY (2015). A comparison study on similarity and dissimilarity measures in clustering continuous data. PloS One, 10(12): e0144059.
 
Ward Jr JH (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301): 236–244.
 
Wilkerson MD, Hayes DN (2010). ConsensusClusterPlus: A class discovery tool with confidence assessments and item tracking. Bioinformatics, 26(12): 1572–1573.
 
Witten DM, Tibshirani R (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490): 713–726.
 
Xu R, Wunsch D (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3): 645–678.

Related articles PDF XML
Related articles PDF XML

Copyright
© 2021 The Author(s).
This is a free to read article.

Keywords
mixed data variable selection

Metrics
since February 2021
1592

Article info
views

755

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy