Hybrid Density- and Partition-Based Clustering Algorithm for Data With Mixed-Type Variables
Volume 19, Issue 1 (2021), pp. 15–36
Pub. online: 28 January 2021
Type: Statistical Data Science
Received
1 September 2020
1 September 2020
Accepted
1 October 2020
1 October 2020
Published
28 January 2021
28 January 2021
Abstract
Clustering is an essential technique for discovering patterns in data. Many clustering algorithms have been developed to tackle the ever increasing quantity and complexity of data, yet algorithms that can cluster data with mixed variables (continuous and categorical) remain limited despite the abundance of mixed-type data. Of the existing clustering methods for mixed data types, some posit unverifiable distributional assumptions or rest on unbalanced contributions of different variable types. To address these issues, we propose a two-step hybrid density- and partition-based (HyDaP) algorithm to detect clusters after variable selection. The first step involves both density-based and partition-based algorithms to identify the data structure formed by continuous variables and determine important variables (both continuous and categorical) for clustering. The second step involves a partition-based algorithm together with our proposed novel dissimilarity measure to obtain clustering results. Simulations across various scenarios were conducted to compare the HyDaP algorithm with other commonly used methods. Our HyDaP algorithm was applied to identify sepsis phenotypes and yielded important results.
Supplementary material
Supplementary MaterialThe R codes and a brief tutorial of implementing the HyDaP are available at GitHub: https://github.com/gmailw1264648156/HyDaP.