<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn>
<issn pub-type="ppub">1680-743X</issn>
<issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS996</article-id>
<article-id pub-id-type="doi">10.6339/21-JDS996</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Statistical Data Science</subject></subj-group></article-categories>
<title-group>
<article-title>Hybrid Density- and Partition-Based Clustering Algorithm for Data With Mixed-Type Variables</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Wang</surname><given-names>Shu</given-names></name><xref ref-type="aff" rid="j_jds996_aff_001">1</xref><xref ref-type="aff" rid="j_jds996_aff_002">2</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Yabes</surname><given-names>Jonathan G.</given-names></name><xref ref-type="aff" rid="j_jds996_aff_003">3</xref><xref ref-type="aff" rid="j_jds996_aff_004">4</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Chang</surname><given-names>Chung-Chou H.</given-names></name><email xlink:href="mailto:changj@pitt.edu">changj@pitt.edu</email><xref ref-type="aff" rid="j_jds996_aff_003">3</xref><xref ref-type="aff" rid="j_jds996_aff_004">4</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<aff id="j_jds996_aff_001"><label>1</label>Department of Biostatistics, College of Public Health and Health Professions, <institution>University of Florida</institution></aff>
<aff id="j_jds996_aff_002"><label>2</label><institution>University of Florida Health Cancer Center</institution></aff>
<aff id="j_jds996_aff_003"><label>3</label>Department of Biostatistics, Graduate School of Public Health, <institution>University of Pittsburgh</institution></aff>
<aff id="j_jds996_aff_004"><label>4</label>Department of Medicine, School of Medicine, <institution>University of Pittsburgh</institution></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:changj@pitt.edu">changj@pitt.edu</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2021</year></pub-date><pub-date pub-type="epub"><day>28</day><month>1</month><year>2021</year></pub-date>
<volume>19</volume><issue>1</issue><fpage>15</fpage><lpage>36</lpage>
<supplementary-material id="S1" content-type="archive" xlink:href="jds996_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>The R codes and a brief tutorial of implementing the HyDaP are available at GitHub: <uri>https://github.com/gmailw1264648156/HyDaP</uri>.</p>
</caption>
</supplementary-material>
<history>
<date date-type="received"><month>9</month><year>2020</year></date>
<date date-type="accepted"><month>10</month><year>2020</year></date>
</history>
<permissions><copyright-statement>2021 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2021</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Clustering is an essential technique for discovering patterns in data. Many clustering algorithms have been developed to tackle the ever increasing quantity and complexity of data, yet algorithms that can cluster data with mixed variables (continuous and categorical) remain limited despite the abundance of mixed-type data. Of the existing clustering methods for mixed data types, some posit unverifiable distributional assumptions or rest on unbalanced contributions of different variable types. To address these issues, we propose a two-step hybrid density- and partition-based (HyDaP) algorithm to detect clusters after variable selection. The first step involves both density-based and partition-based algorithms to identify the data structure formed by continuous variables and determine important variables (both continuous and categorical) for clustering. The second step involves a partition-based algorithm together with our proposed novel dissimilarity measure to obtain clustering results. Simulations across various scenarios were conducted to compare the HyDaP algorithm with other commonly used methods. Our HyDaP algorithm was applied to identify sepsis phenotypes and yielded important results.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>mixed data</kwd>
<kwd>variable selection</kwd>
</kwd-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds996_reflist_001">
<title>References</title>
<ref id="j_jds996_ref_001">
<mixed-citation publication-type="journal"> <string-name><surname>Angus</surname> <given-names>DC</given-names></string-name>, <string-name><surname>Van der Poll</surname> <given-names>T</given-names></string-name> (<year>2013</year>). <article-title>Severe sepsis and septic shock</article-title>. <source>New England Journal of Medicine</source>, <volume>369</volume>: <fpage>840</fpage>–<lpage>851</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_002">
<mixed-citation publication-type="chapter"> <string-name><surname>Ankerst</surname> <given-names>M</given-names></string-name>, <string-name><surname>Breunig</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Kriegel</surname> <given-names>HP</given-names></string-name>, <string-name><surname>Sander</surname> <given-names>J</given-names></string-name> (<year>1999</year>). <chapter-title>OPTICS: Ordering points to identify the clustering structure</chapter-title>. In: <source>ACM Sigmod Record</source>, volume <volume>28</volume>, <fpage>49</fpage>–<lpage>60</lpage>. <publisher-name>ACM</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_003">
<mixed-citation publication-type="chapter"> <string-name><surname>Ester</surname> <given-names>M</given-names></string-name>, <string-name><surname>Kriegel</surname> <given-names>HP</given-names></string-name>, <string-name><surname>Sander</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>X</given-names></string-name>, <etal>et  al.</etal> (<year>1996</year>). <chapter-title>A density-based algorithm for discovering clusters in large spatial databases with noise</chapter-title>. In: <source>KDD</source>, volume <volume>96</volume>, <fpage>226</fpage>–<lpage>231</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_004">
<mixed-citation publication-type="other"> <string-name><surname>Gower</surname> <given-names>JC</given-names></string-name> (1971). A general coefficient of similarity and some of its properties. <italic>Biometrics</italic>, 857–871.</mixed-citation>
</ref>
<ref id="j_jds996_ref_005">
<mixed-citation publication-type="book"> <string-name><surname>Han</surname> <given-names>J</given-names></string-name>, <string-name><surname>Pei</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kamber</surname> <given-names>M</given-names></string-name> (<year>2011</year>). <source>Data Mining: Concepts and Techniques</source>. <publisher-name>Elsevier</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_006">
<mixed-citation publication-type="chapter"> <string-name><surname>Haripriya</surname> <given-names>H</given-names></string-name>, <string-name><surname>Amrutha</surname> <given-names>S</given-names></string-name>, <string-name><surname>Veena</surname> <given-names>R</given-names></string-name>, <string-name><surname>Nedungadi</surname> <given-names>P</given-names></string-name> (<year>2015</year>). <chapter-title>Integrating apriori with paired K-Means for cluster fixed mixed data</chapter-title>. In: <source>Proceedings of the Third International Symposium on Women in Computing and Informatics</source>, <fpage>10</fpage>–<lpage>16</lpage>. <publisher-name>ACM</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_007">
<mixed-citation publication-type="journal"> <string-name><surname>Hennig</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liao</surname> <given-names>TF</given-names></string-name> (<year>2013</year>). <article-title>How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification</article-title>. <source>Journal of the Royal Statistical Society: Series C (Applied Statistics)</source>, <volume>62</volume>(<issue>3</issue>): <fpage>309</fpage>–<lpage>369</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_008">
<mixed-citation publication-type="journal"> <string-name><surname>Huang</surname> <given-names>Z</given-names></string-name> (<year>1998</year>). <article-title>Extensions to the K-Means algorithm for clustering large data sets with categorical values</article-title>. <source>Data Mining and Knowledge Discovery</source>, <volume>2</volume>(<issue>3</issue>): <fpage>283</fpage>–<lpage>304</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_009">
<mixed-citation publication-type="journal"> <string-name><surname>Hubert</surname> <given-names>L</given-names></string-name>, <string-name><surname>Arabie</surname> <given-names>P</given-names></string-name> (<year>1985</year>). <article-title>Comparing partitions</article-title>. <source>Journal of Classification</source>, <volume>2</volume>(<issue>1</issue>): <fpage>193</fpage>–<lpage>218</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_010">
<mixed-citation publication-type="journal"> <string-name><surname>Jensen</surname> <given-names>PB</given-names></string-name>, <string-name><surname>Jensen</surname> <given-names>LJ</given-names></string-name>, <string-name><surname>Brunak</surname> <given-names>S</given-names></string-name> (<year>2012</year>). <article-title>Mining electronic health records: Towards better research applications and clinical care</article-title>. <source>Nature Reviews Genetics</source>, <volume>13</volume>(<issue>6</issue>): <fpage>395</fpage>–<lpage>405</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_011">
<mixed-citation publication-type="book"> <string-name><surname>Kaufman</surname> <given-names>L</given-names></string-name>, <string-name><surname>Rousseeuw</surname> <given-names>PJ</given-names></string-name> (<year>2009</year>). <source>Finding Groups in Data: An Introduction to Cluster Analysis</source>, volume <volume>344</volume>. <publisher-name>John Wiley &amp; Sons</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_012">
<mixed-citation publication-type="journal"> <string-name><surname>Liu</surname> <given-names>V</given-names></string-name>, <string-name><surname>Escobar</surname> <given-names>GJ</given-names></string-name>, <string-name><surname>Greene</surname> <given-names>JD</given-names></string-name>, <string-name><surname>Soule</surname> <given-names>J</given-names></string-name>, <string-name><surname>Whippy</surname> <given-names>A</given-names></string-name>, <string-name><surname>Angus</surname> <given-names>DC</given-names></string-name>, <etal>et  al.</etal> (<year>2014</year>). <article-title>Hospital deaths in patients with sepsis from 2 independent cohorts</article-title>. <source>Journal of the American Medical Association</source>, <volume>312</volume>(<issue>1</issue>): <fpage>90</fpage>–<lpage>92</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_013">
<mixed-citation publication-type="chapter"> <string-name><surname>MacQueen</surname> <given-names>J</given-names></string-name>, <etal>et  al.</etal> (<year>1967</year>). <chapter-title>Some methods for classification and analysis of multivariate observations</chapter-title>. In: <source>Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability</source>, volume <volume>1</volume>, <fpage>281</fpage>–<lpage>297</lpage>. <publisher-loc>Oakland, CA, USA</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_014">
<mixed-citation publication-type="other"> <string-name><surname>McCutcheon</surname> <given-names>AL</given-names></string-name> (1987). <italic>Latent Class Analysis</italic>. 64. Sage.</mixed-citation>
</ref>
<ref id="j_jds996_ref_015">
<mixed-citation publication-type="journal"> <string-name><surname>Monti</surname> <given-names>S</given-names></string-name>, <string-name><surname>Tamayo</surname> <given-names>P</given-names></string-name>, <string-name><surname>Mesirov</surname> <given-names>J</given-names></string-name>, <string-name><surname>Golub</surname> <given-names>T</given-names></string-name> (<year>2003</year>). <article-title>Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data</article-title>. <source>Machine Learning</source>, <volume>52</volume>(<issue>1–2</issue>): <fpage>91</fpage>–<lpage>118</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_016">
<mixed-citation publication-type="journal"> <string-name><surname>Moustaki</surname> <given-names>I</given-names></string-name> (<year>1996</year>). <article-title>A latent trait and a latent class model for mixed observed variables</article-title>. <source>British Journal of Mathematical and Statistical Psychology</source>, <volume>49</volume>(<issue>2</issue>): <fpage>313</fpage>–<lpage>334</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_017">
<mixed-citation publication-type="book"> <string-name><surname>Pagès</surname> <given-names>J</given-names></string-name> (<year>2014</year>). <source>Multiple Factor Analysis by Example Using R</source>. <publisher-name>CRC Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_018">
<mixed-citation publication-type="journal"> <string-name><surname>Rand</surname> <given-names>WM</given-names></string-name> (<year>1971</year>). <article-title>Objective criteria for the evaluation of clustering methods</article-title>. <source>Journal of the American Statistical Association</source>, <volume>66</volume>(<issue>336</issue>): <fpage>846</fpage>–<lpage>850</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_019">
<mixed-citation publication-type="journal"> <string-name><surname>Reddy</surname> <given-names>MJ</given-names></string-name>, <string-name><surname>Kavitha</surname> <given-names>B</given-names></string-name> (<year>2012</year>). <article-title>Clustering the mixed numerical and categorical dataset using similarity weight and filter method</article-title>. <source>International Journal of Database Theory and Application</source>, <volume>5</volume>(<issue>1</issue>): <fpage>121</fpage>–<lpage>134</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_020">
<mixed-citation publication-type="journal"> <string-name><surname>Scicluna</surname> <given-names>BP</given-names></string-name>, <string-name><surname>Van Vught</surname> <given-names>LA</given-names></string-name>, <string-name><surname>Zwinderman</surname> <given-names>AH</given-names></string-name>, <string-name><surname>Wiewel</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Davenport</surname> <given-names>EE</given-names></string-name>, <string-name><surname>Burnham</surname> <given-names>KL</given-names></string-name>, <etal>et  al.</etal> (<year>2017</year>). <article-title>Classification of patients with sepsis according to blood genomic endotype: A prospective cohort study</article-title>. <source>The Lancet Respiratory Medicine</source>, <volume>5</volume>(<issue>10</issue>): <fpage>816</fpage>–<lpage>826</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_021">
<mixed-citation publication-type="journal"> <string-name><surname>Seymour</surname> <given-names>CW</given-names></string-name>, <string-name><surname>Kennedy</surname> <given-names>JN</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>CCH</given-names></string-name>, <string-name><surname>Elliott</surname> <given-names>CF</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Z</given-names></string-name>, <etal>et  al.</etal> (<year>2019</year>). <article-title>Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis</article-title>. <source>Journal of the American Medical Association</source>, <volume>321</volume>(<issue>20</issue>): <fpage>2003</fpage>–<lpage>2017</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_022">
<mixed-citation publication-type="journal"> <string-name><surname>Seymour</surname> <given-names>CW</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>VX</given-names></string-name>, <string-name><surname>Iwashyna</surname> <given-names>TJ</given-names></string-name>, <string-name><surname>Brunkhorst</surname> <given-names>FM</given-names></string-name>, <string-name><surname>Rea</surname> <given-names>TD</given-names></string-name>, <string-name><surname>Scherag</surname> <given-names>A</given-names></string-name>, <etal>et  al.</etal> (<year>2016</year>). <article-title>Assessment of clinical criteria for sepsis: For the third international consensus definitions for sepsis and septic shock (sepsis-3)</article-title>. <source>Journal of the American Medical Association</source>, <volume>315</volume>(<issue>8</issue>): <fpage>762</fpage>–<lpage>774</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_023">
<mixed-citation publication-type="journal"> <string-name><surname>Shirkhorshidi</surname> <given-names>AS</given-names></string-name>, <string-name><surname>Aghabozorgi</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wah</surname> <given-names>TY</given-names></string-name> (<year>2015</year>). <article-title>A comparison study on similarity and dissimilarity measures in clustering continuous data</article-title>. <source>PloS One</source>, <volume>10</volume>(<issue>12</issue>): <fpage>e0144059</fpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_024">
<mixed-citation publication-type="journal"> <string-name><surname>Ward Jr</surname> <given-names>JH</given-names></string-name> (<year>1963</year>). <article-title>Hierarchical grouping to optimize an objective function</article-title>. <source>Journal of the American Statistical Association</source>, <volume>58</volume>(<issue>301</issue>): <fpage>236</fpage>–<lpage>244</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_025">
<mixed-citation publication-type="journal"> <string-name><surname>Wilkerson</surname> <given-names>MD</given-names></string-name>, <string-name><surname>Hayes</surname> <given-names>DN</given-names></string-name> (<year>2010</year>). <article-title>ConsensusClusterPlus: A class discovery tool with confidence assessments and item tracking</article-title>. <source>Bioinformatics</source>, <volume>26</volume>(<issue>12</issue>): <fpage>1572</fpage>–<lpage>1573</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_026">
<mixed-citation publication-type="journal"> <string-name><surname>Witten</surname> <given-names>DM</given-names></string-name>, <string-name><surname>Tibshirani</surname> <given-names>R</given-names></string-name> (<year>2010</year>). <article-title>A framework for feature selection in clustering</article-title>. <source>Journal of the American Statistical Association</source>, <volume>105</volume>(<issue>490</issue>): <fpage>713</fpage>–<lpage>726</lpage>.</mixed-citation>
</ref>
<ref id="j_jds996_ref_027">
<mixed-citation publication-type="journal"> <string-name><surname>Xu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Wunsch</surname> <given-names>D</given-names></string-name> (<year>2005</year>). <article-title>Survey of clustering algorithms</article-title>. <source>IEEE Transactions on Neural Networks</source>, <volume>16</volume>(<issue>3</issue>): <fpage>645</fpage>–<lpage>678</lpage>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
