<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1096</article-id>
<article-id pub-id-type="doi">10.6339/23-JDS1096</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Computing in Data Science</subject></subj-group></article-categories>
<title-group>
<article-title>Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Wu</surname><given-names>Zhaoxing</given-names></name><xref ref-type="aff" rid="j_jds1096_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-3153-2662</contrib-id>
<name><surname>Zhang</surname><given-names>Chunming</given-names></name><email xlink:href="mailto:cmzhang@stat.wisc.edu">cmzhang@stat.wisc.edu</email><xref ref-type="aff" rid="j_jds1096_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<aff id="j_jds1096_aff_001"><label>1</label><institution>University of Wisconsin-Madison</institution>, Department of Statistics, <country>U.S.A.</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:cmzhang@stat.wisc.edu">cmzhang@stat.wisc.edu</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2023</year></pub-date><pub-date pub-type="epub"><day>2</day><month>3</month><year>2023</year></pub-date><volume>21</volume><issue>2</issue><fpage>310</fpage><lpage>332</lpage><supplementary-material id="S1" content-type="archive" xlink:href="jds1096_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>All of our code is open source in the following GitHub repository <uri>https://github.com/zwu363/projection-pursuit-index</uri>.</p>
</caption>
</supplementary-material><history><date date-type="received"><day>7</day><month>11</month><year>2022</year></date><date date-type="accepted"><day>27</day><month>2</month><year>2023</year></date></history>
<permissions><copyright-statement>2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2023</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Analyzing “large <italic>p</italic> small <italic>n</italic>” data is becoming increasingly paramount in a wide range of application fields. As a projection pursuit index, the Penalized Discriminant Analysis (<inline-formula id="j_jds1096_ineq_001"><alternatives><mml:math>
<mml:mi mathvariant="normal">PDA</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{PDA}$]]></tex-math></alternatives></inline-formula>) index, built upon the Linear Discriminant Analysis (<inline-formula id="j_jds1096_ineq_002"><alternatives><mml:math>
<mml:mi mathvariant="normal">LDA</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{LDA}$]]></tex-math></alternatives></inline-formula>) index, is devised in <xref ref-type="bibr" rid="j_jds1096_ref_009">Lee and Cook</xref> (<xref ref-type="bibr" rid="j_jds1096_ref_009">2010</xref>) to classify high-dimensional data with promising results. Yet, there is little information available about its performance compared with the popular Support Vector Machine (<inline-formula id="j_jds1096_ineq_003"><alternatives><mml:math>
<mml:mi mathvariant="normal">SVM</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{SVM}$]]></tex-math></alternatives></inline-formula>). This paper conducts extensive numerical studies to compare the performance of the <inline-formula id="j_jds1096_ineq_004"><alternatives><mml:math>
<mml:mi mathvariant="normal">PDA</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{PDA}$]]></tex-math></alternatives></inline-formula> index with the <inline-formula id="j_jds1096_ineq_005"><alternatives><mml:math>
<mml:mi mathvariant="normal">LDA</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{LDA}$]]></tex-math></alternatives></inline-formula> index and <inline-formula id="j_jds1096_ineq_006"><alternatives><mml:math>
<mml:mi mathvariant="normal">SVM</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{SVM}$]]></tex-math></alternatives></inline-formula>, demonstrating that the <inline-formula id="j_jds1096_ineq_007"><alternatives><mml:math>
<mml:mi mathvariant="normal">PDA</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{PDA}$]]></tex-math></alternatives></inline-formula> index is robust to outliers and able to handle high-dimensional datasets with extremely small sample sizes, few important variables, and multiple classes. Analyses of several motivating real-world datasets reveal the practical advantages and limitations of individual methods, suggesting that the <inline-formula id="j_jds1096_ineq_008"><alternatives><mml:math>
<mml:mi mathvariant="normal">PDA</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{PDA}$]]></tex-math></alternatives></inline-formula> index provides a useful alternative tool for classifying complex high-dimensional data. These new insights, along with the hands-on implementation of the <inline-formula id="j_jds1096_ineq_009"><alternatives><mml:math>
<mml:mi mathvariant="normal">PDA</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{PDA}$]]></tex-math></alternatives></inline-formula> index functions in the <sans-serif>R</sans-serif> package <italic>classPP</italic>, facilitate statisticians and data scientists to make effective use of both sets of classification tools.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>large <italic>p</italic> small <italic>n</italic></kwd>
<kwd>linear discriminant analysis</kwd>
<kwd>penalized discriminant analysis</kwd>
<kwd>supervised classification</kwd>
<kwd>SVM</kwd>
</kwd-group>
<funding-group><award-group><funding-source xlink:href="https://doi.org/10.13039/100000001">U.S. National Science Foundation</funding-source><award-id>DMS-2013486</award-id><award-id>DMS-1712418</award-id></award-group><award-group><funding-source xlink:href="https://doi.org/10.13039/100012787">University of Wisconsin-Madison Office of the Vice Chancellor for Research and Graduate Education</funding-source></award-group><award-group><funding-source xlink:href="https://doi.org/10.13039/100001395">Wisconsin Alumni Research Foundation</funding-source></award-group><funding-statement>C. Zhang’s work was partially supported by U.S. National Science Foundation grants DMS-2013486 and DMS-1712418, and provided by the University of Wisconsin-Madison Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. </funding-statement></funding-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1096_reflist_001">
<title>References</title>
<ref id="j_jds1096_ref_001">
<mixed-citation publication-type="journal"> <string-name><surname>Burczynski</surname> <given-names>ME</given-names></string-name>, <string-name><surname>Peterson</surname> <given-names>RL</given-names></string-name>, <string-name><surname>Twine</surname> <given-names>NC</given-names></string-name>, <string-name><surname>Zuberek</surname> <given-names>KA</given-names></string-name>, <string-name><surname>Brodeur</surname> <given-names>BJ</given-names></string-name>, <string-name><surname>Casciotti</surname> <given-names>L</given-names></string-name>, <etal>et al.</etal> (<year>2006</year>). <article-title>Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells</article-title>. <source><italic>The Journal of Molecular Diagnostics</italic></source>, <volume>8</volume>(<issue>1</issue>): <fpage>51</fpage>–<lpage>61</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.2353/jmoldx.2006.050079" xlink:type="simple"> https://doi.org/10.2353/jmoldx.2006.050079</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_002">
<mixed-citation publication-type="journal"> <string-name><surname>Cortes</surname> <given-names>C</given-names></string-name>, <string-name><surname>Vapnik</surname> <given-names>V</given-names></string-name> (<year>1995</year>). <article-title>Support-vector networks</article-title>. <source><italic>Machine Learning</italic></source>, <volume>20</volume>: <fpage>273</fpage>–<lpage>297</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1096_ref_003">
<mixed-citation publication-type="journal"> <string-name><surname>Friedman</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tukey</surname> <given-names>J</given-names></string-name> (<year>1974</year>). <article-title>A projection pursuit algorithm for exploratory data analysis</article-title>. <source><italic>IEEE Transactions on Computers</italic></source>, <volume>C-23</volume>(<issue>9</issue>): <fpage>881</fpage>–<lpage>890</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1109/T-C.1974.224051" xlink:type="simple"> https://doi.org/10.1109/T-C.1974.224051</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_004">
<mixed-citation publication-type="chapter"> <string-name><surname>Gaudette</surname> <given-names>L</given-names></string-name>, <string-name><surname>Japkowicz</surname> <given-names>N</given-names></string-name> (<year>2009</year>). <chapter-title>Evaluation methods for ordinal classification</chapter-title>. In: <source><italic>Advances in Artificial Intelligence</italic></source> (<string-name><given-names>Y</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>N</given-names> <surname>Japkowicz</surname></string-name>, eds.), <fpage>207</fpage>–<lpage>210</lpage>. <publisher-name>Springer Berlin Heidelberg</publisher-name>, <publisher-loc>Berlin, Heidelberg</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1096_ref_005">
<mixed-citation publication-type="journal"> <string-name><surname>Golub</surname> <given-names>TR</given-names></string-name>, <string-name><surname>Slonim</surname> <given-names>DK</given-names></string-name>, <string-name><surname>Tamayo</surname> <given-names>P</given-names></string-name>, <string-name><surname>Huard</surname> <given-names>C</given-names></string-name>, <string-name><surname>Gaasenbeek</surname> <given-names>M</given-names></string-name>, <string-name><surname>Mesirov</surname> <given-names>JP</given-names></string-name>, <etal>et al.</etal> (<year>1999</year>). <article-title>Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring</article-title>. <source><italic>Science</italic></source>, <volume>286</volume>(<issue>5439</issue>): <fpage>531</fpage>–<lpage>537</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1126/science.286.5439.531" xlink:type="simple"> https://doi.org/10.1126/science.286.5439.531</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_006">
<mixed-citation publication-type="journal"> <string-name><surname>Gordon</surname> <given-names>GJG</given-names></string-name>, <string-name><surname>Jensen</surname> <given-names>RVR</given-names></string-name>, <string-name><surname>Hsiao</surname> <given-names>LLL</given-names></string-name>, <string-name><surname>Gullans</surname> <given-names>SRS</given-names></string-name>, <string-name><surname>Blumenstock</surname> <given-names>JEJ</given-names></string-name>, <string-name><surname>Ramaswamy</surname> <given-names>SS</given-names></string-name>, <etal>et al.</etal> (<year>2002</year>). <article-title>Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma</article-title>. <source><italic>Cancer Research</italic></source>, <volume>62</volume>(<issue>17</issue>): <fpage>4963</fpage>–<lpage>4967</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1096_ref_007">
<mixed-citation publication-type="journal"> <string-name><surname>Hastie</surname> <given-names>TJ</given-names></string-name>, <string-name><surname>Tibshirani</surname> <given-names>R</given-names></string-name>, <string-name><surname>Buja</surname> <given-names>A</given-names></string-name> (<year>1994</year>). <article-title>Flexible discriminant analysis by optimal scoring</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>89</volume>: <fpage>1255</fpage>–<lpage>1270</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1080/01621459.1994.10476866" xlink:type="simple"> https://doi.org/10.1080/01621459.1994.10476866</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_008">
<mixed-citation publication-type="chapter"> <string-name><surname>Kruskal</surname> <given-names>JB</given-names></string-name> (<year>1969</year>). <chapter-title>Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new “index of condensation</chapter-title>. In: <source><italic>Statistical Computation</italic></source> (<string-name><given-names>RC</given-names> <surname>Milton</surname></string-name>, <string-name><given-names>JA</given-names> <surname>Nelder</surname></string-name>, eds.), <fpage>427</fpage>–<lpage>440</lpage>. <publisher-name>Academic Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1096_ref_009">
<mixed-citation publication-type="journal"> <string-name><surname>Lee</surname> <given-names>EK</given-names></string-name>, <string-name><surname>Cook</surname> <given-names>D</given-names></string-name> (<year>2010</year>). <article-title>A projection pursuit index for large p small n data</article-title>. <source><italic>Statistics and Computing</italic></source>, <volume>20</volume>(<issue>3</issue>): <fpage>381</fpage>–<lpage>392</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1007/s11222-009-9131-1" xlink:type="simple"> https://doi.org/10.1007/s11222-009-9131-1</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_010">
<mixed-citation publication-type="journal"> <string-name><surname>Lee</surname> <given-names>EK</given-names></string-name>, <string-name><surname>Cook</surname> <given-names>D</given-names></string-name>, <string-name><surname>Klinke</surname> <given-names>S</given-names></string-name>, <string-name><surname>Lumley</surname> <given-names>T</given-names></string-name> (<year>2005</year>). <article-title>Projection pursuit for exploratory supervised classification</article-title>. <source><italic>Journal of Computational and Graphical Statistics</italic></source>, <volume>14</volume>(<issue>4</issue>): <fpage>831</fpage>–<lpage>846</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1198/106186005X77702" xlink:type="simple"> https://doi.org/10.1198/106186005X77702</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_011">
<mixed-citation publication-type="journal"> <string-name><surname>Marron</surname> <given-names>JS</given-names></string-name> (<year>2015</year>). <article-title>Distance weighted discrimination</article-title>. <source><italic>Wiley Interdisciplinary Reviews: Computational Statistics</italic></source>, <volume>7</volume>: <fpage>109</fpage>–<lpage>114</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1002/wics.1345" xlink:type="simple"> https://doi.org/10.1002/wics.1345</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_012">
<mixed-citation publication-type="journal"> <string-name><surname>Nakayama</surname> <given-names>R</given-names></string-name>, <string-name><surname>Nemoto</surname> <given-names>T</given-names></string-name>, <string-name><surname>Takahashi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ohta</surname> <given-names>T</given-names></string-name>, <string-name><surname>Kawai</surname> <given-names>A</given-names></string-name>, <string-name><surname>Seki</surname> <given-names>K</given-names></string-name>, <etal>et al.</etal> (<year>2007</year>). <article-title>Gene expression analysis of soft tissue sarcomas: Characterization and reclassification of malignant fibrous histiocytoma</article-title>. <source><italic>Nature</italic></source>, <volume>20</volume>(<issue>7</issue>): <fpage>749</fpage>–<lpage>759</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1038/448749b" xlink:type="simple"> https://doi.org/10.1038/448749b</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_013">
<mixed-citation publication-type="journal"> <string-name><surname>Pomeroy</surname> <given-names>SL</given-names></string-name>, <string-name><surname>Tamayo</surname> <given-names>P</given-names></string-name>, <string-name><surname>Gaasenbeek</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sturla</surname> <given-names>LM</given-names></string-name>, <string-name><surname>Angelo</surname> <given-names>M</given-names></string-name>, <string-name><surname>McLaughlin</surname> <given-names>ME</given-names></string-name>, <etal>et al.</etal> (<year>2002</year>). <article-title>Prediction of central nervous system embryonal tumour outcome based on gene expression</article-title>. <source><italic>Nature</italic></source>, <volume>415</volume>(<issue>6870</issue>): <fpage>436</fpage>–<lpage>442</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1038/415436a" xlink:type="simple"> https://doi.org/10.1038/415436a</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_014">
<mixed-citation publication-type="journal"> <string-name><surname>Singh</surname> <given-names>D</given-names></string-name>, <string-name><surname>Febbo</surname> <given-names>PG</given-names></string-name>, <string-name><surname>Ross</surname> <given-names>K</given-names></string-name>, <string-name><surname>Jackson</surname> <given-names>DG</given-names></string-name>, <string-name><surname>Manola</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ladd</surname> <given-names>C</given-names></string-name>, <etal>et al.</etal> (<year>2002</year>). <article-title>Gene expression correlates of clinical prostate cancer behavior</article-title>. <source><italic>Cancer Cell</italic></source>, <volume>1</volume>(<issue>2</issue>): <fpage>203</fpage>–<lpage>209</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1016/S1535-6108(02)00030-2" xlink:type="simple"> https://doi.org/10.1016/S1535-6108(02)00030-2</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_015">
<mixed-citation publication-type="journal"> <string-name><surname>Sørlie</surname> <given-names>T</given-names></string-name>, <string-name><surname>Perou</surname> <given-names>CM</given-names></string-name>, <string-name><surname>Tibshirani</surname> <given-names>R</given-names></string-name>, <string-name><surname>Aas</surname> <given-names>T</given-names></string-name>, <string-name><surname>Geisler</surname> <given-names>S</given-names></string-name>, <string-name><surname>Johnsen</surname> <given-names>H</given-names></string-name>, <etal>et al.</etal> (<year>2001</year>). <article-title>Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications</article-title>. <source><italic>Proceedings of the National Academy of Sciences of the United States of America</italic></source>, <volume>98</volume>: <fpage>10869</fpage>–<lpage>10874</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1073/pnas.191367098" xlink:type="simple"> https://doi.org/10.1073/pnas.191367098</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_016">
<mixed-citation publication-type="journal"> <string-name><surname>Yeoh</surname> <given-names>EJ</given-names></string-name>, <string-name><surname>Ross</surname> <given-names>ME</given-names></string-name>, <string-name><surname>Shurtleff</surname> <given-names>SA</given-names></string-name>, <string-name><surname>Williams</surname> <given-names>WK</given-names></string-name>, <string-name><surname>Patel</surname> <given-names>D</given-names></string-name>, <string-name><surname>Mahfouz</surname> <given-names>R</given-names></string-name>, <etal>et al.</etal> (<year>2002</year>). <article-title>Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling</article-title>. <source><italic>Cancer Cell</italic></source>, <volume>1</volume>(<issue>2</issue>): <fpage>133</fpage>–<lpage>143</lpage>. <ext-link ext-link-type="doi" xlink:href=" https://doi.org/10.1016/S1535-6108(02)00032-6" xlink:type="simple"> https://doi.org/10.1016/S1535-6108(02)00032-6</ext-link></mixed-citation>
</ref>
<ref id="j_jds1096_ref_017">
<mixed-citation publication-type="other"> <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Ye</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name> (2022). A computational perspective on projection pursuit in high dimensions: feasible or infeasible feature extraction. <italic>International Statistical Review</italic>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1111/insr.12517">https://doi.org/10.1111/insr.12517</ext-link>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
