Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 310–332
Pub. online: 2 March 2023
Type: Computing In Data Science
Open Access
Received
7 November 2022
7 November 2022
Accepted
27 February 2023
27 February 2023
Published
2 March 2023
2 March 2023
Abstract
Analyzing “large p small n” data is becoming increasingly paramount in a wide range of application fields. As a projection pursuit index, the Penalized Discriminant Analysis ($\mathrm{PDA}$) index, built upon the Linear Discriminant Analysis ($\mathrm{LDA}$) index, is devised in Lee and Cook (2010) to classify high-dimensional data with promising results. Yet, there is little information available about its performance compared with the popular Support Vector Machine ($\mathrm{SVM}$). This paper conducts extensive numerical studies to compare the performance of the $\mathrm{PDA}$ index with the $\mathrm{LDA}$ index and $\mathrm{SVM}$, demonstrating that the $\mathrm{PDA}$ index is robust to outliers and able to handle high-dimensional datasets with extremely small sample sizes, few important variables, and multiple classes. Analyses of several motivating real-world datasets reveal the practical advantages and limitations of individual methods, suggesting that the $\mathrm{PDA}$ index provides a useful alternative tool for classifying complex high-dimensional data. These new insights, along with the hands-on implementation of the $\mathrm{PDA}$ index functions in the R package classPP, facilitate statisticians and data scientists to make effective use of both sets of classification tools.
Supplementary material
Supplementary MaterialAll of our code is open source in the following GitHub repository https://github.com/zwu363/projection-pursuit-index.
References
Burczynski ME, Peterson RL, Twine NC, Zuberek KA, Brodeur BJ, Casciotti L, et al. (2006). Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. The Journal of Molecular Diagnostics, 8(1): 51–61. https://doi.org/10.2353/jmoldx.2006.050079
Friedman J, Tukey J (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, C-23(9): 881–890. https://doi.org/10.1109/T-C.1974.224051
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439): 531–537. https://doi.org/10.1126/science.286.5439.531
Hastie TJ, Tibshirani R, Buja A (1994). Flexible discriminant analysis by optimal scoring. Journal of the American Statistical Association, 89: 1255–1270. https://doi.org/10.1080/01621459.1994.10476866
Lee EK, Cook D (2010). A projection pursuit index for large p small n data. Statistics and Computing, 20(3): 381–392. https://doi.org/10.1007/s11222-009-9131-1
Lee EK, Cook D, Klinke S, Lumley T (2005). Projection pursuit for exploratory supervised classification. Journal of Computational and Graphical Statistics, 14(4): 831–846. https://doi.org/10.1198/106186005X77702
Marron JS (2015). Distance weighted discrimination. Wiley Interdisciplinary Reviews: Computational Statistics, 7: 109–114. https://doi.org/10.1002/wics.1345
Nakayama R, Nemoto T, Takahashi H, Ohta T, Kawai A, Seki K, et al. (2007). Gene expression analysis of soft tissue sarcomas: Characterization and reclassification of malignant fibrous histiocytoma. Nature, 20(7): 749–759. https://doi.org/10.1038/448749b
Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870): 436–442. https://doi.org/10.1038/415436a
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2): 203–209. https://doi.org/10.1016/S1535-6108(02)00030-2
Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences of the United States of America, 98: 10869–10874. https://doi.org/10.1073/pnas.191367098
Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. (2002). Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1(2): 133–143. https://doi.org/10.1016/S1535-6108(02)00032-6
Zhang C, Ye J, Wang X (2022). A computational perspective on projection pursuit in high dimensions: feasible or infeasible feature extraction. International Statistical Review. https://doi.org/10.1111/insr.12517.