Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. Assessment of Projection Pursuit Index f ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R
Zhaoxing Wu   Chunming Zhang  

Authors

 
Placeholder
https://doi.org/10.6339/23-JDS1096
Pub. online: 2 March 2023      Type: Computing In Data Science      Open accessOpen Access

Received
7 November 2022
Accepted
27 February 2023
Published
2 March 2023

Abstract

Analyzing “large p small n” data is becoming increasingly paramount in a wide range of application fields. As a projection pursuit index, the Penalized Discriminant Analysis ($\mathrm{PDA}$) index, built upon the Linear Discriminant Analysis ($\mathrm{LDA}$) index, is devised in Lee and Cook (2010) to classify high-dimensional data with promising results. Yet, there is little information available about its performance compared with the popular Support Vector Machine ($\mathrm{SVM}$). This paper conducts extensive numerical studies to compare the performance of the $\mathrm{PDA}$ index with the $\mathrm{LDA}$ index and $\mathrm{SVM}$, demonstrating that the $\mathrm{PDA}$ index is robust to outliers and able to handle high-dimensional datasets with extremely small sample sizes, few important variables, and multiple classes. Analyses of several motivating real-world datasets reveal the practical advantages and limitations of individual methods, suggesting that the $\mathrm{PDA}$ index provides a useful alternative tool for classifying complex high-dimensional data. These new insights, along with the hands-on implementation of the $\mathrm{PDA}$ index functions in the R package classPP, facilitate statisticians and data scientists to make effective use of both sets of classification tools.

Supplementary material

 Supplementary Material
All of our code is open source in the following GitHub repository https://github.com/zwu363/projection-pursuit-index.

References

 
Burczynski ME, Peterson RL, Twine NC, Zuberek KA, Brodeur BJ, Casciotti L, et al. (2006). Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. The Journal of Molecular Diagnostics, 8(1): 51–61. https://doi.org/10.2353/jmoldx.2006.050079
 
Cortes C, Vapnik V (1995). Support-vector networks. Machine Learning, 20: 273–297.
 
Friedman J, Tukey J (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, C-23(9): 881–890. https://doi.org/10.1109/T-C.1974.224051
 
Gaudette L, Japkowicz N (2009). Evaluation methods for ordinal classification. In: Advances in Artificial Intelligence (Y Gao, N Japkowicz, eds.), 207–210. Springer Berlin Heidelberg, Berlin, Heidelberg.
 
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439): 531–537. https://doi.org/10.1126/science.286.5439.531
 
Gordon GJG, Jensen RVR, Hsiao LLL, Gullans SRS, Blumenstock JEJ, Ramaswamy SS, et al. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research, 62(17): 4963–4967.
 
Hastie TJ, Tibshirani R, Buja A (1994). Flexible discriminant analysis by optimal scoring. Journal of the American Statistical Association, 89: 1255–1270. https://doi.org/10.1080/01621459.1994.10476866
 
Kruskal JB (1969). Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new “index of condensation. In: Statistical Computation (RC Milton, JA Nelder, eds.), 427–440. Academic Press.
 
Lee EK, Cook D (2010). A projection pursuit index for large p small n data. Statistics and Computing, 20(3): 381–392. https://doi.org/10.1007/s11222-009-9131-1
 
Lee EK, Cook D, Klinke S, Lumley T (2005). Projection pursuit for exploratory supervised classification. Journal of Computational and Graphical Statistics, 14(4): 831–846. https://doi.org/10.1198/106186005X77702
 
Marron JS (2015). Distance weighted discrimination. Wiley Interdisciplinary Reviews: Computational Statistics, 7: 109–114. https://doi.org/10.1002/wics.1345
 
Nakayama R, Nemoto T, Takahashi H, Ohta T, Kawai A, Seki K, et al. (2007). Gene expression analysis of soft tissue sarcomas: Characterization and reclassification of malignant fibrous histiocytoma. Nature, 20(7): 749–759. https://doi.org/10.1038/448749b
 
Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870): 436–442. https://doi.org/10.1038/415436a
 
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2): 203–209. https://doi.org/10.1016/S1535-6108(02)00030-2
 
Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences of the United States of America, 98: 10869–10874. https://doi.org/10.1073/pnas.191367098
 
Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. (2002). Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1(2): 133–143. https://doi.org/10.1016/S1535-6108(02)00032-6
 
Zhang C, Ye J, Wang X (2022). A computational perspective on projection pursuit in high dimensions: feasible or infeasible feature extraction. International Statistical Review. https://doi.org/10.1111/insr.12517.

Related articles PDF XML
Related articles PDF XML

Copyright
2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
large p small n linear discriminant analysis penalized discriminant analysis supervised classification SVM

Funding
C. Zhang’s work was partially supported by U.S. National Science Foundation grants DMS-2013486 and DMS-1712418, and provided by the University of Wisconsin-Madison Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation.

Metrics (since February 2021)
14

Article info
views

0

Full article
views

14

PDF
downloads

8

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy