Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. Exploring Massive Risk Factors of Catego ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Exploring Massive Risk Factors of Categorical Outcomes via Supervised Dimension Reduction
Yan Li   Kangni Alemdjrodo   Yanzhu Lin     All authors (5)

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1188
Pub. online: 27 May 2025      Type: Statistical Data Science      Open accessOpen Access

Received
29 September 2024
Accepted
2 May 2025
Published
27 May 2025

Abstract

We propose to explore high-dimensional data with categorical outcomes by generalizing the penalized orthogonal-components regression method (POCRE), a supervised dimension reduction method initially proposed for high-dimensional linear regression. This generalized POCRE, i.e., gPOCRE, sequentially builds up orthogonal components by selecting predictors which maximally explain the variation of the response variables. Therefore, gPOCRE simultaneously selects significant predictors and reduces dimensions by constructing linear components of these selected predictors for a high-dimensional generalized linear model. For multiple categorical outcomes, gPOCRE can also construct common components shared by all outcomes to improve the power of selecting variables shared by multiple outcomes. Both simulation studies and real data analysis are carried out to illustrate the performance of gPOCRE.

Supplementary material

 Supplementary Material
The MATLAB code for gPOCRE is available on the journal’s website. The ISOLET data by Fanty and Cole (1990) can be downloaded from https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=isolet&id=41966, and the breast cancer data can be found in the R package mixOmics (https://mixomics.org/).

References

 
Boulesteix AL, Strimmer K (2006). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioformatics, 8: 32–44. https://doi.org/10.1093/bib/bbl016
 
Chun H, Keleş S (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(1): 3–25. https://doi.org/10.1111/j.1467-9868.2009.00723.x
 
Chung D, Keles S (2010). Sparse partial least squares classification for high dimensional data. Statistical Applications in Genetics and Molecular Biology, 9. Article 17.
 
De Jong S (1993). Simpls: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18: 251–263. https://doi.org/10.1016/0169-7439(93)85002-X
 
Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96: 1348–1360. https://doi.org/10.1198/016214501753382273
 
Fan J, Samworth R, Wu Y (2009). Ultrahigh dimensional feature selection: Beyond the linear model. Journal of Machine Learning Research, 10: 2013–2038.
 
Fanty M, Cole R (1990). Spoken letter recognition. Proceedings of the International Conference on Neural Information Processing Systems, 4: 220–226.
 
Fisher RA (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2): 179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x
 
Freeman C, Kulić D, Basir O (2013). Feature-selected tree-based classification. IEEE Transactions on Cybernetics, 43(6): 1990–2004. https://doi.org/10.1109/TSMCB.2012.2237394
 
Friedman J, Hastie T, Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1. https://doi.org/10.18637/jss.v033.i01
 
Hoskuldsson A (1988). PLS regression methods. Journal of Chemometrics, 2: 211–228. https://doi.org/10.1002/cem.1180020306
 
Hoskuldsson A (1992). The h-principle in modelling with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems, 14: 139–153. https://doi.org/10.1016/0169-7439(92)80099-P
 
Hutter C, Zenklusen JC (2018). The Cancer Genome Atlas: Creating lasting value beyond its data. Cell, 173(2): 283–285. https://doi.org/10.1016/j.cell.2018.03.042
 
Johnstone IM, Silverman BW (2004). Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4): 1594–1649. https://doi.org/10.1214/009053604000000030
 
Lê Cao KA, Rossouw D, Robert-Granié C, Besse P (2008). A sparse PLS for variable selection when integrating omics data. Statistical Applications in Genetics and Molecular Biology. 7(1): Article 35.
 
Lin Y, Zhang M, Zhang D (2015). Generalized orthogonal components regression for high dimensional generalized linear models. Computational Statistics & Data Analysis, 88: 119–127. https://doi.org/10.1016/j.csda.2015.02.006
 
Loh WY (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1): 14–23.
 
Massy WF (1965). Principal components regression in exploratory statistical research. Journal of the American Statistical Association, 60(309): 234–256. https://doi.org/10.1080/01621459.1965.10480787
 
McLachlan GJ (2005). Discriminant Analysis and Statistical Pattern Recognition. John Wiley & Sons.
 
Nguyen DV, Rocke DM (2002a). Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares. Springer.
 
Nguyen DV, Rocke DM (2002b). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18: 39–50. https://doi.org/10.1093/bioinformatics/18.1.39
 
Shen J, Gao S (2008). A solution to separation and multicollinearity in multiple logistic regression. Journal of Data Science, 6(4): 515. https://doi.org/10.6339/JDS.2008.06(4).395
 
Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D (2019). Benefits and limitations of genome-wide association studies. Nature Reviews. Genetics, 20(8): 467–484. https://doi.org/10.1038/s41576-019-0127-1
 
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58: 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
 
Van de Geer SA (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2): 614–645.
 
Velliangiri S, Alagumuthukrishnan S, et al. (2019). A review of dimensionality reduction techniques for efficient computation. Procedia Computer Science, 165: 104–111. https://doi.org/10.1016/j.procs.2020.01.079
 
Wold H (1966). Estimation of principal components and related models by iterative least squares. In PR Krishnajad (Ed.), Multivariate Analysis, 391–420. New York: Academic Press.
 
Wold H (1975). Soft modelling by latent variables: The non-linear iterative partial least squares (nipals) approach. Journal of Applied Probability, 12(S1): 117–142. https://doi.org/10.1017/S0021900200047604
 
Xie J, Lin Y, Yan X, Tang N (2020). Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. Journal of the American Statistical Association, 115(530): 747–760. https://doi.org/10.1080/01621459.2019.1573734
 
Zhang C (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38: 894–942.
 
Zhang D, Lin Y, Zhang M (2009). Penalized orthogonal-components regression for large p small n data. Electronic Journal of Statistics, 3: 781–796.

Related articles PDF XML
Related articles PDF XML

Copyright
2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
gPOCRE latent model logistic regression multinomial regression orthogonal components

Funding
This research was partially supported by NSF CAREER award IIS-0844945, NIH grants R01GM131491, R01GM131491-02S1, R01GM131491-02S2, R01AG080917, and R01AG080917-02S1, NCI grant P30CA062203, and UCI Anti-Cancer Challenge funds from the UC Irvine Comprehensive Cancer Center. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the Chao Family Comprehensive Cancer Center.

Metrics
since February 2021
15

Article info
views

5

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy