Exploring Massive Risk Factors of Categorical Outcomes via Supervised Dimension Reduction

Li, Yan; Alemdjrodo, Kangni; Lin, Yanzhu; Zhang, Min; Zhang, Dabao

doi:10.6339/25-JDS1188

Journal of Data Science

Exploring Massive Risk Factors of Categorical Outcomes via Supervised Dimension Reduction

Yan Li Kangni Alemdjrodo Yanzhu Lin All authors (5)

https://doi.org/10.6339/25-JDS1188

Pub. online: 27 May 2025 Type: Statistical Data Science

Open Access

Received
29 September 2024

Accepted
2 May 2025

Published
27 May 2025

Abstract

We propose to explore high-dimensional data with categorical outcomes by generalizing the penalized orthogonal-components regression method (POCRE), a supervised dimension reduction method initially proposed for high-dimensional linear regression. This generalized POCRE, i.e., gPOCRE, sequentially builds up orthogonal components by selecting predictors which maximally explain the variation of the response variables. Therefore, gPOCRE simultaneously selects significant predictors and reduces dimensions by constructing linear components of these selected predictors for a high-dimensional generalized linear model. For multiple categorical outcomes, gPOCRE can also construct common components shared by all outcomes to improve the power of selecting variables shared by multiple outcomes. Both simulation studies and real data analysis are carried out to illustrate the performance of gPOCRE.

Supplementary material

Supplementary Material

The MATLAB code for gPOCRE is available on the journal’s website. The ISOLET data by Fanty and Cole (1990) can be downloaded from https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=isolet&id=41966, and the breast cancer data can be found in the R package mixOmics (https://mixomics.org/).

References

Boulesteix AL, Strimmer K (2006). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioformatics, 8: 32–44. https://doi.org/10.1093/bib/bbl016

Chun H, Keleş S (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(1): 3–25. https://doi.org/10.1111/j.1467-9868.2009.00723.x

Chung D, Keles S (2010). Sparse partial least squares classification for high dimensional data. Statistical Applications in Genetics and Molecular Biology, 9. Article 17.

De Jong S (1993). Simpls: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18: 251–263. https://doi.org/10.1016/0169-7439(93)85002-X

Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96: 1348–1360. https://doi.org/10.1198/016214501753382273

Fan J, Samworth R, Wu Y (2009). Ultrahigh dimensional feature selection: Beyond the linear model. Journal of Machine Learning Research, 10: 2013–2038.

Fanty M, Cole R (1990). Spoken letter recognition. Proceedings of the International Conference on Neural Information Processing Systems, 4: 220–226.

Fisher RA (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2): 179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x

Freeman C, Kulić D, Basir O (2013). Feature-selected tree-based classification. IEEE Transactions on Cybernetics, 43(6): 1990–2004. https://doi.org/10.1109/TSMCB.2012.2237394

Friedman J, Hastie T, Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1. https://doi.org/10.18637/jss.v033.i01

Hoskuldsson A (1988). PLS regression methods. Journal of Chemometrics, 2: 211–228. https://doi.org/10.1002/cem.1180020306

Hoskuldsson A (1992). The h-principle in modelling with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems, 14: 139–153. https://doi.org/10.1016/0169-7439(92)80099-P

Hutter C, Zenklusen JC (2018). The Cancer Genome Atlas: Creating lasting value beyond its data. Cell, 173(2): 283–285. https://doi.org/10.1016/j.cell.2018.03.042

Johnstone IM, Silverman BW (2004). Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4): 1594–1649. https://doi.org/10.1214/009053604000000030

Lê Cao KA, Rossouw D, Robert-Granié C, Besse P (2008). A sparse PLS for variable selection when integrating omics data. Statistical Applications in Genetics and Molecular Biology. 7(1): Article 35.

Lin Y, Zhang M, Zhang D (2015). Generalized orthogonal components regression for high dimensional generalized linear models. Computational Statistics & Data Analysis, 88: 119–127. https://doi.org/10.1016/j.csda.2015.02.006

Loh WY (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1): 14–23.

Massy WF (1965). Principal components regression in exploratory statistical research. Journal of the American Statistical Association, 60(309): 234–256. https://doi.org/10.1080/01621459.1965.10480787

McLachlan GJ (2005). Discriminant Analysis and Statistical Pattern Recognition. John Wiley & Sons.

Nguyen DV, Rocke DM (2002a). Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares. Springer.

Nguyen DV, Rocke DM (2002b). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18: 39–50. https://doi.org/10.1093/bioinformatics/18.1.39

Shen J, Gao S (2008). A solution to separation and multicollinearity in multiple logistic regression. Journal of Data Science, 6(4): 515. https://doi.org/10.6339/JDS.2008.06(4).395

Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D (2019). Benefits and limitations of genome-wide association studies. Nature Reviews. Genetics, 20(8): 467–484. https://doi.org/10.1038/s41576-019-0127-1

Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58: 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Van de Geer SA (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2): 614–645.

Velliangiri S, Alagumuthukrishnan S, et al. (2019). A review of dimensionality reduction techniques for efficient computation. Procedia Computer Science, 165: 104–111. https://doi.org/10.1016/j.procs.2020.01.079

Wold H (1966). Estimation of principal components and related models by iterative least squares. In PR Krishnajad (Ed.), Multivariate Analysis, 391–420. New York: Academic Press.

Wold H (1975). Soft modelling by latent variables: The non-linear iterative partial least squares (nipals) approach. Journal of Applied Probability, 12(S1): 117–142. https://doi.org/10.1017/S0021900200047604

Xie J, Lin Y, Yan X, Tang N (2020). Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. Journal of the American Statistical Association, 115(530): 747–760. https://doi.org/10.1080/01621459.2019.1573734

Zhang C (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38: 894–942.

Zhang D, Lin Y, Zhang M (2009). Penalized orthogonal-components regression for large p small n data. Electronic Journal of Statistics, 3: 781–796.

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

gPOCRE latent model logistic regression multinomial regression orthogonal components

Funding

This research was partially supported by NSF CAREER award IIS-0844945, NIH grants R01GM131491, R01GM131491-02S1, R01GM131491-02S2, R01AG080917, and R01AG080917-02S1, NCI grant P30CA062203, and UCI Anti-Cancer Challenge funds from the UC Irvine Comprehensive Cancer Center. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the Chao Family Comprehensive Cancer Center.

Metrics

since February 2021

Article info
views

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file