Exploring Massive Risk Factors of Categorical Outcomes via Supervised Dimension Reduction
Pub. online: 27 May 2025
Type: Statistical Data Science
Open Access
Received
29 September 2024
29 September 2024
Accepted
2 May 2025
2 May 2025
Published
27 May 2025
27 May 2025
Abstract
We propose to explore high-dimensional data with categorical outcomes by generalizing the penalized orthogonal-components regression method (POCRE), a supervised dimension reduction method initially proposed for high-dimensional linear regression. This generalized POCRE, i.e., gPOCRE, sequentially builds up orthogonal components by selecting predictors which maximally explain the variation of the response variables. Therefore, gPOCRE simultaneously selects significant predictors and reduces dimensions by constructing linear components of these selected predictors for a high-dimensional generalized linear model. For multiple categorical outcomes, gPOCRE can also construct common components shared by all outcomes to improve the power of selecting variables shared by multiple outcomes. Both simulation studies and real data analysis are carried out to illustrate the performance of gPOCRE.
Supplementary material
Supplementary MaterialThe MATLAB code for gPOCRE is available on the journal’s website. The ISOLET data by Fanty and Cole (1990) can be downloaded from https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=isolet&id=41966, and the breast cancer data can be found in the R package mixOmics (https://mixomics.org/).
References
Boulesteix AL, Strimmer K (2006). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioformatics, 8: 32–44. https://doi.org/10.1093/bib/bbl016
Chun H, Keleş S (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(1): 3–25. https://doi.org/10.1111/j.1467-9868.2009.00723.x
De Jong S (1993). Simpls: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18: 251–263. https://doi.org/10.1016/0169-7439(93)85002-X
Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96: 1348–1360. https://doi.org/10.1198/016214501753382273
Fisher RA (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2): 179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x
Freeman C, Kulić D, Basir O (2013). Feature-selected tree-based classification. IEEE Transactions on Cybernetics, 43(6): 1990–2004. https://doi.org/10.1109/TSMCB.2012.2237394
Friedman J, Hastie T, Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1. https://doi.org/10.18637/jss.v033.i01
Hoskuldsson A (1988). PLS regression methods. Journal of Chemometrics, 2: 211–228. https://doi.org/10.1002/cem.1180020306
Hoskuldsson A (1992). The h-principle in modelling with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems, 14: 139–153. https://doi.org/10.1016/0169-7439(92)80099-P
Hutter C, Zenklusen JC (2018). The Cancer Genome Atlas: Creating lasting value beyond its data. Cell, 173(2): 283–285. https://doi.org/10.1016/j.cell.2018.03.042
Johnstone IM, Silverman BW (2004). Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4): 1594–1649. https://doi.org/10.1214/009053604000000030
Lin Y, Zhang M, Zhang D (2015). Generalized orthogonal components regression for high dimensional generalized linear models. Computational Statistics & Data Analysis, 88: 119–127. https://doi.org/10.1016/j.csda.2015.02.006
Massy WF (1965). Principal components regression in exploratory statistical research. Journal of the American Statistical Association, 60(309): 234–256. https://doi.org/10.1080/01621459.1965.10480787
Nguyen DV, Rocke DM (2002b). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18: 39–50. https://doi.org/10.1093/bioinformatics/18.1.39
Shen J, Gao S (2008). A solution to separation and multicollinearity in multiple logistic regression. Journal of Data Science, 6(4): 515. https://doi.org/10.6339/JDS.2008.06(4).395
Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D (2019). Benefits and limitations of genome-wide association studies. Nature Reviews. Genetics, 20(8): 467–484. https://doi.org/10.1038/s41576-019-0127-1
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58: 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Velliangiri S, Alagumuthukrishnan S, et al. (2019). A review of dimensionality reduction techniques for efficient computation. Procedia Computer Science, 165: 104–111. https://doi.org/10.1016/j.procs.2020.01.079
Wold H (1975). Soft modelling by latent variables: The non-linear iterative partial least squares (nipals) approach. Journal of Applied Probability, 12(S1): 117–142. https://doi.org/10.1017/S0021900200047604
Xie J, Lin Y, Yan X, Tang N (2020). Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. Journal of the American Statistical Association, 115(530): 747–760. https://doi.org/10.1080/01621459.2019.1573734