Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1188

10.6339/25-JDS1188

Statistical Data Science

Exploring Massive Risk Factors of Categorical Outcomes via Supervised Dimension Reduction

Yan

1 Alemdjrodo

Kangni

2 Lin

Yanzhu

3 Zhang

Min

1 Zhang

Dabao

dabao.zhang@uci.edu1∗ 1Department of Epidemiology and Biostatistics, University of California, Irvine, CA 92617, United States 2Department of Mathematics and Statistics, Georgia State University, Atlanta, GA 30303, United States 3Eli Lilly and Company, Indianapolis, IN 46285, United States

∗Corresponding author. Email: dabao.zhang@uci.edu.

2025

2752025

234607623

Supplementary Material

The MATLAB code for gPOCRE is available on the journal’s website. The ISOLET data by Fanty and Cole (1990) can be downloaded from https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=isolet&id=41966, and the breast cancer data can be found in the R package mixOmics (https://mixomics.org/).

2992024252025

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2025

Open access article under the CC BY license.

We propose to explore high-dimensional data with categorical outcomes by generalizing the penalized orthogonal-components regression method (POCRE), a supervised dimension reduction method initially proposed for high-dimensional linear regression. This generalized POCRE, i.e., gPOCRE, sequentially builds up orthogonal components by selecting predictors which maximally explain the variation of the response variables. Therefore, gPOCRE simultaneously selects significant predictors and reduces dimensions by constructing linear components of these selected predictors for a high-dimensional generalized linear model. For multiple categorical outcomes, gPOCRE can also construct common components shared by all outcomes to improve the power of selecting variables shared by multiple outcomes. Both simulation studies and real data analysis are carried out to illustrate the performance of gPOCRE.

Keywords gPOCRE latent model logistic regression multinomial regression orthogonal components

This research was partially supported by NSF CAREER award IIS-0844945, NIH grants R01GM131491, R01GM131491-02S1, R01GM131491-02S2, R01AG080917, and R01AG080917-02S1, NCI grant P30CA062203, and UCI Anti-Cancer Challenge funds from the UC Irvine Comprehensive Cancer Center. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the Chao Family Comprehensive Cancer Center.

References

Boulesteix

, Strimmer

(2006). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioformatics, 8: 32–44. https://doi.org/10.1093/bib/bbl016

Chun

, Keleş

(2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(1): 3–25. https://doi.org/10.1111/j.1467-9868.2009.00723.x

Chung

, Keles

(2010). Sparse partial least squares classification for high dimensional data. Statistical Applications in Genetics and Molecular Biology, 9. Article 17.

De Jong

(1993). Simpls: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18: 251–263. https://doi.org/10.1016/0169-7439(93)85002-X

Fan

, Li

(2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96: 1348–1360. https://doi.org/10.1198/016214501753382273

Fan

, Samworth

, Wu

(2009). Ultrahigh dimensional feature selection: Beyond the linear model. Journal of Machine Learning Research, 10: 2013–2038.

Fanty

, Cole

(1990). Spoken letter recognition. Proceedings of the International Conference on Neural Information Processing Systems, 4: 220–226.

Fisher

(1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2): 179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x

Freeman

, Kulić

, Basir

(2013). Feature-selected tree-based classification. IEEE Transactions on Cybernetics, 43(6): 1990–2004. https://doi.org/10.1109/TSMCB.2012.2237394

Friedman

, Hastie

, Tibshirani

(2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1. https://doi.org/10.18637/jss.v033.i01

Hoskuldsson

(1988). PLS regression methods. Journal of Chemometrics, 2: 211–228. https://doi.org/10.1002/cem.1180020306

Hoskuldsson

(1992). The h-principle in modelling with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems, 14: 139–153. https://doi.org/10.1016/0169-7439(92)80099-P

Hutter

, Zenklusen

(2018). The Cancer Genome Atlas: Creating lasting value beyond its data. Cell, 173(2): 283–285. https://doi.org/10.1016/j.cell.2018.03.042

Johnstone

, Silverman

(2004). Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4): 1594–1649. https://doi.org/10.1214/009053604000000030

Lê Cao

, Rossouw

, Robert-Granié

, Besse

(2008). A sparse PLS for variable selection when integrating omics data. Statistical Applications in Genetics and Molecular Biology. 7(1): Article 35.

Lin

, Zhang

(2015). Generalized orthogonal components regression for high dimensional generalized linear models. Computational Statistics & Data Analysis, 88: 119–127. https://doi.org/10.1016/j.csda.2015.02.006

Loh

(2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1): 14–23.

Massy

(1965). Principal components regression in exploratory statistical research. Journal of the American Statistical Association, 60(309): 234–256. https://doi.org/10.1080/01621459.1965.10480787

McLachlan

(2005). Discriminant Analysis and Statistical Pattern Recognition. John Wiley & Sons.

Nguyen

, Rocke

(2002a). Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares. Springer.

Nguyen

, Rocke

(2002b). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18: 39–50. https://doi.org/10.1093/bioinformatics/18.1.39

Shen

, Gao

(2008). A solution to separation and multicollinearity in multiple logistic regression. Journal of Data Science, 6(4): 515. https://doi.org/10.6339/JDS.2008.06(4).395

Tam

, Patel

, Turcotte

, Bossé

, Paré

, Meyre

(2019). Benefits and limitations of genome-wide association studies. Nature Reviews. Genetics, 20(8): 467–484. https://doi.org/10.1038/s41576-019-0127-1

Tibshirani

(1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58: 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Van de Geer

(2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2): 614–645.

Velliangiri

, Alagumuthukrishnan

, et al. (2019). A review of dimensionality reduction techniques for efficient computation. Procedia Computer Science, 165: 104–111. https://doi.org/10.1016/j.procs.2020.01.079

Wold

(1966). Estimation of principal components and related models by iterative least squares. In

Krishnajad (Ed.), Multivariate Analysis, 391–420. New York: Academic Press.

Wold

(1975). Soft modelling by latent variables: The non-linear iterative partial least squares (nipals) approach. Journal of Applied Probability, 12(S1): 117–142. https://doi.org/10.1017/S0021900200047604

Xie

, Lin

, Yan

, Tang

(2020). Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. Journal of the American Statistical Association, 115(530): 747–760. https://doi.org/10.1080/01621459.2019.1573734

Zhang

(2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38: 894–942.

Zhang

, Lin

, Zhang

(2009). Penalized orthogonal-components regression for large p small n data. Electronic Journal of Statistics, 3: 781–796.