Interaction Selection and Prediction Performance in High-Dimensional Data: A Comparative Study of Statistical and Tree-Based Methods

Nzekwe, Chinedu J.; Kim, Seongtae; Mostafa, Sayed A.

doi:10.6339/24-JDS1127

Journal of Data Science

Interaction Selection and Prediction Performance in High-Dimensional Data: A Comparative Study of Statistical and Tree-Based Methods

Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 259–279

Chinedu J. Nzekwe Seongtae Kim Sayed A. Mostafa

https://doi.org/10.6339/24-JDS1127

Pub. online: 22 May 2024 Type: Statistical Data Science

Open Access

Received
1 August 2023

Accepted
26 March 2024

Published
22 May 2024

Abstract

Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.

Supplementary material

Supplementary Material

The supplementary material includes the following: (1) README: a brief explanation of the supplementary material; (2) application datasets; (3) code files; and (4) the description of the RIT algorithm and additional simulation results.

References

Antoniou A, Pharoah P, Narod S, Risch H, Eyfjörd J, Hopper J, et al. (2003). Average risks of breast and ovarian cancer associated with brca1 or brca2 mutations detected in case series unselected for family history: A combined analysis of 22 studies. American Journal of Human Genetics, 72: 1117–1130. https://doi.org/10.1086/375033

Basu S, Kumbier K (2018). iRF: Iterative Random Forests. R package version 3.0.0.

Basu S, Kumbier K, Brown JB, Yu B (2018). Iterative random forests to discover predictive and stable high-order interactions. Proceedings of the National Academy of Sciences, 115(8): 1943–1948. https://doi.org/10.1073/pnas.1711236115

Bien J, Taylor J, Tibshirani R (2013). A lasso for hierarchical interactions. The Annals of Statistics, 41(3): 1111–1141.

Breheny P, Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5(1): 232–253.

Breheny P, Huang J (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25(2): 173–187. https://doi.org/10.1007/s11222-013-9424-2

Breiman L (1996). Bagging predictors. Machine Learning, 24(2): 123–140.

Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32. https://doi.org/10.1023/A:1010933404324

Breiman L, Friedman JH, Olshen RA, Stone CJ (1984). Classification and regression trees. Biometrics, 40: 874. https://doi.org/10.2307/2530946

Chipman H, Hamada M, Wu CF (1997). A Bayesian variable-selection approach for analyzing designed experiments with complex aliasing. Technometrics, 39(4): 372–381. https://doi.org/10.1080/00401706.1997.10485156

Choi N, Li W, Zhu J (2010). Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association, 105: 354–364. https://doi.org/10.1198/jasa.2010.tm08281

Cordell D, Drangert JO, White S (2009). The story of phosphorus: Global food security and food for thought. Global Environmental Change, 19: 292–305. https://doi.org/10.1016/j.gloenvcha.2008.10.009

Deng CX, Brodie SG (2000). Roles of brca1 and its interacting proteins. BioEssays, 22(8): 728–737. https://doi.org/10.1002/1521-1878(200008)22:8<728::AID-BIES6>3.0.CO;2-B

Dong Y, Wu Y (2022). Nonparametric interaction selection. Statistica Sinica, 32: 1563–1582.

Donoho D (2000). High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture, 1(2000): 1–32.

Evans JD (2006). Beepath: An ordered quantitative-PCR array for exploring honey bee immunity and disease. Journal of Invertebrate Pathology, 93(2): 135–139. https://doi.org/10.1016/j.jip.2006.04.004

Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360. https://doi.org/10.1198/016214501753382273

Fan J, Li R (2006). Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery. In M Sanz-Solé, J Soria, JL Varona & J Verdera. Proc. Madrid Int. Congress of Mathematicians, 3: 595–622.

Fan J, Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20: 101–148.

Feng Y, Hao N, Helen Zhang H (2020). RAMP: Regularized Generalized Linear Models with Interaction Effects. R package version 2.0.2.

Friedman J, Tibshirani R, Hastie T (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1–22. https://doi.org/10.18637/jss.v033.i01

Hao N, Feng Y, Zhang HH (2018). Model selection for high-dimensional quadratic regression via regularization. Journal of the American Statistical Association, 113(522): 615–625. https://doi.org/10.1080/01621459.2016.1264956

Hao N, Zhang HH (2014). Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 109(507): 1285–1301. https://doi.org/10.1080/01621459.2014.881741

Hao N, Zhang HH (2017). A note on high-dimensional linear regression with interactions. American Statistician, 71(4): 291–297. https://doi.org/10.1080/00031305.2016.1264311

Hastie T, Tibshirani R (1990). Exploring the nature of covariate effects in the proportional hazards model. Biometrics, 46(4): 1005–1016. https://doi.org/10.2307/2532444

Hastie T, Tibshirani R, Friedman J, Franklin J (2004). The elements of statistical learning: Data mining, inference, and prediction. The Mathematical Intelligencer, 27: 83–85.

Jain R, Xu W (2021). HDSI: High dimensional selection with interactions algorithm on feature selection and testing. PLoS ONE, 16(2): e0246159. https://doi.org/10.1371/journal.pone.0246159

Kong Y, Li D, Fan Y, Lv J (2017). Interaction pursuit in high-dimensional multi-response regression via distance correlation. ArXiv:Methodology.

Kooperberg C, Leblanc M (2008). Increasing the power of identifying gene-gene interactions in genome-wide association studies. Genetic Epidemiology, 32: 255–263. https://doi.org/10.1002/gepi.20300

Kotsiantis S, Kanellopoulos D (2012). Combining bagging, boosting, and random subspace ensembles for regression problems. International Journal of Innovative Computing, Information & Control: IJICIC. 3953–3961.

Kuchenbaecker K, Hopper J, Barnes D, Phillips KA, Mooij T, Roos-Blom MJ, et al. (2017). Risks of breast, ovarian, and contralateral breast cancer for brca1 and brca2 mutation carriers. JAMA, 317: 2402. https://doi.org/10.1001/jama.2017.7112

Liaw A, Wiener M (2002). Classification and regression by randomforest. R News, 2(3): 18–22.

Manolio TA, Collins FS (2007). Genes, environment, health, and disease. Human Heredity, 63(2): 63–66. https://doi.org/10.1159/000099178

McCullagh P (2002). What is a statistical model? The Annals of Statistics, 30(5): 1225–1267. https://doi.org/10.1214/aos/1035844977

Meinshausen N (2010). Node harvest. Annals of Applied Statistics, 4(4): 2049–2072. https://doi.org/10.1214/10-AOAS367

Nelder JA (1977). A reformulation of linear models. Journal of the Royal Statistical Society. Series A. General, 140(1): 48–77. https://doi.org/10.2307/2344517

Shah RD, Meinshausen N (2014). Random intersection trees. Journal of Machine Learning Research, 15(1): 629–654.

Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Methodological, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Tin Kam Ho (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8): 832–844. https://doi.org/10.1109/34.709601

Van der Laan MJ, Polley EC, Hubbard AE (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(2007): 25.

Wolberg W, Mangasarian O, Street N, Street W (1995). Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. DOI. https://doi.org/10.24432/C5DW2B

Yuan M, Joseph VR, Zou H (2009). Structured variable selection and estimation. Annals of Applied Statistics, 3(4): 1738–1757. https://doi.org/10.1214/09-AOAS254

Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2): 894–942. https://doi.org/10.1214/09-AOS729

Zhao P, Rocha G, Yu B (2009). The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37, No. 6A: 3468–3497. https://doi.org/10.1214/07-AOS584

Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): 1418–1429. https://doi.org/10.1198/016214506000000735

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

interaction selection iRF LASSO predictive modeling RAMP RF

Metrics

since February 2021

429

Article info
views

256

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file