Interaction Selection and Prediction Performance in High-Dimensional Data: A Comparative Study of Statistical and Tree-Based Methods
Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 259–279
Pub. online: 22 May 2024
Type: Statistical Data Science
Open Access
Received
1 August 2023
1 August 2023
Accepted
26 March 2024
26 March 2024
Published
22 May 2024
22 May 2024
Abstract
Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.
Supplementary material
Supplementary MaterialThe supplementary material includes the following: (1) README: a brief explanation of the supplementary material; (2) application datasets; (3) code files; and (4) the description of the RIT algorithm and additional simulation results.
References
Antoniou A, Pharoah P, Narod S, Risch H, Eyfjörd J, Hopper J, et al. (2003). Average risks of breast and ovarian cancer associated with brca1 or brca2 mutations detected in case series unselected for family history: A combined analysis of 22 studies. American Journal of Human Genetics, 72: 1117–1130. https://doi.org/10.1086/375033
Basu S, Kumbier K, Brown JB, Yu B (2018). Iterative random forests to discover predictive and stable high-order interactions. Proceedings of the National Academy of Sciences, 115(8): 1943–1948. https://doi.org/10.1073/pnas.1711236115
Breheny P, Huang J (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25(2): 173–187. https://doi.org/10.1007/s11222-013-9424-2
Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32. https://doi.org/10.1023/A:1010933404324
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984). Classification and regression trees. Biometrics, 40: 874. https://doi.org/10.2307/2530946
Chipman H, Hamada M, Wu CF (1997). A Bayesian variable-selection approach for analyzing designed experiments with complex aliasing. Technometrics, 39(4): 372–381. https://doi.org/10.1080/00401706.1997.10485156
Choi N, Li W, Zhu J (2010). Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association, 105: 354–364. https://doi.org/10.1198/jasa.2010.tm08281
Cordell D, Drangert JO, White S (2009). The story of phosphorus: Global food security and food for thought. Global Environmental Change, 19: 292–305. https://doi.org/10.1016/j.gloenvcha.2008.10.009
Deng CX, Brodie SG (2000). Roles of brca1 and its interacting proteins. BioEssays, 22(8): 728–737. https://doi.org/10.1002/1521-1878(200008)22:8<728::AID-BIES6>3.0.CO;2-B
Evans JD (2006). Beepath: An ordered quantitative-PCR array for exploring honey bee immunity and disease. Journal of Invertebrate Pathology, 93(2): 135–139. https://doi.org/10.1016/j.jip.2006.04.004
Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360. https://doi.org/10.1198/016214501753382273
Friedman J, Tibshirani R, Hastie T (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1–22. https://doi.org/10.18637/jss.v033.i01
Hao N, Feng Y, Zhang HH (2018). Model selection for high-dimensional quadratic regression via regularization. Journal of the American Statistical Association, 113(522): 615–625. https://doi.org/10.1080/01621459.2016.1264956
Hao N, Zhang HH (2014). Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 109(507): 1285–1301. https://doi.org/10.1080/01621459.2014.881741
Hao N, Zhang HH (2017). A note on high-dimensional linear regression with interactions. American Statistician, 71(4): 291–297. https://doi.org/10.1080/00031305.2016.1264311
Hastie T, Tibshirani R (1990). Exploring the nature of covariate effects in the proportional hazards model. Biometrics, 46(4): 1005–1016. https://doi.org/10.2307/2532444
Jain R, Xu W (2021). HDSI: High dimensional selection with interactions algorithm on feature selection and testing. PLoS ONE, 16(2): e0246159. https://doi.org/10.1371/journal.pone.0246159
Kooperberg C, Leblanc M (2008). Increasing the power of identifying gene-gene interactions in genome-wide association studies. Genetic Epidemiology, 32: 255–263. https://doi.org/10.1002/gepi.20300
Kuchenbaecker K, Hopper J, Barnes D, Phillips KA, Mooij T, Roos-Blom MJ, et al. (2017). Risks of breast, ovarian, and contralateral breast cancer for brca1 and brca2 mutation carriers. JAMA, 317: 2402. https://doi.org/10.1001/jama.2017.7112
Manolio TA, Collins FS (2007). Genes, environment, health, and disease. Human Heredity, 63(2): 63–66. https://doi.org/10.1159/000099178
McCullagh P (2002). What is a statistical model? The Annals of Statistics, 30(5): 1225–1267. https://doi.org/10.1214/aos/1035844977
Meinshausen N (2010). Node harvest. Annals of Applied Statistics, 4(4): 2049–2072. https://doi.org/10.1214/10-AOAS367
Nelder JA (1977). A reformulation of linear models. Journal of the Royal Statistical Society. Series A. General, 140(1): 48–77. https://doi.org/10.2307/2344517
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Methodological, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Tin Kam Ho (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8): 832–844. https://doi.org/10.1109/34.709601
Wolberg W, Mangasarian O, Street N, Street W (1995). Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. DOI. https://doi.org/10.24432/C5DW2B
Yuan M, Joseph VR, Zou H (2009). Structured variable selection and estimation. Annals of Applied Statistics, 3(4): 1738–1757. https://doi.org/10.1214/09-AOAS254
Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2): 894–942. https://doi.org/10.1214/09-AOS729
Zhao P, Rocha G, Yu B (2009). The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37, No. 6A: 3468–3497. https://doi.org/10.1214/07-AOS584
Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): 1418–1429. https://doi.org/10.1198/016214506000000735