Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”
  4. Interaction Selection and Prediction Per ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Interaction Selection and Prediction Performance in High-Dimensional Data: A Comparative Study of Statistical and Tree-Based Methods
Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 259–279
Chinedu J. Nzekwe   Seongtae Kim   Sayed A. Mostafa ORCID icon link to view author Sayed A. Mostafa details  

Authors

 
Placeholder
https://doi.org/10.6339/24-JDS1127
Pub. online: 22 May 2024      Type: Statistical Data Science      Open accessOpen Access

Received
1 August 2023
Accepted
26 March 2024
Published
22 May 2024

Abstract

Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.

Supplementary material

 Supplementary Material
The supplementary material includes the following: (1) README: a brief explanation of the supplementary material; (2) application datasets; (3) code files; and (4) the description of the RIT algorithm and additional simulation results.

References

 
Antoniou A, Pharoah P, Narod S, Risch H, Eyfjörd J, Hopper J, et al. (2003). Average risks of breast and ovarian cancer associated with brca1 or brca2 mutations detected in case series unselected for family history: A combined analysis of 22 studies. American Journal of Human Genetics, 72: 1117–1130. https://doi.org/10.1086/375033
 
Basu S, Kumbier K (2018). iRF: Iterative Random Forests. R package version 3.0.0.
 
Basu S, Kumbier K, Brown JB, Yu B (2018). Iterative random forests to discover predictive and stable high-order interactions. Proceedings of the National Academy of Sciences, 115(8): 1943–1948. https://doi.org/10.1073/pnas.1711236115
 
Bien J, Taylor J, Tibshirani R (2013). A lasso for hierarchical interactions. The Annals of Statistics, 41(3): 1111–1141.
 
Breheny P, Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5(1): 232–253.
 
Breheny P, Huang J (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25(2): 173–187. https://doi.org/10.1007/s11222-013-9424-2
 
Breiman L (1996). Bagging predictors. Machine Learning, 24(2): 123–140.
 
Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32. https://doi.org/10.1023/A:1010933404324
 
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984). Classification and regression trees. Biometrics, 40: 874. https://doi.org/10.2307/2530946
 
Chipman H, Hamada M, Wu CF (1997). A Bayesian variable-selection approach for analyzing designed experiments with complex aliasing. Technometrics, 39(4): 372–381. https://doi.org/10.1080/00401706.1997.10485156
 
Choi N, Li W, Zhu J (2010). Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association, 105: 354–364. https://doi.org/10.1198/jasa.2010.tm08281
 
Cordell D, Drangert JO, White S (2009). The story of phosphorus: Global food security and food for thought. Global Environmental Change, 19: 292–305. https://doi.org/10.1016/j.gloenvcha.2008.10.009
 
Deng CX, Brodie SG (2000). Roles of brca1 and its interacting proteins. BioEssays, 22(8): 728–737. https://doi.org/10.1002/1521-1878(200008)22:8<728::AID-BIES6>3.0.CO;2-B
 
Dong Y, Wu Y (2022). Nonparametric interaction selection. Statistica Sinica, 32: 1563–1582.
 
Donoho D (2000). High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture, 1(2000): 1–32.
 
Evans JD (2006). Beepath: An ordered quantitative-PCR array for exploring honey bee immunity and disease. Journal of Invertebrate Pathology, 93(2): 135–139. https://doi.org/10.1016/j.jip.2006.04.004
 
Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360. https://doi.org/10.1198/016214501753382273
 
Fan J, Li R (2006). Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery. In M Sanz-Solé, J Soria, JL Varona & J Verdera. Proc. Madrid Int. Congress of Mathematicians, 3: 595–622.
 
Fan J, Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20: 101–148.
 
Feng Y, Hao N, Helen Zhang H (2020). RAMP: Regularized Generalized Linear Models with Interaction Effects. R package version 2.0.2.
 
Friedman J, Tibshirani R, Hastie T (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1–22. https://doi.org/10.18637/jss.v033.i01
 
Hao N, Feng Y, Zhang HH (2018). Model selection for high-dimensional quadratic regression via regularization. Journal of the American Statistical Association, 113(522): 615–625. https://doi.org/10.1080/01621459.2016.1264956
 
Hao N, Zhang HH (2014). Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 109(507): 1285–1301. https://doi.org/10.1080/01621459.2014.881741
 
Hao N, Zhang HH (2017). A note on high-dimensional linear regression with interactions. American Statistician, 71(4): 291–297. https://doi.org/10.1080/00031305.2016.1264311
 
Hastie T, Tibshirani R (1990). Exploring the nature of covariate effects in the proportional hazards model. Biometrics, 46(4): 1005–1016. https://doi.org/10.2307/2532444
 
Hastie T, Tibshirani R, Friedman J, Franklin J (2004). The elements of statistical learning: Data mining, inference, and prediction. The Mathematical Intelligencer, 27: 83–85.
 
Jain R, Xu W (2021). HDSI: High dimensional selection with interactions algorithm on feature selection and testing. PLoS ONE, 16(2): e0246159. https://doi.org/10.1371/journal.pone.0246159
 
Kong Y, Li D, Fan Y, Lv J (2017). Interaction pursuit in high-dimensional multi-response regression via distance correlation. ArXiv:Methodology.
 
Kooperberg C, Leblanc M (2008). Increasing the power of identifying gene-gene interactions in genome-wide association studies. Genetic Epidemiology, 32: 255–263. https://doi.org/10.1002/gepi.20300
 
Kotsiantis S, Kanellopoulos D (2012). Combining bagging, boosting, and random subspace ensembles for regression problems. International Journal of Innovative Computing, Information & Control: IJICIC. 3953–3961.
 
Kuchenbaecker K, Hopper J, Barnes D, Phillips KA, Mooij T, Roos-Blom MJ, et al. (2017). Risks of breast, ovarian, and contralateral breast cancer for brca1 and brca2 mutation carriers. JAMA, 317: 2402. https://doi.org/10.1001/jama.2017.7112
 
Liaw A, Wiener M (2002). Classification and regression by randomforest. R News, 2(3): 18–22.
 
Manolio TA, Collins FS (2007). Genes, environment, health, and disease. Human Heredity, 63(2): 63–66. https://doi.org/10.1159/000099178
 
McCullagh P (2002). What is a statistical model? The Annals of Statistics, 30(5): 1225–1267. https://doi.org/10.1214/aos/1035844977
 
Meinshausen N (2010). Node harvest. Annals of Applied Statistics, 4(4): 2049–2072. https://doi.org/10.1214/10-AOAS367
 
Nelder JA (1977). A reformulation of linear models. Journal of the Royal Statistical Society. Series A. General, 140(1): 48–77. https://doi.org/10.2307/2344517
 
Shah RD, Meinshausen N (2014). Random intersection trees. Journal of Machine Learning Research, 15(1): 629–654.
 
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Methodological, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
 
Tin Kam Ho (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8): 832–844. https://doi.org/10.1109/34.709601
 
Van der Laan MJ, Polley EC, Hubbard AE (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(2007): 25.
 
Wolberg W, Mangasarian O, Street N, Street W (1995). Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. DOI. https://doi.org/10.24432/C5DW2B
 
Yuan M, Joseph VR, Zou H (2009). Structured variable selection and estimation. Annals of Applied Statistics, 3(4): 1738–1757. https://doi.org/10.1214/09-AOAS254
 
Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2): 894–942. https://doi.org/10.1214/09-AOS729
 
Zhao P, Rocha G, Yu B (2009). The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37, No. 6A: 3468–3497. https://doi.org/10.1214/07-AOS584
 
Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): 1418–1429. https://doi.org/10.1198/016214506000000735

Related articles PDF XML
Related articles PDF XML

Copyright
2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
interaction selection iRF LASSO predictive modeling RAMP RF

Metrics
since February 2021
353

Article info
views

208

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy