Variable Importance Scores

Loh, Wei-Yin; Zhou, Peigen

doi:10.6339/21-JDS1023

Journal of Data Science

Variable Importance Scores

Volume 19, Issue 4 (2021), pp. 569–592

Wei-Yin Loh

Peigen Zhou

https://doi.org/10.6339/21-JDS1023

Pub. online: 16 September 2021 Type: Statistical Data Science

Received
6 July 2021

Accepted
26 August 2021

Published
16 September 2021

Abstract

There are many methods of scoring the importance of variables in prediction of a response but not much is known about their accuracy. This paper partially fills the gap by introducing a new method based on the GUIDE algorithm and comparing it with 11 existing methods. For data without missing values, eight methods are shown to give biased scores that are too high or too low, depending on the type of variables (ordinal, binary or nominal) and whether or not they are dependent on other variables, even when all of them are independent of the response. Among the remaining four methods, only GUIDE continues to give unbiased scores if there are missing data values. It does this with a self-calibrating bias-correction step that is applicable to data with and without missing values. GUIDE also provides threshold scores for differentiating important from unimportant variables with 95 and 99 percent confidence. Correlations of the scores to the predictive power of the methods are studied in three real data sets. For many methods, correlations with marginal predictive power are much higher than with conditional predictive power.

Supplementary material

Supplementary Material

Data files and simulation programs used in the article may be found in a supplementary file.

References

Bi J (2012). A review of statistical methods for determination of relative importance of correlated predictors and identification of drivers of consumer liking. Journal of Sensory Studies, 27: 87–101.

Bleich J, Kapelner A, George EI, Jensen ST (2014). Variable selection for BART: An application to gene regulation. Annals of Applied Statistics, 8: 1750–1781.

Breiman L (2001). Random forests. Machine Learning, 45: 5–32.

Breiman L, Friedman JH, Olshen RA, Stone CJ (1984). Classification and Regression Trees. Chapman & Hall/CRC, Boca Raton.

Bring J (1994). How to standardize regression coefficients. American Statistician, 48: 209–213.

Bureau A, Dupuis J, sK F, Lunetta KL, Hayward B, Keith TP, et al. (2005). Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology, 28: 171–182.

Chambers JM, Hastie TJ (1992). An appetizer. In: Statistical Models in S (JM Chambers, TJ Hastie, eds.), 1–12. Wadsworth & Brooks/Cole, Pacific Grove.

Chaudhuri P, Huang MC, Loh WY, Yao R (1994). Piecewise-polynomial regression trees. Statistica Sinica, 4: 143–167.

Chipman HA, George EI, McCulloch RE (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics, 4: 266–298.

Denby L (1986). Major league baseball salary and performance data. http://lib.stat.cmu.edu/datasets/baseball.data.

Díaz-Uriarte R, Alvarez de Andrés S (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(3): 3.

Friedman J (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29: 1189–1232.

Friedman J, Hastie T, Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1–22.

Greenwell B, Boehmke B, Cunningham J, Developers G (2019). gbm: Generalized Boosted Regression Models. R package version 2.1.5.

Harrison SL, Fazio-Eynullayeva E, Lane DA, Underhill P, Lip GYH (2020). Comorbidities associated with mortality in 31,461 adults with COVID-19 in the United States: A federated electronic medical record analysis. PLoS Medicine, 17(9): 1–11.

Hoaglin DC, Velleman PF (1995). A critical look at some analyses of Major League Baseball salaries. American Statistician, 49: 277–285.

Hothorn T, Hornik K, Zeileis A (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15: 651–674.

Ishwaran H (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1: 519–537.

Ishwaran H, Kogalur U (2007). Random survival forests for R. R News, 7(2): 25–31.

Ishwaran H, Kogalur U, Blackstone E, Lauer M (2008). Random survival forests. Annals of Applied Statistics, 2(3): 841–860.

Johnson RW (2004). 2004 new car and truck data. http://jse.amstat.org/datasets/04cars.txt.

Kim H, Loh WY (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96: 589–604.

Kuhn M (2020). caret: Classification and Regression Training. R package version 6.0-86.

Liaw A, Wiener M (2002). Classification and regression by randomforest. R News, 2(3): 18–22.

Loh WY (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12: 361–386.

Loh WY (2009). Improving the precision of classification trees. Annals of Applied Statistics, 3: 1710–1737.

Loh WY (2012). Variable selection for classification and regression in large p, small n problems. In: Probability Approximations and Beyond (A Barbour, HP Chan, D Siegmund, eds.), volume 205 of Lecture Notes in Statistics—Proceedings, 133–157. Springer, New York.

Loh WY, Eltinge J, Cho MJ, Li Y (2019). Classification and regression trees and forests for incomplete data from sample surveys. Statistica Sinica, 29: 431–453.

Loh WY, Shih YS (1997). Split selection methods for classification trees. Statistica Sinica, 7: 815–840.

Loh WY, Vanichsetakul N (1988). Tree-structured classification via generalized discriminant analysis (with discussion). Journal of the American Statistical Association, 83: 715–728.

Loh WY, Zhang Q, Zhang W, Zhou P (2020). Missing data, imputation and regression trees. Statistica Sinica, 30: 1697–1722.

Lundberg SM, Lee SI (2017). A unified approach to interpreting model predictions. In: NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems (U. von Luxburg, I. Guyon, S. Bengio, H. Wallach, R. Fergus, eds.), 4768–4777.

Nembrini S, König IR, Wright MN (2018). The revival of the Gini importance? Bioinformatics, 21: 3711–3718.

Ribeiro MT, Singh S, Guestrin C (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In: KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144.

Sandri M, Zuccolotto Z (2008). A bias correction algorithm for the Gini variable importance measure in classification trees. Journal of Computational and Graphical Statistics, 17: 611–628.

Strobl C, Boulesteix A, Kneib T, Augustin T, Zeileis A (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9: 307.

Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8: 25.

Therneau TM, Atkinson EJ (2019a). An introduction to recursive partitioning using the RPART routines. R vignette. https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf.

Therneau TM, Atkinson EJ (2019b). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15.

Wei P, Lu Z, Song J (2015). Variable importance analysis: A comprehensive review. Reliability Engineering & Systems Safety, 142: 399–432.

White AP, Liu WZ (1994). Bias in information-based measures in decision tree induction. Machine Learning, 15: 321–329.

Wright MN, Ziegler A (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1): 1–17.

Wu Y, Boos DD, Stefanski LA (2007). Variable selection by the addition of pseudovariables. Journal of the American Statistical Association, 102: 235–243.

Zhu R (2018). Reinforcement Learning Trees. R package version 3.2.2.

Zhu R, Zeng D, Kosorok MR (2015). Reinforcement learning trees. Journal of the American Statistical Association, 110: 1770–1784.

This is a free to read article.

Keywords

bias correction classification and regression tree missing values prediction

Metrics

since February 2021

5169

Article info
views

3183

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file