Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 19, Issue 4 (2021)
  4. Variable Importance Scores

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Variable Importance Scores
Volume 19, Issue 4 (2021), pp. 569–592
Wei-Yin Loh ORCID icon link to view author Wei-Yin Loh details   Peigen Zhou  

Authors

 
Placeholder
https://doi.org/10.6339/21-JDS1023
Pub. online: 16 September 2021      Type: Statistical Data Science     

Received
6 July 2021
Accepted
26 August 2021
Published
16 September 2021

Abstract

There are many methods of scoring the importance of variables in prediction of a response but not much is known about their accuracy. This paper partially fills the gap by introducing a new method based on the GUIDE algorithm and comparing it with 11 existing methods. For data without missing values, eight methods are shown to give biased scores that are too high or too low, depending on the type of variables (ordinal, binary or nominal) and whether or not they are dependent on other variables, even when all of them are independent of the response. Among the remaining four methods, only GUIDE continues to give unbiased scores if there are missing data values. It does this with a self-calibrating bias-correction step that is applicable to data with and without missing values. GUIDE also provides threshold scores for differentiating important from unimportant variables with 95 and 99 percent confidence. Correlations of the scores to the predictive power of the methods are studied in three real data sets. For many methods, correlations with marginal predictive power are much higher than with conditional predictive power.

Supplementary material

 Supplementary Material
Data files and simulation programs used in the article may be found in a supplementary file.

References

 
Bi J (2012). A review of statistical methods for determination of relative importance of correlated predictors and identification of drivers of consumer liking. Journal of Sensory Studies, 27: 87–101.
 
Bleich J, Kapelner A, George EI, Jensen ST (2014). Variable selection for BART: An application to gene regulation. Annals of Applied Statistics, 8: 1750–1781.
 
Breiman L (2001). Random forests. Machine Learning, 45: 5–32.
 
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984). Classification and Regression Trees. Chapman & Hall/CRC, Boca Raton.
 
Bring J (1994). How to standardize regression coefficients. American Statistician, 48: 209–213.
 
Bureau A, Dupuis J, sK F, Lunetta KL, Hayward B, Keith TP, et al. (2005). Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology, 28: 171–182.
 
Chambers JM, Hastie TJ (1992). An appetizer. In: Statistical Models in S (JM Chambers, TJ Hastie, eds.), 1–12. Wadsworth & Brooks/Cole, Pacific Grove.
 
Chaudhuri P, Huang MC, Loh WY, Yao R (1994). Piecewise-polynomial regression trees. Statistica Sinica, 4: 143–167.
 
Chipman HA, George EI, McCulloch RE (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics, 4: 266–298.
 
Denby L (1986). Major league baseball salary and performance data. http://lib.stat.cmu.edu/datasets/baseball.data.
 
Díaz-Uriarte R, Alvarez de Andrés S (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(3): 3.
 
Friedman J (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29: 1189–1232.
 
Friedman J, Hastie T, Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1–22.
 
Greenwell B, Boehmke B, Cunningham J, Developers G (2019). gbm: Generalized Boosted Regression Models. R package version 2.1.5.
 
Harrison SL, Fazio-Eynullayeva E, Lane DA, Underhill P, Lip GYH (2020). Comorbidities associated with mortality in 31,461 adults with COVID-19 in the United States: A federated electronic medical record analysis. PLoS Medicine, 17(9): 1–11.
 
Hoaglin DC, Velleman PF (1995). A critical look at some analyses of Major League Baseball salaries. American Statistician, 49: 277–285.
 
Hothorn T, Hornik K, Zeileis A (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15: 651–674.
 
Ishwaran H (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1: 519–537.
 
Ishwaran H, Kogalur U (2007). Random survival forests for R. R News, 7(2): 25–31.
 
Ishwaran H, Kogalur U, Blackstone E, Lauer M (2008). Random survival forests. Annals of Applied Statistics, 2(3): 841–860.
 
Johnson RW (2004). 2004 new car and truck data. http://jse.amstat.org/datasets/04cars.txt.
 
Kim H, Loh WY (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96: 589–604.
 
Kuhn M (2020). caret: Classification and Regression Training. R package version 6.0-86.
 
Liaw A, Wiener M (2002). Classification and regression by randomforest. R News, 2(3): 18–22.
 
Loh WY (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12: 361–386.
 
Loh WY (2009). Improving the precision of classification trees. Annals of Applied Statistics, 3: 1710–1737.
 
Loh WY (2012). Variable selection for classification and regression in large p, small n problems. In: Probability Approximations and Beyond (A Barbour, HP Chan, D Siegmund, eds.), volume 205 of Lecture Notes in Statistics—Proceedings, 133–157. Springer, New York.
 
Loh WY, Eltinge J, Cho MJ, Li Y (2019). Classification and regression trees and forests for incomplete data from sample surveys. Statistica Sinica, 29: 431–453.
 
Loh WY, Shih YS (1997). Split selection methods for classification trees. Statistica Sinica, 7: 815–840.
 
Loh WY, Vanichsetakul N (1988). Tree-structured classification via generalized discriminant analysis (with discussion). Journal of the American Statistical Association, 83: 715–728.
 
Loh WY, Zhang Q, Zhang W, Zhou P (2020). Missing data, imputation and regression trees. Statistica Sinica, 30: 1697–1722.
 
Lundberg SM, Lee SI (2017). A unified approach to interpreting model predictions. In: NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems (U. von Luxburg, I. Guyon, S. Bengio, H. Wallach, R. Fergus, eds.), 4768–4777.
 
Nembrini S, König IR, Wright MN (2018). The revival of the Gini importance? Bioinformatics, 21: 3711–3718.
 
Ribeiro MT, Singh S, Guestrin C (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In: KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144.
 
Sandri M, Zuccolotto Z (2008). A bias correction algorithm for the Gini variable importance measure in classification trees. Journal of Computational and Graphical Statistics, 17: 611–628.
 
Strobl C, Boulesteix A, Kneib T, Augustin T, Zeileis A (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9: 307.
 
Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8: 25.
 
Therneau TM, Atkinson EJ (2019a). An introduction to recursive partitioning using the RPART routines. R vignette. https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf.
 
Therneau TM, Atkinson EJ (2019b). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15.
 
Wei P, Lu Z, Song J (2015). Variable importance analysis: A comprehensive review. Reliability Engineering & Systems Safety, 142: 399–432.
 
White AP, Liu WZ (1994). Bias in information-based measures in decision tree induction. Machine Learning, 15: 321–329.
 
Wright MN, Ziegler A (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1): 1–17.
 
Wu Y, Boos DD, Stefanski LA (2007). Variable selection by the addition of pseudovariables. Journal of the American Statistical Association, 102: 235–243.
 
Zhu R (2018). Reinforcement Learning Trees. R package version 3.2.2.
 
Zhu R, Zeng D, Kosorok MR (2015). Reinforcement learning trees. Journal of the American Statistical Association, 110: 1770–1784.

Related articles PDF XML
Related articles PDF XML

Copyright
© 2021 The Author(s)
This is a free to read article.

Keywords
bias correction classification and regression tree missing values prediction

Metrics
since February 2021
4996

Article info
views

2814

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy