Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1023

10.6339/21-JDS1023

Statistical Data Science

Variable Importance Scores

https://orcid.org/0000-0001-6983-2495

Loh

Wei-Yin

loh@stat.wisc.edu1∗ Zhou

Peigen

1 1Department of Statistics, University of Wisconsin, 1300 University Avenue, Madison, WI 53706, USA

∗Corresponding author. Email: loh@stat.wisc.edu.

2021

1692021

194569592

Supplementary Material

Data files and simulation programs used in the article may be found in a supplementary file.

6720212682021

2021 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2021

Open access article under the CC BY license.

There are many methods of scoring the importance of variables in prediction of a response but not much is known about their accuracy. This paper partially fills the gap by introducing a new method based on the GUIDE algorithm and comparing it with 11 existing methods. For data without missing values, eight methods are shown to give biased scores that are too high or too low, depending on the type of variables (ordinal, binary or nominal) and whether or not they are dependent on other variables, even when all of them are independent of the response. Among the remaining four methods, only GUIDE continues to give unbiased scores if there are missing data values. It does this with a self-calibrating bias-correction step that is applicable to data with and without missing values. GUIDE also provides threshold scores for differentiating important from unimportant variables with 95 and 99 percent confidence. Correlations of the scores to the predictive power of the methods are studied in three real data sets. For many methods, correlations with marginal predictive power are much higher than with conditional predictive power.

Keywords bias correction classification and regression tree missing values prediction

References

(2012). A review of statistical methods for determination of relative importance of correlated predictors and identification of drivers of consumer liking. Journal of Sensory Studies, 27: 87–101.

Bleich

, Kapelner

, George

, Jensen

(2014). Variable selection for BART: An application to gene regulation. Annals of Applied Statistics, 8: 1750–1781.

Breiman

(2001). Random forests. Machine Learning, 45: 5–32.

Breiman

, Friedman

, Olshen

, Stone

(1984). Classification and Regression Trees. Chapman & Hall/CRC, Boca Raton.

Bring

(1994). How to standardize regression coefficients. American Statistician, 48: 209–213.

Bureau

, Dupuis

, sK

, Lunetta

, Hayward

, Keith

, et al. (2005). Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology, 28: 171–182.

Chambers

, Hastie

(1992). An appetizer. In: Statistical Models in S (

Chambers,

Hastie, eds.), 1–12. Wadsworth & Brooks/Cole, Pacific Grove.

Chaudhuri

, Huang

, Loh

, Yao

(1994). Piecewise-polynomial regression trees. Statistica Sinica, 4: 143–167.

Chipman

, George

, McCulloch

(2010). BART: Bayesian additive regression trees. Annals of Applied Statistics, 4: 266–298.

Denby

(1986). Major league baseball salary and performance data. http://lib.stat.cmu.edu/datasets/baseball.data.

Díaz-Uriarte

, Alvarez de Andrés

(2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(3): 3.

Friedman

(2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29: 1189–1232.

Friedman

, Hastie

, Tibshirani

(2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1–22.

Greenwell

, Boehmke

, Cunningham

, Developers

(2019). gbm: Generalized Boosted Regression Models. R package version 2.1.5.

Harrison

, Fazio-Eynullayeva

, Lane

, Underhill

, Lip

GYH

(2020). Comorbidities associated with mortality in 31,461 adults with COVID-19 in the United States: A federated electronic medical record analysis. PLoS Medicine, 17(9): 1–11.

Hoaglin

, Velleman

(1995). A critical look at some analyses of Major League Baseball salaries. American Statistician, 49: 277–285.

Hothorn

, Hornik

, Zeileis

(2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15: 651–674.

Ishwaran

(2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1: 519–537.

Ishwaran

, Kogalur

(2007). Random survival forests for R. R News, 7(2): 25–31.

Ishwaran

, Kogalur

, Blackstone

, Lauer

(2008). Random survival forests. Annals of Applied Statistics, 2(3): 841–860.

Johnson

(2004). 2004 new car and truck data. http://jse.amstat.org/datasets/04cars.txt.

Kim

, Loh

(2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96: 589–604.

Kuhn

(2020). caret: Classification and Regression Training. R package version 6.0-86.

Liaw

, Wiener

(2002). Classification and regression by randomforest. R News, 2(3): 18–22.

Loh

(2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12: 361–386.

Loh

(2009). Improving the precision of classification trees. Annals of Applied Statistics, 3: 1710–1737.

Loh

(2012). Variable selection for classification and regression in large p, small n problems. In: Probability Approximations and Beyond (

Barbour,

Chan,

Siegmund, eds.), volume 205 of Lecture Notes in Statistics—Proceedings, 133–157. Springer, New York.

Loh

, Eltinge

, Cho

, Li

(2019). Classification and regression trees and forests for incomplete data from sample surveys. Statistica Sinica, 29: 431–453.

Loh

, Shih

(1997). Split selection methods for classification trees. Statistica Sinica, 7: 815–840.

Loh

, Vanichsetakul

(1988). Tree-structured classification via generalized discriminant analysis (with discussion). Journal of the American Statistical Association, 83: 715–728.

Loh

, Zhang

, Zhou

(2020). Missing data, imputation and regression trees. Statistica Sinica, 30: 1697–1722.

Lundberg

, Lee

(2017). A unified approach to interpreting model predictions. In: NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems (

von Luxburg,

Guyon,

Bengio,

Wallach,

Fergus, eds.), 4768–4777.

Nembrini

, König

, Wright

(2018). The revival of the Gini importance? Bioinformatics, 21: 3711–3718.

Ribeiro

, Singh

, Guestrin

(2016). “Why should I trust you?”: Explaining the predictions of any classifier. In: KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144.

Sandri

, Zuccolotto

(2008). A bias correction algorithm for the Gini variable importance measure in classification trees. Journal of Computational and Graphical Statistics, 17: 611–628.

Strobl

, Boulesteix

, Kneib

, Augustin

, Zeileis

(2008). Conditional variable importance for random forests. BMC Bioinformatics, 9: 307.

Strobl

, Boulesteix

, Zeileis

, Hothorn

(2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8: 25.

Therneau

, Atkinson

(2019a). An introduction to recursive partitioning using the RPART routines. R vignette. https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf.

Therneau

, Atkinson

(2019b). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15.

Wei

, Lu

, Song

(2015). Variable importance analysis: A comprehensive review. Reliability Engineering & Systems Safety, 142: 399–432.

White

, Liu

(1994). Bias in information-based measures in decision tree induction. Machine Learning, 15: 321–329.

Wright

, Ziegler

(2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1): 1–17.

, Boos

, Stefanski

(2007). Variable selection by the addition of pseudovariables. Journal of the American Statistical Association, 102: 235–243.

Zhu

(2018). Reinforcement Learning Trees. R package version 3.2.2.

Zhu

, Zeng

, Kosorok

(2015). Reinforcement learning trees. Journal of the American Statistical Association, 110: 1770–1784.