Variable Importance Scores
Volume 19, Issue 4 (2021), pp. 569–592
Pub. online: 16 September 2021
Type: Statistical Data Science
Received
6 July 2021
6 July 2021
Accepted
26 August 2021
26 August 2021
Published
16 September 2021
16 September 2021
Abstract
There are many methods of scoring the importance of variables in prediction of a response but not much is known about their accuracy. This paper partially fills the gap by introducing a new method based on the GUIDE algorithm and comparing it with 11 existing methods. For data without missing values, eight methods are shown to give biased scores that are too high or too low, depending on the type of variables (ordinal, binary or nominal) and whether or not they are dependent on other variables, even when all of them are independent of the response. Among the remaining four methods, only GUIDE continues to give unbiased scores if there are missing data values. It does this with a self-calibrating bias-correction step that is applicable to data with and without missing values. GUIDE also provides threshold scores for differentiating important from unimportant variables with 95 and 99 percent confidence. Correlations of the scores to the predictive power of the methods are studied in three real data sets. For many methods, correlations with marginal predictive power are much higher than with conditional predictive power.
Supplementary material
Supplementary MaterialData files and simulation programs used in the article may be found in a supplementary file.
References
Denby L (1986). Major league baseball salary and performance data. http://lib.stat.cmu.edu/datasets/baseball.data.
Johnson RW (2004). 2004 new car and truck data. http://jse.amstat.org/datasets/04cars.txt.
Therneau TM, Atkinson EJ (2019a). An introduction to recursive partitioning using the RPART routines. R vignette. https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf.