Variable Importance Measures for Multivariate Random Forests
Pub. online: 18 September 2024
Type: Statistical Data Science
Open Access
†
This work is part of the first author’s dissertation research with the second and third authors as dissertation advisors.
Received
31 October 2023
31 October 2023
Accepted
24 August 2024
24 August 2024
Published
18 September 2024
18 September 2024
Abstract
Multivariate random forests (or MVRFs) are an extension of tree-based ensembles to examine multivariate responses. MVRF can be particularly helpful where some of the responses exhibit sparse (e.g., zero-inflated) distributions, making borrowing strength from correlated features attractive. Tree-based algorithms select features using variable importance measures (VIMs) that score each covariate based on the strength of dependence of the model on that variable. In this paper, we develop and propose new VIMs for MVRFs. Specifically, we focus on the variable’s ability to achieve split improvement, i.e., the difference in the responses between the left and right nodes obtained after splitting the parent node, for a multivariate response. Our proposed VIMs are an improvement over the default naïve VIM in existing software and allow us to investigate the strength of dependence both globally and on a per-response basis. Our simulation studies show that our proposed VIM recovers the true predictors better than naïve measures. We demonstrate usage of the VIMs for variable selection in two empirical applications; the first is on Amazon Marketplace data to predict Buy Box prices of multiple brands in a category, and the second is on ecology data to predict co-occurrence of multiple, rare bird species. A feature of both data sets is that some outcomes are sparse — exhibiting a substantial proportion of zeros or fixed values. In both cases, the proposed VIMs when used for variable screening give superior predictive accuracy over naïve measures.
Supplementary material
Supplementary MaterialIn our Online Supplement, we have included pseudo-codes on the MVRF ensemble build using sub-bagging procedure, proposed SI-based VIMs with significant splits, and the proposed RFE strategy of our iterative variable selection method. We have also included the variable choices in the simulation design; box plots and confidence intervals of top features selected by our proposed VIMs from the Amazon application on Luggage category.
References
Amazon - Price Matching (no date). Amazon - price matching. https://www.amazon.com/gp/help/customer/display.html?nodeId=G9EAYKPV5YYDB8P7. Accessed: 25 August 2021.
Guyon I, Weston J, Barnhill S, Vapnik V (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46: 389–422. https://doi.org/10.1023/A:1012487302797
Hooker G, Mentch L, Zhou S (2021). Unrestricted permutation forces extrapolation: Variable importance requires at least one more model, or there is no free variable importance. Statistics and Computing, 31: 1–16. https://doi.org/10.1007/s11222-021-10057-z
Johnston A, Hochachka WM, Strimas-Mackey ME, Ruiz Gutierrez V, Robinson OJ, Miller ET, et al. (2021). Analytical guidelines to increase the value of community science data: An example using ebird data to estimate species distributions. Diversity and Distributions, 27(7): 1265–1277. https://doi.org/10.1111/ddi.13271
Miller PJ, Lubke GH, McArtor DB, Bergeman C (2016). Finding structure in data using multivariate tree boosting. Psychological Methods, 21(4): 583. https://doi.org/10.1037/met0000087
Ng WH, Fink D, LaSorte FA, Auer T, Hochachka WM, Johnston A, et al. (2022). Continental-scale biomass redistribution by migratory birds in response to seasonal variation in productivity. Global Ecology and Biogeography, 31(4): 727–739. https://doi.org/10.1111/geb.13460
Pierdzioch C, Risse M (2020). Forecasting precious metal returns with multivariate random forests. Empirical Economics, 58(3): 1167–1184. https://doi.org/10.1007/s00181-018-1558-9
Rahman R, Otridge J, Pal R (2017). Integratedmrf: Random forest-based framework for integrating prediction from different data types. Bioinformatics, 33(9): 1407–1410. https://doi.org/10.1093/bioinformatics/btw765
Rosenberg KV, Dokter AM, Blancher PJ, Sauer JR, Smith AC, Smith PA, et al. (2019). Decline of the North American avifauna. Science, 366(6461): 120–124. https://doi.org/10.1126/science.aaw1313
Sikdar S, Hooker G, Kadiyali V (2021). Multivariate random forest variable importance measures r package. https://github.com/Megatvini/VIM/.
Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1): 1–21. https://doi.org/10.1186/1471-2105-8-1
Sullivan BL, Wood CL, Iliff MJ, Bonney RE, Fink D, Kelling S (2009). ebird: A citizen-based bird observation network in the biological sciences. Biological Conservation, 142(10): 2282–2292. https://doi.org/10.1016/j.biocon.2009.05.006
Verdinelli I, Wasserman L (2023). Feature importance: A closer look at shapley values and loco. arXiv preprint: https://arxiv.org/abs/2303.05981.
Zaman F, Hirose H (2009). Effect of subsampling rate on subbagging and related ensembles of stable classifiers. In: Pattern Recognition and Machine Intelligence: Third International Conference, PReMI 2009 New Delhi, India, December 16-20, 2009 Proceedings 3 (S Chaudhury, S Mitra, CA Murthy, PS Sastry, SK Pal, eds.), 44–49. Springer.
Zhang H (1998). Classification trees for multiple binary responses. Journal of the American Statistical Association, 93(441): 180–193. https://doi.org/10.1080/01621459.1998.10474100
Zhou Z, Hooker G (2021). Unbiased measurement of feature importance in tree-based methods. ACM Transactions on Knowledge Discovery from Data, 15(2): 1–21. https://doi.org/10.1145/3425637