Variable Importance Measures for Multivariate Random Forests

Sikdar, Sharmistha; Hooker, Giles; Kadiyali, Vrinda

doi:10.6339/24-JDS1152

Journal of Data Science

Variable Importance Measures for Multivariate Random Forests

Volume 23, Issue 1 (2025), pp. 243–263

Sharmistha Sikdar

Giles Hooker ^† Vrinda Kadiyali ^†

https://doi.org/10.6339/24-JDS1152

Pub. online: 18 September 2024 Type: Statistical Data Science

Open Access

^† This work is part of the first author’s dissertation research with the second and third authors as dissertation advisors.

Received
31 October 2023

Accepted
24 August 2024

Published
18 September 2024

Abstract

Multivariate random forests (or MVRFs) are an extension of tree-based ensembles to examine multivariate responses. MVRF can be particularly helpful where some of the responses exhibit sparse (e.g., zero-inflated) distributions, making borrowing strength from correlated features attractive. Tree-based algorithms select features using variable importance measures (VIMs) that score each covariate based on the strength of dependence of the model on that variable. In this paper, we develop and propose new VIMs for MVRFs. Specifically, we focus on the variable’s ability to achieve split improvement, i.e., the difference in the responses between the left and right nodes obtained after splitting the parent node, for a multivariate response. Our proposed VIMs are an improvement over the default naïve VIM in existing software and allow us to investigate the strength of dependence both globally and on a per-response basis. Our simulation studies show that our proposed VIM recovers the true predictors better than naïve measures. We demonstrate usage of the VIMs for variable selection in two empirical applications; the first is on Amazon Marketplace data to predict Buy Box prices of multiple brands in a category, and the second is on ecology data to predict co-occurrence of multiple, rare bird species. A feature of both data sets is that some outcomes are sparse — exhibiting a substantial proportion of zeros or fixed values. In both cases, the proposed VIMs when used for variable screening give superior predictive accuracy over naïve measures.

Supplementary material

Supplementary Material

In our Online Supplement, we have included pseudo-codes on the MVRF ensemble build using sub-bagging procedure, proposed SI-based VIMs with significant splits, and the proposed RFE strategy of our iterative variable selection method. We have also included the variable choices in the simulation design; box plots and confidence intervals of top features selected by our proposed VIMs from the Amazon application on Luggage category.

References

Adler P, Kleinhesselink AR, Hooker G, Teller BJ, Ellner S, Taylor JB (2017). Weak interspecific interactions in a sagebrush steppe: Evidence from observations, models, and experiments. In: 2017 ESA Annual Meeting (August 6–11).

Amazon - Price Matching (no date). Amazon - price matching. https://www.amazon.com/gp/help/customer/display.html?nodeId=G9EAYKPV5YYDB8P7. Accessed: 25 August 2021.

Andonova S, Elisseeff A, Evgeniou T, Pontil M (2002). A simple algorithm for learning stable machines. In: ECAI.

Breiman L (2001). Random forests. Machine Learning, 45: 5–32.

Chen KH, Lin WL, Lin SM (2022). Competition between the black-winged kite and Eurasian kestrel led to population turnover at a subtropical sympatric site. Journal of Avian Biology, 10: e03040.

Chen L, Mislove A, Wilson C (2016). An empirical analysis of algorithmic pricing on Amazon marketplace. In: Proceedings of the 25th International Conference on World Wide Web.

Coleman T, Mentch L, Fink D, Sorte F, Hooker G, Hochachka W, et al. (2020). Statistical inference on tree swallow migrations with random forests. Journal of the Royal Statistical Society. Series C. Applied Statistics, 69(4): 973–989.

Covert I, Lundberg SM, Lee SI (2020). Understanding global feature contributions with additive importance measures. Advances in Neural Information Processing Systems, 33: 17212–17223.

Danaher PJ (2007). Modeling page views across multiple websites with an application to Internet reach and frequency prediction. Marketing Science, 26(3): 422–437.

Danaher PJ, Smith MS (2011). Modeling multivariate distributions using copulas: Applications in marketing. Marketing Science.

De’Ath G (2002). Multivariate regression trees: A new technique for modeling species–environment relationships. Ecology, 83(4): 1105–1117.

Efron B (2014). Estimation and accuracy after model selection. Journal of the American Statistical Association, 109(507): 991–1007.

Fink D, Auer T, Johnston A, Ruiz-Gutierrez V, Hochachka WM, Kelling S (2020). Modeling avian full annual cycle distribution and population trends with citizen science data. Ecological Applications, 30(3): e02056.

Fink D, Auer T, Johnston A, Strimas-Mackey M, Iliff M, Kelling S (2021). ebird status and trends. cornell lab of ornithology, ithaca, new york.

Fink D, Hochachka WM, Zuckerberg B, Winkler DW, Shaby B, Munson MA, et al. (2010). Spatiotemporal exploratory models for broad-scale survey data. Ecological Applications.

Friedman JH (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 1189–1232.

Á Gómez-Losada, Duch-Brown N (2019). Competing for Amazon’s buy box: A machine-learning approach. In: Business Information Systems Workshops: BIS 2019 International Workshops, Seville, Spain, June 26–28, 2019, Revised Papers 22 (W Abramowicz, R Corchuelo, eds.), 445–456. Springer.

Guyon I, Weston J, Barnhill S, Vapnik V (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46: 389–422. https://doi.org/10.1023/A:1012487302797

Hooker G, Mentch L, Zhou S (2021). Unrestricted permutation forces extrapolation: Variable importance requires at least one more model, or there is no free variable importance. Statistics and Computing, 31: 1–16. https://doi.org/10.1007/s11222-021-10057-z

Ishwaran H (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 519–537.

Joe H (1997). Multivariate Models and Multivariate Dependence Concepts. CRC press, Florida.

Johnston A, Hochachka WM, Strimas-Mackey ME, Ruiz Gutierrez V, Robinson OJ, Miller ET, et al. (2021). Analytical guidelines to increase the value of community science data: An example using ebird data to estimate species distributions. Diversity and Distributions, 27(7): 1265–1277. https://doi.org/10.1111/ddi.13271

Mentch L, Hooker G (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. Journal of Machine Learning Research, 17(1): 841–881.

Miller PJ, Lubke GH, McArtor DB, Bergeman C (2016). Finding structure in data using multivariate tree boosting. Psychological Methods, 21(4): 583. https://doi.org/10.1037/met0000087

Ng WH, Fink D, LaSorte FA, Auer T, Hochachka WM, Johnston A, et al. (2022). Continental-scale biomass redistribution by migratory birds in response to seasonal variation in productivity. Global Ecology and Biogeography, 31(4): 727–739. https://doi.org/10.1111/geb.13460

Pierdzioch C, Risse M (2020). Forecasting precious metal returns with multivariate random forests. Empirical Economics, 58(3): 1167–1184. https://doi.org/10.1007/s00181-018-1558-9

Rahman R, Otridge J, Pal R (2017). Integratedmrf: Random forest-based framework for integrating prediction from different data types. Bioinformatics, 33(9): 1407–1410. https://doi.org/10.1093/bioinformatics/btw765

Ribeiro MT, Singh S, Guestrin C (2016). “why should I trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144.

Rosenberg KV, Dokter AM, Blancher PJ, Sauer JR, Smith AC, Smith PA, et al. (2019). Decline of the North American avifauna. Science, 366(6461): 120–124. https://doi.org/10.1126/science.aaw1313

Segal M (1992). Tree-structured methods for longitudinal data. Journal of the American Statistical Association, 87(418): 407–418.

Segal M, Xiao Y (2011). Multivariate random forests. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1): 80–87.

Sikdar S, Hooker G, Kadiyali V (2021). Multivariate random forest variable importance measures r package. https://github.com/Megatvini/VIM/.

Sikdar S, Kadiyali V, Hooker G (2022). Price dynamics on amazon marketplace: A multivariate random forest variable selection approach. Tuck School of Business Working Paper, (3518690).

Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1): 1–21. https://doi.org/10.1186/1471-2105-8-1

Sullivan BL, Wood CL, Iliff MJ, Bonney RE, Fink D, Kelling S (2009). ebird: A citizen-based bird observation network in the biological sciences. Biological Conservation, 142(10): 2282–2292. https://doi.org/10.1016/j.biocon.2009.05.006

Verdinelli I, Wasserman L (2023). Feature importance: A closer look at shapley values and loco. arXiv preprint: https://arxiv.org/abs/2303.05981.

Wager S, Hastie T, Efron B (2014). Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. Journal of Machine Learning Research, 15(1): 1625–1651.

Zaman F, Hirose H (2009). Effect of subsampling rate on subbagging and related ensembles of stable classifiers. In: Pattern Recognition and Machine Intelligence: Third International Conference, PReMI 2009 New Delhi, India, December 16-20, 2009 Proceedings 3 (S Chaudhury, S Mitra, CA Murthy, PS Sastry, SK Pal, eds.), 44–49. Springer.

Zhang H (1998). Classification trees for multiple binary responses. Journal of the American Statistical Association, 93(441): 180–193. https://doi.org/10.1080/01621459.1998.10474100

Zhou Z, Hooker G (2021). Unbiased measurement of feature importance in tree-based methods. ACM Transactions on Knowledge Discovery from Data, 15(2): 1–21. https://doi.org/10.1145/3425637

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

multivariate response problems multivariate tree-based ensembles split improvement variable selection

Metrics

since February 2021

772

Article info
views

453

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file