Predictive Comparison Between Random Machines and Random Forests

Maia, Mateus; Azevedo, Arthur R.; Ara, Anderson

doi:10.6339/21-JDS1025

Journal of Data Science

Predictive Comparison Between Random Machines and Random Forests

Volume 19, Issue 4 (2021), pp. 593–614

Mateus Maia Arthur R. Azevedo Anderson Ara

https://doi.org/10.6339/21-JDS1025

Pub. online: 28 October 2021 Type: Statistical Data Science

Received
5 August 2021

Accepted
19 September 2021

Published
28 October 2021

Abstract

Ensemble techniques have been gaining strength among machine learning models, considering supervised tasks, due to their great predictive capacity when compared with some traditional approaches. The random forest is considered to be one of the off-the-shelf algorithms due to its flexibility and robust performance to both regression and classification tasks. In this paper, the random machines method is applied over simulated data sets and benchmarking datasets in order to be compared with the consolidated random forest models. The results from simulated models show that the random machines method has a better predictive performance than random forest in most of the investigated data sets. Three real data situations demonstrate that the random machines may be used to solve real-world problems with competitive payoff.

Supplementary material

Supplementary Material A

The RM was also implemented in R language and it can be used through the rmachines package, available and documented at GitHub https://github.com/MateusMaiaDS/rmachines. To a overall description of how to reproduce the results from this article just access the README at https://mateusmaiads.github.io/rmachines_and_randomforest/.

Supplementary Material B

Exposes a descriptive analysis of the three real-world applications displayed in Section 5 and additional results around the comparison of RM and RF.

References

Al-Rajab M, Lu J, Xu Q (2017). Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. Computer Methods and Programs in Biomedicine, 146: 11–24.

Td A (2017). O conceito de amor: um estudo exploratório com uma amostra brasileira, Ph.D. thesis, Universidade de São Paulo.

Ara A, Maia M, Louzada F, Macêdo S (2021). Random machines: A bagged-weighted support vector model with free kernel choice. Journal of Data Science, 19(3): 409–428.

Batuwita R, Palade V (2013). Class imbalance learning methods for support vector machines. In: Imbalanced learning: Foundations, Algorithms, and Applications, 83–99.

Bhavan A, Chauhan P, Shah RR, et al. (2019). Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, 184: 104886.

Bosch A, Zisserman A, Munoz X (2007). Image classification using random forests and ferns. In: 2007 IEEE 11th International Conference on Computer Vision, 1–8. IEEE.

Boughorbel S, Jarray F, El-Anbari M (2017). Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PloS ONE, 12(6): e0177678.

Breiman L (1996). Bagging predictors. Machine Learning, 24(2): 123–140.

Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32.

Breiman L (2002). Manual on setting up, using, and understanding random forests v3.1. Statistics Department University of California. Berkeley, CA, USA, 1:58.

Breiman L, et al. (1996). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6): 2350–2383.

Callado ALC (2003). Estudo sobre insolvência entre empresas paraibanas: uma aplicação do termômetro de kanitz. Anais do Encontro Nordestino de Contabilidade–ENECON.

Cortes C, Vapnik V (1995). Support-vector networks. Machine Learning, 20(3): 273–297.

Dietterich TG (2000). Ensemble methods in machine learning. In: International Workshop on Multiple Classifier Systems, 1–15. Springer.

Drucker H, Burges CJ, Kaufman L, Smola AJ, Vapnik V (1997). Support vector regression machines. In: Advances in Neural Information Processing Systems, 155–161.

Dua D, Graff C (2017). UCI machine learning repository.

Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014). Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15(1): 3133–3181.

Fletcher R (2013). Practical Methods of Optimization. John Wiley & Sons.

Freund Y, Schapire RE (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1): 119–139.

Futoma J, Morris J, Lucas J (2015). A comparison of models for predicting early hospital readmissions. Journal of Biomedical Informatics, 56: 229–238.

Gul A, Perperoglou A, Khan Z, Mahmoud O, Miftahuddin M, Adler W, et al. (2018). Ensemble of a subset of knn classifiers. Advances in Data Analysis and Classification, 12(4): 827–840.

Ho TK (1998). The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell, 20(8): 1–22.

Huang J, Lu J, Ling CX (2003). Comparing naive Bayes, decision trees, and svm with auc and accuracy. In: Third IEEE International Conference on Data Mining, 553–556. IEEE.

Huo J, Shi T, Chang J (2016). Comparison of random forest and svm for electrical short-term load forecast with different data sources. In: 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS), 1077–1080. IEEE.

Kim S, Kim C (2020). Influence diagnostics in support vector machines. Journal of the Korean Statistical Society, 1–22.

Land WH, Schaffer JD (2020). The support vector machine. In: The Art and Science of Machine Intelligence, 45–76. Springer.

Larsen J, Goutte C (1999). On optimal data split for generalization estimation and model selection. In: Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No. 98TH8468), 225–234. IEEE.

Liang G, Zhu X, Zhang C (2011). An empirical study of bagging predictors for different learning algorithms. In: Twenty-Fifth AAAI Conference on Artificial Intelligence.

Matthews BW (1975). Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2): 442–451.

Moguerza JM, Muñoz A (2006). Support vector machines with applications. Statistical Science, 21(3): 322–336.

Ouedraogo I, Defourny P, Vanclooster M (2019). Application of random forest regression and comparison of its performance to multiple linear regression in modeling groundwater nitrate concentration at the african continent scale. Hydrogeology Journal, 27(3): 1081–1098.

Pal M (2005). Random forest classifier for remote sensing classification. International Journal of Remote Sensing, 26(1): 217–222.

Probst P, Wright MN, Boulesteix AL (2019). Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3): e1301.

Rodriguez-Galiano V, Sanchez-Castillo M, Chica-Olmo M, Chica-Rivas M (2015). Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geology Reviews, 71: 804–818.

Roy MH, Larocque D (2012). Robustness of random forests for regression. Journal of Nonparametric Statistics, 24(4): 993–1006.

Sage AJ, Genschel U, Nettleton D (2020). Tree aggregation for random forest class probability estimation. Statistical Analysis and Data Mining: The ASA Data Science Journal. 13(2): 134–150.

Scornet E (2016). Random forests and kernel methods. IEEE Transactions on Information Theory, 62(3): 1485–1500.

Scornet E, Biau G, Vert JP, et al. (2015). Consistency of random forests. The Annals of Statistics, 43(4): 1716–1741.

Shivaswamy PK, Chu W, Jansche M (2007). A support vector approach to censored targets. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), 655–660. IEEE.

Statnikov A, Wang L, Aliferis CF (2008). A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics, 9(1): 319.

Syarif I, Zaluska E, Prugel-Bennett A, Wills G (2012). Application of bagging, boosting and stacking to intrusion detection. In: International Workshop on Machine Learning and Data Mining in Pattern Recognition, 593–602. Springer.

Tang F, Ishwaran H (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(6): 363–377.

Van der Laan MJ, Polley EC, Hubbard AE (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1).

Vapnik VN (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5): 988–999.

Wang BX, Japkowicz N (2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25(1): 1–20.

Wu G, Chang EY (2003). Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, 49–56. Washington, DC.

Zareapoor M, Shamsolmoali P, et al. (2015). Application of credit card fraud detection: Based on bagging ensemble classifier. Procedia Computer Science, 48: 679–685. 2015.

This is a free to read article.

Keywords

bagging ensemble support vector machines

Funding

The authors gratefully acknowledge the financial support of the Brazilian research funding agencies CAPES (Federal Agency for the Support and Improvement of Higher Education). M.M.’s work was supported by a Science Foundation Ireland Career Development Award Grant 17/CDA/4695.

Metrics

since February 2021

1862

Article info
views

730

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file