Supplementary Material A

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1025

10.6339/21-JDS1025

Statistical Data Science

Predictive Comparison Between Random Machines and Random Forests

Maia

Mateus

1 Azevedo

Arthur R.

2 Ara

Anderson

alsouzara@gmail.com3∗ 1Department of Math & Statistics, Maynooth University, Maynooth, Ireland 2Department of Statistics, Federal University of Bahia, Salvador-BA, Brazil 3Department of Statistics, Federal University of Paraná, Curitiba-PR, Brazil

∗Corresponding author. Email: alsouzara@gmail.com.

2021

28102021

194593614

Supplementary Material A

The RM was also implemented in R language and it can be used through the rmachines package, available and documented at GitHub https://github.com/MateusMaiaDS/rmachines. To a overall description of how to reproduce the results from this article just access the README at https://mateusmaiads.github.io/rmachines_and_randomforest/.

Supplementary Material B

Exposes a descriptive analysis of the three real-world applications displayed in Section 5 and additional results around the comparison of RM and RF.

5820211992021

2021

This is a free to read article.

Ensemble techniques have been gaining strength among machine learning models, considering supervised tasks, due to their great predictive capacity when compared with some traditional approaches. The random forest is considered to be one of the off-the-shelf algorithms due to its flexibility and robust performance to both regression and classification tasks. In this paper, the random machines method is applied over simulated data sets and benchmarking datasets in order to be compared with the consolidated random forest models. The results from simulated models show that the random machines method has a better predictive performance than random forest in most of the investigated data sets. Three real data situations demonstrate that the random machines may be used to solve real-world problems with competitive payoff.

Keywords bagging ensemble support vector machines

CAPES

Science Foundation Ireland

17/CDA/4695

The authors gratefully acknowledge the financial support of the Brazilian research funding agencies CAPES (Federal Agency for the Support and Improvement of Higher Education). M.M.’s work was supported by a Science Foundation Ireland Career Development Award Grant 17/CDA/4695.

References

Al-Rajab

, Lu

, Xu

(2017). Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. Computer Methods and Programs in Biomedicine, 146: 11–24.

(2017). O conceito de amor: um estudo exploratório com uma amostra brasileira, Ph.D. thesis, Universidade de São Paulo.

Ara

, Maia

, Louzada

, Macêdo

(2021). Random machines: A bagged-weighted support vector model with free kernel choice. Journal of Data Science, 19(3): 409–428.

Batuwita

, Palade

(2013). Class imbalance learning methods for support vector machines. In: Imbalanced learning: Foundations, Algorithms, and Applications, 83–99.

Bhavan

, Chauhan

, Shah

, et al. (2019). Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, 184: 104886.

Bosch

, Zisserman

, Munoz

(2007). Image classification using random forests and ferns. In: 2007 IEEE 11th International Conference on Computer Vision, 1–8. IEEE.

Boughorbel

, Jarray

, El-Anbari

(2017). Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PloS ONE, 12(6): e0177678.

Breiman

(1996). Bagging predictors. Machine Learning, 24(2): 123–140.

Breiman

(2001). Random forests. Machine Learning, 45(1): 5–32.

Breiman

(2002). Manual on setting up, using, and understanding random forests v3.1. Statistics Department University of California. Berkeley, CA, USA, 1:58.

Breiman

, et al. (1996). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6): 2350–2383.

Callado

ALC

(2003). Estudo sobre insolvência entre empresas paraibanas: uma aplicação do termômetro de kanitz. Anais do Encontro Nordestino de Contabilidade–ENECON.

Cortes

, Vapnik

(1995). Support-vector networks. Machine Learning, 20(3): 273–297.

Dietterich

(2000). Ensemble methods in machine learning. In: International Workshop on Multiple Classifier Systems, 1–15. Springer.

Drucker

, Burges

, Kaufman

, Smola

, Vapnik

(1997). Support vector regression machines. In: Advances in Neural Information Processing Systems, 155–161.

Dua

, Graff

(2017). UCI machine learning repository.

Fernández-Delgado

, Cernadas

, Barro

, Amorim

(2014). Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15(1): 3133–3181.

Fletcher

(2013). Practical Methods of Optimization. John Wiley & Sons.

Freund

, Schapire

(1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1): 119–139.

Futoma

, Morris

, Lucas

(2015). A comparison of models for predicting early hospital readmissions. Journal of Biomedical Informatics, 56: 229–238.

Gul

, Perperoglou

, Khan

, Mahmoud

, Miftahuddin

, Adler

, et al. (2018). Ensemble of a subset of knn classifiers. Advances in Data Analysis and Classification, 12(4): 827–840.

(1998). The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell, 20(8): 1–22.

Huang

, Lu

, Ling

(2003). Comparing naive Bayes, decision trees, and svm with auc and accuracy. In: Third IEEE International Conference on Data Mining, 553–556. IEEE.

Huo

, Shi

, Chang

(2016). Comparison of random forest and svm for electrical short-term load forecast with different data sources. In: 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS), 1077–1080. IEEE.

Kim

, Kim

(2020). Influence diagnostics in support vector machines. Journal of the Korean Statistical Society, 1–22.

Land

, Schaffer

(2020). The support vector machine. In: The Art and Science of Machine Intelligence, 45–76. Springer.

Larsen

, Goutte

(1999). On optimal data split for generalization estimation and model selection. In: Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No. 98TH8468), 225–234. IEEE.

Liang

, Zhu

, Zhang

(2011). An empirical study of bagging predictors for different learning algorithms. In: Twenty-Fifth AAAI Conference on Artificial Intelligence.

Matthews

(1975). Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2): 442–451.

Moguerza

, Muñoz

(2006). Support vector machines with applications. Statistical Science, 21(3): 322–336.

Ouedraogo

, Defourny

, Vanclooster

(2019). Application of random forest regression and comparison of its performance to multiple linear regression in modeling groundwater nitrate concentration at the african continent scale. Hydrogeology Journal, 27(3): 1081–1098.

Pal

(2005). Random forest classifier for remote sensing classification. International Journal of Remote Sensing, 26(1): 217–222.

Probst

, Wright

, Boulesteix

(2019). Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3): e1301.

Rodriguez-Galiano

, Sanchez-Castillo

, Chica-Olmo

, Chica-Rivas

(2015). Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geology Reviews, 71: 804–818.

Roy

, Larocque

(2012). Robustness of random forests for regression. Journal of Nonparametric Statistics, 24(4): 993–1006.

Sage

, Genschel

, Nettleton

(2020). Tree aggregation for random forest class probability estimation. Statistical Analysis and Data Mining: The ASA Data Science Journal. 13(2): 134–150.

Scornet

(2016). Random forests and kernel methods. IEEE Transactions on Information Theory, 62(3): 1485–1500.

Scornet

, Biau

, Vert

, et al. (2015). Consistency of random forests. The Annals of Statistics, 43(4): 1716–1741.

Shivaswamy

, Chu

, Jansche

(2007). A support vector approach to censored targets. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), 655–660. IEEE.

Statnikov

, Wang

, Aliferis

(2008). A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics, 9(1): 319.

Syarif

, Zaluska

, Prugel-Bennett

, Wills

(2012). Application of bagging, boosting and stacking to intrusion detection. In: International Workshop on Machine Learning and Data Mining in Pattern Recognition, 593–602. Springer.

Tang

, Ishwaran

(2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(6): 363–377.

Van der Laan

, Polley

, Hubbard

(2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1).

Vapnik

(1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5): 988–999.

Wang

, Japkowicz

(2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25(1): 1–20.

, Chang

(2003). Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, 49–56. Washington, DC.

Zareapoor

, Shamsolmoali

, et al. (2015). Application of credit card fraud detection: Based on bagging ensemble classifier. Procedia Computer Science, 48: 679–685. 2015.