JDSJournal of Data Science1683-86021680-743X1680-743XSchool of Statistics, Renmin University of ChinaJDS102510.6339/21-JDS1025Statistical Data SciencePredictive Comparison Between Random Machines and Random ForestsMaiaMateus1AzevedoArthur R.2AraAndersonalsouzara@gmail.com3∗Department of Math & Statistics, Maynooth University, Maynooth, IrelandDepartment of Statistics, Federal University of Bahia, Salvador-BA, BrazilDepartment of Statistics, Federal University of Paraná, Curitiba-PR, BrazilCorresponding author. Email: alsouzara@gmail.com.202128102021194593614Supplementary Material A

The RM was also implemented in R language and it can be used through the rmachines package, available and documented at GitHub https://github.com/MateusMaiaDS/rmachines. To a overall description of how to reproduce the results from this article just access the README at https://mateusmaiads.github.io/rmachines_and_randomforest/.

Supplementary Material B

Exposes a descriptive analysis of the three real-world applications displayed in Section 5 and additional results around the comparison of RM and RF.

Ensemble techniques have been gaining strength among machine learning models, considering supervised tasks, due to their great predictive capacity when compared with some traditional approaches. The random forest is considered to be one of the off-the-shelf algorithms due to its flexibility and robust performance to both regression and classification tasks. In this paper, the random machines method is applied over simulated data sets and benchmarking datasets in order to be compared with the consolidated random forest models. The results from simulated models show that the random machines method has a better predictive performance than random forest in most of the investigated data sets. Three real data situations demonstrate that the random machines may be used to solve real-world problems with competitive payoff.

baggingensemblesupport vector machinesCAPESScience Foundation Ireland17/CDA/4695The authors gratefully acknowledge the financial support of the Brazilian research funding agencies CAPES (Federal Agency for the Support and Improvement of Higher Education). M.M.’s work was supported by a Science Foundation Ireland Career Development Award Grant 17/CDA/4695. ReferencesAl-RajabM, LuJ, XuQ (2017). Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. TdA (2017). O conceito de amor: um estudo exploratório com uma amostra brasileira, Ph.D. thesis, Universidade de São Paulo.AraA, MaiaM, LouzadaF, MacêdoS (2021). Random machines: A bagged-weighted support vector model with free kernel choice. BatuwitaR, PaladeV (2013). Class imbalance learning methods for support vector machines. In: BhavanA, ChauhanP, ShahRR, et al. (2019). Bagged support vector machines for emotion recognition from speech. BoschA, ZissermanA, MunozX (2007). Image classification using random forests and ferns. In: BoughorbelS, JarrayF, El-AnbariM (2017). Optimal classifier for imbalanced data using Matthews correlation coefficient metric. BreimanL (1996). Bagging predictors. BreimanL (2001). Random forests. BreimanL (2002). Manual on setting up, using, and understanding random forests v3.1. Statistics Department University of California. Berkeley, CA, USA, 1:58.BreimanL, et al. (1996). Heuristics of instability and stabilization in model selection. CalladoALC (2003). Estudo sobre insolvência entre empresas paraibanas: uma aplicação do termômetro de kanitz. Anais do Encontro Nordestino de Contabilidade–ENECON.CortesC, VapnikV (1995). Support-vector networks. DietterichTG (2000). Ensemble methods in machine learning. In: DruckerH, BurgesCJ, KaufmanL, SmolaAJ, VapnikV (1997). Support vector regression machines. In: DuaD, GraffC (2017). UCI machine learning repository.Fernández-DelgadoM, CernadasE, BarroS, AmorimD (2014). Do we need hundreds of classifiers to solve real world classification problems?FletcherR (2013). FreundY, SchapireRE (1997). A decision-theoretic generalization of on-line learning and an application to boosting. FutomaJ, MorrisJ, LucasJ (2015). A comparison of models for predicting early hospital readmissions. GulA, PerperoglouA, KhanZ, MahmoudO, MiftahuddinM, AdlerW, et al. (2018). Ensemble of a subset of knn classifiers. HoTK (1998). The random subspace method for constructing decision forests. HuangJ, LuJ, LingCX (2003). Comparing naive Bayes, decision trees, and svm with auc and accuracy. In: HuoJ, ShiT, ChangJ (2016). Comparison of random forest and svm for electrical short-term load forecast with different data sources. In: KimS, KimC (2020). Influence diagnostics in support vector machines. Journal of the Korean Statistical Society, 1–22.LandWH, SchafferJD (2020). The support vector machine. In: LarsenJ, GoutteC (1999). On optimal data split for generalization estimation and model selection. In: LiangG, ZhuX, ZhangC (2011). An empirical study of bagging predictors for different learning algorithms. In: MatthewsBW (1975). Comparison of the predicted and observed secondary structure of t4 phage lysozyme. MoguerzaJM, MuñozA (2006). Support vector machines with applications. OuedraogoI, DefournyP, VancloosterM (2019). Application of random forest regression and comparison of its performance to multiple linear regression in modeling groundwater nitrate concentration at the african continent scale. PalM (2005). Random forest classifier for remote sensing classification. ProbstP, WrightMN, BoulesteixAL (2019). Hyperparameters and tuning strategies for random forest. Rodriguez-GalianoV, Sanchez-CastilloM, Chica-OlmoM, Chica-RivasM (2015). Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. RoyMH, LarocqueD (2012). Robustness of random forests for regression. SageAJ, GenschelU, NettletonD (2020). Tree aggregation for random forest class probability estimation. ScornetE (2016). Random forests and kernel methods. ScornetE, BiauG, VertJP, et al. (2015). Consistency of random forests. ShivaswamyPK, ChuW, JanscheM (2007). A support vector approach to censored targets. In: StatnikovA, WangL, AliferisCF (2008). A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. SyarifI, ZaluskaE, Prugel-BennettA, WillsG (2012). Application of bagging, boosting and stacking to intrusion detection. In: TangF, IshwaranH (2017). Random forest missing data algorithms. Van der LaanMJ, PolleyEC, HubbardAE (2007). Super learner. VapnikVN (1999). An overview of statistical learning theory. WangBX, JapkowiczN (2010). Boosting support vector machines for imbalanced data sets. WuG, ChangEY (2003). Class-boundary alignment for imbalanced dataset learning. In: ZareapoorM, ShamsolmoaliP, et al. (2015). Application of credit card fraud detection: Based on bagging ensemble classifier.