Predictive Comparison Between Random Machines and Random Forests

Ensemble techniques have been gaining strength among machine learning models, considering supervised tasks, due to their great predictive capacity when compared with some traditional approaches. The random forest is considered to be one of the oﬀ-the-shelf algorithms due to its ﬂexibility and robust performance to both regression and classiﬁcation tasks. In this paper, the random machines method is applied over simulated data sets and benchmarking datasets in order to be compared with the consolidated random forest models. The results from simulated models show that the random machines method has a better predictive performance than random forest in most of the investigated data sets. Three real data situations demonstrate that the random machines may be used to solve real-world problems with competitive payoﬀ.


Introduction
Ensemble methods are machine learning algorithms that use multiple models in order to combine them and build a stronger one (Dietterich, 2000). In general, the strategy of combination of the models is made can be defined by two types of ensembles: i) the bagging approach  based on independent bootstrapping models aggregated by the majority vote (classification tasks), or using the average of models predictions (regression tasks), and ii) the boosting approach (Freund and Schapire, 1997) which generates sequentially aggregated models using different weights based on their previous errors.
The bagging procedure is widely used in many real-world applications and there are several studies showing the effectiveness of this approach (Liang et al., 2011;Syarif et al., 2012;Zareapoor et al., 2015;Bhavan et al., 2019). When  first introduced the bagging procedure, he emphasizes that the success of this method relies on the strength and the instability of single models that compose the bagging algorithm. The strength of the model can be explained as the predictive capacity of each model. The instability concept was also explored by . This characteristic is defined through the idea that if small changes in bootstrap replications of a sample of observations produce large changes in the bootstrap models, then, it can be considered unstable.
Although both of these characteristics are important to obtain robust results through the bagging procedure, it is not simple to optimize them simultaneously. Ho (1998) presents this trade-off between the strength and instability of models. Generally, strong models, i.e.: high accuracy, are more stable, which implies a greater correlation between models of this type, and vice versa. Considering this aspect, Ho (1998) proposed the use of a random subspace method for constructing decision forests, where a random selection of features to split each node is made. Her work showed that it was possible to reduce the correlation of tree models without reducing their accuracy.
Later, this main idea of using random subspace was formalized by Breiman (2001), who presented the random forest (RF) method. Breiman (2001) showed that RF procedure creates base models which were strong (i.e.: high predictive power) and uncorrelated, resulting in a robust and consistent ensemble model (Scornet et al., 2015). The flexibility is also shown through the articles that present the use of random forests in handling missing data (Tang and Ishwaran, 2017), and its robustness to estimate class' probabilities (Sage et al., 2020). The effectiveness of RF is also demonstrated in literature with diverse real-world applications (Pal, 2005;Bosch et al., 2007;Statnikov et al., 2008;Futoma et al., 2015;Rodriguez-Galiano et al., 2015;Ouedraogo et al., 2019). Support vector machines (SVM) are very efficient and popular tools for classification and regression with several perks. SVMs are rooted in the statistical learning theory (Vapnik, 1999) and this method has a globally optimal solution which is obtainable by solving a convex optimization problem, while the problems of local minima disrupt other common contemporary approaches, such as neural networks. A SVM also handles high-dimensional data since it considers non-linearity inherent in data through incorporation of kernel functions (Moguerza and Muñoz, 2006;Shivaswamy et al., 2007;Land and Schaffer, 2020;Kim and Kim, 2020). Despite its great efficiency, the choice of kernel function is crucial in SVM applications, and also it can be overlapped by ensemble models as random forest (Fernández-Delgado et al., 2014;Huo et al., 2016).
Inspired by this concept of random subspace from random forests, the ensemble approach random machines (RM) was designed. The RM method is a new ensemble procedure which uses the SVM as a base learner and applies an innovative random sampling of kernel functions to add instability and benefit the bagging structure. Ara et al. (2021) demonstrated that this algorithm successfully reduced the correlation between base learners while maintaining their strength, resulting in a better predictive performance than the traditional SVM and ensemble of SVMs.
In this paper, the RM were compared with RF, to show that this recent ensemble approach is competitive and can even result in better predictive performance than a robust and consolidated method such as RF in classification and regression tasks. In Section 2, an overview of the methodology of each algorithm is presented. In Section 3 and Section 4, both methods are applied and compared over simulated and benchmarking datasets, respectively. Section 5 reports the application of RM to solve successfully three real-world problems. Section 6 section closes the paper with the final comments.

Random Forests
The RF predictor is composed of multiple tree models f i (x), i = 1, . . . , B, where f i is a tree estimated based on a random subset of size m, m < p, of a p-dimensional base-learner vector x ∈ R p for an outcome variable y. Each tree is built on a bootstrap sample and B is the total number of them. Breiman (2002) refers to m as mtry and suggests that these values should be equal to p/3 for regression tasks and √ p to classification tasks. Other parameters are nodesize -the minimum number of observations inside a terminal node -and number of trees B -also named as ntree -that compose the model.
The final prediction of the random forest is given by the collection of trees, and changes depending on prediction task. For the regression context, the final prediction to a new observation is given by For the classification context, it is given by

Random Machines
The RM method (Ara et al., 2021) uses SVMs (Cortes and Vapnik, 1995;Drucker et al., 1997) as base learners in the bagging procedure with a random sample of kernel functions to build them. The methodology of this ensemble procedure differs for regression and classification assignments. For a classification task, given the observations where n is the sample size, SVM (Cortes and Vapnik, 1995) calculates an optimal hyperplane that separate the observation's classes. Its coordinates w are given by the minimization of the Equation (1), with the constraints y i (w · . . , n, and where C > 0 is a regularization parameter. The solution using the Lagrangian dual optimization for the soft margin problem (Fletcher, 2013), is given by The predictions values for new observations x * are given by, where sgn(.) is the sign function.
To deal with the non-linearity in support vector models, the kernel trick is used to transform the data from the input space into a high-dimensional space where the observations are linearly separable. This transformation is made through the kernel functions K(x, y) = φ(x) · φ(y). The most common kernel functions in SVM applications were used in this paper, and are presented in the Table 1, where γ ∈ R + , d ∈ N.
The use of different kernel functions is one of the main ideas that support the efficiency of RM and differentiates it from traditional ensemble approaches. Through random sampling, it is possible to have a broader representation of the data, since each kernel visits different feature spaces during the bagging procedure. Also, the proportions of visits to these feature spaces are defined by weights based on the predictive capacity of every single kernel. Finally, a model averaging is realized using the weights based on out-of-bag accuracy. Therefore, the random selection of feature spaces increases the diversity of the base learners without decrease its accuracy. The idea of increase in diversity and maintaining the accuracy in bagging was also demonstrated in works that use kNN classifiers as base models (Gul et al., 2018). In the following we explain in detail the entire process used by RM. The classification RM algorithm initializes generating support vectors models h r (x), where r = 1, . . . , R is the number of different kernels functions that will be used over a training set . Afterwards, each model will be validated over a test set {(x i , y i )} T i=1 , and an accuracy vector, ACC ∈ R R , is obtained. So, each coordinate refers to the predictive performance of the support vector machine model with the respective kernel function. For instance, in this paper we consider the four kernel functions presented in Table 1 (R = 4), and then the vector ACC would be given by ACC = {ACC Lin. , ACC P ol. , ACC Gau. , ACC Lap. }.
Subsequently, a vector of probabilities λ ∈ R R is calculated using Equation (3) in order to weight a random selection of the kernel functions that will be used in the bootstrap SVM base learners. Each term of λ is given by, In order to model the base learners that compose the RM, B bootstrap samples are generated and B support vector models g b (x) are estimated based on these samples using the weighted random kernel functions sampled with probability λ r . The probabilities λ r are higher if some kernel function used in h r (x) predicted correctly observations from test set. Therefore, the kernel functions with higher accuracy will appear often when the random kernel selection for each bootstrap model is made. If any kernel function applied in h r (x) does not do better than a random choice, then ACC r will be closer to 0.5 in binary cases and the probability of select that kernel function is near to zero.
Using the out-of-bag samples as test set, the predictive performance of each classifier g b (x) is evaluated, with b = 1, . . . , B, which generates a new vector of accuracy = ( 1 , . . . , B ) ∈ R B . Therefore, a weight is calculated to each model prediction using the Equation (4), The final classification is given by Equation (5), Considering the multi-class case, where K is the number of classes, the final decision model is given by In the regression tasks, the target variable is no longer categorical, but continuous. Therefore, the RM's approach needs some modifications in the general procedure. Thus, the support vector regression (SVR) method (Drucker et al., 1997) is used as a base learner, and the measure used to evaluate is no longer the accuracy, but the root mean square error (RMSE). The equation that defines the probabilities vector of sample a kernel function becomes = ( 1 , . . . , R ) and is now given by Equation (6), with ∀r = 1, . . . , R. The δ = (δ 1 , . . . , δ R ) represents the standardized (i.e.: divided by its standard deviation) RMSE from the support vector regression models h r (x) over the test set, and β is the correlation coefficient that tunes the penalty of the generalization error of each model. The probabilities r are higher if determined kernel function used in h r (x) has lower generalization error measured from the calculated RMSE over the test set. Consequently, the models with lower δ r will frequently appear when the random kernel selection for each bootstrap model is performed.
Model g b (x) by sampling a kernel function with probability λ r or r ; Assign the weight using OOBG b samples.

Calculate G(x)
Both approaches are summarized in pseudo-codes to classification and regression tasks in Algorithm 1. The random selection of kernel functions enables visiting multiple kernel spaces, improving the representation of algorithm's learning, and turning out RM as a different ensemble method when compared with traditional SVM ensemble approaches.

Artificial Data Application
To compare RM and RF concerning their predictive capacity, both methods were applied to different simulated scenarios. The validation technique used was a repeated holdout, with thirty repetitions and a split ratio of 70%-30% of the training-test. This validation setting was selected to measure the generalization capacity to predict new observations consistently (Larsen and Goutte, 1999). Also, in order to achieve the best of each algorithm, a grid search was realized to select the best hyperparameters. For the RF method, the elements that were selected to compose the grid search were: • mtry: number of variables randomly sampled as candidates at each split.
; 10; 25}; as the minimum size of observations in terminal nodes. • ntree ∈ {100; 500; 1000}; as the number of tree collections in a random forest.
The choice of these hyperparameters was justified regarding with Probst et al. (2019), which evaluated that they were the most influential parameters in the RF algorithm. Concerning the RM grid search, the hyperparameters range in C = {0.1; 0.5; 1; 5} (i.e.: cost parameter), γ Gau. = {0.1; 0.5; 1; 5} (i.e.: γ parameter for the Gaussian kernel presented in Table 1) and γ Lap. = {0.1; 0.5; 1; 5}. The other parameters as the polynomial degree d = 2 and SVR parameter ε = 0.1 and β = 2 were chosen as default, since those values yielded reasonable good results for most of datasets, being not necessary to use a grid search to select them.

Classification Task
In classification context, three scenarios were generated with the objective to experiment different data behaviors. Simulation 1 regards a binary classification problem, were y ∈ {1, −1} and each class is sampled from a different multivariate normal distribution. The Class 1 observations are sampled from a distribution with mean vector μ 1 = 0 p , with 0 p as p-dimensional zero vector, and covariate matrix 1 = 4I p . The Class −1 observations are sampled from a normal multivariate that has mean vector μ −1 = 4 × 1 p , with 1 p as p-dimensional vector of ones, and covariate matrix −1 = I p . The Simulation 1 configuration presents a setting where the two groups are easily linearly separable. The Simulation 2 follows the same pattern as Simulation 1, however the parameters of normal multivariate distribution from each class are different. Considering the Class 1 instances, they are sampled from a distribution with mean vector μ 1 = 0 p and covariate matrix 1 = 4I p and the Class −1 observations are sampled from a normal multivariate that has mean vector μ −1 = 2×1 p and covariate matrix −1 = I p . At this second scenario, the two classes are no longer easily linearly separable by a hyperplane as was in the first one. The Simulation 3 data set explores the non-linearity behavior from a binary classification task through a circle uniformly distributed inside in the middle a p-dimensional cube.
All of scenarios varied the parameters: n = {100, 500, 1000} number of observations, the p = {2, 10, 50} dimension, and the r = {0.1, 0.5} ratio observations in each class. The evaluation was realized considering the following metrics: • Accuracy (ACC): estimates the ratio of correctly classified observations to total observations from the sample. From a standard binary confusion matrix, we would have the quantities of true positives (T P ), true negatives (T N), false positives (F P ) and, false negatives (F N).
The result for Simulation 1 was summarized in the Table 2, where the ACC and MCC of the best hyperparameters (i.e.: resulted in maximum ACC and MCC) were obtained. From the outcome it is possible to notice that since the simplicity of the classification task both methods performed well with perfect predictions in most cases. Through this behaviour, RM did not present a good performance when compared concerning the RF when the classes are unbalanced with small sample size, as it can be seen in MCC from RM values in r = 0.1 and n = 100 scenarios, which is generally lower when compared with the RF.
The outcome for Simulation 2 is reported in Table 3. In this artificial data sets, the two classes are no longer easily linearly separable, and this characteristic is reflected in the lower values of ACC and MCC when compared with the first scenario. Despite that, in most cases, the RM outperform the random forest approach. The RM seems few accurate by to MCC measure only in cases where exists a great unbalance between the classes ratio = 0.1, and a small sample size n = 100.
The predictive capacity of each algorithm also can be compared through the number of times that a method won, i.e.: achieve higher or equal ACC or MCC, in thirty holdout repetitions. The outcome of simulation experiments is graphically summarized by Figure 1. Is remarkable that the RM outperformed the RF in the majority of scenarios presented. Nonetheless, both methods have a slightly underfit over small sample sizes in the non-linear classification problem expressed by Simulation 3. Also, considering Simulation 2, the calculated MCC values for the RF models are higher in particular cases where the data have simultaneously unbalanced between the classes and small sample size. However, it is worth remembering that the ACC measure was used during the RM training process, which does not exhibit this same behavior.
Tables 2, 3, and 4 also express the sensibility of the RM with respect to the number of bootstrap samples B. From all simulation scenarios, it can be observed that, on average, the RMSE tends to be smaller as we increase the number of base learners. However, the difference seems to be smaller from 50 bootstrap samples and 100, showing the consistency of the algorithm.

Regression Task
The artificial data generation for regression tasks considered five different scenarios to evaluate which algorithm would perform better. Simulations 1-3 are toy examples (Scornet, 2016), Simulation 4 (Van der Laan et al., 2007) and Simulation 5 (Roy and Larocque, 2012) are also simulation scenarios already tested and used in literature. All covariates X = (X 1 , . . . , X p ) from Simulations 1-4 follow a uniform distribution [−1, 1] p . For the case of Simulation 5 each predictor follows an independent standard normal distribution. To appraise how each model is affected by the sample size, the values of n = {30, 100, 500, 1000} were chosen. Moreover, the RMSE was the measure selected to analyze the performance of RF and RM.
The equations of each simulation scenario are described below • model 1: p = 2, Y = X 2 1 + e −X 2 2 + N (0, 0.25) • model 2: p = 8, Y = X 1 X 2 + X 2 3 − X 4 X 7 + X 5 X 8 − X 2 6 + N (0, 0.5) • model 3: p = 4, Y = − sin(X 1 ) + X 2 2 + X 3 − e −X 2 4 + N (0, 0.5) • model 4: p = 6, Y = X 2 1 + X 2 2 X 3 e −|X 4 | + X 6 − X 5 + N (0, 0.5) • model 5: p = 6, Y = X 1 + 0.707X 2 2 + 21 X 3 >0 + 0.873 log(|X 1 |)|X 3 | + 0.894X 2 X 4 + 21 X 5 >0 + 0.464e X 6 + N (0, 1) The averages of RMSE are presented in Table 5. The result achieved in all scenarios gives the evidence that the regression RM outperformed the RF, reinforcing the idea that the novel ensemble approach is competitive. It is interesting to notice that the difference between RMSE values of each method is lower when the sample size is equally small, in most cases. This behaviour maybe can be interpreted as the regression RM approach can benefit even more with larger sample sizes. Figure 2 emphasizes the superiority of regression RM in that cases, where is possible to see the proportion of the number of times that the RF have greater values of RMSE.

Benchmarking Applications
Simulations are interesting to study the performance of both methods under controlled situations. However, analysis over real data sets is a valuable and essential contribution. Therefore, the comparison was also applied to real-world and benchmarking data. All data sets were retrieved from the UCI Repository of Machine Learning (Dua and Graff, 2017), being fifteen classification tasks and being fifteen regression tasks, summing up a total of thirty different data sets. All of them were chosen in order to diversify the sample size, dimensionality, and domain application. The validation technique, hyperparameter tuning, and metrics rating used in the benchmarking application were the same as used in Section 3.

Classification Cases
The description of classification data sets is given in Table 6 with the number of observations (n), that range from 90-1371, the number of predictors (p), that range from 2-166, and the class proportion, i.e.: the ratio between the numbers of observations in each class. All of them are cases of binary classification. The average calculated values of ACC and MCC are presented in Table 7. Faced with benchmarking considered scenarios, the RM had greater performance than the random forest algorithm, since it achieved higher mean values of ACC and MCC in 12 of 15 data sets. Studying the three cases (german, vehicle, thoraric) where the random forest obtained better MCC values, can be noticed that all of them are cases of imbalance between the classes. Highlighting the thoraric data set, it presents the largest difference of the MCC's mean value between those three and also have the smaller ratio class ratio. The RM's efficiency can also be observed graphically in Figure 3 which shows the proportion that a method won in thirty holdout repetitions.

Regression Cases
The characterization of regression benchmarking is featured in Table 8 that presents the number of observations that ranges (n) from 23-4177, the number of predictors that ranges from 1-60. Also, the mean value of the predictive variable is provided.
The values of Root Mean Squared Error obtained for regression RM and RF are in Table 9. The result given by Table above reveals that the regression RM produced a lower generalization error (i.e.: lower RMSE) in majority of data sets tested. Comparing both columns, it can be noticed that regression RM column is, in average, 8.5% smaller than the random forest one. Also, in absolute terms, regression RM won in 12 of 15 regression scenarios, losing only in: friedman#3, pyrim, and triazine. Investigating deeply, the high ratio p/n value from these two last data might be the reason of RF performed better since it can take more advantage of high dimensionality, and small sample size than the regression RM. Despite these three cases, the superiority of regression RM also is depicted in Figure 4.

Three Real-World Applications
In this section, the RM approach was used to solve three different novel real-world problems: predict a defaulting status from companies, classify people's gender from their definition of love, and predict the rate of use of a Brazilian social assistance programme by municipality. The result was also compared with Linear and Gaussian SVM, and the RF. The descriptive analysis of these applications are shown in the Supplementary Material B.

Default Status from Business Companies
The dataset is composed of 66 observations where each instance represents a determined company. The outcome is a binary variable y i , where if y i = 1, this indicates that the corporation Figure 4: Proportion of the number times that a method won, considering RMSE, in 30 holdout repetitions, calculated for all regression benchmarks.
is compliant with their salaries, on the other hand, if y i = −1, then it indicates a company in default. There are seven continuous covariates that describes each company. Two of them are the current liquid ratio (CLR) and the dry liquidity ratio (DLR), respectively. The other five are Kanitz indexes (Callado, 2003) that can indicate the possibility of business bankruptcy. The proportion ratio between the number of instances from categories y i = 1 and y i = −1 is 27/39. To validate the performance of RM, RF, and SVM, a 100 repeated holdout validation was used with the split ratio of 70-30% of training-test set.
The tuning of the hyperparameters was also applied for all algorithms to achieve the better results for each of them. A grid search was realized over all possible hyperparameters combination for each method being respectively: To evaluate the predictive capacity, ACC and MCC were calculated over the test set. The result is given in Figure 5.
The mean of the average accuracy values for SVM.Lin, SVM.Gau, RF and RM are 80%, 85%, 90%, 90%, respectively. Considering the average MCC values are given by 65.51%, 66.66%, 77.87% and 77.20%, respectively. Interpreting these results, we can infer that RM surpassed the support vector models, and obtained an equivalent performance when compared with the RF.
It is important to emphasize that the RM performed competitively with the robust approach of RF in this application, resulting in great values of ACC and MCC. The optimal hyperparameters configuration were: mtry = 6, nodesize = 5 and ntree = 1000 for RF; C = 5,

Gender Prediction by the Love Interpretation
This database consists of a collection of statements gathered from 581 people, from different gender and age groups, about associations and feelings about what is love. The data was gathered in order to study and explore the concept of love based on a Brazilian sample (Td, 2017). The transcripts of the responses were analyzed by psychologists and psychiatrists that created 14 different categories to indicate a specific type of love perception. Beside it, a score is associated with each category to quantify how much of that feeling is present in the respective answer. For this classification task, the outcome will be the biological gender of each person, which is defined as a binary target y i ∈ {Male, F emale}.
To state the model the 14 categories of love and the age were selected as independent variables. The gender was defined as the dependent variable y i . Therefore, the RM, random forest and, support vector models were applied in order to build a model capable to predict the gender. To evaluate the performance a 100 repeated holdout validation was used with the split ratio of 70-30% of training-test set.
The tuning of hyperparameters was also realized following the same grid search configuration presented in the Section 4.5.1, with the exception of the range of mtry parameter that changed to mtry = {1, 2, 4, 8}.
The results of models performance are summarized in Figure 6. From the violin-boxplots of the average ACC values, it can be noticed that the performance of all models considering the accuracy is almost the same. However, since the classes are unbalanced the evaluation of the predictive capacity through MCC is more meaningful. When only MCC is observed is clear that the Regression machines performed slightly better than the other models. The median of the one hundred MCC means values for SVM.Lin, SVM.Gau, RF and RM are 0.162, 0.145, 0.163 and 0.186, respectively.

Forecasting the Rate of Use of a Brazilian Social Programme
Government public administration aims to provide support to its population through assistance programs that promote the reduction of poverty and inequality. In this sense, the Brazilian government has a social program for direct income distribution. The database from this application contains the collection of Brazilian cities and their rate of use of this benefit. This rate of use (y i ) is defined as the number of people who enjoy the assistance divided by the total population of that city.
Is important to emphasize that the capacity of creating models that can help to predict the rate of use social program like the Bolsa-Família can guide the Government to better manage resources, and provide better support in directing public policies. The data was retrieved from the Brazilian organizational site called Transparency Portal, and bring information about 5564 counties and their socioeconomic indexes.
Also, setting Y as target variable and the other variables as predictors, regression models were applied using the regression RM, RF and support vector regression models using the Linear and Gaussian kernel functions. Their performances were evaluated using the Root Mean Squared Error which was calculated through a validation technique of 100 repeated holdouts with a split ratio of 70-30% training-test. The tuning of hyperparameters was also realized following the same grid search configuration presented in the Section 4.5.1. The = 0.1 parameter of SVR models was set as default.
The average values of RMSE obtained in each algorithm are summarized in Figure 7. From the result is clear that the regression RM outperforms all of models since it presents the lower generalization error among them. The median of average values for SVM.Lin, SVM.Gau, RF and RM are respectively 0.0156, 0.0150, 0.0150 and 0.0147. The behavior of RM shows that it can be a competitive robust ensemble model as RF. Another way to compare and emphasizes the superiority of RM for this regression task is through counting the number of times that the proposed algorithm produced a lower RMSE in a holdout repetition. Regarding all the repetitions, for SVM.Lin that happened 100 times, for the SVM.Gau 99 times, and for the RF 93 times. The optimal hyperparameters configuration were: mtry = 6, nodesize = 25 and ntree = 1000 for random forest; C = 0.1, γ Gau. = 5, γ Lap. = 0.5 for RM; C = 1, γ = 0.1, for SVM with Linear Kernel and C = 1, γ = 0.1 for Gaussian Kernel.

Final Comments
This paper proposes an empirical comparison between the recent ensemble learning approach called RM, and the consolidated tree ensemble method called RF. Both models were evaluated in classification and regression tasks over several simulated data sets and benchmark data. The results obtained show that the new RM procedure is strong competitive and produce better performance over the majority of the presented cases. Despite its good overall performance, the computational cost of this approach still larger when compared with RF.
For both simulated and real databases RM generally has an overall better performance than RF when classes are balanced or with not small samples. This behavior could be explained by the transformation of the feature space provided by the kernel functions. The use of kernel trick in support vector models leads to the different non-linear decision boundaries that may give a better representation of the learning problem than linear split rules from treed models. Moreover, the random sampling of kernel functions can increase the diversity necessary in ensemble approaches more than the random subspace sampling of predictions from RF. Another key aspect that can explain this phenomenon is the base learners that compose each ensemble approach, since SVM are generally stronger, i.e.: have a better predictive capacity, than simple tree models (Huang et al., 2003).
Additionally, RM were considered in thirty benchmark datasets and in three novel realworld applications, which were explored in this paper. In this case, the RM promoted better results in two of them and presented similar results in one of them. These facts reinforcing the predictive capacity the RM. On the other hand, the RM did not show a high predictive capacity via MCC in some situations where exist imbalanced classes with small sample size. Traditionally, SVM drops the predictive capacity in these cases (Wu and Chang, 2003). Some authors reported modifications on traditional SVM models to deal with this problem, and obtain greater results (Wang and Japkowicz, 2010;Batuwita and Palade, 2013).
RM and RF are considered ensemble learning methods, which different procedures based on bagging modelling. A principal difference of them is the base learner, RM considers support vector models as base learners and random forest considers decision trees as base model. Random machines consider different kernel functions to map the complete feature space. Random forest considers different feature subspace. In terms of the computational complexity, SVM has a time complexity O(n 3 ) and decision trees has a time complexity O(n × p 2 ) (Al-Rajab et al., 2017). For this reason the RM is more computational complex than RF, specially in situations with large sample sizes. Our experience shows that a learning time with 60 seconds on random forest is equivalent a 400 seconds on RM.
For future works, new procedures to accelerate the learning time of RM may be considered. Also, may it interesting to use these adaptations jointly with its workflow to obtain better results in more simulation scenarios, different kernels and different weighing functions as well as an exhaustive investigation of the computational costs.

Supplementary Material A
The RM was also implemented in R language and it can be used through the rmachines package, available and documented at GitHub https://github.com/MateusMaiaDS/rmachines. To a overall description of how to reproduce the results from this article just access the README at https://mateusmaiads.github.io/rmachines_and_randomforest/.

Supplementary Material B
Exposes a descriptive analysis of the three real-world applications displayed in Section 5 and additional results around the comparison of RM and RF.