Support Vector Machines Classification on Class Imbalanced Data: A Case Study with Real Medical Data

: support vector machines (SVMs) constitute one of the most popular and powerful classification methods. However, SVMs can be limited in their performance on highly imbalanced datasets. A classifier which has been trained on an imbalanced dataset can produce a biased model towards the majority class and result in high misclassification rate for minority class. For many applications, especially for medical diagnosis, it is of high importance to accurately distinguish false negative from false positive results. The purpose of this study is to successfully evaluate the performance of a classifier, keeping the correct balance between sensitivity and specificity, in order to enable the success of trauma outcome prediction. We compare the standard (or classic) SVM (C SVM) with resampling methods and a cost sensitive method, called Two Cost SVM (TC SVM), which constitute widely accepted strategies for imbalanced datasets and the derived results were discussed in terms of the sensitivity analysis and receiver operating characteristic (ROC) curves.


Introduction and motivation
Support vector machines (SVMs), a powerful machine learning technique, were introduced by Vapnik (Vapnik (1995) and Cortes and Vapnik (1995), Burges (1998), Cristianinio and Shawe-Taylor (2000), Scholkopf and Smola (2001)) and successfully applied in various realworld problems, ranging from image retrieval (Tong and Chang (2001)) and handwriting recognition (Cortes (1995)) to face detection (Osuna et al. (1997)) and speaker identification (Schmidt, M.(1996)).SVMs have found popularity among machine learning researchers and statisticians due to its theoretical and practical advantages which justify its improved performance in binary classification scenario.However, standard SVMs, instead of their effectiveness in balanced datasets, could be proved inappropriate when they are faced with imbalanced data.The issue concerning imbalanced data is recognized as a crucial problem in machine learning community (Chawla, et al. (2004)).In these cases, classifiers tend to be overpowered by the majority class and ignore the minority examples assuming an equal misclassification error.Therefore, the produced models are, often, biased toward the majority class while having a low performance on the minority class.Furthermore, classifiers are typically designed to maximize the overall accuracy which is not an appropriate evaluation measure for imbalanced data.As a consequence, in order to handle imbalanced data we should both, consider improved algorithms and choose other metrics, such as Geometric mean and AUC to measure the performance, instead of accuracy.In parallel with, for many applications, especially for medical diagnosis where normal cases are the majority, it is more important the correct balance between sensitivity and specificity means since we have to accurately distinguish false negative results from false positives.Numerous recent works, including preprocessing and algorithmic methods have been proposed and dealt with the crucial problem of imbalanced data.These techniques can be sorted into two different categories: preprocessing the data by oversampling the minority instances or undersampling the majority instances and algorithmic methods including cost-sensitive learning (Batuvita and Palade (2013)).In our comparative study we use a cost sensitive learning technique proposed by Veropoulos et al. (1999) called "TC SVM" due to the fact that it uses two costs for the two different classes.In addition we applied two different forms of re-sampling methods, namely, random over-sampling and random under-sampling as well.Last but not least we present a combination of a widely used method called Synthetic Minority Oversampling Technique (SMOTE) proposed by Chawla et al.(2002) with random undersampling and the results were developed in the last section.Parpoula et al. (2013) have already dealt with the analysis of a large dimensional Trauma dataset; however, their study lies on the comparison of several high-powered data mining techniques.The motivation of conducting the present study, applied in the medical dataset in question, is not only to enable the success trauma outcome prediction, improving the quality of the prediction model, but also to successfully evaluate the performance of a classifier faced with imbalanced data and keeping the correct balance between sensitivity and specificity.In this way, we compare the performance of the standard SVM with the TC SVM, random over-/under-sampling and a combination of SMOTE method with undersampling, and the derived results were discussed in terms of the sensitivity analysis.The merits of our comparative study through a real medical data set show the effectiveness of the considered approaches.
The rest of this paper is organized as follows.In Section 2, we present a theoretical background of the considered SVM classifiers.In Section 3, we present the SVM analysis and we carry out a comparative study for the considered methods in terms of accuracy, Geometric mean and the Area Under the Roc Curve (AUC).We also describe the performance criteria used for the evaluation of the employed methods.In conclusion, in Section 4, we summarize the results of our study and we highlight some concluding remarks.Note here that, we use classic and standard SVM with the same meaning as soft margin SVM.Moreover, we also use Gaussian or Radial or RBF kernel, consider exactly the same.

Theoretical background
In this section we briefly summarize the basic concept of the considered methods by providing a short but required theoretical background.Firstly, we discuss the main problem of soft margin SVM and then the modifications resulting in TC SVM method.Subsequently, we discuss the main concept of the pre-processing methods that we have applied in our analysis.Last subsection contains the metrics examined in our work.

Introduction to Support Vector Machines
SVM algorithm aims to find the optimal separating hyperplane which effectively separates the data points into the labeled classes.Let us consider that we have a binary classification problem.The data points are mapped into a high-dimensional feature space (Hilbert space) by a kernel function K (dot products between data points).For input points   ∊   and label of the class of data   ( = 1 … ), the decision function in the feature space can be considered as follows where  is the model bias.Note that only those points which lie closest to the hyperplane have   > 0 and consist the support vectors.Let us assume the primal optimization problem in order to obtain the necessary parameters.The soft margin optimization problem (Cortes and Vapnik (1995)) can be formulated as: where w is the weight vector normal to the hyperplane,   are the slack variables that hold for misclassification examples and, consequently, the term ∑    =1 can be considered as a measure of the amount of total misclassifications of the model (esp.the training errors).The trade-off between maximization of the margin and minimization of error is controlled by cost parameter .The Lagrangian optimization problem of (2), used for finding the parameter b and coefficients   , has the following formulation: which satisfy KKT conditions.Note here that SVM address with the problem of moderately imbalanced data in more effective way, compared to other classifiers, due to the fact that SVM only takes into account those instances that are close to the boundary, means the support vectors, for building its model (for more details see Akbani et al. (2004)).More specifically, Akbani et al. (2004) have argued that due to the constraint ∑      =1 = 0, the coefficients   of each positive support vector are fewer than the negative support vectors, and as a result must be larger in magnitude than the   values correspond to the negative support vectors.The   in question, act as weights in the final classifier and consequently receive a higher weight than negative, something that counterbalance, in some extent, the effect of support vector imbalance.

Cost sensitivity SVM (TC SVM) for imbalanced data
As we can conclude from equation (2) the cost C given to positive and negative class is exactly the same.However, in case of imbalanced data, as we have already mentioned, the same cost could be result to a biased model toward the majority class and as a consequence could provide suboptimal results.Veropoulos et al. (1999) proposed a cost sensitive method (Two-cost method) to deal with the above problem revealed in SVMs.They generalize the soft margin approach so that the formulation of the Lagrangian contains two misclassification costs, one for each class examples.More specifically the reformulation of the optimization problem having two errors given as follows: This dual optimization problem can be solved in just the same way as solving the standard SVM optimization problem.Good results can be obtained, as indicated in Akbani et al. (2004), by setting the ratio  + / − equal to the minority to majority class ratio.

Sampling methods
Data preprocessing methods can be used to balance the datasets before training SVM models.In data level, there are methods for balancing the classes consist of resampling the original data set either by over-sampling the minority or by under-sampling the majority class, until when there is a balance ratio between the two classes.Apart from random over-/undersampling there are synthetic generation methods like SMOTE (Chawla et al. (2002)) or like ROSE (Menardi and Torelli (2013)).Resampling methods have been addressed to train SVM models with imbalanced data in many different fields (see for example Akbani et al. (2004), Yuan et al. (2006), Batuwita and Palade (2009), Batuwita and Palade (2009)).However, such methods have revealed significant disadvantages.On the one hand, under-sampling may throw out useful information acquired from data, while over-sampling increase the computational burden since it increases the size of the data.

Random Sampling SVM
Random over-sampling constitutes the simplest method that increases the minority class examples.It randomly replicates existing instances in the minority class so that it balances the class distribution.Random over-sampling doesn't put additional information but it increases the weight of minority examples by replication.However, there is a problem that has been generally occurred, that is the over-fitting problem.As a consequence, even though we have high accuracy in training set, the classification performance of test set will likely be worse.Chawla et al. (2002) proposed Synthetic Minority Over-sampling Technique (SMOTE) in order to avoid over-fitting problem in random over-sampling.SMOTE method generates synthetic data based on the feature space similarities between minority instances.These examples will be generated by using the information from the k-nearest neighbours of each instance of the minority class.More precisely, this method finds the k-nearest neighbours of each minority example , randomly selects one of them, and multiplies the corresponding feature vector difference with a randomly taken number between 0 and 1 so as to produce a new minority example in the neighborhood.It should be mentioned that SMOTE not only avoids over-fitting, but it also causes the decision boundaries for the minority class to move towards the majority class.
Random under-sampling, contrary to oversampling, removes randomly majority instances keeping all examples of minority class.The training process becomes faster since many majority examples are ignored.However, the main disadvantage of random under-sampling is that potentially useful data are lost.There are some heuristic under-sampling methods which try to remove superfluous instances which will not affect the classification accuracy of the training set (Hart (1968)).

Undersampling and SMOTE Combination
SMOTE (Chawla et. al. 2002) as we have already mentioned is a well-known algorithm to fight the unbalanced problem to many learning algorithms.The general idea of this method is to artificially generate new examples of the minority class using the nearest neighbors of these cases.In the present modification, we simultaneously under-sample the majority class examples, leading to a more balanced dataset and avoiding over-fitting.
In conclusion it should be noted that when focusing on approaches at the data level (means rebalancing the data distribution), there are two important problems associated with a SVM classifier.The first one is that over-sampling methods significantly increase the dataset size leading to bigger computational time and overfitting of data.Secondly an optimal ratio of class distribution is empirically determined by grid search procedures.

Metrics for evaluating model performance
Traditionally, the performance of a binary classifier is accomplished by using metrics derived from the confusion matrix (Table 1).More precisely, given a classifier and a record, there are four possible scenarios: True Positives (TP) where positive records are correctly predicted as positive, False Negatives (FN) where positive records are incorrectly identified as negative, False Positives (FP) where negative records are classified as positive ones, and finally True Negatives (TN) where negative records are correctly identified as negative.Using a two-by-two confusion matrix we can easily represent these possible outcomes and compute the measures are followed.
Accuracy is the most common measure used for quantify the performance of a classifier.Despite the efficacy of accuracy measure on balanced data sets using standard SVM, overall accuracy in case of imbalanced data, constitutes an inappropriate metric.For instance, a classifier that predicts all samples as negative has high accuracy (4) but it cannot detect rare positive samples.
Consequently, the performance of such systems, in order to get optimal balance classification ability, described effectively in terms of sensitivity (or true positive rate or positive class accuracy) and specificity (or true negative rate or negative class accuracy) More precisely, sensitivity measures the proportion of actual positives that are correctly identified as such, meaning that it measures the percentage of people who are having the disease and they are correctly identified as having the disease.The specificity measures the proportion of actual negatives which identified correctly meaning that it measures the percentage of people who are not having the disease and they are correctly identified as healthy.As far as the Type error I as concerned, it occurs when the null hypothesis is true, but it is rejected.In medical diagnosis an example of type I error includes a test that indicates a patient to have a disease when in fact the patient does not have the disease.A typical example of medical experiments regarding Type II error would be a failure to detect the disease in a patient who really has the disease.It should be noted that a test with high sensitivity has low type II error and a test with high specificity has low type I error.Kubat and Matwin (1997) based on these two measures proposed Geometric mean, a geometric mean of sensitivity and specificity Moreover, Receiver Operating Characteristic (ROC) curves are another way besides confusion matrices to examine the performance of a classifier in a much more intuitive and robust way.A ROC curve (Pepe (2000)) is used to evaluate the performance of a system with dichotomous outcome.The trade-off between sensitivity and specificity can be represented graphically as a ROC curve.The Area Under the Curve (AUC) can indicate balance classification ability between sensitivity and specificity as a function of varying a classification threshold.For more details we refer to Swets and Pickett (1982).Consequently, in order to handle imbalanced data we should consider other measures, such as Geometric mean and AUC.

Application -Comparative results
In this section we compare the performance of the two different methods, SVM and Twocost SVM random sampling (random oversampling and random undersampling), a combination of SMOTE and random undersampling as well as a new proposed method called ROSE on a large dimensional Trauma data set consisting of  = 8862 patients and 41 factors that include demographic, transport and intrahospital data.The main aim is to provide an unbiased estimation of each model's discrimination.In this way the values of performance criteria are calculated from a data set which is not used in the model building process, constitute a portion of the original data set and called test set.A classifier should present high values of accuracy, sensitivity, specificity, AUROC and geometric mean and the model's generalization performance is often estimated by the holdout validation.In our study we deal with a large data set that is split randomly into a training set, containing 75% of cases (6647) and the test set, containing 25% of cases (2215) in order to evaluate the performance of classifiers on new data.Our medical dataset is highly imbalanced since it consists of 446 positive instances (majority class) and 8416 instances of negative instances (minority class).This makes imperative both the use of pre-processing methods to balance the dataset and cost sensitive learning methods that give another weights into the two different classes.In addition the use of more robust measures than accuracy, like Geometric mean and AUC will provide more reliable conclusions.Our motivation for conducting this study comes from medical decision support something indicates that the choice of a medical data set was imperative.For each patient the target attribute, variable y is binary and denotes the probability of death.Specifically variable y, expressed in the form of two categories -1 and 1, where -1 represent the survival, while the value of 1 the death.According to medical advices, all the prognostic factors should be treated equally during the statistical analysis and there is no factor that should be always maintained in the model.The names of these factors are included in the Appendix Section.The analysis, which contains all steps of data pre-processing and model development, was carried out using R codes and the algorithms were implemented using simultaneously the packages 'e1071' and 'DMwR'.

Standard SVM
For a standard SVM classifier we should determine not only the kernel function but also the regularization parameter C the value of gamma in case of a Gaussian (RBF) kernel and the degrees of freedom in case of polynomial kernel.The issue of model selection in support vector machine is vital and influence the overall performance of the classifier, making SVM quite sensitive to the selection of these parameters.
Applying a 10-fold cross validation we obtain the cost value for the best performance in terms of error rate, equal to 2. Figure 1 illustrates the changes in classification error for different values of cost parameter in case of a standard linear SVM.Besides the cost parameter, the intrinsic parameters of SVM classifier greatly affect its performance.For a Gaussian (RBF) basis kernel apart from the regularization parameter C, the value of gamma should be selected from several candidates.The gamma value should normally be between 1/k (=0.0244) and 6/k (=0.14634),where k represents the data dimension (41 in our study).Performing a grid search we chose the one that result in the best performance.Figure 2 displays the difference in error, changing the gamma parameter.The optimal value of gamma (=0.03125) showed in the following figure (red line), gives the smallest error rate.We performed a selection of gamma parameter in SVM and in Table 2 are illustrated some selected values.3 are obtained using  = 2 for the linear kernel,  = 1 for the sigmoid, polynomial and Gaussian kernel and  = 0.03125 for the Gaussian kernel.If the kernel type is set to polynomial or sigmoid the parameter bias sets the offset parameter in the kernel function and the default value 0 is suitable in most cases.Only if kernel type is set to polynomial the parameter degree is enabled and is set to be equal to 3.  3 shows the performance of SVM using different kernels.Both SVM with a linear and SVM with a Gaussian kernel have the highest classification accuracy, sensitivity, specificity AUC and Geometric mean.Gaussian kernel reaches the percentage of 0.9848, 0.77922, 0.99821, 0.8866 and 0.8796 for accuracy, sensitivity, specificity AUC and Geometric mean respectively.Almost similar results were given for the linear kernel.The second best results were taken using a Sigmoid kernel regarding accuracy measure.However Sigmoid has the worst performance assuming the results for the most robust metrics as AUC and Geometric mean.It should be mentioned that there are an overfitting of data, especially in case of non-linear kernels considering the above measures.In parallel with, conducting SVM classification without selective sampling, we observed that the g-mean values are consistently low.

Two-cost SVM
Applying Two-cost SVM one should determine two cost, as concluded from the aforementioned theory in the previous section.For achieving expected classification results, the misclassification costs play a crucial role in the construction of a cost sensitive learning model.We discover the optimal parameters based on different evaluation functions such as Geometric mean and AUC.The ratio between the minority and majority class for trauma dataset is equal to 0.05299.More specifically, by setting the cost of the minority class equal to 1 and changing the cost of majority class we performed a search among many values.We execute the analysis for values varied from 0.01 to 2.0.The most accurate results in terms of Geometric mean measure were given for values 0.04, 0.0529(=ratio) and 0.06 of the majority cost as concluded from Figure 4.The best performance gives the value 0.06.However, the two other values gave almost similarly results.We finally chose the inverse ratio between the two classes, setting the ratio equal to the minority to majority class ratio ( − =  + * 0.05299).
Figure 4 illustrates in separate graphs the performance for accuracy, sensitivity and specificity, changing the cost of majority class.Dashed grey line shows the value of each measure in case of standard SVM.In Figure 5 we consider the comparisons mentioned below but in the same graph.The vertical grey line indicates the cost of majority class when it was set to be equal to the ratio of the two classes.As we can conclude from Figure 6, sensitivity was continually increasing as support vectors were increasing.In contrast, accuracy and specificity gathered higher values for fewer support vectors.Note here that increasing the majority cost we have fewer support vectors, as well.

Some Comparisons among C and TC SVM
Some comparisons between SVM and TC SVM are contained in order to obtain the importance of the applied methodology.First of all, we present the performance for the linear case and the other kernels are followed after we had chosen the best parameters.Table 4 shows the acquired results where SVM gathers higher accuracy in both train and test set.Comparing standard and TC SVM, the first one has higher specificity which means that the classifier recognizes more actual negatives; in other words this means that using TC SVM we obtain lower Type I error rate.This measure alone does not tell us how well the classifier recognizes positive cases and so it is necessary to take into consideration both sensitivity and specificity of the used classifiers.When the two algorithms are evaluated against the sensitivity, TC SVM has clear advantage having highest percentage, which means that the Type II error rates are lower than the one of C SVM (classic or standard SVM).
Figure 8 displays the ROC curves derived from the two considered methods.The further the curve lays above the reference line, the more accurate the test.The AUROC achieved the value of 0.9198 for linear C SVM and higher value for TC SVM equals to 0.9507.Not only in terms of AUC but also of Geometric mean the cost sensitive method outweighs the standard SVM.  5 describes the performance for the standard SVM and TC SVM for the nonlinear case.The best measure in Geometric mean was gathered for TC SVM using Gaussian with a radial basis kernel.Comparative results are taken using a sigmoid kernel for all the considered metrics, achieving the ratio of 95.20% for Geometric mean for the TC method.It should be noted that using the cost sensitive learning method it reduces the problem of the overfitting.Almost similar results were given considering the AUC metric instead of Geometric mean measure.Gaussian kernel has clearly the highest Geometric mean and AUC compared to all non-linear kernels considering TC SVM whereas polynomial kernel has the lowest.The difference between the two kernels, Gaussian and sigmoid, is so small that both achieve good results for all measures.Furthermore, cost-sensitive SVM performs well for the linear case.In accordance with the AUC measure, polynomial kernel has the worst results.Figure 7 displays a comparison in respect to the Geometric mean confirming the above conclusions.Comparing TC with C SVM for a Gaussian kernel, it can be inferred that the first method outperforms the second one in terms of geometric mean and AUC.Unlike, as far polynomial kernel as concerned, the difference between the two compared methods is considerably higher than the previous presented kernels.For the ROC curves in Figure 8, regarding the Gaussian kernel, the TC method performs better on the average compared with the other kernels though the difference between sigmoid kernel is not statistically significant.As we can infer from Figure 8, Polynomial kernel shows the worst performance.9: ROC curves derived from all kernels using both methods Figure 9 illustrates the performance of these two methods on Trauma dataset for all the examined kernels.In addition, it ranks the best candidate models according to the AUC criterion and helps the experimenter to choose the best approach for a given analysis.The highest AUC was obtained for the TC method with a Gaussian kernel (AUC=0.9524)and the second highest was marked for both TC method using a linear kernel (AUC=0.9507)and Sigmoid kernel (AUC=0.9520)with the second slightly outperforms the first one.Almost similar results were showed for the standard linear SVM (AUC=0.9198)and standard SVM with a Gaussian kernel (AUC=0.8876).The AUROC for the Polynomial kernel revealed the lowest value equal to 0.748, 0.8488 for standard and TC SVM respectively.In Figure 9 we mean Linear kernel with linear, Gaussian kernel with Radial, Polynomial kernel with abbreviation Poly and sigmoid kernel with Sigmoid.

Random Over-sampling (SVM-RO)
Learning with over-sampled training sets was repeated 20 times for each size of the increased training sets.Then we chose the increased training set that produced the maximum gmean value for the original training set.

Random Under-sampling (SVM-RU)
We also conducted random under-sampling of the majority instances.In the same way as oversampling, learning with under-sampled training sets was repeated 20 times for each size of the reduced training set.Then we chose the reduced training set that produced the maximum gmean value for the original training set.

SMOTE-SVM and undersampling combination
As far as SMOTE algorithm as concerned, for the calculation of K-nearest neighbors, K was set to 5. Learning was performed using 20 independent synthetically enhanced datasets and then in order to identify the best synthetic sample size we calculate the maximum Geometric mean.In example, an increment of 300% is selected if the maximum average gmean of the original training set appears when 300% of new synthetic instances are added into the training dataset.While increasing minority instances gradually and simultaneously reduced the majority class examples, we observed for each combination Geometric mean values of the original training sets for each experimental dataset.Using SVM-SMOTE, the number of synthetic instances to achieve the desired class balance is unknown and empirical studies must be performed.Minority class was over-sampled at 50%, 100%, 200%, 300%, 400%, 500% and the majority class was under-sampled 10%, 15%, 25%, 50%, 75%, 100%, 125%, 150%, 175%, 200%, 300%, 400%, 500%, 600%, 700%, 800%, 1000%, 2000% as presented in Table 6.We chose the aforementioned rates according to Chawla et al. (2002).Due to its performance, a Gaussian kernel function with a radial basis (,  ′ ) = exp (−‖ −  ′ ‖ 2 / 2 ) is used for SVM classification.We remind that it presents the best performance among all the kernels (linear and nonlinear) in our data set.From SVM-SMOTE, the maximum Geometric mean was found for increments of 500% with the combination of 200% undersampling ratio, achieving the value of 97.18652% for the Geometric mean.Table 7 shows a search among different ratios of under-sampling for 50%, 100%, 200%, 300%, 400%, 500% SMOTE-SVM respectively.Figure 11

Experimental Results and Discussion
The Geometric mean values for the original training set using the 5 different methods are shown in the below table.Comparison of differences for all pairs of methods illustrated that SMOTE-SVM and oversampling have the best performance in test set.However using random-oversampling there is a problem of overfitting of data something that it is more likely to happen using nonlinear kernels.For this reason, SMOTE SVM seems to have the best performance.SMOTE SVM slightly outperformed SVM-RU and the biased TC method provides a very competitive solution to other existing standard methods, in optimization of Geometric mean and AUC for combating imbalanced classification problems.These results confirm the advantages of the considered approach, showing the promising perspective and new understanding of cost sensitive learning.On the other hand, sampling methods seem to outperform the C SVM. Especially a combination of SMOTE SVM with undersampling has revealed the best performance considering not only the Geometric mean and computational time, but also the overfitting problems that have been created using the other methods.
This paper presents a comparative analysis of different SVM strategies on real medical data.Evaluating the reliability of classifier algorithms is essential to ensure data quality.We used the Geometric mean measure and the Area Under the Roc Curve, both obtained by sensitivity and specificity, for the comparison of algorithms in order to provide useful results.Note that these two metrics gave almost similar results.In this way we make some comparisons only in terms of Geometric mean.It is obvious that the effort of health care to prevent patients' death is a huge problem that arises, forcing researchers to be more careful in their research.Sensitivity and specificity measure the prognostic model's ability to recognize the patients of a certain group (survivors or non-survivors).The value of this comparative study is the ability to calculate Type I and Type II error rates, giving lower Type II error with the cost sensitive and data preprocessing methods and as a consequence higher sensitivity compared to C SVM.This issue is of high importance for medical diagnosis due to the fact that the presented methodology gives us the ability to recognize the patients which are going to die and they are provided by an appropriate treatment.In this way, many deaths would be avoided.This method may assist as guidelines for improving the quality of treatment and therefore survivability of a patient through optimal trauma management.Although, Parpoula et al. (2013) have already dealt with the analysis of the Trauma dataset; their study focuses on the comparison of several data mining techniques including standard SVM.Our motivation for conducting this study is different because what we want to achieve is the balance between sensitivity and specificity enable the success trauma outcome prediction.The effectiveness of the considered approach is obvious.
We hope this work will convince experimenters to use not only standard SVM techniques but also reformulations of SVMs for the extraction of useful patterns when they deal with imbalanced medical datasets.Support Vector Machines are a powerful predictive tool and the use of the SVMs classifiers as an alternative method for supporting medical knowledge discovery is one of the most promising topics for further research.

Figure 1 :
Figure 1: Performance of SVM with a linear kernel for different values of cost parameter.Red line shows the cost with the best performance in terms of error rate.

Figure 2 :
Figure 2: Performance of SVM with a Gaussian (RBF) kernel for different values of gamma parameter For our Trauma dataset the minority class consists of positive instances and the majority class consists of negative instances.The two cost parameters are the minority cost ( + ) referred to positive instances and the majority cost ( − ) referred to the negative instances.We can reduce the effects of class imbalance by assigning a higher classification cost for the minority class examples than the majority class examples.Veropoulos et al. (1999) and Akbani et al. (2004) suggested the inverse ratio between the two class sizes as a good choice that improves the performance of the TC SVM method.After performing a search among different values for the two costs we confirm the mentioned result.

Figure 3 :
Figure 3: Geometric mean (y-axis) measure changing the cost of majority class (x-axis)

Figure 4 :
Figure 4: Performance in terms of the three measures (Accuracy, Sensitivity, Specificity) changing the cost of majority class (x-axis) (solid red line: Two-cost SVM; dashed line: Classic SVM)

Figure 5 :
Figure 5: Comparison performance of accuracy, sensitivity and specificity majority class (x-axis) (solid black line: Accuracy; dashed red line: Sensitivity; dashed green line: Specificity).Vertical line indicates the cost of majority class when it was set to be equal to the ratio of the two classes.

Figure 7 :
Figure 7: Roc curves comparison for linear case

Figure 8 :
Figure 8: Roc curves comparison for non-linear case in Test set.Red curves represent TC SVM and black curves Classic SVM.

Figure 8
Figure8displays the ROC curves derived from all SVMs with the three non-linear kernels.For the ROC curves in Figure8, regarding the Gaussian kernel, the TC method performs better on the average compared with the other kernels though the difference between sigmoid kernel is not statistically significant.As we can infer from Figure8, Polynomial kernel shows the worst performance.

Figure
Figure 9: ROC curves derived from all kernels using both methods

Figure 10 :
Figure 10: Geometric mean values of the training dataset in terms of increase of synthetic minority instances by SMOTE shows the percentage of minority correct values of the original training sets as instances added by SMOTE with 4 different under-sampling ratios.The highest value revealed with the combination of 50-SMOTE SVM and 200% under-sampling.

Table 2 :
Model selection (some selected values) for gamma parameter in SVM with a Gaussian (RBF) kernel

Table 3 :
Comparison of standard SVM performance for different kernels on Trauma dataset

Table 4 :
Performance comparison for standard SVM and TC svm with linear kernel

Table 5 :
Performance comparison for the two different SVM techniques with different kernels (nonlinear case)In the above Table,C is an abbreviation for classic or standard SVM and TC for TC SVM

Table 6 :
Grid search for different combinations of SMOTE SVM and random undersampling

Table 7 :
Comparison of % Minority correct for different undersampling ratio changing the Oversampling rate

Table 8 :
Geometric mean of Training sets obtained from 4 different methods