Abstract: The problem of variable selection is fundamental to statistical modelling in diverse fields of sciences. In this paper, we study in particular the problem of selecting important variables in regression problems in the case where observations and labels of a real-world dataset are available. At first, we examine the performance of several existing statistical methods for analyzing a real large trauma dataset which consists of 7000 observations and 70 factors, that include demographic, transport and intrahospital data. The statistical methods employed in this work are the nonconcave penalized likelihood methods (SCAD, LASSO, and Hard), the generalized linear logis tic regression, and the best subset variable selection (with AIC and BIC), used to detect possible risk factors of death. Supersaturated designs (SSDs) are a large class of factorial designs which can be used for screening out the important factors from a large set of potentially active variables. This paper presents a new variable selection approach inspired by supersaturated designs given a dataset of observations. The merits and the effectiveness of this approach for identifying important variables in observational studies are evaluated by considering several two-levels supersaturated designs, and a variety of different statistical models with respect to the combinations of factors and the number of observations. The derived results are encour aging since the alternative approach using supersaturated designs provided specific information that are logical and consistent with the medical experi ence, which may also assist as guidelines for trauma management.
Abstract: support vector machines (SVMs) constitute one of the most popular and powerful classification methods. However, SVMs can be limited in their performance on highly imbalanced datasets. A classifier which has been trained on an imbalanced dataset can produce a biased model towards the majority class and result in high misclassification rate for minority class. For many applications, especially for medical diagnosis, it is of high importance to accurately distinguish false negative from false positive results. The purpose of this study is to successfully evaluate the performance of a classifier, keeping the correct balance between sensitivity and specificity, in order to enable the success of trauma outcome prediction. We compare the standard (or classic) SVM (C SVM) with resampling methods and a cost sensitive method, called Two Cost SVM (TC SVM), which constitute widely accepted strategies for imbalanced datasets and the derived results were discussed in terms of the sensitivity analysis and receiver operating characteristic (ROC) curves.