Statistical learning methods have been growing in popularity in recent years. Many of these procedures have parameters that must be tuned for models to perform well. Research has been extensive in neural networks, but not for many other learning methods. We looked at the behavior of tuning parameters for support vector machines, gradient boosting machines, and adaboost in both a classification and regression setting. We used grid search to identify ranges of tuning parameters where good models can be found across many different datasets. We then explored different optimization algorithms to select a model across the tuning parameter space. Models selected by the optimization algorithm were compared to the best models obtained through grid search to select well performing algorithms. This information was used to create an R package, EZtune, that automatically tunes support vector machines and boosted trees.
Abstract: support vector machines (SVMs) constitute one of the most popular and powerful classification methods. However, SVMs can be limited in their performance on highly imbalanced datasets. A classifier which has been trained on an imbalanced dataset can produce a biased model towards the majority class and result in high misclassification rate for minority class. For many applications, especially for medical diagnosis, it is of high importance to accurately distinguish false negative from false positive results. The purpose of this study is to successfully evaluate the performance of a classifier, keeping the correct balance between sensitivity and specificity, in order to enable the success of trauma outcome prediction. We compare the standard (or classic) SVM (C SVM) with resampling methods and a cost sensitive method, called Two Cost SVM (TC SVM), which constitute widely accepted strategies for imbalanced datasets and the derived results were discussed in terms of the sensitivity analysis and receiver operating characteristic (ROC) curves.
Abstract: Cancer is a complex disease where various types of molecular aber rations drive the development and progression of malignancies. Among the diverse molecular aberrations, inherited and somatic mutations on DNA se quences are considered as major drivers for oncogenesis. The complexity of somatic alterations is revealed from large-scale investigations of cancer genomes and robust methods for interring the function of genes. In this review, we will describe sequence mutations of several cancer-related genes and discuss their functional implications in cancer. In addition, we will in troduce the on-line resources for accessing and analyzing sequence mutations in cancer. We will also provide an overview of the statistical and computa tional approaches and future prospects to conduct comprehensive analyses of the somatic alterations in cancer genomes.
Anemia, especially among children, is a serious public health problem in Bangladesh. Apart from understanding the factors associated with anemia, it may be of interest to know the likelihood of anemia given the factors. Prediction of disease status is a key to community and health service policy making as well as forecasting for resource planning. We considered machine learning (ML) algorithms to predict the anemia status among children (under five years) using common risk factors as features. Data were extracted from a nationally representative cross-sectional survey- Bangladesh Demographic and Health Survey (BDHS) conducted in 2011. In this study, a sample of 2013 children were selected for whom data on all selected variables was available. We used several ML algorithms such as linear discriminant analysis (LDA), classification and regression trees (CART), k-nearest neighbors (k-NN), support vector machines (SVM), random forest (RF) and logistic regression (LR) to predict the childhood anemia status. A systematic evaluation of the algorithms was performed in terms of accuracy, sensitivity, specificity, and area under the curve (AUC). We found that the RF algorithm achieved the best classification accuracy of 68.53% with a sensitivity of 70.73%, specificity of 66.41% and AUC of 0.6857. On the other hand, the classical LR algorithm reached a classification accuracy of 62.75% with a sensitivity of 63.41%, specificity of 62.11% and AUC of 0.6276. Among all considered algorithms, the k-NN gave the least accuracy. We conclude that ML methods can be considered in addition to the classical regression techniques when the prediction of anemia is the primary focus.
Technological advances in software development effectively handled technical details that made life easier for data analysts, but also allowed for nonexperts in statistics and computer science to analyze data. As a result, medical research suffers from statistical errors that could be otherwise prevented such as errors in choosing a hypothesis test and assumption checking of models. Our objective is to create an automated data analysis software package that can help practitioners run non-subjective, fast, accurate and easily interpretable analyses. We used machine learning to predict the normality of a distribution as an alternative to normality tests and graphical methods to avoid their downsides. We implemented methods for detecting outliers, imputing missing values, and choosing a threshold for cutting numerical variables to correct for non-linearity before running a linear regression. We showed that data analysis can be automated. Our normality prediction algorithm outperformed the Shapiro-Wilk test in small samples with Matthews correlation coefficient of 0.5 vs. 0.16. The biggest drawback was that we did not find alternatives for statistical tests to test linear regression assumptions which are problematic in large datasets. We also applied our work to a dataset about smoking in teenagers. Because of the opensource nature of our work, these algorithms can be used in future research and projects.