Abstract: The development and application of computational data mining techniques in financial fraud detection and business failure prediction has become a popular cross-disciplinary research area in recent times involving financial economists, forensic accountants and computational modellers. Some of the computational techniques popularly used in the context of financial fraud detection and business failure prediction can also be effectively applied in the detection of fraudulent insurance claims and therefore, can be of immense practical value to the insurance industry. We provide a comparative analysis of prediction performance of a battery of data mining techniques using real-life automotive insurance fraud data. While the data we have used in our paper is US-based, the computational techniques we have tested can be adapted and generally applied to detect similar insurance frauds in other countries as well where an organized automotive insurance industry exists.
Abstract: Information fusion has become a powerful tool for challenging applications such as biological prediction problems. In this paper, we apply a new information-theoretical fusion technique to HIV-1 protease cleavage site prediction, which is a problem that has been in the focus of much interest and investigation of the machine learning community recently. It poses a difficult classification task due to its high dimensional feature space and a relatively small set of available training patterns. We also apply a new set of biophysical features to this problem and present experiments with neural networks, support vector machines, and decision trees. Application of our feature set results in high recognition rates and concise decision trees, producing manageable rule sets that can guide future experiments. In particular, we found a combination of neural networks and support vector machines to be beneficial for this problem.
Abstract: To identify the stand attributes that best explain the variability in wood density, Pinus radiata plantations located in the Chilean coastal sector were studied and modeled. The study area corresponded to stands located in sedimentary soil between the zones of Constituci on and Cobquecura. Within each sampling sector, individual tree variables were recorded and the most relevant stand parameters were estimated. Fifty trees were sampled in each sector, obtaining from each one six wood discs from different stem heights. Each disc was weighed in green and then dried to anhydrous weight, and its basic density was calculated. The profile identification to classify basic density according to stand characteristics was performed through regression trees, a technique based in the use of predictor variables to partition the database using recursive algorithms in regions with similar responses. The objective of the regression tree method is to obtain highly homogenous groups (branches), which are identified using pruning techniques that successively eliminate the branches that least contribute to the classification of the variable of interest. The results found that the stand attributes that contributed significantly to basic density classification were the basal area, the number of trees per hectare, and the mean height.
Abstract: Exploratory data analysis has become more important as large rich data sets become available, with many explanatory variables representing competing theoretical constructs. The restrictive assumptions of linearity and additivity of effects as in regression are no longer necessary to save degrees of freedom. Where there is a clear criterion (dependent) variable or classification, sequential binary segmentation (tree) programs are being used. We explain why, using the current enhanced version (SEARCH) of the original Automatic Interaction Detector program as an illustration. Even the simple example uncovers an interaction that might well have been missed with the usual multivariate regression. We then suggest some promising uses and provide one simple example.