For statistical classification problems where the total sample size is slightly greater than the feature dimension, regularized statistical discriminant rules may reduce classification error rates. We review ten dispersion-matrix regularization approaches, four for the pooled sample covariance matrix, four for the inverse pooled sample covariance matrix, and two for a diagonal covariance matrix, for use in Anderson’s (1951) linear discriminant function (LDF). We compare these regularized classifiers against the traditional LDF for a variety of parameter configurations, and use the estimated expected error rate (EER) to assess performance. We also apply the regularized LDFs to a well-known real-data example on colon cancer. We found that no regularized classifier uniformly outperformed the others. However, we found that the more contemporary classifiers (e.g., Thomaz and Gillies, 2005; Tong et al., 2012; and Xu et al., 2009) tended to outperform the older classifiers, and that certain simple methods (e.g., Pang et al., 2009; Thomaz and Gillies, 2005; and Tong et al., 2012) performed very well, questioning the need for involved cross-validation in estimating regularization parameters. Nonetheless, an older regularized classifier proposed by Smidt and McDonald (1976) yielded consistently low misclassification rates across all scenarios, despite the shape of the true covariance matrix. Finally, our simulations showed that regularized classifiers that relied primarily on asymptotic approximations with respect to the training sample size rarely outperformed the traditional LDF, and are thus not recommended. We discuss our results as they pertain to the effect of high dimension, and offer general guidelines for choosing a regularization method for poorly-posed problems.
PM2.5 is a major air pollutant which has a high probability to cause many serious cardiopulmonary diseases, such as asthma, lung cancer, trachea cancer, bronchus cancer, etc. Up to 2014, a World Health Organization (WHO) air quality model confirmed that 92% of the population in the world lived in areas where air quality levels exceeded WHO limits (i.e., 10 µg/m3). This indicates that PM2.5 is still one of the most serious world-wide problems, and monitoring PM2.5 concentrations is extremely necessary. In this paper, we proposed a easy and flexible spatial-temporal Gaussian mixture model to analyze annual average PM2.5 concentrations. Because of the bimodal distribution of PM2.5 concentrations, we decided for a two- component Gaussian mixture model with county-year-level spatial-temporal random effects. A Markov Chain Monte Carlo (MCMC) algorithm is used to estimating model parameters.
Technological advances in software development effectively handled technical details that made life easier for data analysts, but also allowed for nonexperts in statistics and computer science to analyze data. As a result, medical research suffers from statistical errors that could be otherwise prevented such as errors in choosing a hypothesis test and assumption checking of models. Our objective is to create an automated data analysis software package that can help practitioners run non-subjective, fast, accurate and easily interpretable analyses. We used machine learning to predict the normality of a distribution as an alternative to normality tests and graphical methods to avoid their downsides. We implemented methods for detecting outliers, imputing missing values, and choosing a threshold for cutting numerical variables to correct for non-linearity before running a linear regression. We showed that data analysis can be automated. Our normality prediction algorithm outperformed the Shapiro-Wilk test in small samples with Matthews correlation coefficient of 0.5 vs. 0.16. The biggest drawback was that we did not find alternatives for statistical tests to test linear regression assumptions which are problematic in large datasets. We also applied our work to a dataset about smoking in teenagers. Because of the opensource nature of our work, these algorithms can be used in future research and projects.
Although credit score models have been widely applied, one of the important variables-Merchant Category Code (MCC)-is sometimes misused. MCC misuse may cause errors in credit scoring systems. The present study aimed to develop and deploy an MCC misuse detection system with ensemble models, gives insights into the development process and compares different machine learning methods. XGBoost exhibited the best performance, with overall error, sensitivity, specificity, F_1 score, AUC and PRAUC of 0.1095, 0.7777, 0.9672, 0.8518, 0.9095 and 0.9090, respectively. MCC misuse by merchants can be predicted with satisfactory accuracy by using our ensemble-based detection system. The paper can thus not only suggest the MCC misuse cannot be overlooked but also help researchers and practitioners to apply new ensemble machine learning based detection system or similar problems.
Brand Cluster is proposed based on the background of evolved consumption modes and concepts as well as brand preferences of different categories of consumers. With the support of inter-urban, inter-category and inter-brand big data, after deep learning and profound analysis of consumption relations of different brands, Brand Cluster was born to reflect characteristics of diverse consumers. We try to understand the inner features of 18 clusters of brands and how these clusters look like in different cities, which underlies the practice of city siting of brand owners. Brand Cluster is believed to reveal the relationships between “allies” of brands in a whole new angel of view and in the large. In addition, the make-up of brand clusters in different cities indicate whether a new city is appropriate for brand owners to expand into.
Most research on housing price modeling utilize linear regression models. These research mostly describe the actual contribution of factors in a linear way on magnitude, including positive or negative. The goal of this paper is to identify the non-linear patterns for 3 major types of real estates through model building that includes 49 housing factors. The datasets were composed by 33,027 transactions in Taipei City from July 2013 to the end of 2016. The non-linear patterns present in the combination manner of a sequence of uptrends and downtrends that are derived from Generalized Additive Models (GAM).
As a robust data analysis technique, quantile regression has attracted extensive interest. In this study, the weighted quantile regression (WQR) technique is developed based on sparsity function. We first consider the linear regression model and show that the relative efficiency of WQR compared with least squares (LS) and composite quantile regression (CQR) is greater than 70% regardless of the error distributions. To make the pro- posed method practically more useful, we consider two nontrivial extensions. The first concerns with a nonparametric model. Local WQR estimate is introduced to explore the nonlinear data structure and shown to be much more efficient compared to other estimates under various non-normal error distributions. The second extension concerns with a multivariate problem where variable selection is needed along with regulation. We couple the WQR with penalization and show that under mild conditions, the penalized WQR en- joys the oracle property. The WQR has an intuitive formulation and can be easily implemented. Simulation is conducted to examine its finite sample performance and compare against alternatives. Analysis of mammal dataset is also conducted. Numerical studies are consistent with the theoretical findings and indicate the usefulness of WQR
The Weibull distribution due to its suitability to adequately model data with high degree of positive skewness which is a typical characteristics of the claim amounts, is considered a versatile model for loss modeling in general Insurance. In this paper, the Weibull distribution is fitted to a set of insurance claim data and the probability of ultimate ruin has been computed for Weibull distributed claim data using two methods, namely the Fast Fourier Transform and the 4 moment Gamma De Vylder approximation. The consistency has been found in the values obtained from the both the methods. For the same model, the first two moments of the time to ruin, deficit at the time of ruin and the surplus just prior to ruin have been computed numerically. The moments are found to be exhibiting behavior consistent to what is expected in practical scenario. The influence of the surplus process being subjected to the force of interest earnings and tax payments on the probability of ultimate ruin, causes the later to be higher than what is obtained in the absence of these factors.
Anemia, especially among children, is a serious public health problem in Bangladesh. Apart from understanding the factors associated with anemia, it may be of interest to know the likelihood of anemia given the factors. Prediction of disease status is a key to community and health service policy making as well as forecasting for resource planning. We considered machine learning (ML) algorithms to predict the anemia status among children (under five years) using common risk factors as features. Data were extracted from a nationally representative cross-sectional survey- Bangladesh Demographic and Health Survey (BDHS) conducted in 2011. In this study, a sample of 2013 children were selected for whom data on all selected variables was available. We used several ML algorithms such as linear discriminant analysis (LDA), classification and regression trees (CART), k-nearest neighbors (k-NN), support vector machines (SVM), random forest (RF) and logistic regression (LR) to predict the childhood anemia status. A systematic evaluation of the algorithms was performed in terms of accuracy, sensitivity, specificity, and area under the curve (AUC). We found that the RF algorithm achieved the best classification accuracy of 68.53% with a sensitivity of 70.73%, specificity of 66.41% and AUC of 0.6857. On the other hand, the classical LR algorithm reached a classification accuracy of 62.75% with a sensitivity of 63.41%, specificity of 62.11% and AUC of 0.6276. Among all considered algorithms, the k-NN gave the least accuracy. We conclude that ML methods can be considered in addition to the classical regression techniques when the prediction of anemia is the primary focus.
Compositional data are positive multivariate data, constrained to lie within the simplex space. Regression analysis of such data has been studied and many regression models have been proposed, but most of them not allowing for zero values. Secondly, the case of compositional data being in the predictor variables side has gained little research interest. Surprisingly enough, the case of both the response and predictor variables being compositional data has not been widely studied. This paper suggests a solution for this last problem. Principal components regression using the 𝛼 -transformation and Kulback-Leibler divergence are the key elements of the proposed approach. An advantage of this approach is that zero values are allowed, in both the response and the predictor variables side. Simulation studies and examples with real data illustrate the performance of our algorithm.