Abstract: Clustering is an extremely important task in a wide variety of ap plication domains especially in management and social science research. In this paper, an iterative procedure of clustering method based on multivariate outlier detection was proposed by using the famous Mahalanobis distance. At first, Mahalanobis distance should be calculated for the entire sample, then using T 2 -statistic fix a UCL. Above the UCL are treated as outliers which are grouped as outlier cluster and repeat the same procedure for the remaining inliers, until the variance-covariance matrix for the variables in the last cluster achieved singularity. At each iteration, multivariate test of mean used to check the discrimination between the outlier clusters and the inliers. Moreover, multivariate control charts also used to graphically visual izes the iterations and outlier clustering process. Finally multivariate test of means helps to firmly establish the cluster discrimination and validity. This paper employed this procedure for clustering 275 customers of a famous two wheeler in India based on 19 different attributes of the two wheeler and its company. The result of the proposed technique confirms there exist 5 and 7 outlier clusters of customers in the entire sample at 5% and 1% significance level respectively.
Abstract: Mixture of Weibull distributions has wide application in modeling of heterogeneous data sets. The parameter estimation is one of the most important problems related to mixture of Weibull distributions. In this pa per, we propose a L-moment estimation method for mixture of two Weibull distributions. The proposed method is compared with maximum likelihood estimation (MLE) method according to the bias, the mean absolute error, the mean total error and completion time of the algorithm (time) by sim ulation study. Also, applications to real data sets are given to show the flexibility and potentiality of the proposed estimation method. The com parison shows that, the proposed method is better than MLE method.
Abstract: Scientific interest often centers on characterizing the effect of one or more variables on an outcome. While data mining approaches such as random forests are flexible alternatives to conventional parametric models, they suffer from a lack of interpretability because variable effects are not quantified in a substantively meaningful way. In this paper we describe a method for quantifying variable effects using partial dependence, which produces an estimate that can be interpreted as the effect on the response for a one unit change in the predictor, while averaging over the effects of all other variables. Most importantly, the approach avoids problems related to model misspecification and challenges to implementation in high dimensional settings encountered with other approaches (e.g., multiple linear regression). We propose and evaluate through simulation a method for constructing a point estimate of this effect size. We also propose and evaluate interval estimates based on a non-parametric bootstrap. The method is illustrated on data used for the prediction of the age of abalone.
Abstract: Spread of airborne plant diseases from a propagule source is classically assessed by fitting a gradient curve to aggregated data coming from field experiments. But, aggregating data decreases information about processes involved in disease spread. To overcome this problem, individual count data can be collected; it was done in the case of short-distance spread of wheat brown rust. However, for such data, the gradient curve is a limited model since heterogeneity of hosts is ignored and, consequently, overdisper sion occurs. So, we propose a parametric frailty model in which the frailties represent propensities of hosts to be infected. The model is used to assess dispersal of propagules and heterogeneity of hosts.
Abstract: We extend propensity score methodology to incorporate survey weights from complex survey data and compare the use of multiple linear regression and propensity score analysis to estimate treatment effects in observational data from a complex survey. For illustration, we use these two methods to estimate the effect of gender on information technology (IT) salaries. In our analysis, both methods agree on the size and statistical significance of the overall gender salary gaps in the United States in four different IT occupations after controlling for educational and job-related covariates. Each method, however, has its own advantages which are discussed. We also show that it is important to incorporate the survey design in both linear regression and propensity score analysis. Ignoring the survey weights affects the estimates of population-level effects substantially in our analysis.
Abstract: Data collection for landslide susceptibility modelling is often an almost inhibitive activity. This has been the reason for quite sometimes land slide was described and modelled on the basis of spatially distributed values of landslide related attributes. This paper presents landslide susceptibility analysis at Selangor area, Malaysia, using artificial neural network model with the aid of remote sensing data and geographic information system (GIS) tools. To meet the objectives, landslide locations were identified in the study area from interpretation of aerial photographs and supported with extensive field surveys. Then, the landslide inventory was grouped into two categories: (1) training data (2) testing data. Further, topographical, geological data and satellite images were collected, processed, and constructed into a spatial database using GIS tools and image processing techniques. Nine landslide occurrence attributes were selected and analyzed using an artificial neural network model to generate the landslide susceptibility maps. Landslide loca tion data (training data) were used for training the neural network and five training sites were selected randomly in this case. The use of five training sites ensemble to investigate the model reliability, including the role of the thematic variables used to construct the model, and the model sensitivity to changes in the selection of the training sites. By studying the variation of the neural network’s susceptibility estimate, the error associated with the model is determined. The results of the neural network analysis are shown on five sets of landslide susceptibility maps. Then the susceptibility maps were validated using ”receiver operating characteristics (ROC)” method as a measure for the model verification. Landslide training data which were not used during the training of the neural network was used for the verification of the maps. The results of the analysis were verified using the landslide location data and compared between five different cases. Qualitatively, the model seems to give reasonable results with accuracy observed was 87%, 83%, 85%, 86% and 82% for five different training sites respectively.
For the purpose of generalizing or extending an existing probability distribution, incorporation of additional parameter to it is very common in the statistical distribution theory and practice. In fact, in most of the times, such extensions provide better fit to the real life situations compared to the existing ones. In this article, we propose and study a two-parameter probability distribution, called quasi xgamma distribution, as an extension or generalization of xgamma distribution (Sen et al. 2016) for modeling lifetime data. Important distributional properties along with survival characteristics and distributions of order statistics are studied in detail. Method of maximum likelihood and method of moments are proposed and described for parameter estimation. A data generation algorithm is proposed supported by a Monte-Carlo simulation study to describe the mean square errors of estimates for different sample sizes. A bladder cancer survival data is used to illustrate the application and suitability of the proposed distribution as a potential survival model.
Abstract: The use of multiple regression analysis (MRA) has been on the rise over the last few decades in part due to the realization that analysis of variance (ANOVA) statistics can be advantageously completed using MRA. Given the limitations of ANOVA strategies it is argued that MRA is the better analysis; however, in order to use ANOVA in MRA coding structures must be employed by the researcher which can be confusing to understand. The present paper attempts to simplify this discussion by providing a description of the most popular coding structures, with emphasis on their strengths, limitations, and uses. A visual analysis of each of these strategies is also included along with all necessary steps to create the contrasts. Finally, a decision tree is presented that can be used by researchers to determine which coding structure to utilize in their current research project.
Abstract: Graphical procedures can be useful for illustrating and evaluating the process of inverse regression. We first review some simple and well-known graphical approaches for univariate linear and nonlinear models. We then propose a new graphical tool applicable to situations where the response is bivariate and repeated measures data are available. The proposed method is illustrated with an example of the age determination of tern chicks using measurements on body weight and wing length.