Abstract: We consider a fully Bayesian treatment of radial basis function regression, and propose a solution to the instability of basis selection. Indeed, when bases are selected solely according to the magnitude of their posterior inclusion probabilities, it is often the case that many bases in the same neighborhood end up getting selected leading to redundancy and ultimately inaccuracy of the representation. In this paper, we propose a straightforward solution to the problem based on post-processing the sample path yielded by the model space search technique. Specifically, we perform an a posteriori model-based clustering of the sample path via a mixture of Gaussians, and then select the points closer to the means of the Gaussians. Our solution is found to be more stable and yields a better performance on simulated and real tasks.
The paper presents an investigation of estimating treatment effect using differ- ent matching methods through Monte Carlo simulation. The study proposed a new method which is computationally efficient and convenient in implication—largest caliper matching and compared the performance with other five popular matching methods. The bias, empirical standard deviation and the mean square error of the estimates in the simulation are checked under different treatment prevalence and different distributions of covariates. It is shown that largest caliper matching improves estimation of the population treatment effect in a wide range of settings compare to other methods. It reduces the bias if the data contains the selection on observables and treatment imbalances. Also, findings about the relative performance of the different matching methods are provided to help practitioners determine which method should be used under certain situations. An application of these methods is implemented on the Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments (SUPPORT) data and, important demographic and socioeconomic factors that may affect the clinical outcome are also reported in this paper.
Abstract: In this paper, a tree-structured method is proposed to extend Classification and Regression Trees (CART) algorithm to multivariate survival data, assuming a proportional hazard structure in the whole tree. The method works on the marginal survivor distributions and uses a sandwich estimator of variance to account for the association between survival times. The Wald-test statistics is defined as the splitting rule and the survival trees are developed by maximizing between-node separation. The proposed method intends to classify patients into subgroups with distinctively different prognosis. However, unlike the conventional tree-growing algorithms which work on a subset of data at every partition, the proposed method deals with the whole data set and searches the global optimal split at each partition. The method is applied to a prostate cancer data and its performance is also evaluated by several simulation studies.
Abstract: We have developed a tool for model space exploration and variable selec tion in linear regression models based on a simple spike and slab model (Dey, 2012). The model chosen is the best model with minimum final prediction error (FPE) values among all other models. This is implemented via the R package modelSampler. However, model selection based on FPE criteria is dubious and question able as FPE criteria can be sensitive to perturbations in the data. This R package can be used for empirical assessment of the stability of FPE criteria. A stable model selection is accomplished by using a bootstrap wrapper that calls the primary function of the package several times on the bootstrapped data. The heart of the method is the notion of model averaging for sta ble variable selection and to study the behavior of variables over the entire model space, a concept invaluable in high dimensional situations.
Abstract: This paper evaluates the efficacy of a machine learning approach to data fusion using convolved multi-output Gaussian processes in the context of geological resource modeling. It empirically demonstrates that information integration across multiple information sources leads to superior estimates of all the quantities being modeled, compared to modeling them individually. Convolved multi-output Gaussian processes provide a powerful approach for simultaneous modeling of multiple quantities of interest while taking correlations between these quantities into consideration. Experiments are performed on large scale data taken from a mining context.
Abstract: Methods for testing the equality of two means are of critical importance in many areas of applied statistics. In the microarray context, it is often necessary to apply this kind of testing to small samples containing no more than a dozen elements, when inevitably the power of these tests is low. We suggest an augmentation of the classical t-test by introducing a new test statistic which we call “bio-weight.” We show by simulation that in practically important cases of small sample size, the test based on this statistic is substantially more powerful than that of the classical t-test. The power computations are accompanied by ROC and FDR analysis of the simulated microarray data.
Abstract: This article considers hypothesis testing using Bayes factor in the context of categorical data models represented in two dimensional contingency tables. The study includes multinomial model for a general I × J table data. Other data characteristics such as low as well as polarized cell counts and size of the tables are also considered. The objective is to investigate the sensitivity of Bayes factor taking these features into account so as to understand the performance of non-informative priors itself. Consistency has been studied based on different types of data and using Dirichlet prior with eight different choices for multinomial model followed by a bootstrap simulation. Study has emphasized the reasonable choice of values for the parameters that normally represents the underlying physical phenomena, though partially vague in nature.
Abstract: The likelihood of developing cancer during one’s lifetime is approximately one in two for men and one in three for women in the United States. Cancer is the second-leading cause of death and accounts for one in every four deaths. Evidence-based policy planning and decision making by cancer researchers and public health administrators are best accomplished with up-to-date age-adjusted site-specific cancer death rates. Because of the 3-year lag in reporting, forecasting methodology is employed here to estimate the current year’s rates based on complete observed death data up through three years prior to the current year. The authors expand the State Space Model (SSM) statistical methodology currently in use by the American Cancer Society (ACS) to predict age-adjusted cancer death rates for the current year. These predictions are compared with those from the previous Proc Forecast ACS method and results suggest the expanded SSM performs well.
Abstract: Datasets are sometimes encountered that consist of a two-way table of 0’s and 1’s. For example, this might show which patients are im paired on which of a battery of tests, or which compounds are successful at inactivating which of several micro-organisms. The present paper describes a method of analysing such tables, that reveals and specifies two (or more) systems or modes of action, if indeed they are needed to explain the data. The approach is an extension of what, in the context of cognitive impair ments, is termed double dissociation. In order to be simple enough to be practicable, the approach is deterministic rather than probabilistic.