Abstract: We have developed a tool for model space exploration and variable selec tion in linear regression models based on a simple spike and slab model (Dey, 2012). The model chosen is the best model with minimum final prediction error (FPE) values among all other models. This is implemented via the R package modelSampler. However, model selection based on FPE criteria is dubious and question able as FPE criteria can be sensitive to perturbations in the data. This R package can be used for empirical assessment of the stability of FPE criteria. A stable model selection is accomplished by using a bootstrap wrapper that calls the primary function of the package several times on the bootstrapped data. The heart of the method is the notion of model averaging for sta ble variable selection and to study the behavior of variables over the entire model space, a concept invaluable in high dimensional situations.
Abstract: This paper evaluates the efficacy of a machine learning approach to data fusion using convolved multi-output Gaussian processes in the context of geological resource modeling. It empirically demonstrates that information integration across multiple information sources leads to superior estimates of all the quantities being modeled, compared to modeling them individually. Convolved multi-output Gaussian processes provide a powerful approach for simultaneous modeling of multiple quantities of interest while taking correlations between these quantities into consideration. Experiments are performed on large scale data taken from a mining context.
Abstract: Methods for testing the equality of two means are of critical importance in many areas of applied statistics. In the microarray context, it is often necessary to apply this kind of testing to small samples containing no more than a dozen elements, when inevitably the power of these tests is low. We suggest an augmentation of the classical t-test by introducing a new test statistic which we call “bio-weight.” We show by simulation that in practically important cases of small sample size, the test based on this statistic is substantially more powerful than that of the classical t-test. The power computations are accompanied by ROC and FDR analysis of the simulated microarray data.
Abstract: This article considers hypothesis testing using Bayes factor in the context of categorical data models represented in two dimensional contingency tables. The study includes multinomial model for a general I × J table data. Other data characteristics such as low as well as polarized cell counts and size of the tables are also considered. The objective is to investigate the sensitivity of Bayes factor taking these features into account so as to understand the performance of non-informative priors itself. Consistency has been studied based on different types of data and using Dirichlet prior with eight different choices for multinomial model followed by a bootstrap simulation. Study has emphasized the reasonable choice of values for the parameters that normally represents the underlying physical phenomena, though partially vague in nature.
Abstract: The likelihood of developing cancer during one’s lifetime is approximately one in two for men and one in three for women in the United States. Cancer is the second-leading cause of death and accounts for one in every four deaths. Evidence-based policy planning and decision making by cancer researchers and public health administrators are best accomplished with up-to-date age-adjusted site-specific cancer death rates. Because of the 3-year lag in reporting, forecasting methodology is employed here to estimate the current year’s rates based on complete observed death data up through three years prior to the current year. The authors expand the State Space Model (SSM) statistical methodology currently in use by the American Cancer Society (ACS) to predict age-adjusted cancer death rates for the current year. These predictions are compared with those from the previous Proc Forecast ACS method and results suggest the expanded SSM performs well.
Abstract: Datasets are sometimes encountered that consist of a two-way table of 0’s and 1’s. For example, this might show which patients are im paired on which of a battery of tests, or which compounds are successful at inactivating which of several micro-organisms. The present paper describes a method of analysing such tables, that reveals and specifies two (or more) systems or modes of action, if indeed they are needed to explain the data. The approach is an extension of what, in the context of cognitive impair ments, is termed double dissociation. In order to be simple enough to be practicable, the approach is deterministic rather than probabilistic.
Longitudinal data analysis had been widely developed in the past three decades. Longitudinal data are common in many fields such as public health, medicine, biological and social sciences. Longitudinal data have special nature as the individual may be observed during a long period of time. Hence, missing values are common in longitudinal data. The presence of missing values leads to biased results and complicates the analysis. The missing values have two patterns: intermittent and dropout. The missing data mechanisms are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The appropriate analysis relies heavily on the assumed mechanism and pattern. The parametric fractional imputation is developed to handle longitudinal data with intermittent missing pattern. The maximum likelihood estimates are obtained and the Jackkife method is used to obtain the standard errors of the parameters estimates. Finally a simulation study is conducted to validate the proposed approach. Also, the proposed approach is applied to a real data.
Abstract: This paper considers the estimation of lifetime distribution based on missing-censoring data. Using the simple empirical approach rather than the maximum likelihood argument, we obtain the parametric estimations of lifetime distribution under the assumption that the failure time follows exponential or gamma distribution. We also derive the nonparametric estimation for both continuous and discrete failure distributions under the assumption that the censoring distribution is known. The loss of efficiency due to missing-censoring is shown to be generally small if the data model is specified correctly. Identifiability issue of the lifetime distribution with missing-censoring data is also addressed.
Abstract: This paper proposes a parametric method for estimating animal abundance using data from independent observer line transect surveys. This method allows measurement errors in distance and size, and less than 100% detection rates on the transect line. Based on data from southern bluefin tuna surveys and data from a mike whale survey, simulation studies were conducted and the results show that 1) the proposed estimates agree well with the true values, 2) the effect of small measurement errors in distance could still be large if measurements on size are biased, and 3) incorrectly as suming 100% detection rates on the transect line will greatly underestimate the animal abundance.