Abstract: The problem of detecting differential gene expression with mi croarray data has led to further innovative approaches to controlling false positives in multiple testing. False discovery rate (FDR) has been widely used as a measure of error in this multiple testing context. Direct estima tion of FDR was recently proposed by Storey (2002, Journal of the Royal Statistical Society, Series B 64, 479-498) as a substantially more powerful al ternative to the traditional sequential FDR controlling procedure, pioneered by Benjamini and Hochberg (1995, Journal of the Royal Statistical Society, Series B 57, 289-300). Direct estimation to FDR requires fixing a rejection region of interest and then conservatively estimating the associated FDR. On the other hand, sequential FDR procedure requires fixing a FDR control level and then estimating the rejection region. Thus, sequential and direct approaches to FDR control appear very different. In this paper, we intro duce a unified computational framework for sequential FDR methods and propose a class of more powerful sequential FDR algorithms, that link the direct and sequential approaches. Under the proposed unified compuational framework, both approaches simply approximate the least conservative (op timal) sequential FDR procedure. We illustrate the FDR algorithms and concepts with some numerical studies (simulations) and with two real ex ploratory DNA microarray studies, one on the detection of molecular signa tures in BRCA-mutation breast cancer patients and another on the detection of genetic signatures during colon cancer initiation and progression in the rat.
In this article, we considered the analysis of data with a non-normally distributed response variable. In particular, we extended an existing Area Under the Curve (AUC) regression model that handles only two discrete covariates to a general AUC regression model that can be used to analyze data with unrestricted number of discrete covariates. Comparing with other similar methods which require iterative algorithms and bootstrap procedure, our method involved only closed-form formulae for parameter estimation. Additionally, we also discussed the issue of model identifiability. Our model has broad applicability in clinical trials due to the ease of interpretation on model parameters. We applied our model to analyze a clinical trial evaluating the effects of educational brochures for preventing Fetal Alcohol Spectrum Disorders (FASD). Finally, for a variety of simulation scenarios, our method produced parameter estimates with small biases and confidence intervals with nominal coverage probabilities.
In this work, we study the odd Lindley Burr XII model initially introduced by Silva et al. [29]. This model has the advantage of being capable of modeling various shapes of aging and failure criteria. Some of its statistical structural properties including ordinary and incomplete moments, quantile and generating function and order statistics are derived. The odd Lindley Burr XII density can be expressed as a simple linear mixture of BurrXII densities. Useful characterizations are presented. The maximum likelihood method is used to estimate the model parameters. Simulation results to assess the performance of the maximum likelihood estimators are discussed. We prove empirically the importance and flexibility of the new model in modeling various types of data. Bayesian estimation is performed by obtaining the posterior marginal distributions as well as using the simulation method of Markov Chain Monte Carlo (MCMC) by the Metropolis-Hastings algorithm in each step of Gibbs algorithm. The trace plots and estimated conditional posterior distributions are also presented.
Abstract: The paper proposes the use of Kohonen’s Self Organizing Map (SOM), and supervised neural networks to find clusters in samples of gammaray burst (GRB) using the measurements given in BATSE GRB. The extent of separation between clusters obtained by SOM was examined by cross validation procedure using supervised neural networks for classification. A method is proposed for variable selection to reduce the “curse of dimensionality”. Six variables were chosen for cluster analysis. Additionally, principal components were computed using all the original variables and 6 components which accounted for a high percentage of variance was chosen for SOM analysis. All these methods indicate 4 or 5 clusters. Further analysis based on the average profiles of the GRB indicated a possible reduction in the number of clusters.
Abstract: Polya tree, by embedding parametric families as a special case, provides natural suit to test goodness of fit of a parametric null with non parametric alternatives. For this purpose, we present a new construction on Polya tree for random probability measure, which aims to perform an easy multiple χ 2 test for goodness of fit. Examples of data analyses are provided in simulation studies to highlight the performance of the proposed methods.
Probabilistic topic models have become a standard in modern machine learning to deal with a wide range of applications. Representing data by dimensional reduction of mixture proportion extracted from topic models is not only richer in semantics interpretation, but could also be informative for classification tasks. In this paper, we describe the Topic Model Kernel (TMK), a topicbased kernel for Support Vector Machine classification on data being processed by probabilistic topic models. The applicability of our proposed kernel is demonstrated in several classification tasks with real world datasets. TMK outperforms existing kernels on the distributional features and give comparative results on nonprobabilistic data types.
Abstract: Motivation: A formidable challenge in the analysis of microarray data is the identification of those genes that exhibit differential expression. The objectives of this research were to examine the utility of simple ANOVA, one sided t tests, natural log transformation, and a generalized experiment wise error rate methodology for analysis of such experiments. As a test case, we analyzed a Affymetrix GeneChip microarray experiment designed to test for the effect of a CHD3 chromatin remodeling factor, PICKLE, and an inhibitor of the plant hormone gibberellin (GA), on the expression of 8256 Arabidopsis thaliana genes. Results: The GFWER(k) is defined as the probability of rejecting k or more true null hypothesis at a given p level. Computing probabilities by GFWER(k) was shown to be simple to apply and, depending on the value of k, can greatly increase power. A k value as small as 2 or 3 was concluded to be adequate for large or small experiments respectively. A one sided ttest along with GFWER(2)=.05 identified 43 genes as exhibiting PICKLEdependent expression. Expression of all 43 genes was re-examined by qRTPCR, of which 36 (83.7%) were confirmed to exhibit PICKLE-dependent expression.
Abstract: A basic assumption concerned with general linear regression model is that there is no correlation (or no multicollinearity) between the explana tory variables. When this assumption is not satisfied, the least squares estimators have large variances and become unstable and may have a wrong sign. Therefore, we resort to biased regression methods, which stabilize the parameter estimates. Ridge regression (RR) and principal component regression (PCR) are two of the most popular biased regression methods which can be used in case of multicollinearity. But the r-k class estimator, which is composed by combining the RR estimator and the PCR estimator into a single estimator gives the better estimates of the regression coefficients than the RR estimator and PCR estimator. This paper explores the multiple regression technique using r-k class estimator between TFR and other socio-economic and demographic variables and the data has been taken from the National Family Health Survey-III (NFHS-III): 29 states of India. The analysis shows that use of contraceptive devices shares the greatest impact on fertility rate followed by maternal care, use of improved water, female age at marriage and spacing between births.
Abstract: When comparing the performance of health care providers, it is important that the effect of such factors that have an unwanted effect on the performance indicator (eg. mortality) is ruled out. In register based studies randomization is out of question. We develop a risk adjustment model for hip fracture mortality in Finland by using logistic regression. The model is used to study the impact of the length of the register follow-up period on adjusting the performance indicator for a set of comorbidities. The comorbidities are congestive heart failure, cancer and diabetes. We also introduce an implementation of the minimum description length (MDL) principle for model selection in logistic regression. This is done by using the normalized maximum likelihood (NML) technique. The computational burden becomes too heavy to apply the usual NML criterion and therefore a technique based on the idea of sequentially normalized maximum likelihood (sNML) is introduced. The sNML criterion can be evaluated efficiently also for large models with large amounts of data. The results given by sNML are then compared to the corresponding results given by the traditional AIC and BIC model selection criteria. All three comorbidities have clearly an effect on hip fracture mortality. The results indicate that for congestive heart failure all available medical history should be used, while for cancer it is enough to use only records from half a year before the fracture. For diabetes the choice of time period is not as clear, but using records from three years before the fracture seems to be a reasonable choice.