Longitudinal data analysis had been widely developed in the past three decades. Longitudinal data are common in many fields such as public health, medicine, biological and social sciences. Longitudinal data have special nature as the individual may be observed during a long period of time. Hence, missing values are common in longitudinal data. The presence of missing values leads to biased results and complicates the analysis. The missing values have two patterns: intermittent and dropout. The missing data mechanisms are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The appropriate analysis relies heavily on the assumed mechanism and pattern. The parametric fractional imputation is developed to handle longitudinal data with intermittent missing pattern. The maximum likelihood estimates are obtained and the Jackkife method is used to obtain the standard errors of the parameters estimates. Finally a simulation study is conducted to validate the proposed approach. Also, the proposed approach is applied to a real data.
Abstract: This paper considers the estimation of lifetime distribution based on missing-censoring data. Using the simple empirical approach rather than the maximum likelihood argument, we obtain the parametric estimations of lifetime distribution under the assumption that the failure time follows exponential or gamma distribution. We also derive the nonparametric estimation for both continuous and discrete failure distributions under the assumption that the censoring distribution is known. The loss of efficiency due to missing-censoring is shown to be generally small if the data model is specified correctly. Identifiability issue of the lifetime distribution with missing-censoring data is also addressed.
Abstract: This paper proposes a parametric method for estimating animal abundance using data from independent observer line transect surveys. This method allows measurement errors in distance and size, and less than 100% detection rates on the transect line. Based on data from southern bluefin tuna surveys and data from a mike whale survey, simulation studies were conducted and the results show that 1) the proposed estimates agree well with the true values, 2) the effect of small measurement errors in distance could still be large if measurements on size are biased, and 3) incorrectly as suming 100% detection rates on the transect line will greatly underestimate the animal abundance.
Abstract: The problem of detecting differential gene expression with mi croarray data has led to further innovative approaches to controlling false positives in multiple testing. False discovery rate (FDR) has been widely used as a measure of error in this multiple testing context. Direct estima tion of FDR was recently proposed by Storey (2002, Journal of the Royal Statistical Society, Series B 64, 479-498) as a substantially more powerful al ternative to the traditional sequential FDR controlling procedure, pioneered by Benjamini and Hochberg (1995, Journal of the Royal Statistical Society, Series B 57, 289-300). Direct estimation to FDR requires fixing a rejection region of interest and then conservatively estimating the associated FDR. On the other hand, sequential FDR procedure requires fixing a FDR control level and then estimating the rejection region. Thus, sequential and direct approaches to FDR control appear very different. In this paper, we intro duce a unified computational framework for sequential FDR methods and propose a class of more powerful sequential FDR algorithms, that link the direct and sequential approaches. Under the proposed unified compuational framework, both approaches simply approximate the least conservative (op timal) sequential FDR procedure. We illustrate the FDR algorithms and concepts with some numerical studies (simulations) and with two real ex ploratory DNA microarray studies, one on the detection of molecular signa tures in BRCA-mutation breast cancer patients and another on the detection of genetic signatures during colon cancer initiation and progression in the rat.
In this article, we considered the analysis of data with a non-normally distributed response variable. In particular, we extended an existing Area Under the Curve (AUC) regression model that handles only two discrete covariates to a general AUC regression model that can be used to analyze data with unrestricted number of discrete covariates. Comparing with other similar methods which require iterative algorithms and bootstrap procedure, our method involved only closed-form formulae for parameter estimation. Additionally, we also discussed the issue of model identifiability. Our model has broad applicability in clinical trials due to the ease of interpretation on model parameters. We applied our model to analyze a clinical trial evaluating the effects of educational brochures for preventing Fetal Alcohol Spectrum Disorders (FASD). Finally, for a variety of simulation scenarios, our method produced parameter estimates with small biases and confidence intervals with nominal coverage probabilities.
In this work, we study the odd Lindley Burr XII model initially introduced by Silva et al. [29]. This model has the advantage of being capable of modeling various shapes of aging and failure criteria. Some of its statistical structural properties including ordinary and incomplete moments, quantile and generating function and order statistics are derived. The odd Lindley Burr XII density can be expressed as a simple linear mixture of BurrXII densities. Useful characterizations are presented. The maximum likelihood method is used to estimate the model parameters. Simulation results to assess the performance of the maximum likelihood estimators are discussed. We prove empirically the importance and flexibility of the new model in modeling various types of data. Bayesian estimation is performed by obtaining the posterior marginal distributions as well as using the simulation method of Markov Chain Monte Carlo (MCMC) by the Metropolis-Hastings algorithm in each step of Gibbs algorithm. The trace plots and estimated conditional posterior distributions are also presented.
Abstract: The paper proposes the use of Kohonen’s Self Organizing Map (SOM), and supervised neural networks to find clusters in samples of gammaray burst (GRB) using the measurements given in BATSE GRB. The extent of separation between clusters obtained by SOM was examined by cross validation procedure using supervised neural networks for classification. A method is proposed for variable selection to reduce the “curse of dimensionality”. Six variables were chosen for cluster analysis. Additionally, principal components were computed using all the original variables and 6 components which accounted for a high percentage of variance was chosen for SOM analysis. All these methods indicate 4 or 5 clusters. Further analysis based on the average profiles of the GRB indicated a possible reduction in the number of clusters.
Abstract: Polya tree, by embedding parametric families as a special case, provides natural suit to test goodness of fit of a parametric null with non parametric alternatives. For this purpose, we present a new construction on Polya tree for random probability measure, which aims to perform an easy multiple χ 2 test for goodness of fit. Examples of data analyses are provided in simulation studies to highlight the performance of the proposed methods.
Probabilistic topic models have become a standard in modern machine learning to deal with a wide range of applications. Representing data by dimensional reduction of mixture proportion extracted from topic models is not only richer in semantics interpretation, but could also be informative for classification tasks. In this paper, we describe the Topic Model Kernel (TMK), a topicbased kernel for Support Vector Machine classification on data being processed by probabilistic topic models. The applicability of our proposed kernel is demonstrated in several classification tasks with real world datasets. TMK outperforms existing kernels on the distributional features and give comparative results on nonprobabilistic data types.