Abstract: Two methods for clustering data and choosing a mixture model are proposed. First, we derive a new classification algorithm based on the classification likelihood. Then, the likelihood conditional on these clusters is written as the product of likelihoods of each cluster, and AIC- respectively BIC-type approximations are applied. The resulting criteria turn out to be the sum of the AIC or BIC relative to each cluster plus an entropy term. The performance of our methods is evaluated by Monte-Carlo methods and on a real data set, showing in particular that the iterative estimation algorithm converges quickly in general, and thus the computational load is rather low.
Abstract: Multiple binary outcomes that measure the presence or absence of medical conditions occur frequently in public health survey research. The multiple possibly correlated binary outcomes may compose of a syndrome or a group of related diseases. It is often of scientific interest to model the interrelationships not only between outcome and risk factors, but also between different outcomes. Applied and practical methods dealing with multiple outcomes from complex designed surveys are lacking. We propose a multivariate approach based on the generalized estimating equation (GEE) methodology to simultaneously conduct survey logistic regressions for each binary outcome in a single analysis. The approach has the following attrac tive features: 1) It enables modeling the complete information from multiple outcomes in a single analysis; 2) it permits to test the correlations between multiple binary outcomes; 3) it allows of discerning the outcome-specific ef fect and the overall risk factor effect; and 4) it provides the measurement of difference of the association between risk factors and multiple outcomes. The proposed method is applied to a study on risk factors for heart attack and stroke in 2009 U.S. nationwide Behavioral Risk Factor Surveillance System (BRFSS) data.
Abstract: The present paper deals with the maximum likelihood and Bayes estimation procedure for the shape and scale parameter of Poisson-exponential distribution for complete sample. Bayes estimators under symmetric and asymmetric loss function are obtained using Markov Chain Monte Carlo (MCMC) technique. Performances of the proposed Bayes estimators have been studied and compared with their maximum likelihood estimators on the basis of Monte Carlo study of simulated samples in terms of their risks. The methodology is also illustrated on a real data set.
An extension of truncated Poisson distribution having two parameters for a group of two types of population is derived and named as Bounded Poisson (BP) distribution. To estimate the parameters, method of moment has been employed. To check the suitability and applicability of the model it has been applied on real data set on human fertility derived from the third round of National Family Health Survey conducted in 2005-06 in Uttar Pradesh, India. Proposed model provides a good fitting to the data under consideration.
Abstract: Sample size and power calculations are often based on a two-group comparison. However, in some instances the group membership cannot be ascertained until after the sample has been collected. In this situation, the respective sizes of each group may not be the same as those prespecified due to binomial variability, which results in a difference in power from that expected. Here we suggest that investigators calculate an “expected power” taking into account the binomial variability of the group member ship, and adjust the sample size accordingly when planning such studies. We explore different scenarios where such an adjustment may or may not be necessary for both continuous and binary responses. In general, the number of additional subjects required depends only slightly on the values of the (standardized) difference in the two group means or proportions, but more importantly on the respective sizes of the group membership. We present tables with adjusted sample sizes for a variety of scenarios that can be readily used by investigators at the study design stage. The proposed approach is motivated by a genetic study of cerebral malaria and a sleep apnea study.
Abstract: In voting rights cases, judges often infer unobservable individ ual vote choices from election data aggregated at the precinct level. That is, one must solve an ill-posed inverse problem to obtain the critical information used in these cases. The ill-posed nature of the problem means that tradi tional frequentist and Bayesian approaches cannot be employed without first imposing a range of assumptions. In order to mitigate the problems result ing from incorporating potentially inaccurate information in these cases, we propose the use of information theoretic methods as a basis for recovering an estimate of the unobservable individual vote choices. We illustrate the empirical non-parametric likelihood methods with some election data.
Abstract: In this paper, we consider analysis of follow-up data where each event time is either right censored, observed, left censored or left truncated. In the case of left censoring, the covariates measured at baseline are considered as missing. The work is motivated by data from the MORGAM Project, which explores the association between cardiovascular diseases and their classic and genetic risk factors. We propose a nonparametric multiple imputation (NPMI) approach where the left censored event times and the missing covariates are imputed in hot deck manner. The left truncation due to deaths prior to baseline is compensated by Lexis diagram imputation introduced in the paper. After imputation, the standard estimation methods for right censored survival data can be directly applied. The performance of the proposed imputation approach is studied with simulated and real world data. The results suggest that the NPMI is a flexible and reliable approach to the analysis of left and right censored data.
Abstract: This paper describes how to explore gene expression data using a combination of graphical and numerical methods. We start from the general methodology for multivariate data visualization, describing heatmaps, par allel coordinate plots and scatterplots. We propose new methods for gene expression data analysis using direct manipulation graphics. With linked scatterplots and parallel coordinate plots we explore gene expression data differently than many common practices. To check replicates in relation to treatments we introduce a new type of plot called a “replicate line” plot. There is a worked example, that focuses on an experimental study containing two two-level factors, genotype and cofactor presence, with two replicates.
Abstract: In this paper we tried to fit a predictive model for the average annual rainfall of Bangladesh through a geostatistical approach. From geostatistical point of view, we studied the spatial dependence pattern of average annual rainfall data (measured in mm) collected from 246 stations of Bangladesh. We have employed kriging or spatial interpolation for rainfall data. The data reveals a linear trend when investigated, so by fitting a linear model we tried to remove the trend and, then we used the trend-free data for further calculations. Four theoretical semivariogram models Exponential, Spherical, Gaussian and Matern were used to explain the spatial variation among the average annual rainfall. These models are chosen according to the pattern of empirical semivariogram. The prediction performance of Ordinary kriging with these four fitted models are then compared through 𝑘 fold cross-validation and it is found that Ordinary Kriging performs better when the spatial dependency in average annual rainfall of Bangladesh is modeled through Gaussian semivariogram model.