Abstract: For binary regression model with observed responses (Y s), spec ified predictor vectors (Xs), assumed model parameter vector (β) and case probability function (Pr(Y = 1|X, β)), we propose a simple screening method to test goodness-of-fit when the number of observations (n) is large and Xs are continuous variables. Given any threshold τ ∈ [0, 1], we consider classi fying each subject with predictor X into Y ∗=1 or 0 (a deterministic binary variable other than the observed random binary variable Y ) according to whether the calculated case probability (Pr(Y = 1|X, β)) under hypothe sized true model ≥ or < τ . For each τ , we check the difference between the expected marginal classification error rate (false positives [Y ∗=1, Y =0] or false negatives [Y ∗=0, Y =1]) under hypothesized true model with the ob served marginal error rate which is directly observed due to this classification rule. The screening profile is created by plotting τ -specific marginal error rates (expected and observed) versus τ ∈ [0, 1]. Inconsistency indicates lack of-fit and consistence indicates good model fit. We note that, the variation of the difference between the expected marginal classification error rate and the observed one is constant (O(n −1/2 )) and free of τ . The smallest homo geneous variation at each τ potentially detects flexible model discrepancies with high power. Simulation study shows that, this profile approach named as CERC (classification-error-rate-calibration) is useful for checking wrong parameter value, incorrect predictor vector component subset and link func tion misspecification. We also provide some theoretical results as well as numerical examples to show that, ROC (receiver operating characteristics) curve is not suitable for binary model goodness-of-fit test.
Proteins play a key role in facilitating the infectiousness of the 2019 novel coronavirus. A specific spike protein enables this virus to bind to human cells, and a thorough understanding of its 3-dimensional structure is therefore critical for developing effective therapeutic interventions. However, its structure may continue to evolve over time as a result of mutations. In this paper, we use a data science perspective to study the potential structural impacts due to ongoing mutations in its amino acid sequence. To do so, we identify a key segment of the protein and apply a sequential Monte Carlo sampling method to detect possible changes to the space of lowenergy conformations for different amino acid sequences. Such computational approaches can further our understanding of this protein structure and complement laboratory efforts.
Abstract:In medical literature, researchers suggested various statistical procedures to estimate the parameters in claim count or frequency model. In the recent years, the Poisson regression model has been widely used particularly. However, it is also recognized that the count or frequency data in medical practice often display over-dispersion, i.e., a situation where the variance of the response variable exceeds the mean. Inappropriate imposition of the Poisson may underestimate the standart errors and overstate the significance of the regression parameters, and consequently, giving misleading inference about the regression parameters. This article suggests the Negative Binomial (NB) and Conway-Maxwell-Poisson (COM-Poisson) regression models as an alternatives for handling overdispersion. All mentioned regression models are applied to simulation data and dataset of hospitalization number of people with schizophrenia, the results are compared.
Abstract: In this paper we propose a new bivariate long-term distribution based on the Farlie-Gumbel-Morgenstern copula model. The proposed model allows for the presence of censored data and covariates in the cure parameter. For inferential purpose a Bayesian approach via Markov Chain Monte Carlo (MCMC) is considered. Further, some discussions on the model selection criteria are given. In order to examine outlying and influential observations, we develop a Bayesian case deletion influence diagnostics based on the Kullback-Leibler divergence. The newly developed procedures are illustrated on artificial and real HIV data.
Abstract: This paper, evaluates and compares the heterogeneous balance variation order pair of any two decision-making trial and evaluation laboratory (DEMATEL) theories, in which one has a larger balance and a smaller variation. In contrast, the other one has a smaller balance and a larger variation. With this said, the first author proposed a useful integrated validity index to evaluate any DEMATEL theory presence by combining Liu's balanced coefficient and Liu's variation coefficient. Applying this new validity index, three DEMATELs kinds with a same direct relational matrix are compared that are: the traditional, shrinkage, and balance. Furthermore, conducted is a simple validity experiment Results. show that the balance DEMATEL has the best performance. And that, the shrinkage coefficient's performance is better than that of the traditional DEMATEL.
This article addresses the various mathematical and statistical properties of the Burr type XII distribution (such as quantiles, moments, moment generating function, hazard rate, conditional moments, mean residual lifetime, mean past lifetime, mean deviation about mean and median, stochasic ordering, stress-strength parameter, various entropies, Bonferroni and Lorenz curves and order statistics) are derived. We discuss some exact expressions and recurrence relations for the single and product moments of upper record values. Further, using relations of single moments, we have tabulated the means and variances of upper record values from samples of sizes up to 10 for various values of the α and β. Finally a characterization of this distribution based on conditional moments of record values and recurrence relation of kth record values is presented.
The normal distribution is the most popular model in applications to real data. We propose a new extension of this distribution, called the Kummer beta normal distribution, which presents greater flexibility to model scenarios involving skewed data. The new probability density function can be represented as a linear combination of exponentiated normal pdfs. We also propose analytical expressions for some mathematical quantities: Ordinary and incomplete moments, mean deviations and order statistics. The estimation of parameters is approached by the method of maximum likelihood and Bayesian analysis. Likelihood ratio statistics and formal goodnessof-fit tests are used to compare the proposed distribution with some of its sub-models and non-nested models. A real data set is used to illustrate the importance of the proposed model.
Abstract: Affymetrix high-density oligonucleotide microarray makes it possible to simultaneously measure, and thus compare the expression profiles of hundreds of thousands of genes in living cells. Genes differentially expressed in different conditions are very important to both basic and medical research. However, before detecting these differentially expressed genes from a vast number of candidates, it is necessary to normalize the microarray data due to the significant variation caused by non-biological factors. During the last few years, normalization methods based on probe level or probeset level intensities were proposed in the literature. These methods were motivated by different purposes. In this paper, we propose a multivariate normalization method, based on partial least squares regression, aiming to equalize the central tendency, reduce and equalize the variation of the probe level intensities in any probeset across the replicated arrays. By so doing, we hope that one can precisely estimate the gene expression indexes.
In the recent statistical literature, the difference between explanatory and predictive statistical models has been emphasized. One of the tenets of this dichotomy is that variable selection methods should be applied only to predictive models. In this paper, we compare the effectiveness of the acquisition strategies implemented by Google and Yahoo for the management of innovations. We argue that this is a predictive situation and thus apply lasso variable selection to a Cox regression model in order to compare the Google and Yahoo results. We show that the predictive approach yields different results than an explanatory approach and thus refutes the conventional wisdom that Google was always superior to Yahoo during the period under consideration.
Abstract: Epidemiological cohort study that adopts a two-phase design raises serious issue on how to treat a fairly large amount of missing val ues that are either Missing At Random (MAR) due to the study design or potentially Missing Not At Random (MNAR) due to non-response and loss to follow-up. Cognitive impairment (CI) is an evolving concept that needs epidemiological characterization for its maturity. In this work, we attempt to estimate the incidence rate CI by accounting for the aforemen tioned missing-data process. We consider baseline and first follow-up data of 2191 African-Americans enrolled in a prospective epidemiological study of dementia that adopted a two-phase sampling design. We developed a multiple imputation procedure in the mixture model framework that can be easily implemented in SAS. Sensitivity analysis is carried out to assess the dependence of the estimates on specific model assumptions. It is shown that African-Americans in the age of 65-75 have much higher incidence rate of CI than younger or older elderly. In conclusion, multiple imputation pro vides a practical and general framework for the estimation of epidemiological characteristics in two-phase sampling studies.