Abstract: Design-based regression regards the survey response as a constant waiting to be observed. Bechtel (2007) replaced this constant with the sum of a fixed true value and a random measurement error. The present paper relaxes the assumption that the expected error is zero within a survey respondent. It also allows measurement errors in predictor variables as well as in the response variable. Reasonable assumptions about these errors over respondents, along with coefficient alpha in psychological test theory, enable the regression of true responses on true predictors. This resolves two major issues in survey regression, i.e. errors in variables and item non-response. The usefulness of this resolution is demonstrated with three large datasets collected by the European Social Survey in 2002, 2004 and 2006. The paper concludes with implications of true-value regression for survey theory and practice and for surveying large world populations.
Abstract: Objectives: Exploratory Factor Analysis (EFA) is a very popular statistical technique for identifying potential latent structure underlying a set of observed indicator variables. EFA is used widely in the social sciences, business and finance, machine learning, and the health sciences, among others. Research has found that standard methods of estimating EFA model parameters do not work well when the sample size is relatively small (e.g. less than 50) and/or when the number of observed variables approaches the sample size in value. The purpose of the current study was to investigate and compare some alternative approaches to fitting EFA in the case of small samples and high dimensional data. Results of both a small simulation study, and an application of the methods to an intelligence test revealed that several alternative approaches designed to reduce the dimensionality of the observed variable covariance matrix worked very well in terms of recovering population factor structure with EFA. Implications of these results for practice are discussed..
Abstract: In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science among others. The goal of this paper is to extend the taxicab metric and a newly suggested metric for com-positional data by employing a power transformation. Both metrics are to be used in the k-nearest neighbours algorithm regardless of the presence of zeros. Examples with real data are exhibited.
Abstract: We present power calculations for zero-inflated Poisson (ZIP) and zero-inflated negative-binomial (ZINB) models. We detail direct computations for a ZIP model based on a two-sample Wald test using the expected information matrix. We also demonstrate how Lyles, Lin, and Williamson’s method (2006) of power approximation for categorical and count outcomes can be extended to both zero-inflated models. This method can be used for power calculations based on the Wald test (via the observed information matrix) and the likelihood ratio test, and can accommodate both categorical and continuous covariates. All the power calculations can be conducted when covariates are used in the modeling of both the count data and the “excess zero” data, or in either part separately. We present simulations to detail the performance of the power calculations. Analysis of a malaria study is used for illustration.
Abstract: In dementia screening tests, item selection for shortening an existing screening test can be achieved using multiple logistic regression. However, maximum likelihood estimates for such logistic regression models often experience serious bias or even non-existence because of separation and multicollinearity problems resulting from a large number of highly cor related items. Firth (1993, Biometrika, 80(1),27-38) proposed a penalized likelihood estimator for generalized linear models and it was shown to re duce bias and the non-existence problems. The ridge regression has been used in logistic regression to stabilize the estimates in cases of multicollinear ity. However, neither solves the problems for each other. In this paper, we propose a double penalized maximum likelihood estimator combining Firth’s penalized likelihood equation with a ridge parameter. We present a simu lation study evaluating the empirical performance of the double penalized likelihood estimator in small to moderate sample sizes. We demonstrate the proposed approach using a current screening data from a community-based dementia study.
The concept of ranked set sampling (RSS) is applicable whenever ranking on a set of sampling units can be done easily by a judgment method or based on an auxiliary variable. In this work, we consider a study variable 𝑌 correlated with auxiliary variable 𝑋 which is used to rank the sampling units. Further (𝑋, 𝑌) is assumed to have Morgenstern type bivariate generalized uniform distribution. We obtain an unbiased estimator of a scale parameter associated with the study variable 𝑌 based on different RSS schemes and censored RSS. Efficiency comparison study of these estimators is also performed and presented numerically.
Abstract: This paper provides an introduction to multivariate non-parametric hazard model for the occurrence of earthquakes since the hazard function defines the statistical distribution of inter-event times. The method is ap plied to the Turkish seismicity since a significant portion of Turkey is subject to frequent earthquakes and presents several advantages compared to other more traditional approaches. Destructive earthquakes from 1903 to 2009 between the longitudes of (39-42)N◦ and the latitudes of (26-45)E◦ are used. The paper demonstrates how seismicity and tectonics/physics parameters that can potentially influence the spatio-temporal variability of earthquakes and presents several advantages compared to more traditional approaches.
Abstract: For binary regression model with observed responses (Y s), spec ified predictor vectors (Xs), assumed model parameter vector (β) and case probability function (Pr(Y = 1|X, β)), we propose a simple screening method to test goodness-of-fit when the number of observations (n) is large and Xs are continuous variables. Given any threshold τ ∈ [0, 1], we consider classi fying each subject with predictor X into Y ∗=1 or 0 (a deterministic binary variable other than the observed random binary variable Y ) according to whether the calculated case probability (Pr(Y = 1|X, β)) under hypothe sized true model ≥ or < τ . For each τ , we check the difference between the expected marginal classification error rate (false positives [Y ∗=1, Y =0] or false negatives [Y ∗=0, Y =1]) under hypothesized true model with the ob served marginal error rate which is directly observed due to this classification rule. The screening profile is created by plotting τ -specific marginal error rates (expected and observed) versus τ ∈ [0, 1]. Inconsistency indicates lack of-fit and consistence indicates good model fit. We note that, the variation of the difference between the expected marginal classification error rate and the observed one is constant (O(n −1/2 )) and free of τ . The smallest homo geneous variation at each τ potentially detects flexible model discrepancies with high power. Simulation study shows that, this profile approach named as CERC (classification-error-rate-calibration) is useful for checking wrong parameter value, incorrect predictor vector component subset and link func tion misspecification. We also provide some theoretical results as well as numerical examples to show that, ROC (receiver operating characteristics) curve is not suitable for binary model goodness-of-fit test.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 3 (2020): Special issue: Data Science in Action in Response to the Outbreak of COVID-19, pp. 511–525
Abstract
Proteins play a key role in facilitating the infectiousness of the 2019 novel coronavirus. A specific spike protein enables this virus to bind to human cells, and a thorough understanding of its 3-dimensional structure is therefore critical for developing effective therapeutic interventions. However, its structure may continue to evolve over time as a result of mutations. In this paper, we use a data science perspective to study the potential structural impacts due to ongoing mutations in its amino acid sequence. To do so, we identify a key segment of the protein and apply a sequential Monte Carlo sampling method to detect possible changes to the space of lowenergy conformations for different amino acid sequences. Such computational approaches can further our understanding of this protein structure and complement laboratory efforts.
Abstract:In medical literature, researchers suggested various statistical procedures to estimate the parameters in claim count or frequency model. In the recent years, the Poisson regression model has been widely used particularly. However, it is also recognized that the count or frequency data in medical practice often display over-dispersion, i.e., a situation where the variance of the response variable exceeds the mean. Inappropriate imposition of the Poisson may underestimate the standart errors and overstate the significance of the regression parameters, and consequently, giving misleading inference about the regression parameters. This article suggests the Negative Binomial (NB) and Conway-Maxwell-Poisson (COM-Poisson) regression models as an alternatives for handling overdispersion. All mentioned regression models are applied to simulation data and dataset of hospitalization number of people with schizophrenia, the results are compared.