Abstract: Missing data are a common problem for researchers working with surveys and other types of questionnaires. Often, respondents do not respond to one or more items, making the conduct of statistical analyses, as well as the calculation of scores difficult. A number of methods have been developed for dealing with missing data, though most of these have focused on continuous variables. It is not clear that these techniques for imputation are appropriate for the categorical items that make up surveys. However, methods of imputation specifically designed for categorical data are either limited in terms of the number of variables they can accommodate, or have not been fully compared with the continuous data approaches used with categorical variables. The goal of the current study was to compare the performance of these explicitly categorical imputation approaches with the more well established continuous method used with categorical item responses. Results of the simulation study based on real data demonstrate that the continuous based imputation approach and a categorical method based on stochastic regression appear to perform well in terms of creating data that match the complete datasets in terms of logistic regression results.
Abstract: We propose two classes of nonparametric point estimators of θ = P(X < Y ) in the case where (X, Y ) are paired, possibly dependent, absolutely continuous random variables. The proposed estimators are based on nonparametric estimators of the joint density of (X, Y ) and the distri bution function of Z = Y − X. We explore the use of several density and distribution function estimators and characterise the convergence of the re sulting estimators of θ. We consider the use of bootstrap methods to obtain confidence intervals. The performance of these estimators is illustrated us ing simulated and real data. These examples show that not accounting for pairing and dependence may lead to erroneous conclusions about the rela tionship between X and Y .
Abstract: In this paper, we use generalized influence function and generalized Cook distance to measure the local influence of minor perturbation on the modified ridge regression estimator in ridge type linear regression model. The diagnostics under the perturbation of constant variance and individual explanatory variables are obtained when multicollinearity presents among the regressors. Also we proposed a statistic that reveals the influential cases for Mallow’s method which is used to choose modified ridge regression estimator biasing parameter. Two real data sets are used to illustrate our methodologies.
In this paper, the geometric process model is used for analyzing constant stress accelerated life testing. The generalized half logistic lifetime distribution is considered under progressive type-II censoring. Statistical inference is developed on the basis of maximum likelihood approach for estimating the unknown parameters and getting both the asymptotic and bootstrap confidence intervals. Besides, the predictive values of the reliability function under usual conditions are found. Moreover, the method of finding the optimal value of the ratio of the geometric process is presented. Finally, a simulation study is presented to illustrate the proposed procedures and to evaluate the performance of the geometric process model.
Abstract: The creation of data sets using observational methods for the lag-sequential study of behavior requires selection of a recording time unit. This is an important issue, because standard methods such as momentary sampling and partial-interval sampling, for instance, consistently underestimate the frequency of some behaviors. This leads to inaccurate estimation of both unconditional and conditional probabilities of the different behaviors, the basic descriptive and analytic tools of sequential analysis methodology. The purpose of this paper is to investigate the creation of data sets usable for the purpose of sequential analysis. We show that such data vary depending on the time resolution and that inaccurate choices lead to biased estimations of transition probabilities.
Abstract: We develop a likelihood ratio test statistic, based on the betabinomial distribution, for comparing a single treated group with dichotomous data to dual control groups. This statistic is useful in cases where there is overdispersion or extra-binomial variation. We apply the statistic to data from a two year rodent carcinogenicity study with dual control groups. The test statistic we developed is similar to others that have been developed for incorporation of historical control groups with rodent carcinogenicity experiments. However, for the small sample case we considered, large sample theory used by the other test statistics did not apply. We determined the critical values of this statistic by enumerating its distribution. A small Monte Carlo study shows the new test statistic controls the significance level much better than Fisher’s exact test when there is overdispersion and that it has adequate power.
Getting a machine to understand the meaning of language is a largely important goal to a wide variety of fields, from advertising to entertainment. In this work, we focus on Youtube comments from the top twohundred trending videos as a source of user text data. Previous Sentiment Analysis Models focus on using hand-labelled data or predetermined lexicon-s.Our goal is to train a model to label comment sentiment with emoticons by training on other user-generated comments containing emoticons. Naive Bayes and Recurrent Neural Network models are both investigated and im- plemented in this study, and the validation accuracies for Naive Bayes model and Recurrent Neural Network model are found to be .548 and .812.
Matlab, Python and R have all been used successfully in teaching college students fundamentals of mathematics & statistics. In today’s data driven environment, the study of data through big data analytics is very powerful, especially for the purpose of decision making and using data statistically in this data rich environment. MatLab can be used to teach introductory mathematics such as calculus and statistics. Both Python and R can be used to make decisions involving big data. On the one hand, Python is perfect for teaching introductory statistics in a data rich environment. On the other hand, while R is a little more involved, there are many customizable programs that can make somewhat involved decisions in the context of prepackaged, preprogrammed statistical analysis.
Abstract: Although United States government planners and others outside government had recognized the potential risk of attacks by terrorists, the events of September 11, 2001, vividly revealed the nation’s vulnerabilities to terrorism. Similarly, the 2004 terrorist attacks in Madrid illustrated vul nerabilities to terrorism extend beyond the United States. Those attacks were obvious destructive acts with a primary purpose of massive causalities. Let us consider a bioterrorist attack which is conducted subtly through the release of a Chemical/Biological agent. If such an attack occurs through release of a specific biological agent, an awareness of the potential threat of this agent in terms of the number of infections and deaths that could occur in a community is of paramount importance in preparing the public health community to respond to this attack. An increase in biosurveillance and novel approaches to biosurveillance are needed. This paper illustrates the use of mixed effects model for biosurveillance based on commuter size for regional rail lines. With mixed effects model we can estimate for any station on a given rail system the expected daily number of commuters and establish an acceptability criterion around this expected size. If the actual commuter size is significantly smaller than the estimate, then this could be an indicator of a possible attack. We illustrate this method through an example based on the 2001 daily totals for the Port Authority Transportation Company (PATCO) rail system, which serves residents of southern New Jersey and Philadelphia region in the United States. In addition, we discuss ways to put this application in a real time setting for continuous biosurveillance.