Getting a machine to understand the meaning of language is a largely important goal to a wide variety of fields, from advertising to entertainment. In this work, we focus on Youtube comments from the top twohundred trending videos as a source of user text data. Previous Sentiment Analysis Models focus on using hand-labelled data or predetermined lexicon-s.Our goal is to train a model to label comment sentiment with emoticons by training on other user-generated comments containing emoticons. Naive Bayes and Recurrent Neural Network models are both investigated and im- plemented in this study, and the validation accuracies for Naive Bayes model and Recurrent Neural Network model are found to be .548 and .812.
Matlab, Python and R have all been used successfully in teaching college students fundamentals of mathematics & statistics. In today’s data driven environment, the study of data through big data analytics is very powerful, especially for the purpose of decision making and using data statistically in this data rich environment. MatLab can be used to teach introductory mathematics such as calculus and statistics. Both Python and R can be used to make decisions involving big data. On the one hand, Python is perfect for teaching introductory statistics in a data rich environment. On the other hand, while R is a little more involved, there are many customizable programs that can make somewhat involved decisions in the context of prepackaged, preprogrammed statistical analysis.
Abstract: Although United States government planners and others outside government had recognized the potential risk of attacks by terrorists, the events of September 11, 2001, vividly revealed the nation’s vulnerabilities to terrorism. Similarly, the 2004 terrorist attacks in Madrid illustrated vul nerabilities to terrorism extend beyond the United States. Those attacks were obvious destructive acts with a primary purpose of massive causalities. Let us consider a bioterrorist attack which is conducted subtly through the release of a Chemical/Biological agent. If such an attack occurs through release of a specific biological agent, an awareness of the potential threat of this agent in terms of the number of infections and deaths that could occur in a community is of paramount importance in preparing the public health community to respond to this attack. An increase in biosurveillance and novel approaches to biosurveillance are needed. This paper illustrates the use of mixed effects model for biosurveillance based on commuter size for regional rail lines. With mixed effects model we can estimate for any station on a given rail system the expected daily number of commuters and establish an acceptability criterion around this expected size. If the actual commuter size is significantly smaller than the estimate, then this could be an indicator of a possible attack. We illustrate this method through an example based on the 2001 daily totals for the Port Authority Transportation Company (PATCO) rail system, which serves residents of southern New Jersey and Philadelphia region in the United States. In addition, we discuss ways to put this application in a real time setting for continuous biosurveillance.
In this paper, maximum likelihood and Bayesian methods of estimation are used to estimate the unknown parameters of two Weibull populations with the same shape parameter under joint progressive Type-I (JPT-I) censoring scheme. Bayes estimates of the parameters are obtained based on squared error and LINEX loss functions under the assumption of independent gamma priors. We propose to apply Markov Chain Monte Carlo (MCMC) technique to carry out a Bayesian estimation procedure. The approximate confidence intervals and the credible intervals for the unknown parameters are also obtained. Finally, we analyze a one real data set for illustration purpose.
Abstract: The self-controlled case series (SCCS) and the matched cohort are two frequently used study designs to adjust for known and unknown confounding effects in epidemiological studies. Count data arising from these two designs may not be independent. While conditional Poisson regression models have been used to take into account the dependence of such data, these models have not been available in some standard statistical software packages (e.g., SAS). This article demonstrates 1) the relationship of the likelihood function and parameter estimation between the conditional Poisson regression models and Cox’s proportional hazard models in SCCS and matched cohort studies; 2) that it is possible to fit conditional Poisson regression models with procedures (e.g., PHREG in SAS) using Cox’s partial likelihood model. We tested both conditional Poisson likelihood and Cox’s partial likelihood models on data from studies using either SCCS or a matched cohort design. For the SCCS study, we fitted both parametric and semi-parametric models to model age effects, and described a simple way to apply the parametric and complex semi-parametric analysis to case series data.
Abstract: Meta-analytic methods for diagnostic test performance, Bayesian methods in particular, have not been well developed. The most commonly used method for meta-analysis of diagnostic test performance is the Summary Receiver Operator Characteristic (SROC) curve approach of Moses, Shapiro and Littenberg. In this paper, we provide a brief summary of the SROC method, then present a case study of a Bayesian adaptation of their SROC curve method that retains the simplicity of the original model while additionally incorporating uncertainty in the parameters, and can also easily be extended to incorporate the effect of covariates. We further derive a simple transformation which facilitates prior elicitation from clinicians. The method is applied to two datasets: an assessment of computed tomography for detecting metastases in non-small-cell lung cancer, and a novel dataset to assess the diagnostic performance of endoscopic ultrasound (EUS) in the detection of biliary obstructions relative to the current gold standard of endoscopic retrograde cholangiopancreatography (ERCP).
Abstract: In this study, first exit time of a compound Poisson process with positive jumps and an upper horizontal boundary is considered. An explicit formula is derived for the mean first exit time associated with the compound Poisson process. Finally, an application on traffic accidents is given to illustrate the usage of the mean first exit time.
We demonstrate how to test for conditional independence of two variables with categorical data using Poisson log-linear models. The size of the conditioning set of variables can vary from 0 (simple independence) up to many variables. We also provide a function in R for performing the test. Instead of calculating all possible tables with for loop we perform the test using the loglinear models and thus speeding up the process. Time comparison simulation studies are presented.
Abstract: In this study, we propose a pattern matching procedure to seize similar price movements of two stocks. First, the algorithm of searching the longest common subsequence is introduced to sieve out the time periods in which the two stocks have the same integrated volatility levels and price rise/drop trends. Next we transform the price data in the found matching time periods to the Bollinger Percent b data. The low frequency power spectra of the transformed data are used to extract trends. Pearson’s chi square test is used to assess similarity of the price movement patterns in the matching periods. Simulation results show the proposed procedure can effectively detect the co-movement periods of two price sequences. Finally, we apply the proposed procedure to empirical high frequency transaction data of NYSE.