Abstract: The probability of winning a game in major league baseball depends on various factors relating to team strength including the past per formance of the two teams, the batting ability of the two teams and the starting pitchers. These three factors change over time. We combine these factors by adopting contribution parameters, and include a home field ad vantage variable in forming a two-stage Bayesian model. A Markov chain Monte Carlo algorithm is used to carry out Bayesian inference and to sim ulate outcomes of future games. We apply the approach to data obtained from the 2001 regular season in major league baseball.
Abstract: Searching for data structure and decision rules using classification and regression tree (CART) methodology is now well established. An alternative procedure, search partition analysis (SPAN), is less well known. Both provide classifiers based on Boolean structures; in CART these are generated by a hierarchical series of local sub-searches and in SPAN by a global search. One issue with CART is its perceived instability, another the awkward nature of the Boolean structures generated by a hierarchical tree. Instability arises because the final tree structure is sensitive to early splits. SPAN, as a global search, seems more likely to render stable partitions. To examine these issues in the context of identifying mothers at risk of giving birth to low birth weight babies, we have taken a very large sample, divided it at random into ten non-overlapping sub-samples and performed SPAN and CART analyses on each sub-sample. The stability of the SPAN and CART models is described and, in addition, the structure of the Boolean representation of classifiers is examined. It is found that SPAN partitions have more intrinsic stability and less prone to Boolean structural irregularities.
Abstract: Let {(Xi , Yi), i ≥ 1} be a sequence of bivariate random variables from a continuous distribution. If {Rn, n ≥ 1} is the sequence of record values in the sequence of X’s, then the Y which corresponds with the nth record will be called the concomitant of the nth-record, denoted by R[n] . In FGM family, we determine the amount of information contained in R[n] and compare it with amount of information given in Rn. Also, we show that the Kullback-Leibler distance among the concomitants of record values is distribution-free. Finally, we provide some numerical results of mutual information and Pearson correlation coefficient for measuring the amount of dependency between Rn and R[n] in the copula model of FGM family.
Law and legal studies has been an exciting new field for data science applications whereas the technological advancement also has profound implications for legal practice. For example, the legal industry has accumulated a rich body of high quality texts, images and other digitised formats, which are ready to be further processed and analysed by data scientists. On the other hand, the increasing popularity of data science has been a genuine challenge to legal practitioners, regulators and even general public and has motivated a long-lasting debate in the academia focusing on issues such as privacy protection and algorithmic discrimination. This paper collects 1236 journal articles involving both law and data science from the platform Web of Science to understand the patterns and trends of this interdisciplinary research field in terms of English journal publications. We find a clear trend of increasing publication volume over time and a strong presence of high-impact law and political science journals. We then use the Latent Dirichlet Allocation (LDA) as a topic modelling method to classify the abstracts into four topics based on the coherence measure. The four topics identified confirm that both challenges and opportunities have been investigated in this interdisciplinary field and help offer directions for future research.
A new flexible extension of the inverse Rayleigh model is proposed and studied. Some of its fundamental statistical properties are derived. We assessed the performance of the maximum likelihood method via a simulation study. The importance of the new model is shown via three applications to real data sets. The new model is much better than other important competitive models.
Technological advances in software development effectively handled technical details that made life easier for data analysts, but also allowed for nonexperts in statistics and computer science to analyze data. As a result, medical research suffers from statistical errors that could be otherwise prevented such as errors in choosing a hypothesis test and assumption checking of models. Our objective is to create an automated data analysis software package that can help practitioners run non-subjective, fast, accurate and easily interpretable analyses. We used machine learning to predict the normality of a distribution as an alternative to normality tests and graphical methods to avoid their downsides. We implemented methods for detecting outliers, imputing missing values, and choosing a threshold for cutting numerical variables to correct for non-linearity before running a linear regression. We showed that data analysis can be automated. Our normality prediction algorithm outperformed the Shapiro-Wilk test in small samples with Matthews correlation coefficient of 0.5 vs. 0.16. The biggest drawback was that we did not find alternatives for statistical tests to test linear regression assumptions which are problematic in large datasets. We also applied our work to a dataset about smoking in teenagers. Because of the opensource nature of our work, these algorithms can be used in future research and projects.
Abstract: It is important to estimate transmissibility of influenza virus during its growing phase for understanding the propagation of the virus. The estimation procedures of the transmissibility are usually based on the data generated in flu seasons. The data-generating process of the outbreak of influenza has many features. The data is generated by not only a biological process but also control measures such as flu vaccination. The estimation is discussed by considering the aspects of the data-generating process and using the model to capture the essential characteristics of flu transmission during the growing phase of a flu season.
Abstract: In this small note we have established some new explicit expressions for ratio and inverse moments of lower generalized order statistics for the Marshall-Olkin extended Burr type XII distribution. These explicit expressions can be used to develop the relationship for moments of ordinary order statistics, record statistics and other ordered random variable techniques. Further, a characterization result of this distribution has been considered on using the conditional moment of the lower generalized order statistics.
Abstract: We study a new five-parameter model called the extended Dagum distribution. The proposed model contains as special cases the log-logistic and Burr III distributions, among others. We derive the moments, generating and quantile functions, mean deviations and Bonferroni, Lorenz and Zenga curves. We obtain the density function of the order statistics. The parameters are estimated by the method of maximum likelihood. The observed information matrix is determined. An application to real data illustrates the importance of the new model.
Abstract: Several statistical approaches have been proposed to consider circumstances under which one universal distribution is not capable of fit ting into the whole domain. This paper studies Bayesian detection of mul tiple interior epidemic/square waves in the interval domain, featured by two identical statistical distributions at both ends. We introduce a simple dimension-matching parameter proposal to implement the sampling-based posterior inference for special cases where each segmented distribution on a circle has the same set of regulating parameters. Molecular biology research reveals that, cancer progression may involve DNA copy number alteration at genome regions and connection of two biologically inactive chromosome ends results in a circle holding multiple epidemic/square waves. A slight modification of a simple novel Bayesian change point identification algo rithm, random grafting-pruning Markov chain Monte Carlo (RGPMCMC), is proposed by adjusting the original change point birth/death symmetric transition probability with a differ-by-one change point number ratio. The algorithm performance is studied through simulations with connection to DNA copy number alteration detection, which promises potential applica tion to cancer diagnosis at the genome level.