Abstract: We extend propensity score methodology to incorporate survey weights from complex survey data and compare the use of multiple linear regression and propensity score analysis to estimate treatment effects in observational data from a complex survey. For illustration, we use these two methods to estimate the effect of gender on information technology (IT) salaries. In our analysis, both methods agree on the size and statistical significance of the overall gender salary gaps in the United States in four different IT occupations after controlling for educational and job-related covariates. Each method, however, has its own advantages which are discussed. We also show that it is important to incorporate the survey design in both linear regression and propensity score analysis. Ignoring the survey weights affects the estimates of population-level effects substantially in our analysis.
Abstract: Data collection for landslide susceptibility modelling is often an almost inhibitive activity. This has been the reason for quite sometimes land slide was described and modelled on the basis of spatially distributed values of landslide related attributes. This paper presents landslide susceptibility analysis at Selangor area, Malaysia, using artificial neural network model with the aid of remote sensing data and geographic information system (GIS) tools. To meet the objectives, landslide locations were identified in the study area from interpretation of aerial photographs and supported with extensive field surveys. Then, the landslide inventory was grouped into two categories: (1) training data (2) testing data. Further, topographical, geological data and satellite images were collected, processed, and constructed into a spatial database using GIS tools and image processing techniques. Nine landslide occurrence attributes were selected and analyzed using an artificial neural network model to generate the landslide susceptibility maps. Landslide loca tion data (training data) were used for training the neural network and five training sites were selected randomly in this case. The use of five training sites ensemble to investigate the model reliability, including the role of the thematic variables used to construct the model, and the model sensitivity to changes in the selection of the training sites. By studying the variation of the neural network’s susceptibility estimate, the error associated with the model is determined. The results of the neural network analysis are shown on five sets of landslide susceptibility maps. Then the susceptibility maps were validated using ”receiver operating characteristics (ROC)” method as a measure for the model verification. Landslide training data which were not used during the training of the neural network was used for the verification of the maps. The results of the analysis were verified using the landslide location data and compared between five different cases. Qualitatively, the model seems to give reasonable results with accuracy observed was 87%, 83%, 85%, 86% and 82% for five different training sites respectively.
For the purpose of generalizing or extending an existing probability distribution, incorporation of additional parameter to it is very common in the statistical distribution theory and practice. In fact, in most of the times, such extensions provide better fit to the real life situations compared to the existing ones. In this article, we propose and study a two-parameter probability distribution, called quasi xgamma distribution, as an extension or generalization of xgamma distribution (Sen et al. 2016) for modeling lifetime data. Important distributional properties along with survival characteristics and distributions of order statistics are studied in detail. Method of maximum likelihood and method of moments are proposed and described for parameter estimation. A data generation algorithm is proposed supported by a Monte-Carlo simulation study to describe the mean square errors of estimates for different sample sizes. A bladder cancer survival data is used to illustrate the application and suitability of the proposed distribution as a potential survival model.
Abstract: The use of multiple regression analysis (MRA) has been on the rise over the last few decades in part due to the realization that analysis of variance (ANOVA) statistics can be advantageously completed using MRA. Given the limitations of ANOVA strategies it is argued that MRA is the better analysis; however, in order to use ANOVA in MRA coding structures must be employed by the researcher which can be confusing to understand. The present paper attempts to simplify this discussion by providing a description of the most popular coding structures, with emphasis on their strengths, limitations, and uses. A visual analysis of each of these strategies is also included along with all necessary steps to create the contrasts. Finally, a decision tree is presented that can be used by researchers to determine which coding structure to utilize in their current research project.
Abstract: Graphical procedures can be useful for illustrating and evaluating the process of inverse regression. We first review some simple and well-known graphical approaches for univariate linear and nonlinear models. We then propose a new graphical tool applicable to situations where the response is bivariate and repeated measures data are available. The proposed method is illustrated with an example of the age determination of tern chicks using measurements on body weight and wing length.
Abstract: The probability of winning a game in major league baseball depends on various factors relating to team strength including the past per formance of the two teams, the batting ability of the two teams and the starting pitchers. These three factors change over time. We combine these factors by adopting contribution parameters, and include a home field ad vantage variable in forming a two-stage Bayesian model. A Markov chain Monte Carlo algorithm is used to carry out Bayesian inference and to sim ulate outcomes of future games. We apply the approach to data obtained from the 2001 regular season in major league baseball.
Abstract: Searching for data structure and decision rules using classification and regression tree (CART) methodology is now well established. An alternative procedure, search partition analysis (SPAN), is less well known. Both provide classifiers based on Boolean structures; in CART these are generated by a hierarchical series of local sub-searches and in SPAN by a global search. One issue with CART is its perceived instability, another the awkward nature of the Boolean structures generated by a hierarchical tree. Instability arises because the final tree structure is sensitive to early splits. SPAN, as a global search, seems more likely to render stable partitions. To examine these issues in the context of identifying mothers at risk of giving birth to low birth weight babies, we have taken a very large sample, divided it at random into ten non-overlapping sub-samples and performed SPAN and CART analyses on each sub-sample. The stability of the SPAN and CART models is described and, in addition, the structure of the Boolean representation of classifiers is examined. It is found that SPAN partitions have more intrinsic stability and less prone to Boolean structural irregularities.
Abstract: Let {(Xi , Yi), i ≥ 1} be a sequence of bivariate random variables from a continuous distribution. If {Rn, n ≥ 1} is the sequence of record values in the sequence of X’s, then the Y which corresponds with the nth record will be called the concomitant of the nth-record, denoted by R[n] . In FGM family, we determine the amount of information contained in R[n] and compare it with amount of information given in Rn. Also, we show that the Kullback-Leibler distance among the concomitants of record values is distribution-free. Finally, we provide some numerical results of mutual information and Pearson correlation coefficient for measuring the amount of dependency between Rn and R[n] in the copula model of FGM family.
Law and legal studies has been an exciting new field for data science applications whereas the technological advancement also has profound implications for legal practice. For example, the legal industry has accumulated a rich body of high quality texts, images and other digitised formats, which are ready to be further processed and analysed by data scientists. On the other hand, the increasing popularity of data science has been a genuine challenge to legal practitioners, regulators and even general public and has motivated a long-lasting debate in the academia focusing on issues such as privacy protection and algorithmic discrimination. This paper collects 1236 journal articles involving both law and data science from the platform Web of Science to understand the patterns and trends of this interdisciplinary research field in terms of English journal publications. We find a clear trend of increasing publication volume over time and a strong presence of high-impact law and political science journals. We then use the Latent Dirichlet Allocation (LDA) as a topic modelling method to classify the abstracts into four topics based on the coherence measure. The four topics identified confirm that both challenges and opportunities have been investigated in this interdisciplinary field and help offer directions for future research.