Abstract: This paper describes a test of two alternative sets of ratio edit and imputation procedures, both using the U.S. Census Bureau’s generalized editing/imputation subsystem (“Plain Vanilla”) on 1997 Economic Census data. We compare the quality of edited and im puted data — at both the macro and micro levels — from both sets of procedures and discuss how our quantitative methods allowed us to recommend changes to current procedures.
Abstract: We propose a new method of adding two parameters to a contin uous distribution that extends the idea first introduced by Lehmann (1953) and studied by Nadarajah and Kotz (2006). This method leads to a new class of exponentiated generalized distributions that can be interpreted as a double construction of Lehmann alternatives. Some special models are dis cussed. We derive some mathematical properties of this class including the ordinary moments, generating function, mean deviations and order statis tics. Maximum likelihood estimation is investigated and four applications to real data are presented.
Abstract: A new family of copulas generated by a univariate distribution function is introduced, relations between this copula and other well-known ones are discussed. The new copula is applied to model the dependence of two real data sets as illustrations.
Abstract: In this paper, we reconsider the two-factor stochastic mortality model introduced by Cairns, Blake and Dowd (2006) (CBD). The error terms in the CBD model are assumed to form a two-dimensional random walk. We first use the Doornik and Hansen (2008) multivariate normality test to show that the underlying normality assumption does not hold for the considered data set. Ainou (2011) proposed independent univariate normal inverse Gaussian L´evy processes to model the error terms in the CBD model. We generalize this idea by introducing a possible dependency between the 2-dimensional random variables, using a bivariate Generalized Hyperbolic distribution. We propose four non-Gaussian, fat-tailed distributions: Stu dent’s t, normal inverse Gaussian, hyperbolic and generalized hyperbolic distributions. Our empirical analysis shows some preferences for using the new suggested model, based on Akaike’s information criterion, the Bayesian information criterion and likelihood ratio test, as our in-sample model selec tion criteria, as well as mean absolute percentage error for our out-of-sample projection errors.
Fixed-point algorithms are popular in statistics and data science due to their simplicity, guaranteed convergence, and applicability to high-dimensional problems. Well-known examples include the expectation-maximization (EM) algorithm, majorization-minimization (MM), and gradient-based algorithms like gradient descent (GD) and proximal gradient descent. A characteristic weakness of these algorithms is their slow convergence. We discuss several state-of-art techniques for accelerating their convergence. We demonstrate and evaluate these techniques in terms of their efficiency and robustness in six distinct applications. Among the acceleration schemes, SQUAREM shows robust acceleration with a mean 18-fold speedup. DAAREM and restarted-Nesterov schemes also demonstrate consistently impressive accelerations. Thus, it is possible to accelerate the original fixed-point algorithm by using one of SQUAREM, DAAREM, or restarted-Nesterov acceleration schemes. We describe implementation details and software packages to facilitate the application of the acceleration schemes. We also discuss strategies for selecting a particular acceleration scheme for a given problem.
Pub. online:18 Jul 2022Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 20, Issue 3 (2022): Special Issue: Data Science Meets Social Sciences, pp. 359–379
Abstract
Our paper introduces tree-based methods, specifically classification and regression trees (CRT), to study student achievement. CRT allows data analysis to be driven by the data’s internal structure. Thus, CRT can model complex nonlinear relationships and supplement traditional hypothesis-testing approaches to provide a fuller picture of the topic being studied. Using Early Childhood Longitudinal Study-Kindergarten 2011 data as a case study, our research investigated predictors from students’ demographic backgrounds to ascertain their relationships to students’ academic performance and achievement gains in reading and math. In our study, CRT displays complex patterns between predictors and outcomes; more specifically, the patterns illuminated by the regression trees differ across the subject areas (i.e., reading and math) and between the performance levels and achievement gains. Through the use of real-world assessment datasets, this article demonstrates the strengths and limitations of CRT when analyzing student achievement data as well as the challenges. When achievement data such as achievement gains in our case study are not linearly strongly related to any continuous predictors, regression trees may make more accurate predictions than general linear models and produce results that are easier to interpret. Our study illustrates scenarios when CRT on achievement data is most appropriate and beneficial.
Pub. online:8 Jul 2022Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 20, Issue 3 (2022): Special Issue: Data Science Meets Social Sciences, pp. 339–357
Abstract
Researchers need guidance on how to obtain maximum efficiency and accuracy when annotating training data for text classification applications. Further, given wide variability in the kinds of annotations researchers need to obtain, they would benefit from the ability to conduct low-cost experiments during the design phase of annotation projects. To this end, our study proposes the single-case study design as a feasible and causally-valid experimental design for determining the best procedures for a given annotation task. The key strength of the design is its ability to generate causal evidence at the individual level, identifying the impact of competing annotation techniques and interfaces for the specific annotator(s) included in an annotation project. In this paper, we demonstrate the application of the single-case study in an applied experiment and argue that future researchers should incorporate the design into the pilot stage of annotation projects so that, over time, a causally-valid body of knowledge regarding the best annotation techniques is built.
Bootstrapping is commonly used as a tool for non-parametric statistical inference to assess the quality of estimators in variable selection models. However, for a massive dataset, the computational requirement when using bootstrapping in variable selection models (BootVS) can be crucial. In this study, we propose a novel framework using a bag of little bootstraps variable selection (BLBVS) method with a ridge hybrid procedure to assess the quality of estimators in generalized linear models with a regularized term, such as lasso and group lasso penalties. The proposed method can be easily and naturally implemented with distributed computing, and thus has significant computational advantages for massive datasets. The simulation results show that our novel BLBVS method performs excellently in both accuracy and efficiency when compared with BootVS. Real data analyses including regression on a bike sharing dataset and classification of a lending club dataset are presented to illustrate the computational superiority of BLBVS in large-scale datasets.
In this article I analyse motion picture editing as a point process to explore the temporal structure in the timings of cuts in motion pictures, modelling the editing in 134 Hollywood films released between 1935 and 2005 as a Hawkes process with an exponential kernel. The results show that the editing in Hollywood films can be modelled as a Hawkes process and that the conditional intensity function provides a direct description of the instantaneous cutting rate of a film, revealing the structure of a film’s editing at a range of scales. The parameters of the exponential kernel show a clear trend over time to a more rapid editing style with an increase in the rate of exogenous events and small increase in the rate of endogenous events. This is consistent with the shift from a classical to an intensified continuity editing style. There are, however, few differences between genres indicating the consistency of editing practices in Hollywood cinema over time and different types of films.