Pub. online:19 Jan 2026Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 24, Issue 2 (2026): Special Issue: The 2025 Symposium on Data Science and Statistics (SDSS 2025),, pp. 436–454
Abstract
Land use land cover (LULC) change in the agriculture, is a critical area of concern as it directly impacts food security, environmental health, and economic stability. One of the leading LULC data products is the U.S. Department of Agriculture’s (USDA) Cropland Data Layer (CDL). Produced annually by the USDA National Agricultural Statistics Service (NASS) using satellite imagery, the CDL provides crop-specific data with an estimated classification accuracy of 85% to 95% for major crop types across the U.S. However, several limitations inherent to the CDL, such as crop underestimation bias, pixel misclassification, and difficulty distinguishing certain vegetation types, have raised questions about the accuracy of LULC change estimates derived from this dataset. In this paper, we introduce the R package cdlsim, designed to quantify the sensitivity of CDL-derived metrics through simulations of CDL data at the patch level using NASS published accuracy statistics. We present a case study utilizing landscape metrics calculated with the popular landscapemetrics R package to demonstrate the utility of cdlsim in quantifying the sensitivity of metrics to random perturbations in the data. The case study examines a mixed agricultural and grassland landscape in South Dakota, illustrating how our package enables researchers to achieve a more nuanced representation of land-use change.
Abstract: Simulation studies are important statistical tools used to inves-tigate the performance, properties and adequacy of statistical models. The simulation of right censored time-to-event data involves the generation of two independent survival distributions, where the rst distribution repre-sents the uncensored survival times and the second distribution represents the censoring mechanism. In this brief report we discuss how we can make it so that the percentage of censored data is previously de ned. The described method was used to generate data from a Weibull distribution, but it can be adapted to any other lifetime distribution. We further presented an R code function for generating random samples, considering the proposed approach.
Abstract: In any sport competition, there is a strong interest in knowing which team shall be the champion at the end of the championship. Besides this, the end result of a match, the chance of a team to be qualified for a specific tournament, the chance of being relegated, the best attack, the best defense, among others, are also subject of interest. In this paper we present a simple method with good predictive quality, easy implementation, low computational effort, which allows the calculation of all the interesting quantities above. Following Lee (1997), we estimate the average goals scored by each team by assuming that the number of goals scored by a team in a match follows a univariate Poisson distribution but we consider linear models that express the sum and the difference of goals scored in terms of five covariates: the goal average in a match, the home-team advantage, the team’s offensive power, the opponent team’s defensive power and a crisis indicator. The methodology is applied to the 2008-2009 English Premier League.
A new flexible extension of the inverse Rayleigh model is proposed and studied. Some of its fundamental statistical properties are derived. We assessed the performance of the maximum likelihood method via a simulation study. The importance of the new model is shown via three applications to real data sets. The new model is much better than other important competitive models.