The United States Department of Agriculture’s (USDA’s) National Agricultural Statistics Service (NASS) conducted a pilot study in 2024 to obtain data collected onboard farm machinery and explore their uses for statistical purposes. NASS has recognized high value in these machine-logged data (MLD) systems as they can potentially augment, or even replace, traditional survey efforts while providing additional benefits of reducing respondent burden and improving crop-related estimates. This pilot study ultimately addressed four topics: 1) understanding the obstacles in obtaining MLD from farmers; 2) creating geographic workflows to manage inherent geospatial MLD; 3) developing the linkages to NASS’s tabular list frame information; and 4) assessing the use of MLD to replace survey data for time-sensitive estimates. To study each topic, field-level information was gathered from the MLD systems of dozens of producers over hundreds of fields across the central United States (US) for the 2023 growing season. Results showed that 90% of the fields could be linked to a producer on the NASS list frame. Of those producers, the consistency of MLD versus traditional survey reporting was highly variable for those who were selected for a survey in 2023. Comparisons showed median MLD values were larger than historical NASS survey values. Approximately 48% of survey comparisons showed a difference of 25% or less between MLD and historical NASS survey values. MLD shows promise for use in official statistics; however, further analyses with additional producers’ data and enhancements to MLD collection processes are needed before supplementing traditional survey methods.
Heart rate data collected from wearable devices – one type of time series data – could provide insights into activities, stress levels, and health. Yet, consecutive missing segments (i.e., gaps) that commonly occur due to improper device placement or device malfunction could distort the temporal patterns inherent in the data and undermine the validity of downstream analyses. This study proposes an innovative iterative procedure to fill gaps in time series data that capitalizes on the denoising capability of Singular Spectrum Analysis (SSA) and eliminates SSA’s requirement of pre-specifying the window length and number of groups. The results of simulations demonstrate that the performance of SSA-based gap-filling methods depends on the choice of window length, number of groups, and the percentage of missing values. In contrast, the proposed method consistently achieves the lowest rates of reconstruction error and gap-filling error across a variety of combinations of the factors manipulated in the simulations. The simulation findings also highlight that the commonly recommended long window length – half of the time series length – may not apply to time series with varying frequencies such as heart rate data. The initialization step of the proposed method that involves a large window length and the first four singular values in the iterative singular value decomposition process not only avoids convergence issues but also facilitates imputation accuracy in subsequent iterations. The proposed method provides the flexibility for researchers to conduct gap-filling solely or in combination with denoising on time series data and thus widens the applications.
Pub. online:10 Jul 2024Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 456–468
Abstract
Missing data is a common occurrence in various fields, spanning social science, education, economics, and biomedical research. Disregarding missing data in statistical analyses can introduce bias to study outcomes. To mitigate this issue, imputation methods have proven effective in reducing nonresponse bias and generating complete datasets for subsequent analysis of secondary data. The efficacy of imputation methods hinges on the assumptions of the underlying imputation model. While machine learning techniques such as regression trees, random forest, XGBoost, and deep learning have demonstrated robustness against model misspecification, their optimal performance may necessitate fine-tuning under specific conditions. Moreover, imputed values generated by these methods can sometimes deviate unnaturally, falling outside the normal range. To address these challenges, we propose a novel Predictive Mean Matching imputation (PMM) procedure that leverages popular machine learning-based methods. PMM strikes a balance between robustness and the generation of appropriate imputed values. In this paper, we present our innovative PMM approach and conduct a comparative performance analysis through Monte Carlo simulation studies, assessing its effectiveness against other established methods.