The Second Competition on Spatial Statistics for Large Datasets

Abdulah, Sameh; Alamri, Faten; Nag, Pratik; Sun, Ying; Ltaief, Hatem; Keyes, David E.; Genton, Marc G.

doi:10.6339/22-JDS1076

Journal of Data Science

The Second Competition on Spatial Statistics for Large Datasets

Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 439–460

Sameh Abdulah Faten Alamri Pratik Nag All authors (7)

https://doi.org/10.6339/22-JDS1076

Pub. online: 8 November 2022 Type: Statistical Data Science

Open Access

Received
14 August 2022

Accepted
29 October 2022

Published
8 November 2022

Abstract

In the last few decades, the size of spatial and spatio-temporal datasets in many research areas has rapidly increased with the development of data collection technologies. As a result, classical statistical methods in spatial statistics are facing computational challenges. For example, the kriging predictor in geostatistics becomes prohibitive on traditional hardware architectures for large datasets as it requires high computing power and memory footprint when dealing with large dense matrix operations. Over the years, various approximation methods have been proposed to address such computational issues, however, the community lacks a holistic process to assess their approximation efficiency. To provide a fair assessment, in 2021, we organized the first competition on spatial statistics for large datasets, generated by our ExaGeoStat software, and asked participants to report the results of estimation and prediction. Thanks to its widely acknowledged success and at the request of many participants, we organized the second competition in 2022 focusing on predictions for more complex spatial and spatio-temporal processes, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. In this paper, we describe in detail the data generation procedure and make the valuable datasets publicly available for a wider adoption. Then, we review the submitted methods from fourteen teams worldwide, analyze the competition outcomes, and assess the performance of each team.

Supplementary material

Supplementary Material

In the Supplementary Material, we list the members of all the teams participating in this competition in Table S1. Moreover, Tables S2 to S11 summarize the RMSE values obtained by different teams in each dataset of different sub-competitions, as well as those obtained with ExaGeoStat for reference purpose.

References

Abdulah S, Cao Q, Pei Y, Bosilca G, Dongarra J, Genton MG, et al. (2021). Accelerating geostatistical modeling and prediction with mixed-precision computations: A high-productivity approach with parsec. IEEE Transactions on Parallel and Distributed Systems, 33(4): 964–976.

Abdulah S, Ltaief H, Sun Y, Genton MG, Keyes DE (2018a). ExaGeoStat: A high performance unified software for geostatistics on manycore systems. IEEE Transactions on Parallel and Distributed Systems, 29(12): 2771–2784.

Abdulah S, Ltaief H, Sun Y, Genton MG, Keyes DE (2018b). Parallel approximation of the maximum likelihood estimation for the prediction of large-scale geostatistics simulations. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), 98–108. IEEE.

Abdulah S, Ltaief H, Sun Y, Genton MG, Keyes DE (2019). Geostatistical modeling and prediction using mixed precision tile Cholesky factorization. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), 152–162. IEEE.

Apanasovich TV, Genton MG, Sun Y (2012). A valid Matérn class of cross-covariance functions for multivariate random fields with any number of components. Journal of the American Statistical Association, 107(497): 180–193.

Ba S, Joseph VR (2012). Composite Gaussian process models for emulating expensive functions. The Annals of Applied Statistics, 6: 1838–1860.

Bevilacqua M, Morales-Oñate V, Caamaño-Carrillo C (2018). GeoModels: Procedures for Gaussian and Non Gaussian Geostatistical (Large) Data Analysis. R package version 1.0.0.

Bradley JR, Cressie N, Shi T (2016). A comparison of spatial predictors when datasets could be very large. Statistics Surveys, 10: 100–131.

Cao Q, Abdulah S, Alomairy R, Nag P, Pei Y, Bosilca G, et al. (2022). Reshaping geostatistical modeling and prediction for extreme-scale environmental applications. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.

Chen W, Li Y, Reich BJ, Sun Y (2022). DeepKriging: Spatially dependent deep neural networks for spatial prediction. Statistica Sinica, to appear.

Englund EJ (1990). A variance of geostatisticians. Mathematical Geology, 22(4): 417–455.

Gneiting T (2002). Nonseparable, stationary covariance functions for space-time data. Journal of the American Statistical Association, 97(458): 590–600.

Gneiting T, Kleiber W, Schlather M (2010). Matérn cross-covariance functions for multivariate random fields. Journal of the American Statistical Association, 105(491): 1167–1177.

Guinness J (2021). Gaussian process learning via Fisher scoring of Vecchia’s approximation. Statistics and Computing, 31(3): 1–8.

Guinness J, Katzfuss M, Fahmy Y (2018). GpGp: fast Gaussian process computation using Vecchia’s approximation. R package version 0.1.0.

Heaton MJ, Datta A, Finley A, Furrer R, Guhaniyogi R, Gerber F, et al. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24: 398–425.

Hong Y, Abdulah S, Genton MG, Sun Y (2021). Efficiency assessment of approximated spatial predictions for large datasets. Spatial Statistics, 43: 100517.

Huang H, Abdulah S, Sun Y, Ltaief H, Keyes DE, Genton MG (2021). Competition on spatial statistics for large datasets (with discussion). Journal of Agricultural, Biological and Environmental Statistics, 26(4): 580–595.

Katzfuss M, Guinness J (2021). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, 36(1): 124–141.

Katzfuss M, Guinness J, Gong W, Zilber D (2020). Vecchia approximations of Gaussian-process predictions. Journal of Agricultural, Biological and Environmental Statistics, 25(3): 383–414.

Kingma DP, Ba J (2015). Adam: A method for stochastic optimization. In: International Conference on Learning Representations, San Diego.

Li J, Heap AD (2011). A review of comparative studies of spatial interpolation methods in environmental sciences: Performance and impact factors. Ecological Informatics, 6(3–4): 228–241.

Li J, Heap AD (2014). Spatial interpolation methods applied in the environmental sciences: A review. Environmental Modelling & Software, 53: 173–189.

Li Y, Sun Y (2019). Efficient estimation of nonstationary spatial covariance functions with application to high-resolution climate model emulation. Statistica Sinica, 29(3): 1209–1231.

Mondal S, Abdulah S, Ltaief H, Sun Y, Genton MG, Keyes DE (2022). Parallel approximations of the Tukey g-and-h likelihoods and predictions for non-Gaussian geostatistics. In: International Parallel and Distributed Processing Symposium, 379–389.

Murakami D, Tsutsumida N, Yoshida T, Nakaya T, Lu B (2020). Scalable gwr: A linear-time algorithm for large-scale geographically weighted regression with polynomial kernels. Annals of the American Association of Geographers, 111(2): 459–480.

Nesi L, Legrand A, Mello Schnorr L (2021). Exploiting system level heterogeneity to improve the performance of a geostatistics multi-phase task-based application. In: 50th International Conference on Parallel Processing, 1–10.

Nesi L, Schnorr LM, Legrand A (2022). Multi-phase task-based HPC applications: Quickly learning how to run fast. In: IPDPS 2022 – 36th IEEE International Parallel & Distributed Processing Symposium.

Pebesma EJ (2004). Multivariable geostatistics in S: The gstat package. Computers & Geosciences, 30: 683–691.

Salvaña MLO, Abdulah S, Huang H, Ltaief H, Sun Y, Genton MG, et al. (2021). High performance multivariate geospatial statistics on manycore systems. IEEE Transactions on Parallel and Distributed Systems, 32(11): 2719–2733.

Salvaña MLO, Abdulah S, Ltaief H, Sun Y, Genton MG, Keyes DE (2022). Parallel space-time likelihood optimization for air pollution prediction on large-scale systems. In: Platform for Advanced Scientific Computing Conference (PASC’22), 1–11. Basel, Switzerland, Article No. 17.

Shahbeik S, Afzal P, Moarefvand P, Qumarsy M (2014). Comparison between ordinary kriging (OK) and inverse distance weighted (IDW) based on estimation error. Case study: Dardevey iron ore deposit, NE Iran. Arabian Journal of Geosciences, 7(9): 3693–3704.

Srivastava RM (1987). A non-ergodic framework for variograms and covariance functions, Master’s thesis, Stanford University, Stanford, California.

Vecchia AV (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Methodological), 50(2): 297–312.

Vu Q, Cao Y, Jacobson J, Pearse AR, Zammit-Mangion A (2021). Discussion on “Competition on spatial statistics for large datasets”. Journal of Agricultural, Biological and Environmental Statistics, 26(4): 614–618.

Weber D, Englund E (1992). Evaluation and comparison of spatial interpolators. Mathematical Geology, 24(4): 381–391.

Wikle CK, Cressie N, Zammit-Mangion A, Shumack C (2017). A common task framework (CTF) for objective comparison of spatial prediction methodologies. Statistics Views.

Xiong Y, Chen W, Apley D, Ding X (2007). A non-stationary covariance-based kriging method for metamodelling in engineering design. International Journal for Numerical Methods in Engineering, 71(6): 733–756.

2022 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

Gaussian process multivariate nonstationary prediction space-time spatial

Funding

The research in this manuscript was funded by the King Abdullah University of Science and Technology (KAUST) in Thuwal, Saudi Arabia. We want to thank the Supercomputing Laboratory (KSL) at KAUST (https://www.hpc.kaust.edu.sa/) for supporting this research by providing the hardware resources, including the Shaheen-II Cray XC40 supercomputer used to generate the datasets in this competition.

Metrics

since February 2021

1823

Article info
views

630

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file