A Scalable Spatial Decorrelation Preprocessing Approach for Machine and Deep Learning
Pub. online: 9 December 2025
Type: Statistical Data Science
Open Access
Received
13 June 2025
13 June 2025
Accepted
25 November 2025
25 November 2025
Published
9 December 2025
9 December 2025
Abstract
Spatial data display correlation between observations collected at nearby locations. Generally, machine and deep learning methods either do not account for this correlation or do so indirectly through correlated features. To account for spatial correlation, we propose preprocessing the data using a spatial decorrelation transform motivated from properties of a multivariate Gaussian distribution and Vecchia approximations. The preprocessed, transformed data can then be ported into a machine or deep learning tool. After model fitting on the transformed data, the output can be spatially re-correlated via the corresponding inverse transformation. We show that including this spatial adjustment results in higher predictive accuracy on simulated and real spatial datasets.
Supplementary material
Supplementary MaterialThis material is based upon work supported by the National Aeronautics and Space Administration under Grant/Contract/Agreement No. 10053957-01 and by the National Science Foundation under Grant No. 2053188. R and Python implementations of the proposed spatial whitening transformation are available as a zip file or at https://github.com/amillane/spatialtransform. The contents are organized as follows:
•
README.md : A brief overview of the repository structure and usage instructions.
•
R Function/
–
Functions/TransformFunctions.R : R implementation of the whitening and inverse-whitening transformations.
–
demo.R : Example code demonstrating use of the R transformation functions.
–
SimulatedData1.RData : Example simulated dataset for demonstration.
–
SimulatedData2.RData : Second example simulated dataset.
•
Python Function/
–
Functions/SpatialTransform.py : Python implementation of the whitening and inverse-whitening transformations.
–
Functions/matern.py : Matern covariance utility functions.
–
Functions/mknnIndx.py : Nearest-neighbor index construction for Vecchia approximation.
–
demo.ipynb : Jupyter notebook illustrating how to use the Python implementation.
–
NonLinSimDataSet17.json : Example nonlinear simulated dataset used in demonstrations.
Together, these materials provide complete code and example data needed to reproduce the spatial whitening transformation and the analyses described in the manuscript.
References
Abdulah S, Ltaief H, Sun Y, Genton MG, Keyes DE (2018). Exageostat: A high performance unified software for geostatistics on manycore systems. IEEE Transactions on Parallel and Distributed Systems, 29(12): 2771–2784. https://doi.org/10.1109/TPDS.2018.2850749
Berrett C, Calder CA (2016). Bayesian spatial binary classification. Spatial Statistics, 16: 72–102. https://doi.org/10.1016/j.spasta.2016.01.004
Bradley JR, Cressie N, Shi T (2016). A comparison of spatial predictors when datasets could be very large. Statistics Surveys, 10: 100–131. https://doi.org/10.1214/16-SS115
Chen W, Li Y, Reich BJ, Sun Y (2024). Deepkriging: Spatially dependent deep neural networks for spatial prediction. Statistica Sinica, 34: 291–311. https://doi.org/10.5705/ss.202021.0277
Cisneros D, Richards J, Dahal A, Lombardo L, Huser R (2024). Deep graphical regression for jointly moderate and extreme Australian wildfires. Spatial Statistics, 100811. https://doi.org/10.1016/j.spasta.2024.100811
Datta A, Banerjee S, Finley AO, Gelfand AE (2016a). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514): 800–812. https://doi.org/10.1080/01621459.2015.1044091
Datta A, Banerjee S, Finley AO, Gelfand AE (2016b). On nearest-neighbor Gaussian process models for massive spatial data. Wiley Interdisciplinary Reviews. Computational Statistics, 8(5): 162–171. https://doi.org/10.1002/wics.1383
Finley AO, Datta A, Cook BD, Morton DC, Andersen HE, Banerjee S (2019). Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics, 28(2): 401–414. https://doi.org/10.1080/10618600.2018.1537924
Gawlikowski J, Tassi CRN, Ali M, Lee J, Humt M, Feng J, et al. (2023). A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1): 1513–1589. https://doi.org/10.1007/s10462-023-10562-9
Gelfand AE, Schliep EM (2016). Spatial statistics and Gaussian processes: A beautiful marriage. Spatial Statistics, 18: 86–104. https://doi.org/10.1016/j.spasta.2016.03.006
Georganos S, Grippa T, Niang Gadiaga A, Linard C, Lennert M, Vanhuysse S, et al. (2021). Geographical random forests: A spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling. Geocarto International, 36(2): 121–136. https://doi.org/10.1080/10106049.2019.1595177
Gray SD, Heaton MJ, Bolintineanu DS, Olson A (2022). On the use of deep neural networks for large-scale spatial prediction. Journal of Data Science, 20(4): 493–511. https://doi.org/10.6339/22-JDS1070
Guinness J (2018). Permutation and grouping methods for sharpening Gaussian process approximations. Technometrics, 60(4): 415–429. https://doi.org/10.1080/00401706.2018.1437476
Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, et al. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological, and Environmental Statistics, 24: 398–425. https://doi.org/10.1007/s13253-018-00348-w
Huang H, Abdulah S, Sun Y, Ltaief H, Keyes DE, Genton MG (2021). Competition on spatial statistics for large datasets. Journal of Agricultural, Biological, and Environmental Statistics, 26: 580–595. https://doi.org/10.1007/s13253-021-00457-z
Katzfuss M, Guinness J (2021). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, 36(1). https://doi.org/10.1214/19-STS755
Lin DC, Huang HC, Tzeng S (2023). Some enhancements to deepkriging. Stat, 12(1): e559. https://doi.org/10.1002/sta4.559
Nikparvar B, Thill JC (2021). Machine learning of spatial data. ISPRS International Journal of Geo-Information, 10(9): 600. https://doi.org/10.3390/ijgi10090600
Pace RK, Barry R, Sirmans CF (1998). Spatial statistics and real estate. The Journal of Real Estate Finance and Economics, 17: 5–13. https://doi.org/10.1023/A:1007783811760
Patelli L, Cameletti M, Golini N, Ignaccolo R (2024). A path in regression random forest looking for spatial dependence: A taxonomy and a systematic review. In: Advanced Statistical Methods in Process Monitoring, Finance, and Environmental Science: Essays in Honour of Wolfgang Schmid, Knoth S, Okhrin Y, Otto P, 467–489.
Saha A, Basu S, Datta A (2023). Random forests for spatially dependent data. Journal of the American Statistical Association, 118(541): 665–683. https://doi.org/10.1080/01621459.2021.1950003
Sainsbury-Dale M, Zammit-Mangion A, Richards J, Huser R (2025). Neural Bayes estimators for irregular spatial data using graph neural networks. Journal of Computational and Graphical Statistics, 1–16. https://doi.org/10.1080/10618600.2024.2433671
Sauer A, Gramacy RB, Higdon D (2023). Active learning for deep Gaussian process surrogates. Technometrics, 65(1): 4–18. https://doi.org/10.1080/00401706.2021.2008505
Sekulić A, Kilibarda M, Heuvelink GB, Nikolić M, Bajat B (2020). Random forest spatial interpolation. Remote Sensing, 12(10): 1687. https://doi.org/10.3390/rs12101687
Stein ML (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics, 8: 1–19. https://doi.org/10.1016/j.spasta.2013.06.003
Tonks A, Harris T, Li B, Brown W, Smith R (2024). Forecasting West Nile virus with graph neural networks: Harnessing spatial dependence in irregularly sampled geospatial data. GeoHealth, 8(7): e2023GH000784. https://doi.org/10.1029/2023GH000784
Wikle CK, Zammit-Mangion A (2023). Statistical deep learning for spatial and spatiotemporal data. Annual Review of Statistics and Its Application, 10: 247–270. https://doi.org/10.1146/annurev-statistics-033021-112628
Yuan Q, Shen H, Li T, Li Z, Li S, Jiang Y, et al. (2020). Deep learning in environmental remote sensing: Achievements and challenges. Remote Sensing of Environment, 241, 111716. https://doi.org/10.1016/j.rse.2020.111716
Zammit-Mangion A, Kaminski M.D., Tran B.H., Filippone M, Cressie N (2024). Spatial Bayesian neural networks. Spatial Statistics, 60, 100825. https://doi.org/10.1016/j.spasta.2024.100825
Zammit-Mangion A, Ng TLJ, Vu Q, Filippone M (2022). Deep compositional spatial models. Journal of the American Statistical Association, 117(540): 1787–1808. https://doi.org/10.1080/01621459.2021.1887741
Zhan W, Datta A (2025). Neural networks for geospatial data. Journal of the American Statistical Association, 120(549): 535–547. https://doi.org/10.1080/01621459.2024.2356293
Zhang H, Zimmerman J, Nettleton D, Nordman DJ (2020). Random forest prediction intervals. The American Statistician, 74(4): 392–406. https://doi.org/10.1080/00031305.2019.1585288