Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science
  4. Supervised Spatial Regionalization using ...

Journal of Data Science

Submit your article Information
  • Article info
  • More
    Article info

Supervised Spatial Regionalization using the Karhunen-Loève Expansion and Minimum Spanning Trees
Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 566–584
Ranadeep Daw ORCID icon link to view author Ranadeep Daw details   Christopher K. Wikle  

Authors

 
Placeholder
https://doi.org/10.6339/22-JDS1077
Pub. online: 9 November 2022      Type: Statistical Data Science      Open accessOpen Access

Received
1 September 2022
Accepted
30 October 2022
Published
9 November 2022

Abstract

The article presents a methodology for supervised regionalization of data on a spatial domain. Defining a spatial process at multiple scales leads to the famous ecological fallacy problem. Here, we use the ecological fallacy as the basis for a minimization criterion to obtain the intended regions. The Karhunen-Loève Expansion of the spatial process maintains the relationship between the realizations from multiple resolutions. Specifically, we use the Karhunen-Loève Expansion to define the regionalization error so that the ecological fallacy is minimized. The contiguous regionalization is done using the minimum spanning tree formed from the spatial locations and the data. Then, regionalization becomes similar to pruning edges from the minimum spanning tree. The methodology is demonstrated using simulated and real data examples.

Supplementary material

 Supplementary Material
The supplementary material includes the following files: (1) README: a brief explanation of all the files in the supplementary material; (2) The synthetic dataset; (3) The real-world dataset; (4) Code files; (5) Images used in the paper; (6) A miscellaneous example of KLE computation directly from covariance matrices.

References

 
Adams R, Bischof L (1994). Seeded region growing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6): 641–647.
 
Anderson T, Dragićević S (2020). Complex spatial networks: Theory and geospatial applications. Geography Compass, 14(9): e12502.
 
Assunção RM, Neves MC, Câmara G, da Costa Freitas C (2006). Efficient regionalization techniques for socio-economic geographical units using minimum spanning trees. International Journal of Geographical Information Science, 20(7): 797–811.
 
Bradley JR, Wikle CK, Holan SH (2017). Regionalization of multiscale spatial processes by using a criterion for spatial aggregation error. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3): 815–832.
 
Bradley JR, Wikle CK, Holan SH, Holloway ST (2021). rcage: Regionalization of Multiscale Spatial Processes. R package version 1.1.
 
Chavent M, Kuentz-Simonet V, Labenne A, Saracco J (2018). Clustgeo: An R package for hierarchical clustering with spatial constraints. Computational Statistics, 33(4): 1799–1822.
 
Chen W, Castruccio S, Genton MG (2021). Assessing the risk of disruption of wind turbine operations in Saudi Arabia using bayesian spatial extremes. Extremes, 24(2): 267–292.
 
Cliff AD, Haggett P (1970). On the efficiency of alternative aggregations in region-building problems. Environment and Planning A, 2(3): 285–294.
 
Cressie N (2015). Statistics for Spatial Data. John Wiley & Sons.
 
Cressie N, Johannesson G (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1): 209–226.
 
Dale MR (2017). Applying Graph Theory in Ecological Research. Cambridge University Press.
 
Daw R, Simpson M, Wikle CK, Holan SH, Bradley JR (2022). An overview of univariate and multivariate Karhunen Loève Expansions in Statistics. Journal of the Indian Society for Probability and Statistics, 23: 1–42.
 
Duque JC (2004). Design of Homogenous Territorial Units. A Methodological Proposal and Applications. Universitat de Barcelona.
 
Duque JC, Anselin L, Rey SJ (2012). The max-p-regions problem. Journal of Regional Science, 52(3): 397–419.
 
Duque JC, Ramos R, Suriñach J (2007). Supervised regionalization methods: A survey. International Regional Science Review, 30(3): 195–220.
 
Ester M, Kriegel HP, Sander J, Xu X, et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, volume 96, 226–231.
 
Fang K, Kifer D, Lawson K, Feng D, Shen C (2022). The data synergy effects of time-series deep learning models in hydrology. Water Resources Research, 58(4): e2021WR029583.
 
George JA, Lamar BW, Wallace CA (1997). Political district determination using large-scale network optimization. Socio-Economic Planning Sciences, 31(1): 11–28.
 
Giorgi F (2008). Regionalization of climate change information for impact assessment and adaptation. Bulletin of the World Meteorological Organization, 57(2): 86–92.
 
Gottmann J (1980). Spatial partitioning and the politician’s wisdom. International Political Science Review, 1(4): 432–455.
 
Huang A, Wand MP (2013). Simple marginally noninformative prior distributions for covariance matrices. Bayesian Analysis, 8(2): 439–452.
 
Jarník V (1930). O jistém problému minimálním. (z dopisu panu o. borůvkovi). Práce Moravské přírodovědecké společnosti. 57–63.
 
Karhunen K (1946). Zur Spektraltheorie Stochastischer Prozesse. Annales Academiæ Scientiarum Fennicæ, 34.
 
Kirkley A (2022). Spatial regionalization as optimal data compression. Communications Physics, 5(1): 1–10. Nature Publishing Group.
 
Kleinberg J, Tardos E (2006). Algorithm Design. Pearson Education India.
 
Kruskal JB (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7(1): 48–50.
 
Laszlo M, Mukherjee S (2005). Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering, 17(7): 902–911.
 
Leeds WB, Wikle CK, Fiechter J (2014). Emulator-assisted reduced-rank ecological data assimilation for nonlinear multivariate dynamical spatio-temporal processes. Statistical Methodology, 17: 126–138.
 
Lenzi A, Castruccio S, Rue H, Genton MG (2020). Improving Bayesian local spatial models in large datasets. Journal of Computational and Graphical Statistics, 30(2): 349–359.
 
Loève MM (1955). Probability Theory. Van Nostrand, Princeton, N.J.
 
Luo ZT, Sang H, Mallick B (2021). A bayesian contiguous partitioning method for learning clustered latent variables. The Journal of Machine Learning Research, 22(1): 1748–1799.
 
Lv X, Ma Y, He X, Huang H, Yang J (2018). Ccimst: A clustering algorithm based on minimum spanning tree and cluster centers. Mathematical Problems in Engineering. 2018.
 
MATLAB (2018). 9.7.0.1190202 (R2019b). The MathWorks Inc., Natick, Massachusetts.
 
Obled C, Creutin J (1986). Some developments in the use of empirical orthogonal functions for mapping meteorological fields. Journal of Applied Meteorology and Climatology, 25(9): 1189–1204.
 
Openshaw S, Rao L (1995). Algorithms for reengineering 1991 census geography. Environment and planning A, 27(3): 425–446.
 
Pearson M (2007). Us Infrastructure Finance Needs for Water and Wastewater Rural Community Assistance Partnership (RCAP). Community Resource Group, Washington, DC, USA.
 
Pradhan P, Kriewald S, Costa L, Rybski D, Benton TG, Fischer G, et al. (2020). Urban food systems: How regionalization can contribute to climate change mitigation. Environmental Science & Technology, 54(17): 10551–10560.
 
Prim RC (1957). Shortest connection networks and some generalizations. The Bell System Technical Journal, 36(6): 1389–1401.
 
Ramos MC, Barreto JOM, Shimizu HE, de Moraes APG, de Silva EN. (2020). Regionalization for health improvement: A systematic review. PloS one, 15(12): e0244078.
 
Rasmussen CE (2003). Gaussian processes in machine learning. In: Summer School on Machine Learning, 63–71. Springer.
 
Robinson WS (2009). Ecological correlations and the behavior of individuals. International Journal of Epidemiology, 38(2): 337–341.
 
Singleton AD, Spielman SE (2014). The past, present, and future of geodemographic research in the united states and united kingdom. The Professional Geographer, 66(4): 558–567.
 
Spielman SE, Folch DC (2015). Reducing uncertainty in the american community survey through data-driven regionalization. PloS one, 10(2): e0115626.
 
Teixeira LV, Assunção RM, Loschi RH (2019). Bayesian space-time partitioning by sampling and pruning spanning trees. Journal of Machine Learning Research, 20(85): 1–35.
 
Vecchia AV (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Methodological), 50(2): 297–312.
 
Wendland H (1998). Error estimates for interpolation by compactly supported radial basis functions of minimal degree. Journal of Approximation Theory, 93(2): 258–272.
 
Werdell PJ, McClain CR (2018). Satellite Remote Sensing: Ocean Color. Technical Report. Elsevier.
 
Wikle CK, Milliff RF, Herbei R, Leeds WB (2013). Modern statistical methods in oceanography: A hierarchical perspective. Statistical Science, 28: 466–486.
 
Xu Y, Olman V, Xu D (2002). Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning trees. Bioinformatics, 18(4): 536–545.

PDF XML
PDF XML

Copyright
2022 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
connectivity graph ecological fallacy Karhunen-Loève expansion minimum spanning tree regionalization spatial data

Funding
This research was partially supported by the U.S. National Science Foundation (NSF) grant SES-1853096. The computation for this work was performed on the high performance computing infrastructure provided by the Research Computing Support Services at the University of Missouri, Columbia, MO, and is supported in part by the NSF grant CNS-1429294.

Metrics
since February 2021
727

Article info
views

403

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy