Supervised Spatial Regionalization using the Karhunen-Loève Expansion and Minimum Spanning Trees

The article presents a methodology for supervised regionalization of data on a spatial domain. Deﬁning a spatial process at multiple scales leads to the famous ecological fallacy problem. Here, we use the ecological fallacy as the basis for a minimization criterion to obtain the intended regions. The Karhunen-Loève Expansion of the spatial process maintains the relationship between the realizations from multiple resolutions. Speciﬁcally, we use the Karhunen-Loève Expansion to deﬁne the regionalization error so that the ecological fallacy is minimized. The contiguous regionalization is done using the minimum spanning tree formed from the spatial locations and the data. Then, regionalization becomes similar to pruning edges from the minimum spanning tree. The methodology is demonstrated using simulated and real data examples.


Introduction
Regionalization in the geospatial literature refers to dividing a spatial domain under study into a set of geographically contiguous partitions. Conceptually, this is similar to clustering a set of spatial locations with an additional spatial contiguity constraint that only allows spatial neighbors to lie in the same partition. However, it is different from the unsupervised spatial clustering methods (e.g., DBSCAN, Ester et al., 1996) that do not consider any information from the observed random variable. In fact, regionalization usually is more similar to classification or regression methods, where the target is to partition the spatial domain based on the behavior of a spatially distributed random variable. The problem can either be discrete or continuous, depending on the measure of the domain space. For example, regionalization by joining neighboring census tracts (Spielman and Folch, 2015) is a discrete regionalization problem since the original spatial domain has a discrete measure, whereas methodologies including decision trees, random forests, and BART partition the space of continuous features into a set of compact subsets. From one perspective, the logic behind regionalization is to create homogeneous regions such that the behavior of a loss function within every partition remains similar. In this way, regionalization can help reduce the size of large spatial datasets. Moreover, the spatial boundaries between the neighboring regions are also informative about the underlying distribution of the spatial process.
One can view spatial regionalization as a bias-variance trade-off problem applied to the underlying spatial process. Real-world spatial datasets are often globally non-stationary yet contain many locally stationary patterns in smaller regions. Spatial regionalization tries to separate such small areas with stable spatial structures. The implicit problem of bias-variance trade-off arises from the choice of the size or the number of the regions since smaller regions may result in more accurate predictions, but simultaneously increase the uncertainties of the parameter estimation. See Lenzi et al. (2020), Fang et al. (2022) for a more detailed discussion of these issues.
Regionalization has been studied in various branches of scientific analysis, mainly in geography and the social sciences. For example, it has been applied to problems in federal policy (Gottmann, 1980;Pearson, 2007), climatology (Giorgi, 2008;Pradhan et al., 2020), public health (Ramos et al., 2020), social science (Openshaw and Rao, 1995;George et al., 1997), and data compression (Kirkley, 2022), among others. Interested readers can see Singleton and Spielman (2014), Duque et al. (2007) for surveys on spatial regionalization methodologies and applications. The methodologies can largely be divided into two parts -implicit methods and explicit methods. The implicit methods (see section 2 of Duque et al., 2007) do not use the spatial contiguity constraint to create the partitions. Instead, they create clusters based on the response variable and then try to impose the spatial compactness constraint in a second stage. Although these algorithms use location information, most methods cannot guarantee spatial contiguity. Moreover, some use the location information as a feature (e.g., Bradley et al., 2017;Chen et al., 2021), which may require one to use an additional weighting parameter determining the effect of the locations. Alternatively, explicit methods ensure the spatial contiguity of the regions. Examples of explicit methods are the max-p algorithm (Duque et al., 2012), minimum spanning trees (Assunção et al., 2006), and combinatorial search (Cliff and Haggett, 1970). Some other categories of regionalization include heuristic methodologies (Duque, 2004), and seeded region growing (Adams and Bischof, 1994).
Most of the aforementioned spatial regionalization methodologies do not consider the problem of the ecological fallacy (Robinson, 2009) that appears naturally with the spatial change of support (COS) problem. The ecological fallacy occurs when a process inside a common spatial boundary is studied at different resolutions and different inference is obtained simply because of aggregation. This problem is similar to Simpson's paradox, which demonstrates contrasting statistical inference drawn between an individual level and a group level study of the same data. As an example, county-level unemployment rates may show different trends from the same data aggregated to the state level. This occurs for various reasons, such as different population sizes of the group level data, high variance within groups, etc. Although most regionalization methods do not account for the ecological fallacy, Bradley et al. (2017) provided a methodology to mitigate the problem by using the ecological fallacy itself as the criterion for spatial aggregation (CAGE). However, it is known that the k-means algorithm used in Bradley et al. (2017) fails in creating spatially contiguous regions and only creates regions that are close in values. The authors also suggest a hierarchical agglomerative clustering (Chavent et al., 2018) and implemented it in the R package rcage (Bradley et al., 2021). However, this method suffers from the pitfalls of implicit spatial methods mentioned above. Moreover, the O(n 2 log n) computational complexity of the agglomerative clustering is limiting in Bayesian applications such as used in Bradley et al. (2017). Hence, we consider further improvements in terms of both the explicit spatial neighborhood methodology and the computational complexity of the regionalization process.
We extend the two-stage method of Bradley et al. (2017) for the spatial regionalization problem. In the first stage, we mitigate the ecological fallacy issue by projecting the response variable onto the space of the optimal eigenfunctions. This is done using the Karhunen-Loève Expansion (KLE) of the underlying spatial process. KLE is an optimal feature expansion method that uses the "kernel trick" on the covariance function of the data. It uses orthonormal basis functions as the spatial features and uncorrelated random variables with zero mean and eigen-values as the variances as basis expansion coefficients. The optimality of KLE can be established by noting that among all feature expansions with a fixed number of expansion terms, KLE has the minimum mean-square error. See Daw et al. (2022) for a review of KLE and an application in spatial modelling. Using KLE as a tool of regionalization transforms the problem onto the space of the covariance function. Thus, this ensures the same regionalization for all random variables that follow the same distributional assumptions as our observed data. Hence, KLE-based regionalization criteria are stronger than those that use only the response variable. Different from Bradley et al. (2017), in the second stage we perform spatial partitioning using minimum spanning trees (MSTs). We do this by considering the spatial data as a connected graph, where the data locations constitute the vertex set, and the edge sets only contain the location tuples that are neighbors. Based on the dissimilarity in the feature space, the edges are associated with a loss function, which ensures the existence of a unique MST. Then, the MST yields a connected graph with the minimum total cost. The spatial regions are created at the last stage by pruning edges from the MST using a heuristic algorithm. We present our model in a unified view of discrete and continuous regionalization and demonstrate the methodology on simulated and real-data examples.
The remainder of the manuscript is organized as follows. Section 2 describes the Karhunen-Loève Expansion and ecological fallacy regionalization criterion. Section 3 provides a description of the minimum spanning tree regionalization algorithm. We demonstrate the method using both simulated and real-world data in Section 4. Section 5 concludes with a summary and discussion of possible future directions.

Prerequisites
On a compact spatial domain S ⊆ R 2 , consider a spatially dependent, real-valued univariate spatial process Y (·) : S → R. We assume that we have observed noisy realizations from this spatially dependent process at N locations, denoted by S = {s 1 , . . . , s N }. Unsupervised spatial regionalization only uses the distribution of S to partition S (or S) and is not of interest in this manuscript. Rather, we also consider a corresponding observed random variable Z(·) and a latent process Y (·). The observations and the true process at locations S are denoted by Z = {Z 1 , . . . , Z N } and Y = {Y 1 , . . . , Y N }, respectively. Spatial prediction methods use these observed variables to estimate the dependence structure (or parameters in a specified spatial dependence model) in the underlying latent process Y in order to generate predictions. The dynamics of the spatial process are implicitly captured in the covariance function of the latent process. Without loss of generality, we assume that Y (·) is a zero-mean process with a covariance function C(·, ·) : S × S → R. In this manuscript, we do not use any stationarity assumption on the covariance function. We only require Z to be a continuous random variable over the real line, but it easily extends to cases with arbitrary distributions. However, this assumption can also be extended to arbitrary data types using the transformations as in generalized linear models.
The spatial regionalization problem is concerned with finding a contiguous partition π = {P 1 , . . . , P n } : P j ⊂ S, P j ∩ P k = ∅ for j = k of the spatial domain. In continuous regionalization, each P j is a compact subset of S satisfying ∪ n j =1 P j = S, whereas discrete regionalization implies that P j ⊂ S and ∪ n j =1 P j = S. The regions P j are also known as areal units, and the data (process) over the areal units are similarly called areal data (process). We denote the observed areal data and the unobserved areal process similarly as In spatial statistics, the areal random variables Y A (·) are defined by averaging the point-level random variable Y as follows (Cressie, 2015): Here μ(·) is the Lebesgue (counting) measure in the continuous (discrete) regionalization problem. From this definition, Y A (·) is essentially the expected value of Y (·) over the chosen partition. Therefore, supervised regionalization problems yield an ANOVA-like decomposition of the space, meaning that Y (·) is constrained to be as homogeneous as possible within each spatial partition, and the between-partition discrepancies are maximized. In this way, the areal data becomes a close approximation of the original point-level spatial data. This also sets up the motivation behind many regionalization approaches. That is, the discrepancy between Y (·) and Y A (·) should be a minimum to get an effective partition of the domain.

Ecological Fallacy as Regionalization Criterion
As noted by Bradley et al. (2017), minimizing the distance between Y (·) and Y A (·) is not sufficient since it does not necessarily mitigate the ecological fallacy. Instead, creating areal units using the ecological fallacy itself as the loss function (i.e., their criterion for spatial aggregation) ensures that the regions do not suffer from this problem and the spatial trends (and other inference summaries) in both processes remain similar. Since the covariance function captures the spatial dependence, we use it here to create the intended loss function. In particular, the covariance function suggests features from the spatial data following the "kernel trick" for the associated reproducing kernel Hilbert space (RKHS), which represents a covariance kernel C(·, ·) as inner products of feature maps: Here φ(·) is the projection of Y onto some arbitrary space of feature vectors. Although we can use any family of basis functions for the projection, it is difficult to justify the arbitrary choice of family and the number of functions. Therefore, to avoid such ambiguities, and to adhere to a more data-driven approach, we use the Karhunen-Loève Expansion (KLE, Karhunen, 1946;Loève, 1955) of the covariance function, which is well-known to be the optimal basis expansion method in the L 2 sense. The KLE of a univariate spatial process Y (·) is a consequence of Mercer's theorem. Mercer's theorem guarantees that the covariance function C(·, ·) has eigenvalues λ 1 λ 2 · · · 0 and eigenfunctions {ψ k (.) : S → R}, which satisfy the orthonormality condition, i.e., s ψ i (s)ψ j (s) ds = δ ij . Then, the KLE of a spatial process Y (·) is given by where the α k -s are uncorrelated random variables with mean 0 and variance λ k . As evident from the representation, the KLE satisfies the bi-orthogonality constraint, which means it uses both orthonormal basis functions and uncorrelated expansion coefficients. Using the KLE, the regionalization of Y is addressed as the following two-stage representation For the purpose of this manuscript, it is convenient to unify the eigenvalues and eigenvectors into a new set of scaled KL features by multiplying the eigenfunctions ψ(·) with the eigenvalues as ξ k (·) = √ λ k ψ k (·). We call these "SKL basis functions" or "SKL expansions" (SKLE) hereafter. Areal level SKL basis functions are also defined in the same fashion as in equation (2).

Criterion for Regionalization
For regionalization, we use the following loss function Ł: π → [0, ∞], defined over the partition set π = {P 1 , . . . , P n } as follows To incorporate the KLE of Y (·), we define the operator K(·) in equation (4) as the projection onto the space of the SKLE: Minimizing the above loss function is the same as minimizing the ecological fallacy in the spatial regionalization π . Equation (4) and equation (3) together ensure the decomposition of the space since the within-region discrepancies are also being minimized. The propositions 2 and 4 from Bradley et al. (2017) justify this SKLE-based loss function and connect the ecological fallacy loss in the data with the same in the eigenspace. They showed that for any measurable realvalued function f over S, there is an equivalence between minimizing the spatial aggregation error (or the ecological fallacy loss) in f A and maintaining the between-scale homogeneity of the underlying SKLEs. The more homogeneous the eigenfunctions are, the areal units become a better representation of the point-level data. Note that this choice of K also leads to the same regionalization for any random variable following the same distributional assumptions. This is a much stronger result than setting K = I (the identity operator) in equation (4), which is limited to the specific realization of Y . Hence, we base our regionalization algorithms on the space of the SKLE of the underlying process Y (·).

Remark 1.
In practice, we truncate the infinite sum in the equation (1) at a large number M using the first M eigenvalue-eigenvector pairs. One can use any standard rule (e.g., 'elbow rule' or scree plots) to select the truncation parameter, which alleviates the arbitrary choice of the number of basis functions. Higher M ensures a fine-scale representation of the process but increases the computational cost significantly.
Remark 2. Following Mercer's theorem, one can show that among all possible linear expansions of C(·, ·) with M expansion terms, the KLE minimizes the expected mean-square error. Therefore, KLE is the optimal feature expansion in the L 2 sense and so justifies being the chosen feature space.
Remark 3. To compute the KLE, one can use the eigendecomposition of the empirical covariance matrix of the data if a set of replicates is available (e.g., for spatio-temporal data). However, this is not possible when we have only one realization, more locations than replications, or when the observations are too noisy. Thus, we use a two-stage Obled-Creutin (O-C) basis calculation procedure as outlined in Bradley et al. (2017), Obled and Creutin (1986). In this approach, one starts with any choice of generating basis functions (GBF) from some commonly used family of basis functions. The corresponding O-C bases are then computed by orthonormalizing the GBFs, using the routine in Algorithm 1.
Step 3 of Algorithm 2 models the observed data using these O-C bases. Then in the second step, the O-C bases are multiplied by a rotation matrix to get the Karhunen-Loève basis functions. See the routine in Algorithm 2 and Appendix B for the proof of correctness of the O-C approach.

Spatial Regionalization
In this section, we utilize the components of graph theory to express the spatial data as a connected graph and then partition it by removing the edges of the graph.

Spatial Data as Connected Graph
There is an equivalence between a set of geospatial locations and a connectivity graph (e.g., Dale, 2017;Anderson and Dragićević, 2020). Consider a set of spatial locations S = {s 1 , . . . , s N } with a pre-defined neighborhood map for every location. The equivalent connectivity graph G is defined as the tuple set G = (S, V), where the set of the observed locations S is called the vertex set and the edge set V is defined as the collection of edges {ν jk : (j, k) ∈ {1, . . . , N} 2 }, where an edge ν jk exists if and only if the locations s j and s k are spatial neighbors. Information from Y (·) is incorporated into G by associating the edges with a set of edge weights. We define a weight function ω jk for the edge ν jk as The choice of the absolute loss seemed to be a better choice than the squared-error distance due to the roughness of the SKLEs. We use the notation G ω = (S, V; ω) to denote a connectivity graph with edge weights ω, whereas G = (S, V) denotes a graph without any information on edge weights. We define a path from location s j to s k if there exists a set of edges {ν ji 1 , ν i 1 i 2 , . . . , ν i t−1 i t , ν i t k } that connect the two spatial locations. A circuit is a path where an ordering of the edges inside the path has the same first and last location. A graph is connected if there is a path between any two locations. Under this representation, a (discrete) spatial partition is the same as a subset of locations connected by a path. Therefore, we can partition the connectivity graph G ω into a set of connected subgraphs to create the spatial regions such that the loss function Ł in section 2 is as small as possible inside the subgraphs. Motivated by this, we will use the structure of spanning tree next to create the regions.

Minimum Spanning Tree
A spanning tree T of a connected graph G ω = (S, V; ω) is a connected sub-graph T = (S, U), U ⊆ V, such that any two locations in S are connected by a unique path. In applications to spatial data, spanning trees create a unique traversal route between the locations of the data under study. If S has N locations, a spanning tree contains exactly N − 1 edges. Therefore, removing edges from the spanning tree leads to a discrete regionalization, where the locations inside a region are connected via the split unique path. Hence, spanning trees arise intuitively in partitioning spatial regions (Assunção et al., 2006;Teixeira et al., 2019;Luo et al., 2021).
A connected graph typically has many possible spanning trees, leading to multiple paths. Here we utilize the weights ω jk to get a supervised and unique route to connect the locations. Using these edge-specific weights, the cost of a graph is defined as the sum of the edge weights. The minimum spanning tree (MST) is defined as the spanning tree that has the minimum total cost. Therefore, we use the MST of the graph G ω as the unique choice of the spanning tree. Note that a MST is unique if and only if the pairwise weights are distinct real numbers. In our regionalization methodology, since the weights arise from the SKLE, they are different with probability 1 and hence lead to a unique choice of the MST. Now, we need to remove n − 1 edges from the MST to get the discrete regionalization.
We use the well-known Kruskal algorithm (Kruskal, 1956) to create the MST from the connected graph G. It is a greedy algorithm, i.e., the MSTs created using the Kruskal algorithm are locally optimal solutions. The Kruskal algorithm initially sorts all the weights associated with edges from low to high and adds the particular edge with the global minimum weight to the MST. Then, in each iteration, it adds one edge from the remaining edge set with the minimum weight. An edge is only rejected if the associated locations form a circuit with the spanning tree formed up to the particular iteration. The algorithm stops once all locations are added to the MST. The other famous choice for MST construction is Prim's algorithm (Prim, 1957), which uses a similar greedy routine. Prim's algorithm chooses a random location first and then finds an associated path with minimum weight. See the routines in Appendix C for the pseudocode associated with these two algorithms.

Regionalization
After finding the MST, optimal spatial regions are created by pruning the edges of the MST. The most famous method for partitioning an MST is the SKATER method (Assunção et al., 2006), which uses a combinatorial search in every iteration to find the optimal choice. For moderatesized datasets, this approach is computationally expensive and efforts to find cheaper pruning alternatives are still an active area of research (e.g., see Xu et al., 2002;Lv et al., 2018, for a review of edge removal methodologies). The common theme of such methods is to remove the most inconsistent edges. We considered the long-edge removal approach, which creates K partitions by removing the K − 1 edges with the largest weights from the MST (Kleinberg and Tardos, 2006, Section 4.7). However, our experiments showed that removing the highest weighted edges can lead to unbalanced-sized partitions. This occurs naturally since the space of SKLE is usually quite rough. Thus, we include a region size constraint similar to the ideas in Laszlo and Mukherjee (2005). We use the hyperparameter min_partition_size to choose a minimum size of the partitions, i.e., an edge is only removable if the size of the two resulting subgraphs is more than min_partition_size. We proceed from the edge with the highest weight and then verify the above condition. If the condition is not satisfied, we move to the next highest weighted edge. The algorithm stops when the size of every partition satisfies the above condition. Although it can be difficult to specify the hyperparameter constraints in practice, we use the ideas from the CAGE algorithm in Bradley et al. (2017). For example, the American Community Survey CAGE example in Bradley et al. (2017) used 185 partitions and an average of 0.54% of the spatial locations clustered in the same partition. Based on this, we used min_partition_size = N * 0.5 100 in our experiments. Although this performed well in our examples, an optimal choice of this parameter is the subject of future research.
In the examples presented here, we select the intended number of partitions as K. We use the Bayesian model (explained in Section 3.4) to sample a set of partitions following the above edge-removal scheme. For every such sampled partition, we compute the ecological loss function from the equation (4). The optimal choice for the spatial regionalization is the one that minimizes equation (4). In this way, we choose a partition that results in the minimum amount of ecological fallacy loss while performing the spatial change-of-support.

Computation
In this section, we propose the full computational details of our approach. Our modeling approach uses basis functions and assumes prior distributions on the modeling parameters. However, it is not necessary to model the data using a Bayesian basis regression. We simply need to estimate the form of the covariance function, which is possible using any common spatial modeling approach, such as Gaussian process regression (Rasmussen, 2003), fixed rank kriging (Cressie and Johannesson, 2008), Vecchia Gaussian process approaches (Vecchia, 1988), etc. An example of estimating KLE using a Gaussian process regression is provided in Appendix D. We chose a Bayesian approach here because it allows the simple specification of different regularization approaches through the prior distribution.
Our first step is to estimate the Karhunen-Loève basis functions from the data Z by considering a set of mild assumptions on the data distribution. Following Algorithm 1, we need to use a set of generating basis functions (GBF) to get the KLE. Here, we have chosen the well-known 6-th order Wendland basis functions (Wendland, 1998). Wendland bases are multi-resolutional and are often a good candidate for non-stationary data. The Wendland basis function of order 6 has the following form where s is the spatial location and z are a set of knots selected on a regular grid over the spatial domain S. Here, ζ is a range parameter that controls the resolution of the chosen basis function. Assuming the spatial domain to be a subset of the unit square [0, 1] × [0, 1], we used 4 different choices of ν. Corresponding to the j -th resolution, q j equidistant knots are placed in each dimension of S. We considered q = {5, 10, 15, 20}, which leads to 25, 100, 225 and 400 basis functions. Note that model fitting can be difficult in real data applications and one needs to check the sensitivity of the hyperparameter choices. Since this is out of the scope of our manuscript, we have used a sufficiently well-performing choice of the hyperparameters to demonstrate the methodology. Given the Wendland basis functions, Algorithm 1 is implemented to get the O-C version of the Wendland basis functions, which is equivalent to the orthonormalization of the GBFs. We denote the O-C basis function as φ j (·) : S → R; j = 1, . . . ,M. In matrix form, it is denoted as ∈ R N ×M and the j -th row corresponding to location s j is denoted as φ(s j ) ∈ RM . Using the O-C basis, we specify our choice of the model for Z as: The error term (s) is assumed to follow an iid normal distribution with a common variance σ 2 . Since basis functions are often over-specified, regularization is accomplished by using prior information π(β, σ 2 ) on the model parameters. We assume β to be normally distributed with mean 0 and covariance . We experimented with a few choices of prior for , such as the conjugate multivariate normal prior, the Moran's I prior (e.g., Bradley et al., 2017), the independent ridge prior (i.e., = τ 2 I), and matrix half-t prior (Huang and Wand, 2013). The results presented here use the matrix half-t prior. This prior has a simple interpretable hierarchical structure and performs well over arbitrary covariance matrices. Using these choices, we specify our full model below: As mentioned previously, hyperparameter tuning is crucial in practical applications. Any Bayesian model-fitting diagnostic can be used to validate the choice of the models. Note that the regionalization can only perform as well as our fitted model permits. We fixed the values of the hyperparameters by experimenting with a set of pre-determined choices and checking the results on a small validation set offline. For the examples in this manuscript, we used ν = 2 and ρ = 1 for all the examples. We use Gibbs sampling to sample from the posterior distribution of all the hyperparameters. For our purpose, the knowledge of the posterior distribution for | Z is necessary. That is, we obtain samples 1 , . . . , K from | Z . Here, step 5 -6 of Algorithm 2 gives the KLE. This is done by performing an eigendecomposition of each j as j = E j j E T j . Then, the j -th realization of the KL eigenfunctions are obtained as j = E j and the eigenvalues are the diagonals of j . Since we are using the SKL basis functions, they are given by We also calculate the edge weights here as ω jk = ξ (s j ) − ξ (s k ) . Our next goal is regionalization based on the SKLE. We address both the discrete and the continuous cases simultaneously here. For the case of discrete regionalization, we evaluate the SKLEs at the data locations S. To define the neighborhood, we use a distance-based method, i.e., include the set of points within given proximity in the neighborhood set. The continuous regionalization is approximated by evaluating the SKLEs at a set of gridded "pseudo locations" over the spatial domain S. Note that since the eigenfunctions are available over the whole domain S, it is possible to compute the SKLEs, and hence the connectivity graph G ω , at the pseudo locations without any complicated algebra. In our examples, we used the same amount of pseudo locations as the original data and sampled them over a rectangular grid covering the domain. We considered the neighborhood as the set of axis-wise and diagonal-wise closest grid points. Then, compact regions are formed by joining the pseudo grid points that belong to the same spatial region. Hence, together, we use the connected graph G ω = (S, V; ω), where S is either S or the set of pseudo points. Similarly, V and ω correspond to the edges between either observed locations or the pseudo grid points. The rest of the algorithm is simple. In the j -th iteration, we first compute the MST T j from the j -th connectivity graph G j = (S, V; ω j ). Then we remove the edges from T j to get the corresponding spatial regions. We implemented our method using Matlab (MATLAB, 2018).

Simulation and Real Data Examples
We demonstrate our methodology with a simulated and real world example. Specifically, we use our method to create the areal units from point level data and also compute the regionalization error over the units.

Simulation Example: Two-Dimensional Gaussian Process
Consider simulated data from a Gaussian process over the unit square with the mean function m(s) and the covariance functions c(s i , s j ) given by the following: Here δ ij is the Dirac Delta function. We simulate from this process at n = 2500 randomly generated locations. To apply our method, the SKLE of the underlying spatial process was first derived by using the model from Section 3.4. In the second stage, we generated gridded pseudo locations of the same size from the dataset. The minimum partition size is chosen as 13 and the number of areal regions is fixed at 80. The MST is first computed based on the sample, then is pruned to get spatial regionalization. Figure 1 shows the areal units obtained by our method. Note that our algorithm selects a regionalization that ensures the minimum ecological fallacy loss in equation (4). This implies that we can represent the 2,500 data points with 80 areal units, optimally minimizing the possibility of an ecological fallacy. The error map in the bottom row of Figure 1 also shows how the regionalization error is smoothed over the spatial regions (i.e., an estimate of the ecological fallacy criterion).  We implemented the kmeans method in this example to find 10 spatial partitions of the Gaussian process simulated data. The left image in the top row shows the original data. The other 3 panels show the colors corresponding to cluster labels, here numbered from 1 -10. The top right image shows the kmeans partitions on the sampled process as used in CAGE. In the bottom row, we append location information to the sampled process to force the soft contiguity constraint. For the bottom left image, the process is appended with locations to get the new featureỸ 1 (s) = Y(s); s . The bottom right image scales the location information twice, i.e.,Ỹ 2 (s) = Y(s); 2s , to strengthen the soft contiguity constraint. It is clear that kmeans fails to provide contiguous partitions.
We also implemented the clustering methodology used by Bradley et al. (2017) to compare with our model for these simulated data. We have implemented both the kmeans and the spatial agglomerative clustering in this example, illustrating with 10 spatial partitions to facilitate interpretation. Both Figure 2 and 3 demonstrate that the kmeans and the 3 algorithm fail to maintain regional contiguity. Even appending the soft-clustering constraint does not guarantee the geographical contiguity. In our case, the MST-pruning-based regionalization method does not suffer from these issues.

Real Data Example: Prediction of Ocean Color
We apply our methodology to partition ocean color observations in the coastal Gulf of Alaska (Leeds et al. (2014), Wikle et al. (2013)). Ocean color is a proxy measure of the abundance of phytoplankton in the near surface ocean. The areas with more phytoplankton appear greener than the low-density regions. Hence, regionalization of ocean color provides information on regions of primary productivity in the ocean ecosystem, which is a primary component of the lower trophic levels of the ocean food chain. Many satellites, such as SeaWiFS, MODIS, and MERIS collect global and local level remotely sensed observations of ocean color. See Werdell Figure 3: An example with soft spatial constrained hierarchical clustering: We implemented a soft hierarchical clustering method similar to that in the R package ClustGeo on the Gaussian process simulated data. The left image in the top row shows the original data. The other 3 images show colors corresponding to cluster labels, here numbered from 1 -10. The top right image shows the partitions when no location information is appended to the process. The bottom row shows similar regionalization based on location-appended process. The hierarchical clustering is then performed on these new features to get the regions. From the best regionalization (the bottom right one), it is clear that the regions are getting more defined due to the higher scaling effect of the locations. and McClain (2018) for more information on satellites dedicated to oceanographic data.
We consider SeaWiFS data over the coastal Gulf of Alaska for 12 May, 2000. The dataset contains observations at 4718 spatial coordinates and was analyzed previously by Daw et al. (2022) using the CAGE methodology. However, the authors used additional covariate information about ocean model (i.e., simulated ocean color output, sea surface temperature, and sea surface height). Moreover, the kmeans based clustering used in that paper failed to provide spatially contiguous clusters. In our case, we have not used any covariate information and have directly modeled the response surface using a basis function regression in the first stage as described above. The MST-prune-based clustering in the second stage leads to guaranteed spatially contiguous partitions in our case. Figure 4 shows the results of using our approach on these data. We choose the minimum partition size as described in Section 3.4, which leads to the choice of a minimum partition size equal to 24. We consider 140 areal regions in the data, an order of magnitude reduction. Note the similarity between the aggregated (area-level data) and the point-level data, with similar error distribution.

Discussion
In this manuscript, we developed a novel spatial regionalization methodology that seeks to minimize the ecological fallacy across spatial aggregations. Although the ecological fallacy is well-studied in the geography and ecological literature, there has been little research on using it to inform partitioning spatial domains. Many existing methodologies of special regionalization suffer from the very problem since it naturally occurs while changing the resolution of the spatial support. An exception is the CAGE approach of Bradley et al. (2017), which also minimizes the ecological fallacy, yet is not always guaranteed to give spatially contiguous regions and is costly to implement. We also consider a formal definition of the regionalization error based on the ecological fallacy between the point level and areal level data. We focus on an explicit spatial methodology, which ensures that the regions are spatially contiguous with probability 1.
The ecological fallacy is defined using the Karhunen-Loève Expansion (KLE) of the underlying spatial process. The KLE uses the "kernel trick" to derive an optimal set of features from the covariance kernel of the process. Therefore, using KLE also ensures the same regionalization for any spatial process that follows the same distributional form as the data being considered. To compute the KLE, we need to estimate the covariance function of the underlying spatial process. One can use any spatial modeling technique here to estimate the covariance. Here we used a Bayesian basis function regression model, using the O-C basis approach from Bradley et al. (2017). The O-C basis is the orthonormal version of any chosen family of generating basis functions. The KLE is computed by rotating the O-C basis functions using a matrix of eigenvectors that decorrelate the basis expansion coefficients. We use the scaled KLE by multiplying the KL eigenfunctions with the squared root of the eigenvalues to get the spatial regions in the second stage.
Our regionalization method uses the minimum spanning tree (MST) associated with the weighted graph of the spatial data. For this, we first represent the SKLE as a connectivity graph. Spatial locations form its vertex set. An edge in the edge set represents a tuple of neighbors. The edges are associated with weights, which are the discrepancies between the SKLE of the locations. MST derives the unique subgraph that connects all the vertices and has the minimum total cost, which is the sum of the edge weights of the MST. Note that the MST also only connects the spatial neighbors. Then, we prune the MST by removing its edges to get the spatial regions. We constrain the partitions to have a minimum size and keep dividing the MST until a stopping criterion meets. We consider various choices of the minimum partition size and choose the pruned graph with the minimum value of the regionalization error. In this way, we use the optimal features (i.e., the SKLE) in regionalization that ensures both minimum ecological fallacy and spatial contiguity. It should also be noted here that there are a different set of MST partitioning methodologies that minimizes a set of different loss functions (e.g., Xu et al., 2002;Lv et al., 2018).
Despite the promise of our methodology, there are several areas for future research and application. For example, the method can easily extend to temporal, spatio-temporal, and other general stochastic processes. Furthermore, it will be interesting to see the behavior of the KLEbased regionalization for different types of observed variables. Global pruning of the MST is also a known difficulty for researchers due to the combinatorial complexity in every iteration. We also look to derive an information-based hypothesis testing procedure to facilitate a more statistically sound stopping criterion.
Another important direction would be the application to large datasets. First, large datasets would require dealing with a massive number of eigenvectors. Although low-rank analysis is arguably possible by considering the first few SKLEs, it is known to not address the small-scale structures. Another source of problem is the presence of outliers, since both the estimation KLE and the MST are highly affected by it. Therefore, very large spatial datasets are naturally difficult to handle through this approach. Future work will attempt to make this procedure more robust and computation-friendly for such applications.

Supplementary Material
The supplementary material includes the following files: (1) README: a brief explanation of all the files in the supplementary material; (2) The synthetic dataset; (3) The real-world dataset; (4) Code files; (5) Images used in the paper; (6) A miscellaneous example of KLE computation directly from covariance matrices.

A Pseudocode for O-C Basis Calculation
We provide the pseudocode for the O-C basis function in this section. O-C basis functions are useful when we have to model the data using generating basis functions (GBFs). The O-C approach simply finds the orthonormal version of our chosen GBF. Then, the KLE is computed by multiplying the O-C basis with a rotation matrix that ensures a diagonal covariance among the expansion.

B Proof of Correctness of Algorithm 2
First, we prove that the OC basis functions from Algorithm 1 are orthonormal. Define, F ∈ RM ×M to be the matrix with i, j -th element f ij = S φ i (s)φ j (s) ds. Then, it is straight-forward to see that F = Q T W Q = I. Therefore, is an orthonormal basis function of the chosen GBF family G. Now, note that the step 2 in Algorithm 2 is same as the two-step definition: Z = Y + ; Y = γ . From step 3, the posterior covariance matrix of γ is . Then, using the eigendecomposition in step 5, we have the following: cov(Y) = T = E E T T = T Since both and E are orthonormal, is also orthonormal. Also, is a diagonal matrix of non-increasing eigenvalues. Hence, ( , ) is the eigendecomposition of the covariance kernel of C = cov(Y). Thus, we can prove that ψ j (·)-s are the Karhunen-Loève basis functions and λ j -s are the corresponding eigenvalues.

C Pseudocode for Finding the MST
In this section, we provide the pseudocode for the MST. We have used Kruskal's algorithm (Kruskal, 1956) in our study. Both Kruskal's and Prim's (Jarník, 1930;Prim, 1957) algorithms are greedy ways to construct the MST from a connectivity graph. Both algorithms run with a computational cost of O(|S| log |V|), which in our case is same as N log N , since we consider O(1) neighbors for each spatial locations. See that Kruskal's algorithm starts with the least weighted edge and adds edges iteratively, whereas Prim's algorithm starts by choosing a location randomly and then finds the least costly edge from the edge set.
Algorithm 3 MST construction using Kruskal algorithm.

D General Methodology for KLE Estimation
In this section, we propose how to compute the KLE for any choice of model. Suppose that, under the assumptions of Section 2, one uses a spatial model of the following form:

Y (s) = f (s).
Here f (·) can be any common spatial model, such as a Gaussian process (Rasmussen, 2003), Fixed Rank Kriging (Cressie and Johannesson, 2008), Vecchia process (Vecchia, 1988), etc. After fitting the model, we can estimate the parameters of f , which gives us an estimation of the covariance kernel asĈ. Now, using the routine in Algorithm 5, one can directly estimate the KLE fromĈ.