Geostatistics for large datasets on Riemannian manifolds: a matrix-free approach

Large or very large spatial (and spatio-temporal) datasets have become common place in many environmental and climate studies. These data are often collected in non-Euclidean spaces (such as the planet Earth) and they often present non-stationary anisotropies. This paper proposes a generic approach to model Gaussian Random Fields (GRFs) on compact Riemannian manifolds that bridges the gap between existing works on non-stationary GRFs and random fields on manifolds. This approach can be applied to any smooth compact manifolds, and in particular to any compact surface. By defining a Riemannian metric that accounts for the preferential directions of correlation, our approach yields an interpretation of the ''local anisotropies'' as resulting from ''local'' deformations of the domain. We provide scalable algorithms for the estimation of the parameters and for optimal prediction by kriging and simulation able to tackle very large grids. Stationary and non-stationary illustrations are provided.


Introduction
Large or very large spatial (and spatio-temporal) datasets have become common place in many environmental and climate studies.Various approaches have been proposed by the statistical community to tackle the "big N " challenge in ways that properly acknowledge the spatial and spatio-temporal dependence structures usually observed in these datasets.Recently, several competitions have been organized to compare methods and algorithms in this context (Heaton et al., 2019, Huang et al., 2021) which provide an excellent overview of state-of-the-art methods for analyzing large spatial datasets.Most algorithms rely on approximation methods of Gaussian Processes, also known as Gaussian Random Fields (GRFs).In their conclusion, the organizers of the first competition note that "spatial data may exhibit anisotropy, nonstationarity, large and small range spatial dependence as well".Despite its long history, modeling nonstationary spatial datasets remains a challenge.Sampson and Guttorp (1992) proposed a space deformation approach further developed for instance in Perrin and Senoussi (2000) and Fouedjio et al. (2015).The main difficulty with this approach is to estimate a valid, global, deformation of the domain, which in practice is not guaranteed to exist.Paciorek and Schervish (2006) (see also Fouedjio et al. (2016)) introduced a class of nonstationary covariance functions based on the kernel convolution approach of Higdon et al. (1999), later generalized in Porcu et al. (2009).Our approach rather builds on the so-called "SPDE approach" introduced in the seminal work of Lindgren et al. (2011), which relates GRFs characterized with Matérn covariance functions to stationary solutions of a specific Stochastic Partial Differential Equation (SPDE).This approach can be extended to the nonstationary case by allowing spatially varying coefficients of the SPDE, see for instance Fuglstad et al. (2015a;b) among other possible references.Another challenge often faced when analyzing environmental or climate data is to work on non-Euclidean domains.In particular, methods for analyzing data on spheres has received a lot of attention, see Marinucci and Peccati (2011), Jeong et al. (2017) and Porcu et al. (2021) for recent reviews and Rayner et al. (2020) for a recent application using the SPDE approach dealing with extremely large datasets.Methods developed for spheres depend usually on specific properties, such as expansion into the spherical harmonics (Emery and Porcu, 2019, Lang and Schwab, 2015, Lantuéjoul et al., 2019) or the use of arc distances to define valid covariance models (Gneiting, 2013, Huang et al., 2011).
This paper aims at bridging the gap between existing works on nonstationary GRFs and random fields on manifolds.Specifically, we propose a generic approach to model GRFs on compact Riemannian manifolds and we provide scalable algorithms for their optimal prediction by kriging and simulation.Our approach is based on two main ingredients.First, random fields are defined through expansions in the eigenfunctions of the Laplace-Beltrami operator on the Riemannian manifold which are, in some cases, solutions to some particular SPDE.Then, we build finite element approximation of these GRFs.This construction allows to perform optimal prediction, simulation (including conditional) and estimation of the parameters using scalable algorithms.
For this purpose, we define a Riemannian metric that accounts for the preferential directions of correlation of the nonstationary GRF.This method yields an interpretation of the "local anisotropies" as resulting from "local" deformations of the domain, in striking contrast to both the space deformation and the kernel convolution approaches.The resulting fields can be seen as a direct generalization of the construction for nonstationary random fields proposed in Fuglstad et al. (2015a).
Our approach can be applied to any smooth compact manifold, and in particular to any compact surface or hypersurface.It shares clear similarities with the work of Borovitskiy et al. (2020), but in contrast, is not restricted to Whittle-Matérn fields since in our approach the GRF is characterized by its spectral density, whose inverse is restricted to belong to the family of positive polynomials.Besides, our approach does not rely on the explicit computation of the eigenfunctions and eigenvalues of the Laplace-Beltrami operator, and can be seen as providing a theoretical motivation to the method developed in Borovitskiy et al. (2021) to deal with Gaussian processes on graphs.
The flexibility of our approach does not result in increased computational costs.Indeed, we show how prediction and conditional simulations can be performed through a so-called "matrix-free" approach.This approach, unlike classical geostatistical algorithms, does not require to build and store possibly large covariance matrices, but instead relies only on products between some sparse matrices and vectors.This in turn ensures the scalability of this method, thus paving the way to efficient nonstationary geostatistics for large datasets.We illustrate our approach with 2D and 3D synthetic examples and grids with more that 10 7 nodes for the 3D cases.
The organization of the paper is the following.The GRF model is presented in Section 2, along with its finite element approximation and covariance function.Kriging and simulation algorithms are provided in Section 3. Section 4 shows how the parameters can be estimated using maximum likelihood.All methods are summarized as Algorithms.Stationary and nonstationary illustrations are then provided in Section 5. We conclude with some final words in Section 6. Proofs and technical details are deferred to the Appendix.Throughout this paper, vectors and matrices will be denoted in bold fonts.The superscript T denotes the transpose operation on matrices or vectors.Diag is the operator that transforms a vector of length n into an n × n matrix whose diagonal elements are those of the vector and whose off-diagonal elements are 0. I p is the p × p identity matrix.|A| is the determinant of the square matrix A and • denotes the Euclidean norm.
2 Random Fields on Riemannian Manifolds

Definition and Finite Element Approximation
A generic approach to define and characterize GRFs on Riemannian manifolds has been proposed in Borovitskiy et al. (2020) and in Lang and Pereira (2021) and is now briefly recalled.Let D be a smooth compact manifold of dimension d equipped with a Riemannian metric g, and let f : R + → R + be a function such that for some β > d/2, |λ β f (λ)| is bounded as λ → +∞.A centered GRF Z is constructed on the resulting Riemannian manifold (D, g) through the expansion where f 1/2 : R + → R is a function such that f 1/2 2 = f on R + , {w k } k∈N is a sequence of independent standard Gaussian variables, {λ k } k∈N denote the set of eigenvalues of the Laplace-Beltrami operator −∆ on (D, g), and {e k } 1≤k≤∞ denote the associated eigenfunctions.In order to get a feeling of what Eq. ( 1) represents, one could recall that on R d the eigenfunctions of the Laplacian are all the members of the uncountable family of functions {e −i ω,x : ω ∈ R d }.In this case, the Whittle-Matérn random fields are solution to the SPDE (κ 2 − ∆) α/2 Z = W where W is the white noise process on R d and the associated spectral density is f 1/2 ( w ) = (κ 2 + ω ) −α/2 .Going back to our definition of GRFs on compact manifolds, the eigenvalues and eigenfunctions of the Laplace-Beltrami operator are not known in general.A first approach, proposed by Borovitskiy et al. (2020), consists of computing them numerically by solving (approximately) the corresponding eigenvalue problems.Instead, we approximate Z using a finite element approach as in Lindgren et al. (2011) and Lang and Pereira (2021), as this method will naturally yield scalable algorithms for prediction and sampling tasks.First, the manifold D is triangulated using n nodes s 1 , . . ., s n ∈ D and a family of compact support approximation functions {ψ i } 1≤i≤n is defined over D, where each ψ i is the piecewise linear function equal to 1 at the node s i and 0 at all the other nodes.Then, Z is approximated by a linear combination Z defined as where for any i ∈ {1, . . ., n}, z i = Z(s i ) is the weight associated with the basis function ψ i .
The weights Z = (z 1 , . . ., z n ) T are chosen so that Z can also be written using the same expansion as the one defining the original field Z in (1), but replacing now the eigenfunctions {e k } k∈N and eigenvalues {λ k } k∈N of −∆ by those of its Galerkin approximation (see Appendix D for more details).This particular choice yields an explicit formula to compute these weights, see proposition 2.1 below.We first introduce C and F , the mass and stiffness matrices respectively defined by where (•, •) denotes the L 2 inner product on the Riemannian manifold (see Appendix C).Note that since each basis function ψ i (1 ≤ i ≤ n) is zero for every node of the triangulation except one, the resulting matrices C and F are sparse.Let then √ C ∈ R n×n be a matrix such that √ C( √ C) T = C, and let S be the matrix defined by Note that since S is real, symmetric and positive semi-definite, it is diagonalizable with non-negative eigenvalues.Therefore it can be written as where λ 1 , . . ., λ n denote the eigenvalues of S and V is an orthogonal matrix whose columns are eigenvectors of S.
Proposition 2.1.Let Z be the vector of weights as in Eq. (2).Then, the vector Z is a centered Gaussian vector with covariance matrix Σ given by where f (S) is a so-called matrix function, defined from the eigendecomposition of S as This result is shown in Pereira (2019) and Lang and Pereira (2021).In the latter reference, a convergence result of the approximation of Z as the mesh size of the triangulation decreases is provided, thus further justifying this approach.
The matrix square-root √ C can be computed using matrix functions or through a Cholesky decomposition.In practice however, √ C is replaced by the so-called mass lumping approximation defined as the diagonal matrix with entries: To ease the notations, but at the cost of a slight abuse of notation, the mass lumping approximation will also be denoted √ C in the remainder of this text.As shown in Lindgren et al. (2011), this approximation comes with negligible effect on the covariance of the resulting random field.It allows to readily have access to the inverse of the square-root matrix √ C and yields which ensures that the matrix S is also sparse.
The covariance matrix Σ in Eq. ( 5) involves a matrix function defined above through the eigendecomposition of S, which is notoriously computationally expensive.To avoid this, two particular cases can be considered.If the function f is approximated by a polynomial, the resulting matrix function becomes a matrix polynomial, which can be computed without involving eigendecompositions.This is the rationale behind the Galerkin-Chebyshev approach proposed in Lang and Pereira (2021), where the function f is replaced by its Chebyshev polynomial approximation over an interval containing the eigenvalues of S.
We propose here an alternative approach in the spirit of Lindgren et al. (2011) and Rue and Held (2005).We assume that f is the inverse of a polynomial P taking positive values over R + , i.e. f = 1/P .Then the resulting precision matrix Q of the weights can be expressed as which again involves matrix polynomial instead of a matrix function.Computing the matrix Q can then be done by summing iterates of the matrix S, resulting in a matrix whose sparsity depends on the degree of P : the higher the degree of P , the less sparse Q is.This approach, which we refer to as a "matrix free" approach, will be adopted in the algorithms presented in Section 3 for prediction and simulation.

Stationary and Approximately Stationary Covariance Functions
Let us consider the GRF Z defined by Eq. ( 1) on some compact Riemannian manifold (D, g).It is straightforward to show that its covariance function C Z can be written as: In the particular case where (D, g) is an Euclidean domain equipped with the usual metric, Solin and Särkkä (2020) show that C Z approximates the covariance of a random field on D with radial spectral density f 1/2 (defined in Eq. ( 1)).They also provide a uniform bound on the error between the actual covariance function of Z and the covariance function associated with f 1/2 which shows that the approximation improves as we move further away from the boundary of D. Hence, in this case, Z approximates an isotropic GRF with covariance C given by where C is the inverse Fourier transform of w → f 1/2 ( w ).
More generally, expansions similar to (9) have been used to characterize the covariance of random fields on (Riemannian) manifolds.For instance, Lang and Schwab (2015) use it to describe covariance functions of random fields on the sphere (endowed with its usual metric), and show an explicit link between the regularity of the resulting field and the decay of the sequence {f 1/2 (λ k )} k∈N .On general compact Riemannian manifolds, Borovitskiy et al. (2020) characterize their "Matérn Gaussian processes in the sense of Whittle" through covariance functions of the form (9), by taking f 1/2 to be the spectral density of the usual Matérn covariance function (i.e. as defined for isotropic random fields on R d ).Hence, the field Z defined by (1) with covariance function given by ( 9) can be seen as the counterpart, on the Riemannian manifold (D, g), of the random fields with radial spectral density f 1/2 on R d and covariance function given by the inverse Fourier transform of f 1/2 .Examples of sampled GRFs with Matérn covariance on different domains are presented in Figure 1.

Nonstationary Covariances
The general construction of random fields on Riemannian manifold presented in the previous section can be used to define nonstationary models of GRFs, and in particular fields that exhibit local anisotropies.Such fields are defined on Euclidean domains of dimension d ∈ {2, 3} as follows: around each point of the domain, there is a preferential direction along which the range of highly correlated values is maximal, whereas it is minimal in the orthogonal direction(s).The angles defining the preferential directions are called anisotropy angles and the size of the ranges are called anisotropy ranges.These anisotropy parameters can be graphically represented by an ellipse/ellipsoid whose axes length and direction are respectively given by the anisotropy ranges and angles.Following the approach described in Pereira (2019), a GRF with local anisotropies on some bounded Euclidean domain D can be built by defining a GRF on a specific Riemannian manifold: anisotropy angles and ranges can be used to define a metric tensor at each point of D. In other words, at each p ∈ D, the metric is chosen so that it locally "deforms" D into a local domain where the anisotropy reduces to isotropy thanks to the composition of a rotation and a scaling that would turn an ellipse/ellipsoid into a circle/sphere (see Figure 2).This transformation results in a metric defined as where D(p) is the diagonal matrix whose entries are the anisotropy ranges at p ∈ D and R(p) is the rotation matrix defined from the anisotropy angles at p. To get the covariance properties of the field in the original domain D equipped with the met- where h ∈ R d is some infinitesimal displacement vector around p. It is then straightforward to check that such a covariance locally reproduces the desired anisotropy properties around p (see Chilès and Delfiner (2012) for details).An example of the type of nonstationary fields that can be sampled using this method is presented in Figure 3.In conclusion, given a compact Euclidean domain D and a field of anisotropy parameters on D, defining nonstationary random fields with the corresponding anisotropy properties can be done by applying the approach described in Section 2.1 to a tailored Riemannian manifold (namely D equipped with the metric (11)).Note that similar ideas could be applied to define random fields with varying covariance structure on more general surfaces if one can define coherent fields of anisotropy parameters on such surfaces.

Prediction on Riemannian Manifolds
We now show how the construction presented in Section 2 provides efficient prediction algorithms in a quite general setting, which includes nonstationary covariances, non-Euclidean support and non-Matérn covariance functions.Given some spatial domain D, we assume that we observe some real-valued variable Y at p ≥ 1 locations x 1 , . . ., x p ∈ D. These observations are modeled as where 1 , . . ., p are independent standard Gaussian variables, τ > 0, and Z denotes some GRF on D acting as a latent variable.Hence, they can be seen as observations of the latent field Z affected by some independent centered Gaussian noise with variance τ 2 .
We aim at making the Best Linear Unbiased Prediction (BLUP) of the variable Z at q locations x p+1 , . . ., x p+q ∈ D. Under a Gaussian assumption, recall that the BLUP is equal to the conditional expectation (Tong, 2012), i.e. the optimal prediction in a L 2 -sense.In the geostatistical literature, this prediction is referred to as kriging (Chilès and Delfiner, 2012).In most geostatistical approaches, the GRF is either characterized by a covariance function or by a precision matrix.Here, in contrast, the GRF is defined on a triangulation of D and it is characterized by a positive polynomial P as per Eq. ( 8).In this section, to derive the kriging algorithm, we first suppose that this polynomial is known.We will show in the next section how τ 2 and the coefficients of P can be estimated from a single realization of the vector of observations Y = (Y (x 1 ), . . ., Y (x p )) T .
Note that the approach we present here readily generalizes to the case where covariates are added to the model.In that case, the vector of observations Y should contain the residuals obtained after removing from the data points the trend defined by the covariates.

A "Matrix Free" Kriging Algorithm
We start with some triangulation of D with n nodes s 1 , . . ., s n ∈ D, and Z is approximated by its finite element approximation Z associated to this triangulation as shown in Section 2. Following the interpolation rule in Eq. ( 2), the values of the field Z at the observed locations x 1 , . . ., x p can be expressed as a linear combination of the values taken at the triangulation nodes s 1 , . . ., s n .The vector of observations Y can thus be written as where Z = (Z(s 1 ), . . ., Z(s n )) T , = ( 1 , . . ., p ) T and M D ∈ R p×n is the so-called design matrix containing the interpolation weights defined by The next proposition provides an analytic expression for the kriging predictors Z * (x p+1 ) at some given target location Proposition 3.1.The conditional distribution of Z given Y is that of a Gaussian vector with mean E[Z|Y ] and covariance matrix Cov [Z|Y ] given by and where Σ is the covariance matrix of Z, Q = Σ −1 is its precision matrix, and τ 2 and M D are defined in Eq. (13).Then, the vector of kriging predictors Z * = (Z * (x p+1 ), . . ., Z * (x p+q )) T can be written as where M T is the target design matrix defined by We have assumed that the function f characterizing the field Z is the inverse of P , a positive polynomial on R + .This implies in particular that the precision matrix Q in (15) can be expressed as in Eq. ( 8).Assuming that P is known, the kriging predictors Z * in (17) can thus be computed using Algorithm 1.
1: Solve for X the linear system where Note that all the matrices √ C, S, M D and M T are sparse, and that the matrix Q is also sparse when P has a low degree.A classical approach for solving the linear system in Algorithm 1 consists in first building and storing the matrix A = τ 2 Q + M T D M D , and then using a method for solving sparse linear systems.For instance, one could use so-called sparse direct solvers.Such solvers start by factorizing A into sparse triangular factors (using LU or Cholesky decompositions) and then solving the resulting triangular systems.The Cholesky decomposition for large A can be done exactly or with very accurate decomposition using Tile Low Rank approximations with the ExaGeoStat software (Abdulah et al., 2018).Note however, that in this case the matrix A needs to be built and stored.
When these direct approaches are not possible due to size of the problem, an alternative approach using iterative solvers must be used (Nocedal and Wright, 2006).Such solvers only rely on products between the matrix A and vectors, and therefore can be used to approximately solve the linear system in Algorithm 1 without effectively requiring to build and store A: only a routine performing the product between A and vectors is needed.Note that such products would only require products between the sparse matrices √ C, S and M D and vectors (see Algorithm 2).This approach yields a so-called "matrix-free" approach to kriging in the sense that it does not require to explicitly build and store the precision matrix Q of the field.

Conditional Simulations
Generating samples from the conditional distribution of Z (or rather of its approximation Z) given Y is now a straightforward task.Since the distribution of Z is entirely specified through the Algorithm 2: Product between a matrix (αDP (B)D T + M T M ) and a vector.
Compute y ← c k w + Sy 6: end for 7: Return αDy + x distributions of its weights Z, this amounts to sample from the conditional distribution of Z given Y , which will be denoted π Z|Y .Recall from Proposition 3.1 that π Z|Y is a multivariate Gaussian distribution with mean E[Z|Y ] and covariance matrix Cov[Z|Y ] given by Eqs. ( 15) and ( 16).Hence, sampling from the conditional distribution π Z|Y is straightforward, as shown in in Algorithm 3.
Algorithm 3: Conditional simulation of the vector Z given the observations Y .
Require: Conditional mean E[Z|Y ] defined in Eq. ( 15).1: Sample a centered Gaussian vector X with covariance matrix Cov The conditional mean E[Z|Y ] required for Algorithm 3 is computed using the same methods as those described in Section 3 to compute the kriging predictors.
Since an explicit formula is available for Cov[Z|Y ] in Eq. ( 16), sampling the Gaussian vector of the first step in Algorithm 3 could be directly done by finding a factor L such that Cov[Z|Y ] = LL T and then returning the product LW where W is a vector of independent standard Gaussian variables.Possibles candidates for L include the Cholesky decomposition of Cov[Z|Y ], but also the matrix function h(Cov[Z|Y ]) where h is the square-root function, and the matrix function where h is the inverse square-root function.In practice, and as before, both matrix functions could be approximated by matrix polynomials and the "matrix-free" algorithms presented above could be used.
Another method to sample the Gaussian vector of the first step in Algorithm 3 stems from the fact that the vector Z − E[Z|Y ] is such a vector, and that the expression of Cov • Solving for X the linear system BX = W , where B is a square-root of Q (e.g. the Cholesky factor of Q or the matrix ( √ C) √ P (S)).
In practice, the matrix functions appearing above are once again polynomially approximated.More details on the step 1 of Algorithm 4 can be found in Pereira and Desassis (2019).
Algorithm 4: Sampling a centered Gaussian vectors with covariance matrix Cov[Z|Y ].
Require: Matrix M D and parameter τ defining the observations Y in Eq. ( 13).
1: Sample a centered Gaussian vector Z with covariance matrix Σ or precision matrix Q.
2: Sample a vector with independent standard Gaussian entries.
by replacing Y by Y in Eq. ( 15).

Estimation of the Parameters
In order to apply the kriging and simulation procedures presented above, one needs to know the polynomial P characterizing the field Z, as well as the parameter τ 2 defining the variance of the Gaussian noise.Usually, these parameters are not known, and they must be estimated from the observations Y .We now show how a maximum likelihood approach can be efficiently implemented.
Let us denote by θ the vector of parameters containing the coefficients of P and the variance τ 2 .Note then that, following Eq.( 13), the vector Y is a centered Gaussian vector with covariance matrix where and The precise computation of the log-likelihood deserves some comments.The quadratic form ( 22) is computed using the methods introduced to solve the linear system, as shown in Algorithm 1 and in Eq. ( 19).Evaluating the log-determinants in Eq. ( 23) could be done using a Cholesky (or LU) factorization of the matrices Q(θ) and A(θ) and then summing the log of the diagonal elements of the resulting triangular factors.However, one can also use a "matrix-free" approach to compute an approximation of these log-determinants.This approach, detailed in the rest of this section, is a generalization to matrix functions of the results in Han et al. (2015).It is based on the following result (proven in Appendix A.2).
Proposition 4.1.Let B be a diagonalizable matrix and h : R → (0, ∞).The log-determinant of the matrix function h(B) satisfies the relation where W is a vector whose entries are independent zero-mean unit-variance random variables, and log h(B) is the matrix function defined from the function log h.
The matrix function log h(B) can in practice be approximated by a matrix polynomial P log h (B) where P log h denotes a polynomial approximation of log h over an interval containing the eigenvalues of B (and defined using for instance Chebyshev polynomial approximation).Then, we can approximate log |h(B)| as where W 1 , . . ., W M denote M independent samples of W (defined for instance from a Gaussian or Rademacher distribution).Similarly to Algorithm 2, each quadratic form in Eq. ( 24) is computed in an iterative way while only requiring products between the matrix B and vectors.This approach can be used to compute the log-determinant ( 23) by noting that both log |P (S)| and log |A(θ)| can be written in the form log h(B) with B = S and h = P for the former, and B = A(θ) and h to be the identity map for the latter.The whole procedure to compute log |Q Y (θ)| is summarized in Algorithm 5. Note in particular that this algorithms requires to know the intervals containing the eigenvalues of S and those of A(θ).Details on the computation of these intervals can be found in Appendix B. A discussion about the "matrix-free" approach to the computation of log |h(B)| is deferred to Section 6.
Algorithm 5: Computation of log |Q Y (θ)| defined in Eq. ( 23) Require: Matrices √ C, S ∈ R n×n and M D ∈ R p×n as defined in Eqs. ( 6), ( 7) and ( 14) Require: Vector θ containing the coefficients of P and the variance parameter τ 2 Require: Number of samples M 1: Compute the coefficients of a polynomial approximation P log P of the function λ → log P (λ) over an interval containing the eigenvalues of S 2: Compute the coefficients of a polynomial approximation P log of the function λ → log λ over an interval containing the eigenvalues of the matrix A(θ) defined in Eq. (20) 3: Set Q 1 = Q 2 := 0 4: for m=1,. . .,M do 5: Sample a vector W with independent identically distributed entries with mean 0 and variance 1 6: Compute u := P log P (S)W using Algorithm 2 7: Since we can now evaluate the log-likelihood for any vector of parameters, we can plug Algorithm 5 into any optimization algorithm that only requires evaluations of an objective function to maximize it.Examples of such algorithms include the Nelder-Mead algorithm, and any gradientdescent algorithm for which the gradients would be numerically approximated by finite differences (Nocedal and Wright, 2006).
Finally, note that the procedure presented in this section naturally extends to the case where covariates are added to the model.Indeed, assuming now that the observations are modeled as Ỹ = Xβ + Y where Y is defined as before, X is a matrix of covariates and β a vector containing the associated regression coefficients.Then, the maximum likelihood estimate of β is given by β(θ) = (X T Q Y (θ)X) −1 XQ Y (θ) Ỹ , and the estimation of the parameter can be carried out by maximizing the likelihood L, where Y is now replaced by Ỹ − X β(θ).

Illustration
The proposed algorithms are illustrated on synthetic very large data sets.Our aim is to show that our approach compares very well with the GMRF approximation in Lindgren et al. (2011) on very large grids, even in cases that are favorable to the GMRF approximation.

Simulation and Kriging
We first consider the case of 3D standardized GRF with varying anisotropies in the horizontal plane and an exponential covariance, which corresponds in 3D to the choice P (λ) = (κ 2 + λ) 2 .Following the remarks of Section 3.2, it is sampled by using a Chebyshev polynomial approximation of λ → 1/ √ P (λ) = (κ 2 + λ) −1 of degree 268 (see e.g.Pereira and Desassis, 2019, for details).We chose a mesh built from 6 tetrahedrons in each cell of a regular grid with lags suitably chosen to fit with the ranges of the GRF.The resulting size of the discretized vector Z is about 1.5 × 10 7 .On this extremely large grid, the simulation takes only 30 seconds on a laptop running at 1.9 Ghz on 8 cores.Most of the time is spent for the matrix-vector products implied by the polynomial approximation as shown in Algorithm 2.
A sub-sampling of this simulation is done to obtain 10 5 randomly located observations which are used to perform kriging.Since interpolation by kriging is smoother than a simulation, a coarser mesh can be used.The size of the resulting kriging system is about 7 × 10 6 .A measurement error with a variance τ 2 = 0.01 is added to the model.Conjugate gradient without preconditioning is used to solve the system of Eq. ( 19) in Algorithm 1.The algorithm converges in 1098 iterations for a computing time around 400 seconds.Results are displayed in Figure 4.Note that in most applications, the variance τ 2 is greater than 1% of the variance.In these cases, the system of Eq. ( 19) is better conditioned, leading to a faster convergence of the conjugate gradient.Kriging is done here in a nonstationary context, with a stationary mesh which is thus far from optimality in most locations.If the simulation were stationary, the mesh could be tailored to the specific model at hand.The grid could be oriented according to the anisotropy tensor and the lag in each direction could be chosen according to the associated directional range of the model.As a result, a given accuracy would be obtained with a lower degree of the Chebyshev polynomial and system Eq.( 19) would be better conditioned, thus leading to faster computations.In our experience, stationary simulations and kriging are usually 10 times faster than nonstationary ones, all other settings being equal.

Estimation of the Parameters of a GMRF
In this example, the coefficients of a polynomial P characterizing an isotropic GMRF and the measurement error τ 2 are estimated by the approach described in Section 4. The coefficients of Figure 4: Left: 3D simulation of a GRF with varying anisotropies.Right: Kriging estimate using 10 5 randomly located samples from the simulation on the left.
the true P are given in Table 1.It does not correspond to a Matérn type.1.5 × 10 4 synthetic observations are sampled from the model at random locations of a square of size 40 and a Gaussian noise with variance τ 2 = 0.01 is added (see Figure 5).To ensure that P takes strictly positive values over R + during the estimation, the problem is re-parameterized by using two arbitrary polynomials P 1 and P 2 as follows: Indeed, all the positive polynomials over R + can be written according to Eq. ( 25).A fixed positive value ε is added to the right-hand side of Eq. ( 25) to ensure strict positivity of P .In this study, ε is set to 10 −3 .The components of θ are the coefficients of P 1 and P 2 .The likelihood is maximized with the COBYLA algorithm (Powell, 1994), using several initial guesses and selecting the best output.To compute the log-determinant by the Hutchinson estimators (Hutchinson, 1989) the same Gaussian vectors are used for all the iterations.First only one vector is used in order to find an acceptable solution.Then the algorithm is relaunched and 10 random vectors are used until convergence.
The true coefficients of P (resulting from those of P 1 and P 2 ) are given in Table 1 along with the estimated coefficients.We also give in this table the initial guess that yielded the estimated coefficients.The initial value for τ 2 was set to 1 and its estimation is 0.00944.At first glance, the estimation results does not seem very accurate.However, the associated covariance functions, computed by Fast-Fourier-Transform (FFT) and displayed on Figure 6 lead to a more positive conclusion.Indeed, the estimated covariance is close to the true one, especially near the origin.Nevertheless, the likelihood seems to have several modes and the results are sensitive to the choice of the initial values.τ 2 is generally well estimated but the estimated covariance of the underlying GMRF is not always so close to the true one.Despite this optimization problem, further discussed in the next section, the matrix-free approach allows to approximate the likelihood in order to estimate the shape of the covariance, even with a moderate number of random vectors for the Hutchinson estimator.

Discussion and Conclusion
We have proposed a generic approach for defining GRFs on compact Riemannian manifolds.It combines two ingredients of modern geostatistics, as pioneered in Lindgren et al. (2011): the definition of GRFs through the expansions in the eigenfunctions of the Laplace-Beltrami operator on the Riemannian manifold and the finite element approximation of these GRFs.This approach is quite general.Not limited to spheres, it can be applied to construct valid GRFs on any smooth compact Rayner NA, Auchmann R, Bessembinder J, Brönnimann S, Brugnara Y, Capponi F, et al. (2020). The

A Proofs
A.1 Proof of Proposition 3.1 Consider the vector X given by Note that X is a centered Gaussian vector with covariance matrix  15) and ( 16) follow from computing Cov[X] −1 from Eq. ( 26) and using the fact that Cov[X]Cov[X] −1 = I p+n .
Finally, recall that since we are dealing with Gaussian vectors, the vector of kriging predictors Z * coincides with the conditional expectation of the vector Z T = (Z(x p+1 ), . . ., Z(x p+q )) T , given the observations Y .Hence, by linearity of the expectation and definition of M T , we have

Figure 1 :
Figure 1: Example of GRFs with Matérn covariance function on the sphere and on surfaces shaped like a cow, a paraboloid and a duck-shaped swim ring.

Figure 3 :
Figure 3: Example of anisotropy parameters (left) and corresponding random field simulation obtained using our method (right), on the unit square.
[Z|Y ] does not explicitly depend on the vectors Z and Y .Hence, by sampling new vectors Z and Y with the same distribution as Z and Y , and by computing Z − E[Z |Y ] we retrieve a centered Gaussian vector with covariance matrix Cov[Z |Y ] = Cov[Z|Y ].This method is presented in Algorithm 4.It relies on being able to sample the vector Z described above, which can be done by either:• Computing the product BW where B is a square-root of Σ (e.g. the Cholesky factor of Σ or the matrix (

Figure 5 :
Figure 5: Left: Realization of the GMRF on the square.Right: Sampling of 1.5 × 10 4 noisy observations from the simulation on the left.

Figure 6 :
Figure 6: Covariances obtained from P by FFT (true model and estimated).
multivariate Gaussian it follows that the conditional distribution of Z given Y is a Gaussian vector with mean E[Z|Y ] and covariance matrix Cov[Z|Y ](Tong, 2012, Theorem 3.3.4)given byE[Z|Y ] = ΣM T D (M D ΣM T D + τ 2 I p ) −1 Y , Cov[Z|Y ] = Σ − ΣM T D (M D ΣM T D + τ 2 I p ) −1 M D Σ.Then, Eqs. (