Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1073

10.6339/22-JDS1073

Statistical Data Science

Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes

Saha

Arkajyoti

1 Datta

Abhirup

2 Banerjee

Sudipto

sudipto@ucla.edu3∗ 1Department of Statistics, University of Washington, Seattle, WA, USA 2Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA 3UCLA Department of Biostatistics, 650 Charles E. Young Drive South, University of California Los Angeles, CA 90095-1772, USA

∗Corresponding author. Email: sudipto@ucla.edu.

2022

3112022

204533544

Supplementary Material

This supplementary material contains discussion on why is it infeasible to directly use a Monte Carlo sampling to estimate p ( Y ) in (4), evaluation of the algorithms under consideration with respect to misclassification error, and details of the code and data used in the article.

16820226102022

2022 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2022

Open access article under the CC BY license.

Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.

Keywords binary data generalized linear mixed models spatial, Gaussian processes

National Institute of Environmental Health Sciences

R01 ES033739

National Science Foundation

DMS-1915803

National Science Foundation

NSF/DMS 1916349

NSF/IIS 1562303

National Institute of Environmental Health Sciences

R01ES030210

5R01ES027027

Abhirup Datta was partially supported by National Institute of Environmental Health Sciences (NIEHS) grant R01 ES033739 and by National Science Foundation (NSF) Division of Mathematical Sciences grant DMS-1915803. Sudipto Banerjee was partially supported by the National Science Foundation (NSF) from grants NSF/DMS 1916349 and NSF/IIS 1562303, and by the National Institute of Environmental Health Sciences (NIEHS) from grants R01ES030210 and 5R01ES027027.

References

Albert

, Chib

(1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422): 669–679.

Azzalini

, Capitanio

(2014). The Skew-Normal and Related Families, volume 3. Cambridge University Press.

Banerjee

, Gelfand

(2006). Bayesian wombling: Curvilinear gradient assessment under spatial process models. Journal of the American Statistical Association, 101(476): 1487–1501.

Berrett

, Calder

(2016). Bayesian spatial binary classification. Spatial Statistics, 16: 72–102.

Botev

(2017). The normal law under linear restrictions: simulation and estimation via minimax tilting. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 79(1): 125–148.

Botev

, Belzile

(2021). TruncatedNormal: Truncated Multivariate Normal and Student Distributions. R package version 2.2.2.

Cao

, Durante

, Genton

(2022). Scalable computation of predictive probabilities in probit models with gaussian process priors. Journal of Computational and Graphical Statistics, 1–12. https://doi.org/10.1080/10618600.2022.2036614.

Cao

, Genton

, Keyes

, Turkiyyah

(2022). tlrmvnmvt: Computing high-dimensional multivariate normal and student-t probabilities with low-rank methods in r. Journal of Statistical Software, 101: 1–25.

Cao

, Genton

, Keyes

, Turkiyyah

(2020). tlrmvnmvt: Low-Rank Methods for MVN and MVT Probabilities. R package version 1.1.0.

Datta

(2021). Nearest-neighbor sparse cholesky matrices in spatial statistics. Wiley Interdisciplinary Reviews: Computational Statistics, 14(5): e1574.

Datta

, Banerjee

, Finley

, Gelfand

(2016a). Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets. Journal of the American Statistical Association, 111(514): 800–812.

Datta

, Banerjee

, Finley

, Gelfand

(2016b). On nearest-neighbor gaussian process models for massive spatial data. Wiley Interdisciplinary Reviews: Computational Statistics, 8(5): 162–171.

De Oliveira

(2000). Bayesian prediction of clipped gaussian random fields. Computational Statistics & Data Analysis, 34(3): 299–314.

De Oliveira

, Kedem

, Short

(1997). Bayesian prediction of transformed gaussian random fields. Journal of the American Statistical Association, 92(440): 1422–1433.

Diggle

, Tawn

, Moyeed

(1998). Model-based geostatistics. Journal of the Royal Statistical Society. Series C. Applied Statistics, 47(3): 299–350.

Finley

, Banerjee

, McRoberts

(2009). Hierarchical spatial models for predicting tree species assemblages across large domains. Annals of Applied Statistics, 3(3): 1052–1079.

Finley

, Datta

, Cook

, Morton

, Andersen

, Banerjee

(2019). Efficient algorithms for bayesian nearest neighbor gaussian processes. Journal of Computational and Graphical Statistics, 28(2): 401–414.

Genz

(1992). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1(2): 141–149.

Heagerty

, Lele

(1998). A composite likelihood approach to binary spatial data. Journal of the American Statistical Association, 93(443): 1099–1111.

Lee

, Nelder

(1996). Hierarchical generalized linear models. Journal of the Royal Statistical Society, Series B, Methodological, 58(4): 619–656.

Saha

, Datta

(2018a). Brisc: bootstrap for rapid inference on spatial covariances. Stat, 7(1): e184.

Saha

, Datta

(2018b). BRISC: Fast Inference for Large Spatial Datasets using BRISC. R package version 0.1.0.

Vecchia

(1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, Methodological, 50(2): 297–312.

Zhang

, Arellano-Valle

, Genton

, Huser

(2022). Tractable bayes of skew-elliptical link models for correlated binary data. Biometrics. https://doi.org/10.1111/biom.13731.