<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1073</article-id>
<article-id pub-id-type="doi">10.6339/22-JDS1073</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Statistical Data Science</subject></subj-group></article-categories>
<title-group>
<article-title>Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Saha</surname><given-names>Arkajyoti</given-names></name><xref ref-type="aff" rid="j_jds1073_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Datta</surname><given-names>Abhirup</given-names></name><xref ref-type="aff" rid="j_jds1073_aff_002">2</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Banerjee</surname><given-names>Sudipto</given-names></name><email xlink:href="mailto:sudipto@ucla.edu">sudipto@ucla.edu</email><xref ref-type="aff" rid="j_jds1073_aff_003">3</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<aff id="j_jds1073_aff_001"><label>1</label>Department of Statistics, <institution>University of Washington</institution>, Seattle, WA, <country>USA</country></aff>
<aff id="j_jds1073_aff_002"><label>2</label>Department of Biostatistics, <institution>Johns Hopkins University</institution>, Baltimore, MD, <country>USA</country></aff>
<aff id="j_jds1073_aff_003"><label>3</label>UCLA Department of Biostatistics, 650 Charles E. Young Drive South, <institution>University of California Los Angeles</institution>, CA 90095-1772, <country>USA</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:sudipto@ucla.edu">sudipto@ucla.edu</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2022</year></pub-date><pub-date pub-type="epub"><day>3</day><month>11</month><year>2022</year></pub-date><volume>20</volume><issue>4</issue><fpage>533</fpage><lpage>544</lpage><supplementary-material id="S1" content-type="document" xlink:href="jds1073_s001.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<title>Supplementary Material</title>
<p>This supplementary material contains discussion on why is it infeasible to directly use a Monte Carlo sampling to estimate <inline-formula id="j_jds1073_ineq_001"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$p(Y)$]]></tex-math></alternatives></inline-formula> in (4), evaluation of the algorithms under consideration with respect to misclassification error, and details of the code and data used in the article.</p>
</caption>
</supplementary-material><history><date date-type="received"><day>16</day><month>8</month><year>2022</year></date><date date-type="accepted"><day>6</day><month>10</month><year>2022</year></date></history>
<permissions><copyright-statement>2022 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2022</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>binary data</kwd>
<kwd>generalized linear mixed models</kwd>
<kwd>spatial, Gaussian processes</kwd>
</kwd-group>
<funding-group><award-group><funding-source xlink:href="https://doi.org/10.13039/100000066">National Institute of Environmental Health Sciences</funding-source><award-id>R01 ES033739</award-id></award-group><award-group><funding-source xlink:href="https://doi.org/10.13039/100000001">National Science Foundation</funding-source><award-id>DMS-1915803</award-id></award-group><award-group><funding-source xlink:href="https://doi.org/10.13039/100000001">National Science Foundation</funding-source><award-id>NSF/DMS 1916349</award-id><award-id>NSF/IIS 1562303</award-id></award-group><award-group><funding-source xlink:href="https://doi.org/10.13039/100000066">National Institute of Environmental Health Sciences</funding-source><award-id>R01ES030210</award-id><award-id>5R01ES027027</award-id></award-group><funding-statement>Abhirup Datta was partially supported by National Institute of Environmental Health Sciences (NIEHS) grant R01 ES033739 and by National Science Foundation (NSF) Division of Mathematical Sciences grant DMS-1915803. Sudipto Banerjee was partially supported by the National Science Foundation (NSF) from grants NSF/DMS 1916349 and NSF/IIS 1562303, and by the National Institute of Environmental Health Sciences (NIEHS) from grants R01ES030210 and 5R01ES027027. </funding-statement></funding-group>
</article-meta>
</front>
<body/>
<back>
<ref-list id="j_jds1073_reflist_001">
<title>References</title>
<ref id="j_jds1073_ref_001">
<mixed-citation publication-type="journal"> <string-name><surname>Albert</surname> <given-names>JH</given-names></string-name>, <string-name><surname>Chib</surname> <given-names>S</given-names></string-name> (<year>1993</year>). <article-title>Bayesian analysis of binary and polychotomous response data</article-title>. <source>Journal of the American Statistical Association</source>, <volume>88</volume>(<issue>422</issue>): <fpage>669</fpage>–<lpage>679</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_002">
<mixed-citation publication-type="book"> <string-name><surname>Azzalini</surname> <given-names>A</given-names></string-name>, <string-name><surname>Capitanio</surname> <given-names>A</given-names></string-name> (<year>2014</year>). <source>The Skew-Normal and Related Families</source>, volume <volume>3</volume>. <publisher-name>Cambridge University Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_003">
<mixed-citation publication-type="journal"> <string-name><surname>Banerjee</surname> <given-names>S</given-names></string-name>, <string-name><surname>Gelfand</surname> <given-names>AE</given-names></string-name> (<year>2006</year>). <article-title>Bayesian wombling: Curvilinear gradient assessment under spatial process models</article-title>. <source>Journal of the American Statistical Association</source>, <volume>101</volume>(<issue>476</issue>): <fpage>1487</fpage>–<lpage>1501</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_004">
<mixed-citation publication-type="journal"> <string-name><surname>Berrett</surname> <given-names>C</given-names></string-name>, <string-name><surname>Calder</surname> <given-names>CA</given-names></string-name> (<year>2016</year>). <article-title>Bayesian spatial binary classification</article-title>. <source>Spatial Statistics</source>, <volume>16</volume>: <fpage>72</fpage>–<lpage>102</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_005">
<mixed-citation publication-type="journal"> <string-name><surname>Botev</surname> <given-names>ZI</given-names></string-name> (<year>2017</year>). <article-title>The normal law under linear restrictions: simulation and estimation via minimax tilting</article-title>. <source>Journal of the Royal Statistical Society, Series B, Statistical Methodology</source>, <volume>79</volume>(<issue>1</issue>): <fpage>125</fpage>–<lpage>148</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_006">
<mixed-citation publication-type="other"> <string-name><surname>Botev</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Belzile</surname> <given-names>L</given-names></string-name> (2021). TruncatedNormal: Truncated Multivariate Normal and Student Distributions. R package version 2.2.2.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_007">
<mixed-citation publication-type="other"> <string-name><surname>Cao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Durante</surname> <given-names>D</given-names></string-name>, <string-name><surname>Genton</surname> <given-names>MG</given-names></string-name> (2022). Scalable computation of predictive probabilities in probit models with gaussian process priors. <italic>Journal of Computational and Graphical Statistics</italic>, 1–12. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/10618600.2022.2036614" xlink:type="simple">https://doi.org/10.1080/10618600.2022.2036614</ext-link>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_008">
<mixed-citation publication-type="journal"> <string-name><surname>Cao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Genton</surname> <given-names>MG</given-names></string-name>, <string-name><surname>Keyes</surname> <given-names>DE</given-names></string-name>, <string-name><surname>Turkiyyah</surname> <given-names>GM</given-names></string-name> (<year>2022</year>). <article-title>tlrmvnmvt: Computing high-dimensional multivariate normal and student-t probabilities with low-rank methods in r</article-title>. <source>Journal of Statistical Software</source>, <volume>101</volume>: <fpage>1</fpage>–<lpage>25</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_009">
<mixed-citation publication-type="other"> <string-name><surname>Cao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Genton</surname> <given-names>M</given-names></string-name>, <string-name><surname>Keyes</surname> <given-names>D</given-names></string-name>, <string-name><surname>Turkiyyah</surname> <given-names>G</given-names></string-name> (2020). tlrmvnmvt: Low-Rank Methods for MVN and MVT Probabilities. R package version 1.1.0.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_010">
<mixed-citation publication-type="journal"> <string-name><surname>Datta</surname> <given-names>A</given-names></string-name> (<year>2021</year>). <article-title>Nearest-neighbor sparse cholesky matrices in spatial statistics</article-title>. <source>Wiley Interdisciplinary Reviews: Computational Statistics</source>, <volume>14</volume>(<issue>5</issue>): <fpage>e1574</fpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_011">
<mixed-citation publication-type="journal"> <string-name><surname>Datta</surname> <given-names>A</given-names></string-name>, <string-name><surname>Banerjee</surname> <given-names>S</given-names></string-name>, <string-name><surname>Finley</surname> <given-names>AO</given-names></string-name>, <string-name><surname>Gelfand</surname> <given-names>AE</given-names></string-name> (<year>2016</year>a). <article-title>Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets</article-title>. <source>Journal of the American Statistical Association</source>, <volume>111</volume>(<issue>514</issue>): <fpage>800</fpage>–<lpage>812</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_012">
<mixed-citation publication-type="journal"> <string-name><surname>Datta</surname> <given-names>A</given-names></string-name>, <string-name><surname>Banerjee</surname> <given-names>S</given-names></string-name>, <string-name><surname>Finley</surname> <given-names>AO</given-names></string-name>, <string-name><surname>Gelfand</surname> <given-names>AE</given-names></string-name> (<year>2016</year>b). <article-title>On nearest-neighbor gaussian process models for massive spatial data</article-title>. <source>Wiley Interdisciplinary Reviews: Computational Statistics</source>, <volume>8</volume>(<issue>5</issue>): <fpage>162</fpage>–<lpage>171</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_013">
<mixed-citation publication-type="journal"> <string-name><surname>De Oliveira</surname> <given-names>V</given-names></string-name> (<year>2000</year>). <article-title>Bayesian prediction of clipped gaussian random fields</article-title>. <source>Computational Statistics &amp; Data Analysis</source>, <volume>34</volume>(<issue>3</issue>): <fpage>299</fpage>–<lpage>314</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_014">
<mixed-citation publication-type="journal"> <string-name><surname>De Oliveira</surname> <given-names>V</given-names></string-name>, <string-name><surname>Kedem</surname> <given-names>B</given-names></string-name>, <string-name><surname>Short</surname> <given-names>DA</given-names></string-name> (<year>1997</year>). <article-title>Bayesian prediction of transformed gaussian random fields</article-title>. <source>Journal of the American Statistical Association</source>, <volume>92</volume>(<issue>440</issue>): <fpage>1422</fpage>–<lpage>1433</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_015">
<mixed-citation publication-type="journal"> <string-name><surname>Diggle</surname> <given-names>PJ</given-names></string-name>, <string-name><surname>Tawn</surname> <given-names>JA</given-names></string-name>, <string-name><surname>Moyeed</surname> <given-names>RA</given-names></string-name> (<year>1998</year>). <article-title>Model-based geostatistics</article-title>. <source>Journal of the Royal Statistical Society. Series C. Applied Statistics</source>, <volume>47</volume>(<issue>3</issue>): <fpage>299</fpage>–<lpage>350</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_016">
<mixed-citation publication-type="journal"> <string-name><surname>Finley</surname> <given-names>AO</given-names></string-name>, <string-name><surname>Banerjee</surname> <given-names>S</given-names></string-name>, <string-name><surname>McRoberts</surname> <given-names>RE</given-names></string-name> (<year>2009</year>). <article-title>Hierarchical spatial models for predicting tree species assemblages across large domains</article-title>. <source>Annals of Applied Statistics</source>, <volume>3</volume>(<issue>3</issue>): <fpage>1052</fpage>–<lpage>1079</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_017">
<mixed-citation publication-type="journal"> <string-name><surname>Finley</surname> <given-names>AO</given-names></string-name>, <string-name><surname>Datta</surname> <given-names>A</given-names></string-name>, <string-name><surname>Cook</surname> <given-names>BD</given-names></string-name>, <string-name><surname>Morton</surname> <given-names>DC</given-names></string-name>, <string-name><surname>Andersen</surname> <given-names>HE</given-names></string-name>, <string-name><surname>Banerjee</surname> <given-names>S</given-names></string-name> (<year>2019</year>). <article-title>Efficient algorithms for bayesian nearest neighbor gaussian processes</article-title>. <source>Journal of Computational and Graphical Statistics</source>, <volume>28</volume>(<issue>2</issue>): <fpage>401</fpage>–<lpage>414</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_018">
<mixed-citation publication-type="journal"> <string-name><surname>Genz</surname> <given-names>A</given-names></string-name> (<year>1992</year>). <article-title>Numerical computation of multivariate normal probabilities</article-title>. <source>Journal of Computational and Graphical Statistics</source>, <volume>1</volume>(<issue>2</issue>): <fpage>141</fpage>–<lpage>149</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_019">
<mixed-citation publication-type="journal"> <string-name><surname>Heagerty</surname> <given-names>PJ</given-names></string-name>, <string-name><surname>Lele</surname> <given-names>SR</given-names></string-name> (<year>1998</year>). <article-title>A composite likelihood approach to binary spatial data</article-title>. <source>Journal of the American Statistical Association</source>, <volume>93</volume>(<issue>443</issue>): <fpage>1099</fpage>–<lpage>1111</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_020">
<mixed-citation publication-type="journal"> <string-name><surname>Lee</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Nelder</surname> <given-names>JA</given-names></string-name> (<year>1996</year>). <article-title>Hierarchical generalized linear models</article-title>. <source>Journal of the Royal Statistical Society, Series B, Methodological</source>, <volume>58</volume>(<issue>4</issue>): <fpage>619</fpage>–<lpage>656</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_021">
<mixed-citation publication-type="journal"> <string-name><surname>Saha</surname> <given-names>A</given-names></string-name>, <string-name><surname>Datta</surname> <given-names>A</given-names></string-name> (<year>2018</year>a). <article-title>Brisc: bootstrap for rapid inference on spatial covariances</article-title>. <source>Stat</source>, <volume>7</volume>(<issue>1</issue>): <fpage>e184</fpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_022">
<mixed-citation publication-type="other"> <string-name><surname>Saha</surname> <given-names>A</given-names></string-name>, <string-name><surname>Datta</surname> <given-names>A</given-names></string-name> (2018b). BRISC: Fast Inference for Large Spatial Datasets using BRISC. R package version 0.1.0.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_023">
<mixed-citation publication-type="journal"> <string-name><surname>Vecchia</surname> <given-names>AV</given-names></string-name> (<year>1988</year>). <article-title>Estimation and model identification for continuous spatial processes</article-title>. <source>Journal of the Royal Statistical Society, Series B, Methodological</source>, <volume>50</volume>(<issue>2</issue>): <fpage>297</fpage>–<lpage>312</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1073_ref_024">
<mixed-citation publication-type="other"> <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Arellano-Valle</surname> <given-names>RB</given-names></string-name>, <string-name><surname>Genton</surname> <given-names>MG</given-names></string-name>, <string-name><surname>Huser</surname> <given-names>R</given-names></string-name> (2022). Tractable bayes of skew-elliptical link models for correlated binary data. <italic>Biometrics</italic>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/biom.13731" xlink:type="simple">https://doi.org/10.1111/biom.13731</ext-link>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
