Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1102

10.6339/23-JDS1102

Computing in Data Science

Efficient Bayesian High-Dimensional Classification via Random Projection with Application to Gene Expression Data

Chakraborty

Abhisek

cabhisek@stat.tamu.eduzovialpapai@gmail.com1∗ 1Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843, USA

∗Email: cabhisek@stat.tamu.edu or zovialpapai@gmail.com.

2024

1262023

221152172

Supplementary Material

Software implementation of the methodologies developed in the article is available for use at zovialpapai/Bayesian-classification-with-random-projection. Here, we present a short description about the directories in the repository, as follows: (1) functions: The directory contains utility functions in two R scripts, that are utilised in the repeated simulations and real data analysis conducted in the paper. (a) “BCC_Functions.R” contains functions for compression matrix generation; Probit regression via Albert & Chib and Holmes & Held data augmentation schemes; Logit regression via Polya-Gamma data augmentation scheme; hyper-parameter tuning; and associated helper functions. (b) Probit_HH_cpp.R contains Probit regression via Holmes & Held data augmentation scheme, written in Rcpp. (2) repeated simulations: The directory contains three R scripts, named BCC_sims.R, Weakleaners.R, and time_comparison.R. (a) BCC_sims.R can be utilised to carry out the simulations presented in Section 3 on High-dimensional Probit regression, and Section 5 on High-dimensional Logit regression, along with hyper-parameter tuning. (b) Weakleaners.R can be utilized to study the effect of number of replications of compression matrix (or number of weak classifiers) on the accuracy of classifiers AC, AC+, HH, HH+. The results are presented in Section 3. (c) time_comparison.R can be utilised to study comparative computional time of our classifiers. The results are presented in Section 3. (3) data: Micro-array gene expression cancer data sets utilized in the article is freely available on the website data.mendeley.com. Copies of the data sets are available in the data directory in the our repository. (4) real data analysis: The directory contains the a R script named BCC_data.R that can be utilised to carry out the analysis of micro-array gene expression cancer data sets (Leukemia, Lung Cancer, Prostate cancer), presented in Section 4 of the paper.

10120232642023

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2024

Open access article under the CC BY license.

Inspired by the impressive successes of compress sensing-based machine learning algorithms, data augmentation-based efficient Gibbs samplers for Bayesian high-dimensional classification models are developed by compressing the design matrix to a much lower dimension. Ardent care is exercised in the choice of the projection mechanism, and an adaptive voting rule is employed to reduce sensitivity to the random projection matrix. Focusing on the high-dimensional Probit regression model, we note that the naive implementation of the data augmentation-based Gibbs sampler is not robust to the presence of co-linearity in the design matrix – a setup ubiquitous in n < p problems. We demonstrate that a simple fix based on joint updates of parameters in the latent space circumnavigates this issue. With a computationally efficient MCMC scheme in place, we introduce an ensemble classifier by creating R ( ∼ 25–50) projected copies of the design matrix, and subsequently running R classification models with the R projected design matrix in parallel. We combine the output from the R replications via an adaptive voting scheme. Our scheme is inherently parallelizable and capable of taking advantage of modern computing environments often equipped with multiple cores. The empirical success of our methodology is illustrated in elaborate simulations and gene expression data applications. We also extend our methodology to a high-dimensional logistic regression model and carry out numerical studies to showcase its efficacy.

Keywords collapsed Gibbs sampler data augmentation dimensionality reduction ensemble learning parallel processing

References

Achlioptas

(2003). Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4): 671–687. Special Issue on PODS 2001. https://doi.org/10.1016/S0022-0000(03)00025-4

Adragni

, Cook

(2014). Sufficient dimension reduction and prediction in regression. Philosophical Transactions of Royal Society A, 367: 1–21.

Albert

, Chib

(1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422): 669–679. https://doi.org/10.1080/01621459.1993.10476321

Armagan

, Dunson

, Lee

(2013). Generalized double pareto shrinkage. Statistica Sinica, 23(1): 119–143.

Banerjee

, Roy

(2014). Linear Algebra and Matrix Analysis for Statistics. Chapman and Hall/CRC.

Bhadra

, Datta

, Polson

, Willard

(2017). The horseshoe+ estimator of ultra-sparse signals. Bayesian Analysis, 12(4): 1105–1131. https://doi.org/10.1214/16-BA1028

Bhattacharya

, Chakraborty

, Mallick

(2016). Fast sampling with Gaussian scale mixture priors in high-dimensional regression. Biometrika, 103(4): 985–991. https://doi.org/10.1093/biomet/asw042

Bhattacharya

, Pati

, Pillai

, Dunson

(2015). Dirichlet–laplace priors for optimal shrinkage. Journal of the American Statistical Association, 110(512): 1479–1490 PMID: 27019543. https://doi.org/10.1080/01621459.2014.960967

Biswas

, Mackey

, Meng

(2022). Scalable spike-and-slab. In: Proceedings of the 39th International Conference on Machine Learning (

Chaudhuri,

Jegelka,

Song,

Szepesvari,

Niu,

Sabato, eds.), volume 162 of Proceedings of Machine Learning Research, 2021–2040. PMLR.

Brown

, Griffin

(2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis, 5(1): 171–188. https://doi.org/10.1214/10-BA507

Candes

, Romberg

, Tao

(2006). Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8): 1207–1223. https://doi.org/10.1002/cpa.20124

Cannings

, Samworth

(2017). Random-projection ensemble classification. Journal of the Royal Statistical Society Series B, 79(4): 959–1035. https://doi.org/10.1111/rssb.12228

Cao

, Durante

, Genton

(2022). Scalable computation of predictive probabilities in probit models with Gaussian process priors. Journal of Computational and Graphical Statistics, 31(3): 709–720. https://doi.org/10.1080/10618600.2022.2036614

Carvalho

, Polson

, Scott

(2009). Handling sparsity via the horseshoe. In: Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics (

van Dyk,

Welling, eds.), volume 5 of Proceedings of Machine Learning Research, 73–80. PMLR, Hilton, Clearwater Beach Resort, Clearwater Beach, Florida USA.

Carvalho

, Polson

, Scott

(2010). The horseshoe estimator for sparse signals. Biometrika, 97(2): 465–480. https://doi.org/10.1093/biomet/asq017

Chipman

, George

, Mcculloch

(2006). Bayesian ensemble learning. In: Advances in Neural Information Processing Systems (

Schölkopf,

Platt,

Hoffman, eds.), volume 19, 1–8. MIT Press.

Chipman

, George

, McCulloch

(1998). Bayesian cart model search. Journal of the American Statistical Association, 93(443): 935–948. https://doi.org/10.1080/01621459.1998.10473750

Clyde

, Lee

(2001). Bagging and the bayesian bootstrap. In: Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics (

Richardson,

Jaakkola, eds.), volume R3 of Proceedings of Machine Learning Research, 57–62. PMLR. Reissued by PMLR on 31 March 2021.

Corrêa

, Ludermir

(2007). Dimensionality reduction of very large document collections by semantic mapping. In: Proceedings of the 6th International Workshop on Self-Organizing Maps. volume 6. 1–6.

Cox

, Cox

(2001). Multidimensional Scaling. Chapman and Hall/CRC.

Dasgupta

(2013). Experiments with random projection. arXiv preprint: https://arxiv.org/abs/1301.3849.

Donoho

(2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4): 1289–1306. https://doi.org/10.1109/TIT.2006.871582

DuMouchel

(2002). Data Squashing: Constructing Summary Data Sets. 579–591. Springer US, Boston, MA.

Faes

, Ormerod

, Wand

(2011). Variational bayesian inference for parametric and nonparametric regression with missing data. Journal of the American Statistical Association, 106(495): 959–971. https://doi.org/10.1198/jasa.2011.tm10301

George

, McCulloch

(1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423): 881–889. https://doi.org/10.1080/01621459.1993.10476353

Girolami

, Rogers

(2006). Variational Bayesian multinomial probit regression with gaussian process priors. Neural Computation, 18(8): 1790–1817. https://doi.org/10.1162/neco.2006.18.8.1790

Golub

, Slonim

, Tamayo

, Huard

, Gaasenbeek

, Mesirov

, et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286: 531–537. https://doi.org/10.1126/science.286.5439.531

Gordon

, Beiser

, Brandt

, et al. (2002). The ocular hypertension treatment study: Baseline factors that predict the onset of primary open-angle glaucoma. Archives of Ophthalmology, 120: 714–34. https://doi.org/10.1001/archopht.120.6.714

Guhaniyogi

, Dunson

(2015). Bayesian compressed regression. Journal of the American Statistical Association, 110(512): 1500–1514. https://doi.org/10.1080/01621459.2014.969425

Hans

(2009). Bayesian lasso regression. Biometrika, 96(4): 835–845. https://doi.org/10.1093/biomet/asp047

Held

, Holmes

(2006). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1): 145–168. https://doi.org/10.1214/06-BA105

Hinton

, Roweis

(2002). Stochastic neighbor embedding. In: Advances in Neural Information Processing Systems (

Becker,

Thrun,

Obermayer, eds.), volume 15. MIT Press.

Hoeting

, Madigan

, Raftery

, Volinsky

(1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4): 382–401. https://doi.org/10.1214/ss/1009212519

Hotelling

(1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 6: 417–441. https://doi.org/10.1037/h0071325

Johnson

, Lindenstraus

(1984). Extensions of lipschitz mappings into hilbert space. Contemporary Mathematics, 26: 189–206. https://doi.org/10.1090/conm/026/737400

Jolliffe

, Cadima

(2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374: 1–16.

Kim

, Ghahramani

(2012). Bayesian classifier combination. In: Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (

Lawrence,

Girolami, eds.), volume 22 of Proceedings of Machine Learning Research, 619–627. PMLR, La, Palma, Canary Islands.

Lee

HKH

, Taddy

, Gray

(2010). Selection of a representative sample. Journal of Classification, 27: 41–53. https://doi.org/10.1007/s00357-010-9044-x

, Japkowicz

, Stocki

, Ungar

(2010). Cascading Customized Naïve Bayes Couple. 147–160. Springer, Berlin Heidelberg, Berlin, Heidelberg.

, Hastie

, Church

(2006a). Improving random projections using marginal information. In: Conference on Learning Theory. 635–649. 2006.

, Hastie

, Church

(2006b). Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 287–296. 2006.

Loaiza-Maya

, Nibbering

(2022). Fast variational bayes methods for multinomial probit models. Journal of Business & Economic Statistics, https://doi.org/10.1080/07350015.2022.2139267.

Lorbert

, Blei

, Schapire

, Ramadge

(2012). A bayesian boosting model. arXiv preprint: https://arxiv.org/abs/1209.1996.

Madigan (2004). Likelihood-based data squashing: A modeling approach to instance construction. Data Mining and Knowledge Discovery, 6: 173–190. https://doi.org/10.1023/A:1014095614948

Mika

, Schölkopf

, Smola

, Müller

, Scholz

, Rätsch

(1998). Kernel pca and de-noising in feature spaces. In: Advances in Neural Information Processing Systems (

Kearns,

Solla,

Cohn, eds.), volume 1, 8. MIT Press.

Mitchell

, Beauchamp

(1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404): 1023–1032. https://doi.org/10.1080/01621459.1988.10478694

Mukherjee

, Sen

(2021). Variational inference in high-dimensional linear regression. arXiv preprint: https://arxiv.org/abs/2104.12232.

Owen

(2003). Data squashing empirical likelihood. Data Mining and Knowledge Discovery, 7: 101–113. https://doi.org/10.1023/A:1021568920107

Park

, Casella

(2008). The bayesian lasso. Journal of the American Statistical Association, 103(482): 681–686. https://doi.org/10.1198/016214508000000337

Piironen

, Vehtari

(2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2): 5018–5051. https://doi.org/10.1214/17-EJS1337SI

Polson

, Scott

(2011). Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction. Oxford University Press.

Polson

, Scott

, Windle

(2013). Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association, 108(504): 1339–1349. https://doi.org/10.1080/01621459.2013.829001

Roweis

, Saul

(2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290: 2323–2326. https://doi.org/10.1126/science.290.5500.2323

Shin

, Bhattacharya

, Johnson

(2015). Scalable bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. Statistica Sinica, 28: 1053–1078.

Singh

, Febbo

, Ross

, et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Genome Biology, 1: 203–212.

Sra

, Dhillon

(2005). Generalized nonnegative matrix approximations with bregman divergences. In: Advances in Neural Information Processing Systems (

Weiss,

Schölkopf,

Platt, eds.), volume 18. MIT Press.

Tanner

, Wong

(1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398): 528–540. https://doi.org/10.1080/01621459.1987.10478458

Tibshirani

(1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Methodological, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Titsias

, Lawrence

(2010). Bayesian gaussian process latent variable model. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (

Teh,

Titterington, eds.), volume 9 of Proceedings of Machine Learning Research, 844–851. PMLR, Chia, Laguna Resort, Sardinia, Italy.

van der Maaten

, Hinton

(2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9: 1–27.

Xie

, Huang

(2009). SCAD-penalized regression in high-dimensional partially linear models. The Annals of Statistics, 37(2): 673–696. https://doi.org/10.1214/07-AOS580

Zhang

(2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2): 894–942. https://doi.org/10.1214/09-AOS729

Zou

(2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): 1418–1429. https://doi.org/10.1198/016214506000000735

Zou

, Hastie

(2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 67(2): 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x