Efficient Bayesian High-Dimensional Classification via Random Projection with Application to Gene Expression Data
Volume 22, Issue 1 (2024), pp. 152–172
Pub. online: 12 June 2023
Type: Computing In Data Science
Open Access
Received
10 January 2023
10 January 2023
Accepted
26 April 2023
26 April 2023
Published
12 June 2023
12 June 2023
Abstract
Inspired by the impressive successes of compress sensing-based machine learning algorithms, data augmentation-based efficient Gibbs samplers for Bayesian high-dimensional classification models are developed by compressing the design matrix to a much lower dimension. Ardent care is exercised in the choice of the projection mechanism, and an adaptive voting rule is employed to reduce sensitivity to the random projection matrix. Focusing on the high-dimensional Probit regression model, we note that the naive implementation of the data augmentation-based Gibbs sampler is not robust to the presence of co-linearity in the design matrix – a setup ubiquitous in $n\lt p$ problems. We demonstrate that a simple fix based on joint updates of parameters in the latent space circumnavigates this issue. With a computationally efficient MCMC scheme in place, we introduce an ensemble classifier by creating R ($\sim 25$–50) projected copies of the design matrix, and subsequently running R classification models with the R projected design matrix in parallel. We combine the output from the R replications via an adaptive voting scheme. Our scheme is inherently parallelizable and capable of taking advantage of modern computing environments often equipped with multiple cores. The empirical success of our methodology is illustrated in elaborate simulations and gene expression data applications. We also extend our methodology to a high-dimensional logistic regression model and carry out numerical studies to showcase its efficacy.
Supplementary material
Supplementary MaterialSoftware implementation of the methodologies developed in the article is available for use at zovialpapai/Bayesian-classification-with-random-projection. Here, we present a short description about the directories in the repository, as follows: (1) functions: The directory contains utility functions in two R scripts, that are utilised in the repeated simulations and real data analysis conducted in the paper. (a) “BCC_Functions.R” contains functions for compression matrix generation; Probit regression via Albert & Chib and Holmes & Held data augmentation schemes; Logit regression via Polya-Gamma data augmentation scheme; hyper-parameter tuning; and associated helper functions. (b) Probit_HH_cpp.R contains Probit regression via Holmes & Held data augmentation scheme, written in Rcpp. (2) repeated simulations : The directory contains three R scripts, named BCC_sims.R, Weakleaners.R, and time_comparison.R. (a) BCC_sims.R can be utilised to carry out the simulations presented in Section 3 on High-dimensional Probit regression, and Section 5 on High-dimensional Logit regression, along with hyper-parameter tuning. (b) Weakleaners.R can be utilized to study the effect of number of replications of compression matrix (or number of weak classifiers) on the accuracy of classifiers AC, AC+, HH, HH+. The results are presented in Section 3. (c) time_comparison.R can be utilised to study comparative computional time of our classifiers. The results are presented in Section 3. (3) data : Micro-array gene expression cancer data sets utilized in the article is freely available on the website data.mendeley.com. Copies of the data sets are available in the data directory in the our repository. (4) real data analysis : The directory contains the a R script named BCC_data.R that can be utilised to carry out the analysis of micro-array gene expression cancer data sets (Leukemia, Lung Cancer, Prostate cancer), presented in Section 4 of the paper.
References
Achlioptas D (2003). Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4): 671–687. Special Issue on PODS 2001. https://doi.org/10.1016/S0022-0000(03)00025-4
Albert JH, Chib S (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422): 669–679. https://doi.org/10.1080/01621459.1993.10476321
Bhadra A, Datta J, Polson NG, Willard B (2017). The horseshoe+ estimator of ultra-sparse signals. Bayesian Analysis, 12(4): 1105–1131. https://doi.org/10.1214/16-BA1028
Bhattacharya A, Chakraborty A, Mallick BK (2016). Fast sampling with Gaussian scale mixture priors in high-dimensional regression. Biometrika, 103(4): 985–991. https://doi.org/10.1093/biomet/asw042
Bhattacharya A, Pati D, Pillai NS, Dunson DB (2015). Dirichlet–laplace priors for optimal shrinkage. Journal of the American Statistical Association, 110(512): 1479–1490 PMID: 27019543. https://doi.org/10.1080/01621459.2014.960967
Brown PJ, Griffin JE (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis, 5(1): 171–188. https://doi.org/10.1214/10-BA507
Candes EJ, Romberg JK, Tao T (2006). Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8): 1207–1223. https://doi.org/10.1002/cpa.20124
Cannings TI, Samworth RJ (2017). Random-projection ensemble classification. Journal of the Royal Statistical Society Series B, 79(4): 959–1035. https://doi.org/10.1111/rssb.12228
Cao J, Durante D, Genton MG (2022). Scalable computation of predictive probabilities in probit models with Gaussian process priors. Journal of Computational and Graphical Statistics, 31(3): 709–720. https://doi.org/10.1080/10618600.2022.2036614
Carvalho CM, Polson NG, Scott JG (2009). Handling sparsity via the horseshoe. In: Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics (D van Dyk, M Welling, eds.), volume 5 of Proceedings of Machine Learning Research, 73–80. PMLR, Hilton, Clearwater Beach Resort, Clearwater Beach, Florida USA.
Carvalho CM, Polson NG, Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2): 465–480. https://doi.org/10.1093/biomet/asq017
Chipman HA, George EI, McCulloch RE (1998). Bayesian cart model search. Journal of the American Statistical Association, 93(443): 935–948. https://doi.org/10.1080/01621459.1998.10473750
Dasgupta S (2013). Experiments with random projection. arXiv preprint: https://arxiv.org/abs/1301.3849.
Donoho D (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4): 1289–1306. https://doi.org/10.1109/TIT.2006.871582
Faes C, Ormerod JT, Wand MP (2011). Variational bayesian inference for parametric and nonparametric regression with missing data. Journal of the American Statistical Association, 106(495): 959–971. https://doi.org/10.1198/jasa.2011.tm10301
George EI, McCulloch RE (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423): 881–889. https://doi.org/10.1080/01621459.1993.10476353
Girolami M, Rogers S (2006). Variational Bayesian multinomial probit regression with gaussian process priors. Neural Computation, 18(8): 1790–1817. https://doi.org/10.1162/neco.2006.18.8.1790
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286: 531–537. https://doi.org/10.1126/science.286.5439.531
Gordon M, Beiser J, Brandt J, et al. (2002). The ocular hypertension treatment study: Baseline factors that predict the onset of primary open-angle glaucoma. Archives of Ophthalmology, 120: 714–34. https://doi.org/10.1001/archopht.120.6.714
Guhaniyogi R, Dunson DB (2015). Bayesian compressed regression. Journal of the American Statistical Association, 110(512): 1500–1514. https://doi.org/10.1080/01621459.2014.969425
Hans C (2009). Bayesian lasso regression. Biometrika, 96(4): 835–845. https://doi.org/10.1093/biomet/asp047
Held L, Holmes CC (2006). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1): 145–168. https://doi.org/10.1214/06-BA105
Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4): 382–401. https://doi.org/10.1214/ss/1009212519
Hotelling H (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 6: 417–441. https://doi.org/10.1037/h0071325
Johnson WB, Lindenstraus J (1984). Extensions of lipschitz mappings into hilbert space. Contemporary Mathematics, 26: 189–206. https://doi.org/10.1090/conm/026/737400
Lee HKH, Taddy M, Gray GA (2010). Selection of a representative sample. Journal of Classification, 27: 41–53. https://doi.org/10.1007/s00357-010-9044-x
Loaiza-Maya R, Nibbering D (2022). Fast variational bayes methods for multinomial probit models. Journal of Business & Economic Statistics, https://doi.org/10.1080/07350015.2022.2139267.
Lorbert A, Blei DM, Schapire RE, Ramadge PJ (2012). A bayesian boosting model. arXiv preprint: https://arxiv.org/abs/1209.1996.
Madigan (2004). Likelihood-based data squashing: A modeling approach to instance construction. Data Mining and Knowledge Discovery, 6: 173–190. https://doi.org/10.1023/A:1014095614948
Mitchell TJ, Beauchamp JJ (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404): 1023–1032. https://doi.org/10.1080/01621459.1988.10478694
Mukherjee S, Sen S (2021). Variational inference in high-dimensional linear regression. arXiv preprint: https://arxiv.org/abs/2104.12232.
Owen A (2003). Data squashing empirical likelihood. Data Mining and Knowledge Discovery, 7: 101–113. https://doi.org/10.1023/A:1021568920107
Park T, Casella G (2008). The bayesian lasso. Journal of the American Statistical Association, 103(482): 681–686. https://doi.org/10.1198/016214508000000337
Piironen J, Vehtari A (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2): 5018–5051. https://doi.org/10.1214/17-EJS1337SI
Polson NG, Scott JG, Windle J (2013). Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association, 108(504): 1339–1349. https://doi.org/10.1080/01621459.2013.829001
Roweis ST, Saul LK (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290: 2323–2326. https://doi.org/10.1126/science.290.5500.2323
Tanner MA, Wong WH (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398): 528–540. https://doi.org/10.1080/01621459.1987.10478458
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Methodological, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Titsias M, Lawrence ND (2010). Bayesian gaussian process latent variable model. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (YW Teh, M Titterington, eds.), volume 9 of Proceedings of Machine Learning Research, 844–851. PMLR, Chia, Laguna Resort, Sardinia, Italy.
Xie H, Huang J (2009). SCAD-penalized regression in high-dimensional partially linear models. The Annals of Statistics, 37(2): 673–696. https://doi.org/10.1214/07-AOS580
Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2): 894–942. https://doi.org/10.1214/09-AOS729
Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): 1418–1429. https://doi.org/10.1198/016214506000000735
Zou H, Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 67(2): 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x