Eﬃcient Bayesian High-Dimensional Classiﬁcation via Random Projection with Application to Gene Expression Data

Inspired by the impressive successes of compress sensing-based machine learning algorithms, data augmentation-based eﬃcient Gibbs samplers for Bayesian high-dimensional classiﬁcation models are developed by compressing the design matrix to a much lower dimension. Ardent care is exercised in the choice of the projection mechanism, and an adaptive voting rule is employed to reduce sensitivity to the random projection matrix. Focusing on the high-dimensional Probit regression model, we note that the naive implementation of the data augmentation-based Gibbs sampler is not robust to the presence of co-linearity in the design matrix – a setup ubiquitous in n < p problems. We demonstrate that a simple ﬁx based on joint updates of parameters in the latent space circumnavigates this issue. With a computationally eﬃcient MCMC scheme in place, we introduce an ensemble classiﬁer by creating R ( ∼ 25 – 50 ) projected copies of the design matrix, and subsequently running R classiﬁcation models with the R projected design matrix in parallel. We combine the output from the R replications via an adaptive voting scheme. Our scheme is inherently parallelizable and capable of taking advantage of modern computing environments often equipped with multiple cores. The empirical success of our methodology is illustrated in elaborate simulations and gene expression data applications. We also extend our methodology to a high-dimensional logistic regression model and carry out numerical studies to showcase its eﬃcacy.


Introduction
With the advent of modern technologies, it is now commonplace in many disciplines, including but not limited to bioinformatics, ecology, remote sensing etc., to collect data containing massive numbers of predictors, ranging from thousands to millions or more.In such settings, it is commonly of interest to consider classification models (Albert and Chib, 1993;Loaiza-Maya and Nibbering, 2022;Cao et al., 2022) such as where (•) is the cumulative distribution function of N(0, 1), X is an n × p matrix of predictors, p n, y is an n × 1 binary response vector.As traditional techniques such as maximum likelihood cannot be used, a rich variety of alternatives have been proposed mainly in the context of linear regression models, ranging from frequentist penalized optimization methods (Tibshirani, 1996;Zou and Hastie, 2005;Zou, 2006;Zhang, 2010;Xie and Huang, 2009) to Bayesian variable selection or shrinkage priors.Examples include the classical "two-group" discrete mixture priors with a point mass at zero (Mitchell and Beauchamp, 1988;George and McCulloch, 1993;Shin et al., 2015), and continuous shrinkage priors expressed as global-local variance mixtures of Gaussian distribution (Polson and Scott, 2011;Park and Casella, 2008;Hans, 2009;Brown and Griffin, 2010;Carvalho et al., 2009Carvalho et al., , 2010;;Bhadra et al., 2017;Piironen and Vehtari, 2017;Armagan et al., 2013;Bhattacharya et al., 2015).The Bayesian approaches are particularly attractive since they provide probabilistic characterization of uncertainty in the high-dimensional regression coefficients and in the resulting predictions, while penalization methods tend to focus on point estimation.It is well known that computing the posterior under Bayesian variable selection priors is an intractable problem, so that one can at best hope for a rough approximation using Markov chain Monte Carlo (MCMC) sampling unless p is small.However, recent developments in this regard have largely improved the computational aspects of the "two-group" (Biswas et al., 2022) and continuous shrinkage priors (Bhattacharya et al., 2016), but scalability still remains to be an quite open area of enquiry.Alternatively, a commonly used approach is to approximate the posterior with a computationally tractable distribution.This gives rise to variational Bayes approximations (Girolami and Rogers, 2006;Titsias and Lawrence, 2010;Faes et al., 2011;Mukherjee and Sen, 2021).Guhaniyogi and Dunson (2015) proposed a new approach for high-dimensional regression problems based on random projections of the scaled predictor vector prior to analysis which solved several problems simultaneously.In particular, for linear regression models, it completely avoids the computational bottleneck due to the enormous p. Their approach is inspired by the data squashing literature (DuMouchel, 2002;Madigan, 2004;Owen, 2003;Lee et al., 2010) and dramatic success of compressed sensing to facilitate storage and analysis, while retaining the ability to reconstruct the compressed signals with high accuracy under sparsity conditions (Donoho, 2006;Candes et al., 2006).
While such data compression based approaches have largely proved to be extremely successful in various contexts (Guhaniyogi and Dunson, 2015;Cannings and Samworth, 2017), MCMC computation in the context of Bayesian classifications approaches is still illusive.In this article we primarily focus on Bayesian high-dimensional probit regression.We propose a compressed sensing framework equipped with a data augmentation based Gibbs sampler that calculates a conjugate Gaussian-inverse gamma posterior for the regression coefficients corresponding to the compressed predictors in parallel for different random projections in the latent space.Finally, we aggregate the outcomes corresponding to the different compression via a simple but adaptive voting rule.We propose a principled approach to tune the cut-off parameter α of the binary classifier that dramatically improves classification accuracy across extensive simulation and real data examples.Importantly, our proposed methodology is inherently paralizable and continues to enjoy computational tractability even in ultra high-dimensional set up.We also note that, a naive implementation of the data augmentation Gibbs sampler results in poor mixing in the MCMC chains.It turns out that a simple fix based on joint update of the parameters in the latent space alleviates the problem to a large extent, and improves computational efficacy.We end our discussion with an extension to the high-dimensional logistic regression case, where we employ a Polya-Gamma data augmentation with the compressed covariate matrix.
In summary, our primary contributions in this article are three-fold.First, we present a scalable and inherently parallelizable compressed sensing framework equipped with a data augmentation based Gibbs sampler for high-dimensional probit regression.Secondly, for the classification task, we adapt an alternative Gibbs sampling scheme with joint update of the parameters in the latent space that showcases improved mixing.Thirdly, we present an adaptive voting rule involving a data-driven choice a cut-off parameter for our ensemble classifier, that improves the accuracy of our classifiers without significant increase in compute time.
Rest of the paper is organised as follows.Section 2 introduces a data compression strategy, a data augmentation based Gibbs sampler equipped with adaptive cut-off to carry out Bayesian high-dimensional probit regression.Section 3 presents elaborate empirical studies to demonstrate the efficacy of our methodology.Section 4 includes Micro-array gene expression data analyses to showcase practical utility and scalability of our prescription.Section 5 features an extension to our methodology to Bayesian high-dimensional logit regression, as well as an supporting empirical study.In section 6, we conclude.

Notations
For subjects i = 1, ..., n, let y i ∈ Y denote a response and x i = (x i1 , ..., x ip ) ∈ X ∈ R p denote predictors.We consider compressed regression models having the form where is the cumulative distribution function of N(0, 1), is an m × p projection matrix with m < min(n, p), and β = (β 1 , ..., β m ) T are coefficients on the compressed predictors which a priori are from some distribution π(•).To ensure that our inferential procedure is robust to the specific choice of random projection , we consider R different random compression matrices.The sparsity of a projection matrix is controlled by a parameter s, refer to (2.2) for details.Now we are in a position to systematically unfold the pieces of our proposal.

Compression Mechanism
The choice of the projection scheme in (2.1) is a vital component of our methodology, and there is potential merit in attempting to utilize data-driven dimension reduction techniques, i.e, estimate based on the data.To that end, we can turn to the enormous body of literature on linear dimension reduction techniques, including principal component analysis (Hotelling, 1933;Jolliffe and Cadima, 2016), non-negative matrix factorization (Sra and Dhillon, 2005), singular value decomposition (Banerjee and Roy, 2014), sufficient dimension reduction (Adragni and Cook, 2014), semantic mapping (Corrêa and Ludermir, 2007), multi-dimensional scaling (Cox and Cox, 2001), to name a few.Besides, various non-linear dimensionality reduction techniques are available in our arsenal, like kernel-PCA (Mika et al., 1998), locally linear embedding (Roweis and Saul, 2000), stochastic neighbourhood embedding (Hinton andRoweis, 2002), t-SNE (van der Maaten andHinton, 2008), etc.Such techniques are routinely incorporated in predictive models to ensure scalability, but developing data driven projection schemes still convey a huge computational price that is often practically infeasible in ensuing applications.Further, some of the linear dimension reduction techniques, e.g singular value decomposition, principal components analysis, classical multidimensional scaling, etc. suffer from unfavorable local properties (van der Maaten and Hinton, 2008).We make this precise in the next paragraph, focusing on singular value decomposition.
Given the design matrix X = (x 1 , x 2 , . . ., x n ) T , in order to embed the n points in R p into R m via the singular value decomposition, we project them onto the m-dimensional space spanned by the singular vectors corresponding to the m largest singular values of X.This produces an optimal rank m approximation of X under several popular matrix norms.But this optimality implies no guarantees regarding local properties of the resulting embedding.That is, we can easily devise examples where the new distance between a pair of points is arbitrarily smaller than their original distance.In increasing number of modern machine learning applications where dimensionality reduction is desirable, the absence of such local guarantees can make it hard to exploit embeddings algorithmically.
With these limitations in mind, following Guhaniyogi and Dunson (2015), Cannings and Samworth (2017), we do not attempt to estimate based on the data.Instead, we take refuge to random projection schemes, owing to their favourable local properties, and computational simplicity of the resulting algorithms.The seminal paper by Johnson and Lindenstraus (1984) proved the existence of lower dimensional projection mechanisms that satisfy favorable local properties, under certain sufficient conditions.To add on to the impressive theoretical developments in Johnson and Lindenstraus (1984), Achlioptas (2003) provided a concrete construction of such projection mechanisms as follows: (i) Let U ⊂ R p be an arbitrary set of n points collected in a n × p design matrix X.Also, given ε, ν > 0, define m 0 ∝ log n, where the proportionality constant depends on ν and ε. (ii) For any integer m m 0 ; define = ((ψ ij )) to be a p × m random matrix such that ψ ij are independent random variables from the following probability distribution: where, for example, s = 1 or 3. (iii) Let E = (1/ √ m)X , and suppose f : R p → R m projects the i-th row of X to the i-the row of E.Then, for any u, v ∈ U, , with probability at least 1 − n −ν .Random compression matrices with similar local properties can also be obtained via populating a random matrix with elements drawn from the standard normal distribution (Dasgupta, 2013;Li et al., 2006b).But the sparse random projections scheme described in equation (2.2),only processes 1/s-th of the data, and only involves simulation from uniform density.Hence it is typically preferred in many applications.Further, Li et al. (2006a,b) demonstrated that we can even use very sparse random projection with s 3, e.g., s = √ p, or s = p/ log p to significantly speed up the computation.However, for improved robustness, they recommended choosing s less aggressively, e.g, s = √ p.In practice, we observed that fixed s = 5 or 10 works reasonably well in varied empirical settings.Further, to ensure that our methodology is robust to the specific choice of random projection , we consider R different random compression matrices and utilize an ensemble classifier that combines outputs obtained corresponding to each of the random projection.In this context, readers may be aware that the usage of Bayesian ensemble methods is ubiquitous in classification and prediction settings; including Bayesian model averaging (Hoeting et al., 1999), Bayesian additive classification and regression tree (Chipman et al., 1998(Chipman et al., , 2006)), Bayesian version of bagging (Clyde and Lee, 2001), to name a few.Bayesian Model Averaging (Hoeting et al., 1999) describes an umbrella of techniques where we not only quantify model parameter uncertainty, but also the associated model uncertainty.Clyde and Lee (2001) introduced a Bayesian version of bagging based on the Bayesian bootstrap that often results in more efficient estimators.Bayesian CART (Chipman et al., 1998(Chipman et al., , 2006) ) is constructed via an ensemble of binary decision trees built by dividing the predictor space repeatedly into partitions based on splitting rules and it has enjoyed immense empirical success in this context.Other related ideas include Bayesian classifier combination (Kim and Ghahramani, 2012), Bayesian boosting (Lorbert et al., 2012), cascading classifiers (Li et al., 2010), bucket of models, stacking etc.
In this article, we adapt an adaptive voting scheme in 2.5 to combine outputs obtained corresponding to each of the random projections.In our empirical studies, we typically observe that the performance of the classifier improves with the increase in R, but usually plateaus after a while.A choice of R = 25 to 50 provided favorable results across the different numerical studies we considered.Next, we describe MCMC algorithms to learn the parameters in (2.1).

A Naive Blocked Gibbs Sampler
The probit regression model in (2.1) enjoys an equivalent representation via latent variables (Tanner and Wong, 1987;Albert and Chib, 1993), where δ(•) is the Dirac delta measure, y i is simply the deterministic conditional on the sign of the stochastic latent variable z i .In what follows, we denote y = (y 1 , . . ., y n ) T and z 3) is exactly same as in (2.1).The advantage of working with representation in (2.3) is that, for judicious choice of π(β), we easily device efficient blocked Gibbs sampler (Albert and Chib, 1993).In particular, we assume a normal prior on β, i.e, π(β) ≡ N(μ, ), where μ is set to vector of p zeros, and is diagonal matrix of order p.Then, the full conditional distribution of β is remains to be normal: (2.4) The full conditional for each element z i is then truncated normal, where TN a,b (•, •) is the pdf of truncated normal distribution restricted to (a, b).The full conditional distributions in equations (2.4)-(2.5)are extremely straight forward to sample from, and we refer to this algorithm as Algorithm 1: AC from here on.The above latent variable based augmentation method offers a convenient framework to device simple Markov chain Monte Carlo (MCMC) algorithm by iteratively sampling from the full conditional densities.However, a potential problem lurks in that there is strong posterior correlation between regression coefficients β and the latent variables z, clearly indicated in the above model.In the standard Albert-Chib iterative updates, this correlation is likely to cause slow mixing of the MCMC chain.

An Improved Blocked Gibbs Sampler
To combat the issue of auto-correlation in the MCMC chain, based on Held and Holmes (2006), we suggest a simple approach that reduces auto-correlation and dramatically improves the mixing in the Markov chain.Here, we put the factorisation, π(β | z, y) = π(z | y) π(β | z) to use, and update β and z jointly.Note that, in the above display, the distribution π(β | z) is unchanged from earlier (2.4), but now we update z from its marginal distribution obtained via integrating over β.In particular, we assume that the prior for π(β) is a mean zero normal density N(0, ), Algorithm 1 Ensemble AC/ AC+.Input : (a) Data: binary response vector y n×1 , design matrix X n×p ; a query point x ∈ R p .
(c) Hyper-parameters: compression dimension m, number of projections R, sparsity parameter s.
where is a diagonal matrix of order p.Then we have π(z | y) ∼ N(0, I n + XV X T T ) truncated to an appropriate region.Direct sampling from the multivariate truncated normal is known to be difficult, however, it is straightforward to Gibbs sample the distribution, where z −i = (z 1 , . . ., z i−1 , z i+1 , . . ., z n ) T , and Following an update to each z i we recalculate B via where B old and z old i denote the values of B and z i prior to the update of z i , and F i denotes the i-th column of F = V z X T T .The full conditional distributions in equations (2.4) and (2.6) describes our Algorithm 2: HH.It is important to note that, we only need to calculate F , u i and v i need only be performed once before we run the MCMC iterations.Consequently, the algorithm carries little increase in computational burden over the naive Gibbs sampling approach in section 2.3.The simple modification of the sampler based on use of joint updates dramatically improves mixing and sampling efficiency in the Markov chain across all the numerical studies that we have performed.
With a computationally efficient MCMC scheme in place, we next introduce an ensemble classifier via first creating R (∼ 25-50) projected copies of the design matrix, and then running R classification models with the R projected design matrices in parallel.Finally, we combine the output from the R replications via an adaptive voting scheme equipped with a data driven approach, reminiscent of leave one out cross validation, to choose cut-off parameter introduced next.
Algorithm 2 Ensemble HH/ HH+.Input : (a) Data: binary response vector y n×1 , design matrix X n×p ; a query point x ∈ R p .
(c) Hyper-parameters: compression dimension m, number of projections R, sparsity parameter s.

Adaptive Voting Scheme
We calculate an ensemble of predictions corresponding to the R projections, and combine the results via a simple voting scheme following Cannings and Samworth (2017).Suppose {y k (x)} R k=1 ∈ {0, 1} R are the predictions at x corresponding to the R projections, then the combined classifier takes the form: (2.9) where α ∈ (0, 1) is a hyper-parameter, and δ(•) denotes the Dirac delta measure.We emphasise that additional flexibility is afforded by not pre-specifying the voting threshold α to be 0.5.
In order to develop a data-driven approach to determine α, we introduce some notations.Suppose that the pair (X, Y ) takes values in R p × {0, 1} with joint distribution characterised by where π (r) has the cumulative distribution function G n,r (•), r = 0, 1.Note that, the oracle choice of the cut-off parameter α minimises the miss-classification error rate, i.e, Obviously, we cannot calculate α oracle in practice, since (π r , G n,r ), r = 0, 1 are unknown.So, we estimate it via replacing the unknown quantities by their sample counter parts, i.e, αoracle = arg min where and Ĝn,r where δ(•) is the Dirac delta measure.Since empirical distribution functions are piece-wise constant, the objective function in (2.10) does not have a unique minimum, so we choose αoracle to be the average of the smallest and largest minimisers.From here on, we refer to the versions of AC and HH equipped with the adaptive voting scheme in (2.9)-(2.10)as AC+ and HH+, respectively.AC+ and HH+ enjoy superior performance across numerous empirical studies, compared to AC and HH that uses the default choice of α = 0.5.

Simulation Study
In this section we compare the predictive performance of various versions of the proposed high dimensional probit regression methodology equipped with the data augmentation based Gibbs sampler (2.3).We also consider the alternative implementation of data augmentation based Gibbs sampler that potentially enjoys improved computational efficacy (2.4).We take the proposed principled approach to tune the cut-off parameter α of the binary classifier (2.5).Before presenting the empirical results, we propose default set ups for our algorithms.
The algorithms described in Sections 2.3 and 2.4 involve three tuning-parameters: (1) the dimension of the compressed linear subspace m, (2) the sparsity parameter s of the compression matrix, and (3) the number of random projections R. We propose default choices of these hyperparameters for ease of use of the practitioners: 1.Based on the recommendation in Guhaniyogi and Dunson (2015), we also propose to use the dimension of the linear subspace to compress to, m = 40.This choice works reasonably well in practice while preserving the computational convenience.2. We set the sparsity parameter s of the compression matrix to 10, for both sparse and dense examples in Sub-sections 3.1 and 3.2, respectively.It is important to note that the sparsity in the projection matrix increases as s increases, refer to equation (2.2) for details.Consequently, there is potential merit in using smaller s for dense cases in 3.2.We prefer a default choice of s = 10 since it works reasonably well under both the set ups, and provides concrete guideline to the users.Moreover, Li et al. (2006a,b) demonstrated that we can use very sparse random projection with s 3, e.g., s = √ p, or s = p/ log p to significantly speed up the computation.
In particular, when the data are approximately normal, log p of the data usually suffice, i.e. s = p/ log p, because of the exponential tail bounds in normal-like distributions.A less aggressive choice of s = 10 provides a good balance between computational convenience and robustness of the procedure in a wide range of empirical studies.3. We typically observe that the performance of our classifiers improve with the increase in R, but usually plateaus after a while, refer to Figures 1, 3 for details.A choice of R = 25 to 50 provide favorable results across the different numerical studies we considered.For the sake of objectivity, we set the number of random projections R = 50.
In the next two subsections, we carry our repeated simulations and benchmark our proposed classifiers: AC/AC+ and HH/HH+.The performance metric we mainly focus on is the median of the missclassification error rates obtained from the different repetitions of a simulation.For a query point x ∈ R p , a missclassification is observed if y vote (x) calculated via (2.9) is different from the true class level.We also report the between repetition standard deviation of missclassification error to demonstrate the stability of the numerical results.
Figure 1: Choice of R in sparse cases.Miss-classification error rates with varying number of weak classifiers R, for (n, p, ζ ) = (10 2 , 10 4 , 10) and ρ ∈ {0.0, 0.5, 0.7, 0.9} for algorithms AC and AC+.We typically observe that the performance of our classifiers improve with the increase in R, but usually plateaus after a while.This observation holds for all other (n, p, ζ ) combinations presented in Tables 1 and 3 both for AC/AC+ and HH/HH+, and hence the additional plots are not presented to avoid repetitiveness.

Sparse Cases
We generate observations from the high-dimensional Probit Regression model.We consider the following scenarios, and in each of the scenarios we simulate 50 data sets.We keep the sample size fixed at n = 10 2 but vary the number of covariates p = 10 3 , 10 4 and number of non-zero regressions coefficients ζ = 5, 10, in order to assess how sparsity impacts performance.We set first ζ regression coefficient at 1, and remaining p − ζ at 0. Further, we generate the design matrix X such that corr(x i , x j ) = ρ |i−j | and vary ρ = 0.0, 0.5, 0.7, 0.9, in order to assess the sampling efficiency under correlated design.To complete the model-prior specification, we set the prior variance to be the identity matrix.
For MCMC based model implementations, we discard the first 5000 samples as a burnin and draw inference based on the next 5000 samples.In particular, we report median miss classification error rates and corresponding 95% confidence intervals obtained from the repeated simulations.We also report effective sample size (ESS) as an empirical measure of sampling efficiency under the two MCMC schemes, in order to investigate the mixing behavior of our Figure 3: Choice of R in dense cases.Miss-classification error rates with varying number of weak classifiers R, for (n, p, ζ ) = (10 2 , 10 4 , 10) and ρ ∈ {0.0, 0.5, 0.7, 0.9} for algorithms AC and AC+.We typically observe that the performance of our classifiers improve with the increase in R, but usually plateaus after a while.This observation holds for all other (n, p, ζ ) combinations presented in Tables 1 and 3 both for AC/AC+ and HH/HH+, and hence the additional plots are not presented to avoid repetitiveness.
samplers.The effective sample size is a measure of the amount of the auto-correlation in a Markov chain, and essentially amounts to the number of independent samples in the MCMC path.From an algorithmic robustness perspective, it is desirable that the effective sample sizes remain stable across varying sparsity and co-linearity in the design matrix, and this is the aspect we wish to investigate here.
We present the miss-classification error rates of the classifiers averaged over the repetitions and the corresponding standard errors under various simulation scenarios for (n, p) = (10 2 , 10 3 ) in Table 1.While all the versions of our methodology enjoyed similar accuracy, the classifiers equipped with the data-driven choice of the cut-off parameter α compared to the default choice α = 0.5, seem to slightly improve the performance.Next, we present the corresponding effective sample sizes of the classifiers averaged over the repetitions and the corresponding standard errors under various simulation scenarios in Table 2.The alternative implementation of the data augmentation Gibbs sampler seems to be more robust to the presence of co-linearity in the design matrix, compared to the vanilla implementation.In particular, the effective sample size of AC/AC+ drops by 15% as the ρ changes from 0 to 0.9, whereas the drop is only about 1% for HH/HH+.Moreover, the alternative implementation of the data augmentation Gibbs sampler enjoys significantly higher effective sample size.Although, the gain is about 10% for the independent design, we observe a more pronounced improvement of 25% at ρ = 0.9.This indicates Table 1: (Sparse cases) median miss-classification error proportions and between repetition standard error in the subscript for (n, p) = (10 2 , 10 3 ) with varying sparsity ζ .Independent ρ = 0.5 ρ = 0.7 ρ = 0.9  Independent ρ = 0.5 ρ = 0.7 ρ = 0.9 that HH/HH+ will be particularly preferred when the design matrix is highly correlated.
For the case (n, p) = (10 2 , 10 4 ), we present the miss-classification error rates of the classifiers averaged over the repetitions and the corresponding standard errors under various simulation scenarios in Table 3, and the corresponding effective sample sizes of the classifiers averaged over the repetitions and the corresponding standard errors under various simulation scenarios in Table 4. Notably, the effective sample size for AC/AC+ plummeted less compared the case earlier, i.e it drops by 10% as the ρ changes from 0 to 0.9, whereas the is practically no drop for HH/HH+.

Dense Cases
We stick to the same data generation scenarios, except now we set all the regression coefficients to 1.These dense cases correspond to a one dimensional subspace with no sparsity, and are motivated by the practical applications where each of the covariates has small effect on the outcome.We continue to use the same MCMC configurations as before, and focus the same performance metric.
We present the miss-classification error rates of the classifiers and effective sample size  averaged over the repetitions and the corresponding standard errors under various simulation scenarios in Table 1.The classifiers equipped with the data-driven choice of the cut-off parameter α compared to the default choice α = 0.5, still slightly improve the performance and remains to be more robust to the presence of co-linearity in the design matrix.In particular, the effective sample size of AC/AC+ drops by 5% as the ρ changes from 0 to 0.9, whereas the drop is only about 1% for HH/HH+.Moreover, the alternative implementation of the data augmentation Gibbs sampler enjoys significantly higher effective sample size.Although, the gain is only about 1% for the independent design, we observe a more pronounced improvement of more than 10% at ρ = 0.9 that still demonstrates that HH/HH+ may render more efficient when the design matrix is highly correlated.

Leukemia Data
Leukemia data from high-density Affymetrix oligonucleotide arrays were previously analyzed in Golub et al. (1999), and is freely available on the website data.mendeley.com.There are p = 7129 genes and n = 72 samples coming from two classes: 47 in class ALL (acute lymphocytic leukemia) and 25 in class AML (acute mylogenous leukemia).Before classification, we standardize each The three plots from left to right correspond to γ = 0.5, 0.6, 0.7, respectively.
sample to zero mean and unit variance.
To complete the specification of the compression mechanism, we set the dimension m of the linear subspace to compress to at 40, the sparsity parameter of the compression matrix at 5, and the number of random projections R = 25.For MCMC based model implementations, we discard the first 5000 samples as a burn-in and draw inference based on the next 5000 samples.
To evaluate the performance of the classifiers, we randomly split the 72 samples into training and test sets.Specifically, we set approximately 100×γ % of the observations as training samples, and the rest as test samples.The various version of our algorithm are used to the training data, and their performances are evaluated by the test samples.The above procedure is repeated 50 times for γ = 0.4, 0.5, 0.6, respectively, and the distributions of miss-classification errors in Figure 5.The classifier AC+(HH+) equipped with the data-driven choice of the cut-off parameter α, significantly improve the performance compared to the classifier AC (HH) across all the set ups.Further, as γ increases, i.e, we use more and more training samples, the miss-classification error rates decrease.Moreover, the alternative implementation of the data augmentation Gibbs sampler (HH, HH+) enjoys 3-6 times higher effective sample size compared to AC, AC+.

Lung Cancer Data
We evaluate our method by classifying between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung, freely available on data.mendeley.com.Lung cancer data were analyzed by Gordon et al. (2002).There are 181 tissue samples (31 MPM and 150 ADCA).Each sample is described by 1626 genes.
As in the Leukemia data set, we first standardize the data to zero mean and unit variance, and then apply various classification methods to the standardized data set.We follow the same procedure as that in Leukemia example to randomly split the 181 samples into training and test sets, and utilize the same compression and MCMC specifications.Various classification methods are applied to the training data, and the test errors are calculated using the test data.The procedure is repeated 50 times with γ = 0.4, 0.5, 0.6, respectively, and the distributions of miss-classification errors in Figure 6.The classifier AC+(HH+) equipped with the data-driven choice of the cut-off parameter α, quite dramatically improve the performance compared to the classifier AC (HH) across all the set ups.Moreover, the alternative implementation of the  data augmentation Gibbs sampler (HH, HH+) enjoys 1.5-2 times higher effective sample size compared to AC, AC+.

Prostate Cancer Data
The last example uses the prostate cancer data studied in Singh et al. (2002), also freely available on data.mendeley.com.The training data set contains 102 patient samples, 52 of which (labeled as "tumor") are prostate tumor samples and 50 of which (labeled as "Normal") are prostate samples.There are around 329 genes.
As in the Leukemia data set, we first standardize the data to zero mean and unit variance, and then apply various classification methods to the standardized data set.We follow the same procedure as that in Leukemia example to randomly split the 102 samples into training and test sets, and utilize the same compression and MCMC specifications.Various classification methods are applied to the training data, and the test errors are calculated using the test data.The procedure is repeated 50 times with γ = 0.4, 0.5, 0.6, respectively, and the distributions of miss-classification errors in Figure 7.The classifier AC+(HH+) equipped with the data-driven choice of the cut-off parameter α, tend to improve the performance compared to the classifier AC (HH) across all the set ups.Moreover, the alternative implementation of the data augmentation Gibbs sampler (HH, HH+) enjoys 2.5 − 3 times higher effective sample size compared to AC, AC+.

Extension: Sparse Binary Logistic Regression
Maintaining parity with the section (2.3), we very briefly consider compressed logistic regression models having the form (5.1) where Logit : t → log(t/1−t), 0 < t < 1; is an m×p projection matrix with m < min(n, p); and β = (β 1 , ..., β m ) T are coefficients on the compressed predictors which a priori are from some distribution π(•).The data augmentation mechanism that we adapt here, known as the Polya-Gamma data augmentation scheme (Polson et al., 2013) enjoyed the most empirical success in practice.
To introduce the methodology, we first present the following definitions of Polya-Gamma family of distributions.In particular, the Polya-Gamma distribution PG(b, 0), b > 0 (Polson et al., 2013) is defined as the distribution with the characteristic function: t → (cosh √ t/2) −b , b > 0. The general form PG(b, c), b > 0 of the Polya-Gamma distribution (Polson et al., 2013) has the probability density function: . This family of distributions has been carefully constructed to yield a simple Gibbs sampler for the Bayesian logistic-regression model.We assume π(β) ≡ N(μ, ), and to sample from the posterior distribution using the Polya-Gamma method, simply iterate two steps: where V z = (X T T X + −1 ) −1 , κ = (y 1 −1/2, . . ., y n −1/2) T , and is a diagonal matrix with ii = z i .Noteworthy, the two basic differences in the sampler described in (5.2) from the AC sampler in (2.4)-(2.5)for probit regression are that the full conditional distribution of [β | •] is a scale mixture of Gaussian distributions rather than a location mixture; and the full conditional distribution of [z i | •] are the Polya-Gamma latent distribution instead of the truncated normals.From here on, we refer to the data augmentation based Gibbs sampler in (5.2) by Algorithm 3: PG.As an alternative, we also consider PG equipped with the adaptive choice of cut-off α in (2.5), referred to as PG+.
We consider simulated examples to demonstrate the efficacy of the proposed extension.We generate observations from the high-dimensional Probit Regression model, with specifications described in sections (3.1)-(3.2).Note that, we do not generate data from a high-dimensional logit regression model in order to assess our methodology under mild model misspecification.
To conduct inference, we consider the Gibbs sampler augmented with the Polya-gamma scheme in the latent space.We continue to utilize the compression and MCMC specifications in the simulations presented in sections (3.1)-(3.2).We present the miss-classification error rates of the classifiers averaged over the repetitions and the corresponding standard errors under various simulation scenarios in Table 6.While all the versions of our methodology enjoyed similar accuracy, the classifier equipped with the data-driven choice of the cut-off parameter α compared to the default choice α = 0.5, seems to slightly improve the performance.We also present the effective sample sizes of the classifiers averaged over the repetitions and the corresponding standard errors under various simulation scenarios.The effective sample size of PG/PG+ drops drastically as the ρ changes from 0 to 0.9.In order to maintain the focus of the document on high-dimensional probit regression, and given non-robust simulation results with respect to colinearity in this set up, we leave this as an avenue for future enquiry aimed at designing MCMC schemes robust to presence of co-linearity in the design matrix.

Conclusions
In this article, we presented efficient data augmentation based Gibbs samplers for Bayesian highdimensional Probit and logit models.Focusing on high-dimensional Probit regression model, we demonstrate that the naive implementation of the data augmentation based Gibbs sampler is not robust to the presence of co-linearity in the design matrix-a set up ubiquitous in n < p problems, and considered a simple fix based on joint updates of parameters in the latent space that seems to circumnavigate this issue.With a computationally efficient MCMC scheme in place, we introduced an ensemble classifier via first creating R (25 to 50) projected copies of the design matrix, and then running R classification models with the R projected design matrix in parallel.Finally, we combine the output from the R replications via an adaptive voting scheme reminiscent of leave one out cross validation.Notably, each of the projected design matrix is n × m, compared to the actual design matrix which is n × p.Since m p, each of the projected design matrix induces significantly less storage burden.Moreover, perhaps more importantly, since our scheme is inherently parallelable, it's extremely computationally convenient and is capable of taking advantages of modern multiple cores computing environments.
In principle, ensembles of data augmentation based Gibbs samplers like ours can be developed for high-dimensional multinomial Probit or logit models with ordinal as well as nominal categories.Ensuring the stability of the resulting samplers is an interesting alley for future enquiry.For the sake of brevity of presentation in this article, we do not explore such extensions here.Moreover, a criticism of the compressed regression frameworks is it's inability to carry out variable selection, and this too provides a scope for future exploration.

Figure 5 :
Figure 5: Leukemia data: box plots of miss-classification error rates of AC, AC+, HH, HH+ over 50 random splits of 72 samples, where 100 × γ % of the samples are set as training samples.The three plots from left to right correspond to γ = 0.5, 0.6, 0.7, respectively.

Figure 6 :
Figure 6: Lung cancer data: box plots of miss-classification error rates of AC, AC+, HH, HH+ over 50 random splits of 181 samples, where 100 × γ % of the samples are set as training samples.The three plots from left to right correspond to γ = 0.5, 0.6, 0.7, respectively.

Figure 7 :
Figure 7: Prostate cancer data: box plots of miss-classification error rates of AC, AC+, HH, HH+ over 50 random splits of 102 samples, where 100 × γ % of the samples are set as training samples.The three plots from left to right correspond to γ = 0.5, 0.6, 0.7, respectively.