Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.
Inspired by the impressive successes of compress sensing-based machine learning algorithms, data augmentation-based efficient Gibbs samplers for Bayesian high-dimensional classification models are developed by compressing the design matrix to a much lower dimension. Ardent care is exercised in the choice of the projection mechanism, and an adaptive voting rule is employed to reduce sensitivity to the random projection matrix. Focusing on the high-dimensional Probit regression model, we note that the naive implementation of the data augmentation-based Gibbs sampler is not robust to the presence of co-linearity in the design matrix – a setup ubiquitous in $n\lt p$ problems. We demonstrate that a simple fix based on joint updates of parameters in the latent space circumnavigates this issue. With a computationally efficient MCMC scheme in place, we introduce an ensemble classifier by creating R ($\sim 25$–50) projected copies of the design matrix, and subsequently running R classification models with the R projected design matrix in parallel. We combine the output from the R replications via an adaptive voting scheme. Our scheme is inherently parallelizable and capable of taking advantage of modern computing environments often equipped with multiple cores. The empirical success of our methodology is illustrated in elaborate simulations and gene expression data applications. We also extend our methodology to a high-dimensional logistic regression model and carry out numerical studies to showcase its efficacy.
Abstract: This paper extends the analysis of the bivariate Seemingly Unrelated (SUR) Tobit by modeling its nonlinear dependence structure through copula and assuming non-normal marginal error distributions. For model estimation, the use of copula methods enables the use of the (classical) Inference Function for Margins (IFM) method by Joe and Xu (1996), which is more computationally attractive (feasible) than the full maximum likelihood approach. However, our simulation study shows that the IFM method provides a biased estimate of the copula parameter in the presence of censored observations in both margins. In order to obtain an unbiased estimate of the copula association parameter, we propose/develop a modified version of the IFM method, which we refer to as Inference Function for Augmented Margins (IFAM). Since the usual asymptotic approach, that is the computation of the asymptotic covariance matrix of the parameter estimates, is troublesome, we propose the use of resampling procedures (bootstrap methods) to obtain confidence intervals for the copula-based SUR Tobit model parameters. The satisfactory results from the simulation and empirical studies indicate the adequate performance of our proposed model and methods. We illustrate our procedure using bivariate data on consumption of salad dressings and lettuce by U.S. individuals.