Is Augmentation Effective in Improving Prediction in Imbalanced Datasets?
Pub. online: 15 October 2024
Type: Statistical Data Science
Open Access
Received
25 April 2024
25 April 2024
Accepted
6 September 2024
6 September 2024
Published
15 October 2024
15 October 2024
Abstract
Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.
Supplementary material
Supplementary MaterialThe supplementary materials include a zipped file containing the proofs of the theorems and complementary analysis and a folder containing the code to reproduce our experiment. The code is also available in https://github.com/gabrieloa/augmentation-effective, the instructions to run the code are in the README.md file.
References
Agarap AF (2018). Statistical analysis on e-commerce reviews, with sentiment classification using bidirectional recurrent neural network (RNN). arXiv preprint: https://arxiv.org/abs/1805.03687. Dataset: https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews.
Davidson T, Warmsley D, Macy M, Weber I (2017). Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, volume 11, pages 512–515. Dataset: https://huggingface.co/datasets/hate_speech_offensive.
Grano G, Di Sorbo A, Mercaldo F, Visaggio CA, Canfora G, Panichella S, (2017). Android apps and user feedback: A dataset for software evolution and quality improvement. In: Federica Sarro, Emad Shihab, Meiyappan Nagappan, Marie C. Platenius, Daniel Kaimann, editors, Proceedings of the 2nd ACM SIGSOFT International Workshop on App Market Analytics, pages 8–11. Dataset: https://huggingface.co/datasets/app_reviews.
Han H, Wang W-Y, Mao B-H (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: De-Shuang Huang, Xiao-Ping Zhang, Guang-Bin Huang, editors, Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23–26, 2005, Proceedings, Part I 1, pages 878–887. Springer.
Kokol P, Kokol M, Zagoranski S (2022). Machine learning on small size samples: A synthetic knowledge synthesis. Science Progress, 105(1). https://doi.org/10.1177/00368504211029777.
Kumar V, Choudhary A, Cho E (2020). Data augmentation using pre-trained transformer models. In: William M. Campbell, Alex Waibel, Dilek Hakkani-Tur, Timothy J. Hazen, Kevin Kilgour, Eunah Cho, Varun Kumar, Hadrien Glaude, editors, Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18–26.
Newaz A, Hassan S, Haq FS (2022). An empirical analysis of the efficacy of different sampling techniques for imbalanced classification. arXiv preprint: https://arxiv.org/abs/2208.11852.
Shleifer S (2019). Low resource text classification with ulmfit and backtranslation. arXiv preprint: https://arxiv.org/abs/1903.09244.
Shu J, Xu Z, Meng D (2018). Small sample learning in big data era. arXiv preprint: https://arxiv.org/abs/1808.04572.