Is Augmentation Effective in Improving Prediction in Imbalanced Datasets?

Assunção, Gabriel O.; Izbicki, Rafael; Prates, Marcos O.

doi:10.6339/24-JDS1154

Journal of Data Science

Is Augmentation Effective in Improving Prediction in Imbalanced Datasets?

Gabriel O. Assunção Rafael Izbicki

Marcos O. Prates

https://doi.org/10.6339/24-JDS1154

Pub. online: 15 October 2024 Type: Statistical Data Science

Open Access

Received
25 April 2024

Accepted
6 September 2024

Published
15 October 2024

Abstract

Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.

Supplementary material

Supplementary Material

The supplementary materials include a zipped file containing the proofs of the theorems and complementary analysis and a folder containing the code to reproduce our experiment. The code is also available in https://github.com/gabrieloa/augmentation-effective, the instructions to run the code are in the README.md file.

References

Abdoh SF, Rizka MA, Maghraby FA (2018). Cervical cancer diagnosis using random forest classifier with SMOTE and feature reduction techniques. IEEE Access, 6: 59475–59485.

Agarap AF (2018). Statistical analysis on e-commerce reviews, with sentiment classification using bidirectional recurrent neural network (RNN). arXiv preprint: https://arxiv.org/abs/1805.03687. Dataset: https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews.

Akkaradamrongrat S, Kachamas P, Sinthupinyo S (2019). Text generation for imbalanced text classification. In: 2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE), pages 181–186. IEEE.

Al Najada H, Zhu X (2014). iSRD: Spam review detection with imbalanced data distributions. In: James Joshi, Elisa Bertino, Bhavani Thuraisingham, Ling Liu, editors, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014), pages 553–560. IEEE.

Barbieri F, Camacho-Collados J, Anke LE, Neves L (2020). Tweeteval: Unified benchmark and comparative evaluation for tweet classification. In: Trevor Cohn, Yulan He, Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650.

Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32.

Brier GW, et al. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1): 1–3.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 321–357.

Chen J, Tam D, Raffel C, Bansal M, Yang D (2023a). An empirical survey of data augmentation for limited data learning in NLP. Transactions of the Association for Computational Linguistics, 11: 191–211.

Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, et al. (2023b). xgboost: Extreme gradient boosting. R package version 1.7.5.1.

Davidson T, Warmsley D, Macy M, Weber I (2017). Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, volume 11, pages 512–515. Dataset: https://huggingface.co/datasets/hate_speech_offensive.

Feng SY, Gangal V, Wei J, Chandar S, Vosoughi S, Mitamura T, et al. (2021). A survey of data augmentation approaches for NLP. In: Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988.

Fix E, Hodges JL Jr (1952). Discriminatory analysis-nonparametric discrimination: Small sample performance. Technical report, California Univ Berkeley.

Gao Z, Zhang L-f, Chen M-y, Hauptmann A, Zhang H, Cai A-N (2014). Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimedia Tools and Applications, 68: 641–657.

Grano G, Di Sorbo A, Mercaldo F, Visaggio CA, Canfora G, Panichella S, (2017). Android apps and user feedback: A dataset for software evolution and quality improvement. In: Federica Sarro, Emad Shihab, Meiyappan Nagappan, Marie C. Platenius, Daniel Kaimann, editors, Proceedings of the 2nd ACM SIGSOFT International Workshop on App Market Analytics, pages 8–11. Dataset: https://huggingface.co/datasets/app_reviews.

Han H, Wang W-Y, Mao B-H (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: De-Shuang Huang, Xiao-Ping Zhang, Guang-Bin Huang, editors, Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23–26, 2005, Proceedings, Part I 1, pages 878–887. Springer.

He H, Bai Y, Garcia EA, Li S (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322–1328. IEEE.

Hu Z, Tan B, Salakhutdinov RR, Mitchell TM, Xing EP (2019). Learning data manipulation for augmentation and weighting. Advances in Neural Information Processing Systems, 32: 15764–15775.

Kaur H, Pannu HS, Malhi AK (2019). A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR), 52(4): 1–36.

Kokol P, Kokol M, Zagoranski S (2022). Machine learning on small size samples: A synthetic knowledge synthesis. Science Progress, 105(1). https://doi.org/10.1177/00368504211029777.

Kumar V, Choudhary A, Cho E (2020). Data augmentation using pre-trained transformer models. In: William M. Campbell, Alex Waibel, Dilek Hakkani-Tur, Timothy J. Hazen, Kevin Kilgour, Eunah Cho, Varun Kumar, Hadrien Glaude, editors, Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18–26.

Li Y, Sun G, Zhu Y (2010). Data imbalance problem in text classification. In: Qingling Li, Fei Yu, Yun Liu, editors, 2010 Third International Symposium on Information Processing, pages 301–305. IEEE.

Liaw A, Wiener M (2002). Classification and regression by randomforest. R News, 2(3): 18–22.

Lusted LB (1971). Decision-making studies in patient management. New England Journal of Medicine, 284(8): 416–424.

Mohasseb A, Bader-El-Den M, Cocea M, Liu H (2018). Improving imbalanced question classification using structured SMOTE based approach. In: 2018 International Conference on Machine Learning and Cybernetics (ICMLC), volume 2, pages 593–597. IEEE.

Newaz A, Hassan S, Haq FS (2022). An empirical analysis of the efficacy of different sampling techniques for imbalanced classification. arXiv preprint: https://arxiv.org/abs/2208.11852.

Padurariu C, Breaban ME (2019). Dealing with data imbalance in text classification. Procedia Computer Science, 159: 736–745.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: 2825–2830.

Quiñonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (2022). Dataset Shift in Machine Learning. MIT Press.

Ripley BD (2007). Pattern Recognition and Neural Networks. Cambridge University Press.

Rupapara V, Rustam F, Shahzad HF, Mehmood A, Ashraf I, Choi GS (2021). Impact of SMOTE on imbalanced text features for toxic comments classification using RVVC model. IEEE Access, 9: 78621–78634.

Shleifer S (2019). Low resource text classification with ulmfit and backtranslation. arXiv preprint: https://arxiv.org/abs/1903.09244.

Shu J, Xu Z, Meng D (2018). Small sample learning in big data era. arXiv preprint: https://arxiv.org/abs/1808.04572.

Stylianou N, Chatzakou D, Tsikrika T, Vrochidis S, Kompatsiaris I (2023). Domain-aligned data augmentation for low-resource and imbalanced text classification. In: European Conference on Information Retrieval, pages 172–187. Springer.

Sumathi B, et al. (2020). Grid search tuning of hyperparameters in random forest classifier for customer feedback sentiment prediction. International Journal of Advanced Computer Science and Applications, 11(9): 173–178.

Sun Y, Genton MG (2011). Functional boxplots. Journal of Computational and Graphical Statistics, 20(2): 316–334.

Tan X, Su S, Huang Z, Guo X, Zuo Z, Sun X, et al. (2019). Wireless sensor networks intrusion detection based on SMOTE and the random forest algorithm. Sensors, 19(1): 203.

Tepper N, Goldbraich E, Zwerdling N, Kour G, Tavor AA, Carmeli B (2020). Balancing via generation for multi-class text classification improvement. In: Trevor Cohn, Yulan He, Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1440–1452.

Tesfahun A, Bhaskari DL (2013). Intrusion detection using random forests classifier with SMOTE and feature reduction. In: Vidyasagar Potdar, Pritam Shah, Rajesh Ingle, Fang Liu, editors, 2013 International Conference on Cloud & Ubiquitous Computing & Emerging Technologies, pages 127–132. IEEE.

van den Goorbergh R, van Smeden M, Timmerman D, Van Calster B (2022). The harm of class imbalance corrections for risk prediction models: Illustration and simulation using logistic regression. Journal of the American Medical Informatics Association, 29(9): 1525–1534.

Vaz AF, Izbicki R, Stern RB (2019). Quantification under prior probability shift: The ratio estimator and its extensions. Journal of Machine Learning Research, 20(79): 1–33.

Wang S, Li D, Zhao L, Zhang J (2013). Sample cutting method for imbalanced text sentiment classification based on BRC. Knowledge-Based Systems, 37: 451–461.

Wu J-L, Huang S (2022). Application of generative adversarial networks and Shapley algorithm based on easy data augmentation for imbalanced text data. Applied Sciences, 12(21): 10964.

Yeh I-C, Lien C-h (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2): 2473–2480.

Zhou Z-H (2018). A brief introduction to weakly supervised learning. National Science Review, 5(1): 44–53.

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

balanced accuracy data augmentation oversampling

Funding

Marcos O. Prates would like to acknowledge (Conselho Nacional de Desenvolvimento Científico e Tecnológico) CNPq grant 309186/2021-8 and FAPEMIG (Fundação de Amparo à Pesquisa do Estado de Minas Gerais) grant APQ-01837-22 and CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) for financial support. Rafael Izbicki is grateful for the financial support of CNPq (422705/2021-7 and 305065/2023-8) and FAPESP (grant 2023/07068-1).

Metrics

since February 2021

2705

Article info
views

835

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file