Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1154

10.6339/24-JDS1154

Statistical Data Science

Is Augmentation Effective in Improving Prediction in Imbalanced Datasets?

Assunção

Gabriel O.

gabrieloliveira1995@gmail.com1∗

https://orcid.org/0000-0003-0379-9690

Izbicki

Rafael

https://orcid.org/0000-0001-8077-4898

Prates

Marcos O.

1 1Department of Statistics, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil 2Department of Statistics, Universidade Federal de São Carlos, São Carlos, Brazil

∗Corresponding author. Email: gabrieloliveira1995@gmail.com.

2024

15102024

00116

Supplementary Material

The supplementary materials include a zipped file containing the proofs of the theorems and complementary analysis and a folder containing the code to reproduce our experiment. The code is also available in https://github.com/gabrieloa/augmentation-effective, the instructions to run the code are in the README.md file.

2542024692024

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2024

Open access article under the CC BY license.

Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.

Keywords balanced accuracy data augmentation oversampling

Marcos O. Prates would like to acknowledge (Conselho Nacional de Desenvolvimento Científico e Tecnológico) CNPq grant 309186/2021-8 and FAPEMIG (Fundação de Amparo à Pesquisa do Estado de Minas Gerais) grant APQ-01837-22 and CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) for financial support. Rafael Izbicki is grateful for the financial support of CNPq (422705/2021-7 and 305065/2023-8) and FAPESP (grant 2023/07068-1).

References

Abdoh

, Rizka

, Maghraby

(2018). Cervical cancer diagnosis using random forest classifier with SMOTE and feature reduction techniques. IEEE Access, 6: 59475–59485.

Agarap

(2018). Statistical analysis on e-commerce reviews, with sentiment classification using bidirectional recurrent neural network (RNN). arXiv preprint: https://arxiv.org/abs/1805.03687. Dataset: https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews.

Akkaradamrongrat

, Kachamas

, Sinthupinyo

(2019). Text generation for imbalanced text classification. In: 2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE), pages 181–186. IEEE.

Al Najada

, Zhu

(2014). iSRD: Spam review detection with imbalanced data distributions. In:

James

Joshi,

Elisa

Bertino,

Bhavani

Thuraisingham,

Ling

Liu, editors, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014), pages 553–560. IEEE.

Barbieri

, Camacho-Collados

, Anke

, Neves

(2020). Tweeteval: Unified benchmark and comparative evaluation for tweet classification. In:

Trevor

Cohn,

Yulan

He,

Yang

Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650.

Breiman

(2001). Random forests. Machine Learning, 45(1): 5–32.

Brier

, et al. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1): 1–3.

Chawla

, Bowyer

, Hall

, Kegelmeyer

(2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 321–357.

Chen

, Tam

, Raffel

, Bansal

, Yang

(2023a). An empirical survey of data augmentation for limited data learning in NLP. Transactions of the Association for Computational Linguistics, 11: 191–211.

Chen

, He

, Benesty

, Khotilovich

, Tang

, Cho

, et al. (2023b). xgboost: Extreme gradient boosting. R package version 1.7.5.1.

Davidson

, Warmsley

, Macy

, Weber

(2017). Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, volume 11, pages 512–515. Dataset: https://huggingface.co/datasets/hate_speech_offensive.

Feng

, Gangal

, Wei

, Chandar

, Vosoughi

, Mitamura

, et al. (2021). A survey of data augmentation approaches for NLP. In:

Chengqing

Zong,

Fei

Xia,

Wenjie

Li,

Roberto

Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988.

Fix

, Hodges

Jr (1952). Discriminatory analysis-nonparametric discrimination: Small sample performance. Technical report, California Univ Berkeley.

Gao

, Zhang

L-f

, Chen

M-y

, Hauptmann

, Zhang

, Cai

A-N

(2014). Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimedia Tools and Applications, 68: 641–657.

Grano

, Di Sorbo

, Mercaldo

, Visaggio

, Canfora

, Panichella

, (2017). Android apps and user feedback: A dataset for software evolution and quality improvement. In:

Federica

Sarro,

Emad

Shihab,

Meiyappan

Nagappan,

Marie C.

Platenius,

Daniel

Kaimann, editors, Proceedings of the 2nd ACM SIGSOFT International Workshop on App Market Analytics, pages 8–11. Dataset: https://huggingface.co/datasets/app_reviews.

Han

, Wang

W-Y

, Mao

B-H

(2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In:

De-Shuang

Huang,

Xiao-Ping

Zhang,

Guang-Bin

Huang, editors, Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23–26, 2005, Proceedings, Part I 1, pages 878–887. Springer.

, Bai

, Garcia

, Li

(2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322–1328. IEEE.

, Tan

, Salakhutdinov

, Mitchell

, Xing

(2019). Learning data manipulation for augmentation and weighting. Advances in Neural Information Processing Systems, 32: 15764–15775.

Kaur

, Pannu

, Malhi

(2019). A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR), 52(4): 1–36.

Kokol

, Kokol

, Zagoranski

(2022). Machine learning on small size samples: A synthetic knowledge synthesis. Science Progress, 105(1). https://doi.org/10.1177/00368504211029777.

Kumar

, Choudhary

, Cho

(2020). Data augmentation using pre-trained transformer models. In:

William M.

Campbell,

Alex

Waibel,

Dilek

Hakkani-Tur,

Timothy J.

Hazen,

Kevin

Kilgour,

Eunah

Cho,

Varun

Kumar,

Hadrien

Glaude, editors, Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18–26.

, Sun

, Zhu

(2010). Data imbalance problem in text classification. In:

Qingling

Li,

Fei

Yu,

Yun

Liu, editors, 2010 Third International Symposium on Information Processing, pages 301–305. IEEE.

Liaw

, Wiener

(2002). Classification and regression by randomforest. R News, 2(3): 18–22.

Lusted

(1971). Decision-making studies in patient management. New England Journal of Medicine, 284(8): 416–424.

Mohasseb

, Bader-El-Den

, Cocea

, Liu

(2018). Improving imbalanced question classification using structured SMOTE based approach. In: 2018 International Conference on Machine Learning and Cybernetics (ICMLC), volume 2, pages 593–597. IEEE.

Newaz

, Hassan

, Haq

(2022). An empirical analysis of the efficacy of different sampling techniques for imbalanced classification. arXiv preprint: https://arxiv.org/abs/2208.11852.

Padurariu

, Breaban

(2019). Dealing with data imbalance in text classification. Procedia Computer Science, 159: 736–745.

Pedregosa

, Varoquaux

, Gramfort

, Michel

, Thirion

, Grisel

, et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: 2825–2830.

Quiñonero-Candela

, Sugiyama

, Schwaighofer

, Lawrence

(2022). Dataset Shift in Machine Learning. MIT Press.

Ripley

(2007). Pattern Recognition and Neural Networks. Cambridge University Press.

Rupapara

, Rustam

, Shahzad

, Mehmood

, Ashraf

, Choi

(2021). Impact of SMOTE on imbalanced text features for toxic comments classification using RVVC model. IEEE Access, 9: 78621–78634.

Shleifer

(2019). Low resource text classification with ulmfit and backtranslation. arXiv preprint: https://arxiv.org/abs/1903.09244.

Shu

, Xu

, Meng

(2018). Small sample learning in big data era. arXiv preprint: https://arxiv.org/abs/1808.04572.

Stylianou

, Chatzakou

, Tsikrika

, Vrochidis

, Kompatsiaris

(2023). Domain-aligned data augmentation for low-resource and imbalanced text classification. In: European Conference on Information Retrieval, pages 172–187. Springer.

Sumathi

, et al. (2020). Grid search tuning of hyperparameters in random forest classifier for customer feedback sentiment prediction. International Journal of Advanced Computer Science and Applications, 11(9): 173–178.

Sun

, Genton

(2011). Functional boxplots. Journal of Computational and Graphical Statistics, 20(2): 316–334.

Tan

, Su

, Huang

, Guo

, Zuo

, Sun

, et al. (2019). Wireless sensor networks intrusion detection based on SMOTE and the random forest algorithm. Sensors, 19(1): 203.

Tepper

, Goldbraich

, Zwerdling

, Kour

, Tavor

, Carmeli

(2020). Balancing via generation for multi-class text classification improvement. In:

Trevor

Cohn,

Yulan

He,

Yang

Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1440–1452.

Tesfahun

, Bhaskari

(2013). Intrusion detection using random forests classifier with SMOTE and feature reduction. In:

Vidyasagar

Potdar,

Pritam

Shah,

Rajesh

Ingle,

Fang

Liu, editors, 2013 International Conference on Cloud & Ubiquitous Computing & Emerging Technologies, pages 127–132. IEEE.

van den Goorbergh

, van Smeden

, Timmerman

, Van Calster

(2022). The harm of class imbalance corrections for risk prediction models: Illustration and simulation using logistic regression. Journal of the American Medical Informatics Association, 29(9): 1525–1534.

Vaz

, Izbicki

, Stern

(2019). Quantification under prior probability shift: The ratio estimator and its extensions. Journal of Machine Learning Research, 20(79): 1–33.

Wang

, Li

, Zhao

, Zhang

(2013). Sample cutting method for imbalanced text sentiment classification based on BRC. Knowledge-Based Systems, 37: 451–461.

J-L

, Huang

(2022). Application of generative adversarial networks and Shapley algorithm based on easy data augmentation for imbalanced text data. Applied Sciences, 12(21): 10964.

Yeh

I-C

, Lien

C-h

(2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2): 2473–2480.

Zhou

Z-H

(2018). A brief introduction to weakly supervised learning. National Science Review, 5(1): 44–53.