<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1154</article-id>
<article-id pub-id-type="doi">10.6339/24-JDS1154</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Statistical Data Science</subject></subj-group></article-categories>
<title-group>
<article-title>Is Augmentation Effective in Improving Prediction in Imbalanced Datasets?</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Assunção</surname><given-names>Gabriel O.</given-names></name><email xlink:href="mailto:gabrieloliveira1995@gmail.com">gabrieloliveira1995@gmail.com</email><xref ref-type="aff" rid="j_jds1154_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-0379-9690</contrib-id>
<name><surname>Izbicki</surname><given-names>Rafael</given-names></name><xref ref-type="aff" rid="j_jds1154_aff_002">2</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-8077-4898</contrib-id>
<name><surname>Prates</surname><given-names>Marcos O.</given-names></name><xref ref-type="aff" rid="j_jds1154_aff_001">1</xref>
</contrib>
<aff id="j_jds1154_aff_001"><label>1</label>Department of Statistics, <institution>Universidade Federal de Minas Gerais</institution>, Belo Horizonte, <country>Brazil</country></aff>
<aff id="j_jds1154_aff_002"><label>2</label>Department of Statistics, <institution>Universidade Federal de São Carlos</institution>, São Carlos, <country>Brazil</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:gabrieloliveira1995@gmail.com">gabrieloliveira1995@gmail.com</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2024</year></pub-date><pub-date pub-type="epub"><day>15</day><month>10</month><year>2024</year></pub-date><volume content-type="ahead-of-print">0</volume><issue>0</issue><fpage>1</fpage><lpage>16</lpage><supplementary-material id="S1" content-type="document" xlink:href="jds1154_s001.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<title>Supplementary Material</title>
<p>The supplementary materials include a zipped file containing the proofs of the theorems and complementary analysis and a folder containing the code to reproduce our experiment. The code is also available in <uri>https://github.com/gabrieloa/augmentation-effective</uri>, the instructions to run the code are in the <monospace>README.md</monospace> file.</p>
</caption>
</supplementary-material><history><date date-type="received"><day>25</day><month>4</month><year>2024</year></date><date date-type="accepted"><day>6</day><month>9</month><year>2024</year></date></history>
<permissions><copyright-statement>2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2024</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>balanced accuracy</kwd>
<kwd>data augmentation</kwd>
<kwd>oversampling</kwd>
</kwd-group>
<funding-group><funding-statement>Marcos O. Prates would like to acknowledge (Conselho Nacional de Desenvolvimento Científico e Tecnológico) CNPq grant 309186/2021-8 and FAPEMIG (Fundação de Amparo à Pesquisa do Estado de Minas Gerais) grant APQ-01837-22 and CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) for financial support. Rafael Izbicki is grateful for the financial support of CNPq (422705/2021-7 and 305065/2023-8) and FAPESP (grant 2023/07068-1).</funding-statement></funding-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1154_reflist_001">
<title>References</title>
<ref id="j_jds1154_ref_001">
<mixed-citation publication-type="journal"> <string-name><surname>Abdoh</surname> <given-names>SF</given-names></string-name>, <string-name><surname>Rizka</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Maghraby</surname> <given-names>FA</given-names></string-name> (<year>2018</year>). <article-title>Cervical cancer diagnosis using random forest classifier with SMOTE and feature reduction techniques</article-title>. <source><italic>IEEE Access</italic></source>, <volume>6</volume>: <fpage>59475</fpage>–<lpage>59485</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_002">
<mixed-citation publication-type="other"> <string-name><surname>Agarap</surname> <given-names>AF</given-names></string-name> (<year>2018</year>). Statistical analysis on e-commerce reviews, with sentiment classification using bidirectional recurrent neural network (RNN). arXiv preprint: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1805.03687">https://arxiv.org/abs/1805.03687</ext-link>. Dataset: <ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews">https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews</ext-link>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_003">
<mixed-citation publication-type="chapter"> <string-name><surname>Akkaradamrongrat</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kachamas</surname> <given-names>P</given-names></string-name>, <string-name><surname>Sinthupinyo</surname> <given-names>S</given-names></string-name> (<year>2019</year>). <chapter-title>Text generation for imbalanced text classification</chapter-title>. In: <source><italic>2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE)</italic></source>, pages <fpage>181</fpage>–<lpage>186</lpage>. <publisher-name>IEEE</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_004">
<mixed-citation publication-type="chapter"> <string-name><surname>Al Najada</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>X</given-names></string-name> (<year>2014</year>). <chapter-title>iSRD: Spam review detection with imbalanced data distributions</chapter-title>. In: <string-name><given-names>James</given-names> <surname>Joshi</surname></string-name>, <string-name><given-names>Elisa</given-names> <surname>Bertino</surname></string-name>, <string-name><given-names>Bhavani</given-names> <surname>Thuraisingham</surname></string-name>, <string-name><given-names>Ling</given-names> <surname>Liu</surname></string-name>, editors, <source><italic>Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)</italic></source>, pages <fpage>553</fpage>–<lpage>560</lpage>. <publisher-name>IEEE</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_005">
<mixed-citation publication-type="chapter"> <string-name><surname>Barbieri</surname> <given-names>F</given-names></string-name>, <string-name><surname>Camacho-Collados</surname> <given-names>J</given-names></string-name>, <string-name><surname>Anke</surname> <given-names>LE</given-names></string-name>, <string-name><surname>Neves</surname> <given-names>L</given-names></string-name> (<year>2020</year>). <chapter-title>Tweeteval: Unified benchmark and comparative evaluation for tweet classification</chapter-title>. In: <string-name><given-names>Trevor</given-names> <surname>Cohn</surname></string-name>, <string-name><given-names>Yulan</given-names> <surname>He</surname></string-name>, <string-name><given-names>Yang</given-names> <surname>Liu</surname></string-name>, editors, <source><italic>Findings of the Association for Computational Linguistics: EMNLP 2020</italic></source>, pages <fpage>1644</fpage>–<lpage>1650</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_006">
<mixed-citation publication-type="journal"> <string-name><surname>Breiman</surname> <given-names>L</given-names></string-name> (<year>2001</year>). <article-title>Random forests</article-title>. <source><italic>Machine Learning</italic></source>, <volume>45</volume>(<issue>1</issue>): <fpage>5</fpage>–<lpage>32</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_007">
<mixed-citation publication-type="journal"> <string-name><surname>Brier</surname> <given-names>GW</given-names></string-name>, <etal>et al.</etal> (<year>1950</year>). <article-title>Verification of forecasts expressed in terms of probability</article-title>. <source><italic>Monthly Weather Review</italic></source>, <volume>78</volume>(<issue>1</issue>): <fpage>1</fpage>–<lpage>3</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_008">
<mixed-citation publication-type="journal"> <string-name><surname>Chawla</surname> <given-names>NV</given-names></string-name>, <string-name><surname>Bowyer</surname> <given-names>KW</given-names></string-name>, <string-name><surname>Hall</surname> <given-names>LO</given-names></string-name>, <string-name><surname>Kegelmeyer</surname> <given-names>WP</given-names></string-name> (<year>2002</year>). <article-title>SMOTE: Synthetic minority over-sampling technique</article-title>. <source><italic>Journal of Artificial Intelligence Research</italic></source>, <volume>16</volume>: <fpage>321</fpage>–<lpage>357</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_009">
<mixed-citation publication-type="journal"> <string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tam</surname> <given-names>D</given-names></string-name>, <string-name><surname>Raffel</surname> <given-names>C</given-names></string-name>, <string-name><surname>Bansal</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>D</given-names></string-name> (<year>2023</year>a). <article-title>An empirical survey of data augmentation for limited data learning in NLP</article-title>. <source><italic>Transactions of the Association for Computational Linguistics</italic></source>, <volume>11</volume>: <fpage>191</fpage>–<lpage>211</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_010">
<mixed-citation publication-type="other"> <string-name><surname>Chen</surname> <given-names>T</given-names></string-name>, <string-name><surname>He</surname> <given-names>T</given-names></string-name>, <string-name><surname>Benesty</surname> <given-names>M</given-names></string-name>, <string-name><surname>Khotilovich</surname> <given-names>V</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cho</surname> <given-names>H</given-names></string-name>, et al. (<year>2023</year>b). <italic>xgboost: Extreme gradient boosting</italic>. R package version 1.7.5.1.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_011">
<mixed-citation publication-type="chapter"> <string-name><surname>Davidson</surname> <given-names>T</given-names></string-name>, <string-name><surname>Warmsley</surname> <given-names>D</given-names></string-name>, <string-name><surname>Macy</surname> <given-names>M</given-names></string-name>, <string-name><surname>Weber</surname> <given-names>I</given-names></string-name> (<year>2017</year>). <chapter-title>Automated hate speech detection and the problem of offensive language</chapter-title>. In: <source><italic>Proceedings of the International AAAI Conference on Web and Social Media</italic></source>, volume <volume>11</volume>, pages <fpage>512</fpage>–<lpage>515</lpage>. Dataset: <uri>https://huggingface.co/datasets/hate_speech_offensive</uri>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_012">
<mixed-citation publication-type="chapter"> <string-name><surname>Feng</surname> <given-names>SY</given-names></string-name>, <string-name><surname>Gangal</surname> <given-names>V</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chandar</surname> <given-names>S</given-names></string-name>, <string-name><surname>Vosoughi</surname> <given-names>S</given-names></string-name>, <string-name><surname>Mitamura</surname> <given-names>T</given-names></string-name>, <etal>et al.</etal> (<year>2021</year>). <chapter-title>A survey of data augmentation approaches for NLP</chapter-title>. In: <string-name><given-names>Chengqing</given-names> <surname>Zong</surname></string-name>, <string-name><given-names>Fei</given-names> <surname>Xia</surname></string-name>, <string-name><given-names>Wenjie</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Roberto</given-names> <surname>Navigli</surname></string-name>, editors, <source><italic>Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021</italic></source>, pages <fpage>968</fpage>–<lpage>988</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_013">
<mixed-citation publication-type="other"> <string-name><surname>Fix</surname> <given-names>E</given-names></string-name>, <string-name><surname>Hodges</surname> <given-names>JL</given-names> <suffix>Jr</suffix></string-name> (<year>1952</year>). Discriminatory analysis-nonparametric discrimination: Small sample performance. Technical report, California Univ Berkeley.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_014">
<mixed-citation publication-type="journal"> <string-name><surname>Gao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L-f</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>M-y</given-names></string-name>, <string-name><surname>Hauptmann</surname> <given-names>A</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>A-N</given-names></string-name> (<year>2014</year>). <article-title>Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset</article-title>. <source><italic>Multimedia Tools and Applications</italic></source>, <volume>68</volume>: <fpage>641</fpage>–<lpage>657</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_015">
<mixed-citation publication-type="chapter"> <string-name><surname>Grano</surname> <given-names>G</given-names></string-name>, <string-name><surname>Di Sorbo</surname> <given-names>A</given-names></string-name>, <string-name><surname>Mercaldo</surname> <given-names>F</given-names></string-name>, <string-name><surname>Visaggio</surname> <given-names>CA</given-names></string-name>, <string-name><surname>Canfora</surname> <given-names>G</given-names></string-name>, <string-name><surname>Panichella</surname> <given-names>S</given-names></string-name>, (<year>2017</year>). <chapter-title>Android apps and user feedback: A dataset for software evolution and quality improvement</chapter-title>. In: <string-name><given-names>Federica</given-names> <surname>Sarro</surname></string-name>, <string-name><given-names>Emad</given-names> <surname>Shihab</surname></string-name>, <string-name><given-names>Meiyappan</given-names> <surname>Nagappan</surname></string-name>, <string-name><given-names>Marie C.</given-names> <surname>Platenius</surname></string-name>, <string-name><given-names>Daniel</given-names> <surname>Kaimann</surname></string-name>, editors, <source><italic>Proceedings of the 2nd ACM SIGSOFT International Workshop on App Market Analytics</italic></source>, pages <fpage>8</fpage>–<lpage>11</lpage>. Dataset: <uri>https://huggingface.co/datasets/app_reviews</uri>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_016">
<mixed-citation publication-type="chapter"> <string-name><surname>Han</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W-Y</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>B-H</given-names></string-name> (<year>2005</year>). <chapter-title>Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning</chapter-title>. In: <string-name><given-names>De-Shuang</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>Xiao-Ping</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>Guang-Bin</given-names> <surname>Huang</surname></string-name>, editors, <source><italic>Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005</italic></source>, <conf-loc>Hefei, China</conf-loc>, <conf-date>August 23–26, 2005</conf-date>, Proceedings, Part I 1, pages <fpage>878</fpage>–<lpage>887</lpage>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_017">
<mixed-citation publication-type="chapter"> <string-name><surname>He</surname> <given-names>H</given-names></string-name>, <string-name><surname>Bai</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Garcia</surname> <given-names>EA</given-names></string-name>, <string-name><surname>Li</surname> <given-names>S</given-names></string-name> (<year>2008</year>). <chapter-title>ADASYN: Adaptive synthetic sampling approach for imbalanced learning</chapter-title>. In: <source><italic>2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)</italic></source>, pages <fpage>1322</fpage>–<lpage>1328</lpage>. <publisher-name>IEEE</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_018">
<mixed-citation publication-type="journal"> <string-name><surname>Hu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Tan</surname> <given-names>B</given-names></string-name>, <string-name><surname>Salakhutdinov</surname> <given-names>RR</given-names></string-name>, <string-name><surname>Mitchell</surname> <given-names>TM</given-names></string-name>, <string-name><surname>Xing</surname> <given-names>EP</given-names></string-name> (<year>2019</year>). <article-title>Learning data manipulation for augmentation and weighting</article-title>. <source><italic>Advances in Neural Information Processing Systems</italic></source>, <volume>32</volume>: <fpage>15764</fpage>–<lpage>15775</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_019">
<mixed-citation publication-type="journal"> <string-name><surname>Kaur</surname> <given-names>H</given-names></string-name>, <string-name><surname>Pannu</surname> <given-names>HS</given-names></string-name>, <string-name><surname>Malhi</surname> <given-names>AK</given-names></string-name> (<year>2019</year>). <article-title>A systematic review on imbalanced data challenges in machine learning: Applications and solutions</article-title>. <source><italic>ACM Computing Surveys (CSUR)</italic></source>, <volume>52</volume>(<issue>4</issue>): <fpage>1</fpage>–<lpage>36</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_020">
<mixed-citation publication-type="journal"> <string-name><surname>Kokol</surname> <given-names>P</given-names></string-name>, <string-name><surname>Kokol</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zagoranski</surname> <given-names>S</given-names></string-name> (<year>2022</year>). <article-title>Machine learning on small size samples: A synthetic knowledge synthesis</article-title>. <source><italic>Science Progress</italic></source>, <volume>105</volume>(<issue>1</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1177/00368504211029777" xlink:type="simple">https://doi.org/10.1177/00368504211029777</ext-link>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_021">
<mixed-citation publication-type="chapter"> <string-name><surname>Kumar</surname> <given-names>V</given-names></string-name>, <string-name><surname>Choudhary</surname> <given-names>A</given-names></string-name>, <string-name><surname>Cho</surname> <given-names>E</given-names></string-name> (<year>2020</year>). <chapter-title>Data augmentation using pre-trained transformer models</chapter-title>. In: <string-name><given-names>William M.</given-names> <surname>Campbell</surname></string-name>, <string-name><given-names>Alex</given-names> <surname>Waibel</surname></string-name>, <string-name><given-names>Dilek</given-names> <surname>Hakkani-Tur</surname></string-name>, <string-name><given-names>Timothy J.</given-names> <surname>Hazen</surname></string-name>, <string-name><given-names>Kevin</given-names> <surname>Kilgour</surname></string-name>, <string-name><given-names>Eunah</given-names> <surname>Cho</surname></string-name>, <string-name><given-names>Varun</given-names> <surname>Kumar</surname></string-name>, <string-name><given-names>Hadrien</given-names> <surname>Glaude</surname></string-name>, editors, <source><italic>Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems</italic></source>, pages <fpage>18</fpage>–<lpage>26</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_022">
<mixed-citation publication-type="chapter"> <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>Y</given-names></string-name> (<year>2010</year>). <chapter-title>Data imbalance problem in text classification</chapter-title>. In: <string-name><given-names>Qingling</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Fei</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>Yun</given-names> <surname>Liu</surname></string-name>, editors, <source><italic>2010 Third International Symposium on Information Processing</italic></source>, pages <fpage>301</fpage>–<lpage>305</lpage>. <publisher-name>IEEE</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_023">
<mixed-citation publication-type="journal"> <string-name><surname>Liaw</surname> <given-names>A</given-names></string-name>, <string-name><surname>Wiener</surname> <given-names>M</given-names></string-name> (<year>2002</year>). <article-title>Classification and regression by randomforest</article-title>. <source><italic>R News</italic></source>, <volume>2</volume>(<issue>3</issue>): <fpage>18</fpage>–<lpage>22</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_024">
<mixed-citation publication-type="journal"> <string-name><surname>Lusted</surname> <given-names>LB</given-names></string-name> (<year>1971</year>). <article-title>Decision-making studies in patient management</article-title>. <source><italic>New England Journal of Medicine</italic></source>, <volume>284</volume>(<issue>8</issue>): <fpage>416</fpage>–<lpage>424</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_025">
<mixed-citation publication-type="chapter"> <string-name><surname>Mohasseb</surname> <given-names>A</given-names></string-name>, <string-name><surname>Bader-El-Den</surname> <given-names>M</given-names></string-name>, <string-name><surname>Cocea</surname> <given-names>M</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>H</given-names></string-name> (<year>2018</year>). <chapter-title>Improving imbalanced question classification using structured SMOTE based approach</chapter-title>. In: <source><italic>2018 International Conference on Machine Learning and Cybernetics (ICMLC)</italic></source>, volume <volume>2</volume>, pages <fpage>593</fpage>–<lpage>597</lpage>. <publisher-name>IEEE</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_026">
<mixed-citation publication-type="other"> <string-name><surname>Newaz</surname> <given-names>A</given-names></string-name>, <string-name><surname>Hassan</surname> <given-names>S</given-names></string-name>, <string-name><surname>Haq</surname> <given-names>FS</given-names></string-name> (<year>2022</year>). An empirical analysis of the efficacy of different sampling techniques for imbalanced classification. arXiv preprint: <uri>https://arxiv.org/abs/2208.11852</uri>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_027">
<mixed-citation publication-type="journal"> <string-name><surname>Padurariu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Breaban</surname> <given-names>ME</given-names></string-name> (<year>2019</year>). <article-title>Dealing with data imbalance in text classification</article-title>. <source><italic>Procedia Computer Science</italic></source>, <volume>159</volume>: <fpage>736</fpage>–<lpage>745</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_028">
<mixed-citation publication-type="journal"> <string-name><surname>Pedregosa</surname> <given-names>F</given-names></string-name>, <string-name><surname>Varoquaux</surname> <given-names>G</given-names></string-name>, <string-name><surname>Gramfort</surname> <given-names>A</given-names></string-name>, <string-name><surname>Michel</surname> <given-names>V</given-names></string-name>, <string-name><surname>Thirion</surname> <given-names>B</given-names></string-name>, <string-name><surname>Grisel</surname> <given-names>O</given-names></string-name>, <etal>et al.</etal> (<year>2011</year>). <article-title>Scikit-learn: Machine learning in Python</article-title>. <source><italic>Journal of Machine Learning Research</italic></source>, <volume>12</volume>: <fpage>2825</fpage>–<lpage>2830</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_029">
<mixed-citation publication-type="book"> <string-name><surname>Quiñonero-Candela</surname> <given-names>J</given-names></string-name>, <string-name><surname>Sugiyama</surname> <given-names>M</given-names></string-name>, <string-name><surname>Schwaighofer</surname> <given-names>A</given-names></string-name>, <string-name><surname>Lawrence</surname> <given-names>ND</given-names></string-name> (<year>2022</year>). <source><italic>Dataset Shift in Machine Learning</italic></source>. <publisher-name>MIT Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_030">
<mixed-citation publication-type="book"> <string-name><surname>Ripley</surname> <given-names>BD</given-names></string-name> (<year>2007</year>). <source><italic>Pattern Recognition and Neural Networks</italic></source>. <publisher-name>Cambridge University Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_031">
<mixed-citation publication-type="journal"> <string-name><surname>Rupapara</surname> <given-names>V</given-names></string-name>, <string-name><surname>Rustam</surname> <given-names>F</given-names></string-name>, <string-name><surname>Shahzad</surname> <given-names>HF</given-names></string-name>, <string-name><surname>Mehmood</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ashraf</surname> <given-names>I</given-names></string-name>, <string-name><surname>Choi</surname> <given-names>GS</given-names></string-name> (<year>2021</year>). <article-title>Impact of SMOTE on imbalanced text features for toxic comments classification using RVVC model</article-title>. <source><italic>IEEE Access</italic></source>, <volume>9</volume>: <fpage>78621</fpage>–<lpage>78634</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_032">
<mixed-citation publication-type="other"> <string-name><surname>Shleifer</surname> <given-names>S</given-names></string-name> (<year>2019</year>). Low resource text classification with ulmfit and backtranslation. arXiv preprint: <uri>https://arxiv.org/abs/1903.09244</uri>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_033">
<mixed-citation publication-type="other"> <string-name><surname>Shu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>D</given-names></string-name> (<year>2018</year>). Small sample learning in big data era. arXiv preprint: <uri>https://arxiv.org/abs/1808.04572</uri>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_034">
<mixed-citation publication-type="chapter"> <string-name><surname>Stylianou</surname> <given-names>N</given-names></string-name>, <string-name><surname>Chatzakou</surname> <given-names>D</given-names></string-name>, <string-name><surname>Tsikrika</surname> <given-names>T</given-names></string-name>, <string-name><surname>Vrochidis</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kompatsiaris</surname> <given-names>I</given-names></string-name> (<year>2023</year>). <chapter-title>Domain-aligned data augmentation for low-resource and imbalanced text classification</chapter-title>. In: <source><italic>European Conference on Information Retrieval</italic></source>, pages <fpage>172</fpage>–<lpage>187</lpage>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_035">
<mixed-citation publication-type="journal"> <string-name><surname>Sumathi</surname> <given-names>B</given-names></string-name>, <etal>et al.</etal> (<year>2020</year>). <article-title>Grid search tuning of hyperparameters in random forest classifier for customer feedback sentiment prediction</article-title>. <source><italic>International Journal of Advanced Computer Science and Applications</italic></source>, <volume>11</volume>(<issue>9</issue>): <fpage>173</fpage>–<lpage>178</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_036">
<mixed-citation publication-type="journal"> <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Genton</surname> <given-names>MG</given-names></string-name> (<year>2011</year>). <article-title>Functional boxplots</article-title>. <source><italic>Journal of Computational and Graphical Statistics</italic></source>, <volume>20</volume>(<issue>2</issue>): <fpage>316</fpage>–<lpage>334</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_037">
<mixed-citation publication-type="journal"> <string-name><surname>Tan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Su</surname> <given-names>S</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>X</given-names></string-name>, et al. (<year>2019</year>). <article-title>Wireless sensor networks intrusion detection based on SMOTE and the random forest algorithm</article-title>. <source><italic>Sensors</italic></source>, <volume>19</volume>(<issue>1</issue>): <elocation-id>203</elocation-id>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_038">
<mixed-citation publication-type="chapter"> <string-name><surname>Tepper</surname> <given-names>N</given-names></string-name>, <string-name><surname>Goldbraich</surname> <given-names>E</given-names></string-name>, <string-name><surname>Zwerdling</surname> <given-names>N</given-names></string-name>, <string-name><surname>Kour</surname> <given-names>G</given-names></string-name>, <string-name><surname>Tavor</surname> <given-names>AA</given-names></string-name>, <string-name><surname>Carmeli</surname> <given-names>B</given-names></string-name> (<year>2020</year>). <chapter-title>Balancing via generation for multi-class text classification improvement</chapter-title>. In: <string-name><given-names>Trevor</given-names> <surname>Cohn</surname></string-name>, <string-name><given-names>Yulan</given-names> <surname>He</surname></string-name>, <string-name><given-names>Yang</given-names> <surname>Liu</surname></string-name>, editors, <source><italic>Findings of the Association for Computational Linguistics: EMNLP 2020</italic></source>, pages <fpage>1440</fpage>–<lpage>1452</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_039">
<mixed-citation publication-type="chapter"> <string-name><surname>Tesfahun</surname> <given-names>A</given-names></string-name>, <string-name><surname>Bhaskari</surname> <given-names>DL</given-names></string-name> (<year>2013</year>). <chapter-title>Intrusion detection using random forests classifier with SMOTE and feature reduction</chapter-title>. In: <string-name><given-names>Vidyasagar</given-names> <surname>Potdar</surname></string-name>, <string-name><given-names>Pritam</given-names> <surname>Shah</surname></string-name>, <string-name><given-names>Rajesh</given-names> <surname>Ingle</surname></string-name>, <string-name><given-names>Fang</given-names> <surname>Liu</surname></string-name>, editors, <source><italic>2013 International Conference on Cloud &amp; Ubiquitous Computing &amp; Emerging Technologies</italic></source>, pages <fpage>127</fpage>–<lpage>132</lpage>. <publisher-name>IEEE</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_040">
<mixed-citation publication-type="journal"> <string-name><surname>van den Goorbergh</surname> <given-names>R</given-names></string-name>, <string-name><surname>van Smeden</surname> <given-names>M</given-names></string-name>, <string-name><surname>Timmerman</surname> <given-names>D</given-names></string-name>, <string-name><surname>Van Calster</surname> <given-names>B</given-names></string-name> (<year>2022</year>). <article-title>The harm of class imbalance corrections for risk prediction models: Illustration and simulation using logistic regression</article-title>. <source><italic>Journal of the American Medical Informatics Association</italic></source>, <volume>29</volume>(<issue>9</issue>): <fpage>1525</fpage>–<lpage>1534</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_041">
<mixed-citation publication-type="journal"> <string-name><surname>Vaz</surname> <given-names>AF</given-names></string-name>, <string-name><surname>Izbicki</surname> <given-names>R</given-names></string-name>, <string-name><surname>Stern</surname> <given-names>RB</given-names></string-name> (<year>2019</year>). <article-title>Quantification under prior probability shift: The ratio estimator and its extensions</article-title>. <source><italic>Journal of Machine Learning Research</italic></source>, <volume>20</volume>(<issue>79</issue>): <fpage>1</fpage>–<lpage>33</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_042">
<mixed-citation publication-type="journal"> <string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Li</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name> (<year>2013</year>). <article-title>Sample cutting method for imbalanced text sentiment classification based on BRC</article-title>. <source><italic>Knowledge-Based Systems</italic></source>, <volume>37</volume>: <fpage>451</fpage>–<lpage>461</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_043">
<mixed-citation publication-type="journal"> <string-name><surname>Wu</surname> <given-names>J-L</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>S</given-names></string-name> (<year>2022</year>). <article-title>Application of generative adversarial networks and Shapley algorithm based on easy data augmentation for imbalanced text data</article-title>. <source><italic>Applied Sciences</italic></source>, <volume>12</volume>(<issue>21</issue>): <fpage>10964</fpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_044">
<mixed-citation publication-type="journal"> <string-name><surname>Yeh</surname> <given-names>I-C</given-names></string-name>, <string-name><surname>Lien</surname> <given-names>C-h</given-names></string-name> (<year>2009</year>). <article-title>The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients</article-title>. <source><italic>Expert Systems with Applications</italic></source>, <volume>36</volume>(<issue>2</issue>): <fpage>2473</fpage>–<lpage>2480</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1154_ref_045">
<mixed-citation publication-type="journal"> <string-name><surname>Zhou</surname> <given-names>Z-H</given-names></string-name> (<year>2018</year>). <article-title>A brief introduction to weakly supervised learning</article-title>. <source><italic>National Science Review</italic></source>, <volume>5</volume>(<issue>1</issue>): <fpage>44</fpage>–<lpage>53</lpage>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
