Generative AI Takes a Statistics Exam: A Comparison of Performance Between ChatGPT3.5, ChatGPT4, and ChatGPT4o-mini
Pub. online: 6 May 2025
Type: Education In Data Science
Open Access
Received
23 August 2024
23 August 2024
Accepted
6 March 2025
6 March 2025
Published
6 May 2025
6 May 2025
Abstract
Many believe that use of generative AI as a private tutor has the potential to shrink access and achievement gaps between students and schools with abundant resources versus those with fewer resources. Shrinking the gap is possible only if paid and free versions of the platforms perform with the same accuracy. In this experiment, we investigate the performance of GPT versions 3.5, 4.0, and 4o-mini on the same 16-question statistics exam given to a class of first-year graduate students. While we do not advocate using any generative AI platform to complete an exam, the use of exam questions allows us to explore aspects of ChatGPT’s responses to typical questions that students might encounter in a statistics course. Results on accuracy indicate that GPT 3.5 would fail the exam, GPT4 would perform well, and GPT4o-mini would perform somewhere in between. While we acknowledge the existence of other Generative AI/LLMs, our discussion concerns only ChatGPT because it is the most widely used platform on college campuses at this time. We further investigate differences among the AI platforms in the answers for each problem using methods developed for text analytics, such as reading level evaluation and topic modeling. Results indicate that GPT3.5 and 4o-mini have characteristics that are more similar than either of them have with GPT4.
References
Alharbi A, Hai A, Aljurbua R, Obradovic Z (2024). Ai-driven sentiment trend analysis: Enhancing topic modeling interpretation with chatgpt. In: Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology (I Maglogiannis, L Iliadis, J Macintyre, M Avlonitis, A Papaleonidas, eds.), volume 712. Springer, Cham.
Amin KS, Mayes LC, Khosla P, Doshi R (2024). Assessing the efficacy of large language models in health literacy: A comprehensive cross-sectional study. The Yale Journal of Biology and Medicine, 97: 17–27. https://doi.org/10.59249/ZTOZ1966
Anshari M, Almunawar MN, Shahrill M, Wicaksono DK, Huda M (2017). Smartphones usage in the classrooms: Learning aid or interference? Education and Information Technologies, 22: 3063–3079. https://doi.org/10.1007/s10639-017-9572-7
Anthropic (2023). Meet Claude. https://www.anthropic.com/research.
Arkansas Council of Teachers of Mathematics (2011). Arkansas Council of Teachers of Mathematics Exam. http://example.com. Accessed: February, 2024.
Ball T, Chen S, Herley C (2024). Can we count on LLMs? The fixed-effect fallacy and claims of GPT-4 capabilities. Transactions on Machine Learning Research, 382. https://doi.org/10.1098/rsta.2023.0254
Ballester O, Penner O (2022). Robustness, replicability and scalability in topic modelling. Journal of Informetrics, 16(1): 101224. https://doi.org/10.1016/j.joi.2021.101224
Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, et al. (2018). Quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30): 774. https://doi.org/10.21105/joss.00774
Day AJ, Fenn MK, Ravizza SM (2021). Is it worth it? The costs and benefits of bringing a laptop to a university class. PLoS ONE, 16(5): e0251792. https://doi.org/10.1371/journal.pone.0251792
delMas R, Garfield J, Ooms A, Chance B (2007). Assessing students’ conceptual understanding after a first course in statistics. Statistics Education Research Journal, 6: 28–58. https://doi.org/10.52041/serj.v6i2.483
Ellis AR, Slade E (2023). A new era of learning: Considerations for ChatGPT as a tool to enhance statistics and data science education. Journal of Statistics and Data Science Education, 31(2): 128–133. https://doi.org/10.1080/26939169.2023.2223609
Flesch R (1948). A new readability yardstick. Journal of Applied Psychology, 32(3): 221–233. https://doi.org/10.1037/h0057532
Google (2023). Google Gemini. https://gemini.google.com/.
Huang J, Li S (2023). Opportunities and challenges in the application of ChatGPT in foreign language teaching. International Journal of Education and Social Science Research (IJESSR), 6(4): 75–89. https://doi.org/10.37500/IJESSR.2023.6406
Joshi I, Budhiraja R, Dev H, Kadia J, Ataullah MO, Mitra S, et al. (2023). ChatGPT in the classroom: An analysis of its strengths and weaknesses for solving undergraduate computer science questions. In: SIGCSE 2024: Proceedings of the 55th ACM Technical Symposium on Computer Science Education, volume 1.
Khan Academy (2024). Four stars for KhanMigo: Common sense media rates AI tools for learning. https://blog.khanacademy.org/four-stars-for-khanmigo-common-sense-media-rates-ai-tools-for-learning-kp/. Accessed: 2024-05-06.
Lacey L (2024). OpenAI Now Has a GPT4o-mini. Here’s Why That Matters — cnet.com. https://www.cnet.com/tech/services-and-software/openai-now-has-a-gpt-4o-mini-heres-why-that-matters/. Accessed: 13-08-2024.
Lawrence W, Nesbitt P, Jr PHC (2024). Post-pandemic support for special populations in higher education through generative artificial intelligence. International Journal of Arts, Humanities & Social Science, 05(05). https://doi.org/10.56734/ijahss.v5n5a6
Lee D, Seung H (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401: 788–791. https://doi.org/10.1038/44565
Meaney C, Escobar M, Stukel TA, Austin PC, Jaakkimainen L (2022). Comparison of methods for estimating temporal topic models from primary care clinical text data: Retrospective closed cohort study. JMIR Medical Informatics, 10(12). https://doi.org/10.2196/40102
Microsoft (2023). Microsoft copilot. https://copilot.microsoft.com/.
Open AI Team (2024). Hello GPT-4o. https://openai.com/index/hello-gpt-4o/. Accessed: May, 2024.
Özdemir B (2020). Character counter tool. https://charactercalculator.com/. Accessed: 2024-08-14.
Rijcken E, Scheepers F, Zervanou K, Spruit M, Mosteiro P, Kaymak U (2023). Towards interpreting topic models with chatgpt. In: Proceedings of the 20th World Congress of the International Fuzzy Systems Association (IFSA 2023). Presented at the 20th World Congress of the International Fuzzy Systems Association, IFSA; Conference date: 20-08-2023 – 24-08-2023.
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. (2023). LLaMA: Open and efficient foundation language models. https://doi.org/10.48550/arXiv.2302.13971
Varanasi L (2023). GPT-4 can ace the bar, but it only has a decent chance of passing the CFA exams. here’s a list of difficult exams the ChatGPT and GPT-4 have passed. https://www.businessinsider.com/list-here-are-the-exams-chatgpt-has-passed-so-far-2023-1.
Wasserstein R, Lazar N (2016). The ASA statement on p-values: Context, process, and purpose. American Statistician, 70(2): 129–133. https://doi.org/10.1080/00031305.2016.1154108
Watanabe K, Alexander B (2023). Seeded sequential LDA: A semi-supervised algorithm for topic-specific analysis of sentences. Social Science Computer Review, 42(1): 224–248. https://doi.org/10.1177/08944393231178605
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. (2020). Huggingface’s transformers: State-of-the-art natural language processing. https://doi.org/10.48550/arXiv.1910.03771