Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. Generative AI Takes a Statistics Exam: A ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Generative AI Takes a Statistics Exam: A Comparison of Performance Between ChatGPT3.5, ChatGPT4, and ChatGPT4o-mini
Monnie McGee ORCID icon link to view author Monnie McGee details   Bivin P. Sadler  

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1174
Pub. online: 6 May 2025      Type: Education In Data Science      Open accessOpen Access

Received
23 August 2024
Accepted
6 March 2025
Published
6 May 2025

Abstract

Many believe that use of generative AI as a private tutor has the potential to shrink access and achievement gaps between students and schools with abundant resources versus those with fewer resources. Shrinking the gap is possible only if paid and free versions of the platforms perform with the same accuracy. In this experiment, we investigate the performance of GPT versions 3.5, 4.0, and 4o-mini on the same 16-question statistics exam given to a class of first-year graduate students. While we do not advocate using any generative AI platform to complete an exam, the use of exam questions allows us to explore aspects of ChatGPT’s responses to typical questions that students might encounter in a statistics course. Results on accuracy indicate that GPT 3.5 would fail the exam, GPT4 would perform well, and GPT4o-mini would perform somewhere in between. While we acknowledge the existence of other Generative AI/LLMs, our discussion concerns only ChatGPT because it is the most widely used platform on college campuses at this time. We further investigate differences among the AI platforms in the answers for each problem using methods developed for text analytics, such as reading level evaluation and topic modeling. Results indicate that GPT3.5 and 4o-mini have characteristics that are more similar than either of them have with GPT4.

References

 
Aamir S, Mughal SF, Kayani AJ, Yousuf MZ, Rastgar OA, Syed AA (2024). Impact of generative AI in revolutionizing education. In: 2024 8th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 1–6.
 
AI-Pro Team (2024). Is GPT-4o better than GPT-4? A detailed comparison. Accessed: 2024-07-29.
 
Alharbi A, Hai A, Aljurbua R, Obradovic Z (2024). Ai-driven sentiment trend analysis: Enhancing topic modeling interpretation with chatgpt. In: Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology (I Maglogiannis, L Iliadis, J Macintyre, M Avlonitis, A Papaleonidas, eds.), volume 712. Springer, Cham.
 
Amin KS, Mayes LC, Khosla P, Doshi R (2024). Assessing the efficacy of large language models in health literacy: A comprehensive cross-sectional study. The Yale Journal of Biology and Medicine, 97: 17–27. https://doi.org/10.59249/ZTOZ1966
 
Anshari M, Almunawar MN, Shahrill M, Wicaksono DK, Huda M (2017). Smartphones usage in the classrooms: Learning aid or interference? Education and Information Technologies, 22: 3063–3079. https://doi.org/10.1007/s10639-017-9572-7
 
Anthropic (2023). Meet Claude. https://www.anthropic.com/research.
 
Arkansas Council of Teachers of Mathematics (2011). Arkansas Council of Teachers of Mathematics Exam. http://example.com. Accessed: February, 2024.
 
Ball T, Chen S, Herley C (2024). Can we count on LLMs? The fixed-effect fallacy and claims of GPT-4 capabilities. Transactions on Machine Learning Research, 382. https://doi.org/10.1098/rsta.2023.0254
 
Ballester O, Penner O (2022). Robustness, replicability and scalability in topic modelling. Journal of Informetrics, 16(1): 101224. https://doi.org/10.1016/j.joi.2021.101224
 
Benoit K, Obeng A (2023). Readtext: Import and handling for plain and formatted text files. R package version 0.90.
 
Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, et al. (2018). Quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30): 774. https://doi.org/10.21105/joss.00774
 
Blei DM, Ng AY, Jordan MI (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022.
 
Bommarito MJ, Katz DM (2022). GPT takes the bar exam. Social Science Research Network (SSRN).
 
Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, et al. (2023). Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Science Reports, 13: 16492.
 
Callanan E, Mbakwe A, Papadimitriou A, Pei Y, Sibue M, Zhu X, et al. (2023). Can GPT models be financial analysts? An evaluation of ChatGPT and GPT-4 on mock CFA exams.
 
Day AJ, Fenn MK, Ravizza SM (2021). Is it worth it? The costs and benefits of bringing a laptop to a university class. PLoS ONE, 16(5): e0251792. https://doi.org/10.1371/journal.pone.0251792
 
delMas R, Garfield J, Ooms A, Chance B (2007). Assessing students’ conceptual understanding after a first course in statistics. Statistics Education Research Journal, 6: 28–58. https://doi.org/10.52041/serj.v6i2.483
 
Ellis AR, Slade E (2023). A new era of learning: Considerations for ChatGPT as a tool to enhance statistics and data science education. Journal of Statistics and Data Science Education, 31(2): 128–133. https://doi.org/10.1080/26939169.2023.2223609
 
Fast Company (2023). The learning nonprofit: Khan academy piloting a version of GPT called KhanMigo. Fast Company.
 
Flesch R (1948). A new readability yardstick. Journal of Applied Psychology, 32(3): 221–233. https://doi.org/10.1037/h0057532
 
Google (2023). Google Gemini. https://gemini.google.com/.
 
Graduate Aptitude Test in Engineering (2023). Graduate aptitude test in engineering (gate). In: National-Level Examination for Engineering Graduates. Indian Institute of Technology.
 
Hecking T, Leydesdorff L (2018). Topic modelling of empirical text corpora: Validity, reliability, and reproducibility in comparison to semantic maps.
 
Hidayatullah E (2024). Evaluating the effectiveness of ChatGPT to improve English students’ writing skills. Humanities, Education, Applied Linguistics, and Language Teaching: Conference Series, 1: 14–19.
 
Hochmann A (1986). Math teachers stage a calculated protest. The Washington Post.
 
Huang J, Li S (2023). Opportunities and challenges in the application of ChatGPT in foreign language teaching. International Journal of Education and Social Science Research (IJESSR), 6(4): 75–89. https://doi.org/10.37500/IJESSR.2023.6406
 
Joshi I, Budhiraja R, Dev H, Kadia J, Ataullah MO, Mitra S, et al. (2023). ChatGPT in the classroom: An analysis of its strengths and weaknesses for solving undergraduate computer science questions. In: SIGCSE 2024: Proceedings of the 55th ACM Technical Symposium on Computer Science Education, volume 1.
 
Katz DM, Bommarito MJ, Gao S, Arredondo P (2023). GPT-4 passes the bar exam. Philosophical Transactions of the Royal Society A.
 
Khan Academy (2024). Four stars for KhanMigo: Common sense media rates AI tools for learning. https://blog.khanacademy.org/four-stars-for-khanmigo-common-sense-media-rates-ai-tools-for-learning-kp/. Accessed: 2024-05-06.
 
Koonchanok R, Pan Y, Jang H (2024). Public attitudes toward ChatGPT on twitter: Sentiments, topics, and occupations. Research Square.
 
Kovari A, Katona J (2024). Transformative applications and key challenges of generative ai. In: 2024 IEEE 7th International Conference and Workshop Óbuda on Electrical and Power Engineering (CANDO-EPE), 89–92.
 
Lacey L (2024). OpenAI Now Has a GPT4o-mini. Here’s Why That Matters — cnet.com. https://www.cnet.com/tech/services-and-software/openai-now-has-a-gpt-4o-mini-heres-why-that-matters/. Accessed: 13-08-2024.
 
Lawrence W, Nesbitt P, Jr PHC (2024). Post-pandemic support for special populations in higher education through generative artificial intelligence. International Journal of Arts, Humanities & Social Science, 05(05). https://doi.org/10.56734/ijahss.v5n5a6
 
Lee D, Seung H (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401: 788–791. https://doi.org/10.1038/44565
 
Li S, Liu H, Bian Z, Fang J, Huang H, Liu Y, et al. (2023). Colossal-AI: A unified deep learning system for large-scale parallel training. In: Proceedings of the 52nd International Conference on Parallel Processing, ICPP ’23, 766–775. Association for Computing Machinery, New York, NY, USA.
 
Massey PA, Montgomery C, Zhang AS (2023). Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. Journal of the American Academy of Orthopaedic Surgeons, 31(23): 1173–1179.
 
McGee M, Sadler B (2024). Equity in the use of ChatGPT for the classroom: A comparison of the accuracy and precision of ChatGPT3.5 vs. ChatGPT4 with respect to statistics and data science exams.
 
McLaughlin GH (1969). SMOG grading-a new readability formula. Journal of Reading, 12(8): 639–646.
 
Meaney C, Escobar M, Stukel TA, Austin PC, Jaakkimainen L (2022). Comparison of methods for estimating temporal topic models from primary care clinical text data: Retrospective closed cohort study. JMIR Medical Informatics, 10(12). https://doi.org/10.2196/40102
 
Mervaala E, Kousa I (2024). Order up! Micromanaging inconsistencies in ChatGPT-4o text analyses. In: Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities.
 
Meta AI (2024). Introducing LLaMA: A foundational, 65-billion-parameter large language model. Accessed: August, 2024.
 
Microsoft (2023). Microsoft copilot. https://copilot.microsoft.com/.
 
Newman D, Asuncion A, Smyth P, Welling M (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10: 1801–1828.
 
Open AI Team (2022). ChatGPT: Optimizing language models for dialogue.
 
Open AI Team (2024). Hello GPT-4o. https://openai.com/index/hello-gpt-4o/. Accessed: May, 2024.
 
OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. (2024) GPT-4 technical report.
 
Özdemir B (2020). Character counter tool. https://charactercalculator.com/. Accessed: 2024-08-14.
 
R Core Team (2023). R: A language and environment for statistical computing.
 
Rijcken E, Scheepers F, Zervanou K, Spruit M, Mosteiro P, Kaymak U (2023). Towards interpreting topic models with chatgpt. In: Proceedings of the 20th World Congress of the International Fuzzy Systems Association (IFSA 2023). Presented at the 20th World Congress of the International Fuzzy Systems Association, IFSA; Conference date: 20-08-2023 – 24-08-2023.
 
Taloni A, Borselli M, Scarsi V, Rossi C, Scorcia V, Giannaccare G (2023). Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. Scientific Reports, 13: 18562.
 
Terry OK (2023). I am a student: You have no idea how much we are using ChatGPT. The Chronical of Higher Education, 69(19).
 
Together Computer (2023). OpenChatKit: An open toolkit and base model for dialogue-style applications.
 
Tough P (2019). The Years That Matter Most: How College Makes Us or Breaks Us. Houghton Mifflin Harcourt, New York.
 
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. (2023). LLaMA: Open and efficient foundation language models. https://doi.org/10.48550/arXiv.2302.13971
 
Varanasi L (2023). GPT-4 can ace the bar, but it only has a decent chance of passing the CFA exams. here’s a list of difficult exams the ChatGPT and GPT-4 have passed. https://www.businessinsider.com/list-here-are-the-exams-chatgpt-has-passed-so-far-2023-1.
 
Wasserstein R, Lazar N (2016). The ASA statement on p-values: Context, process, and purpose. American Statistician, 70(2): 129–133. https://doi.org/10.1080/00031305.2016.1154108
 
Watanabe K, Alexander B (2023). Seeded sequential LDA: A semi-supervised algorithm for topic-specific analysis of sentences. Social Science Computer Review, 42(1): 224–248. https://doi.org/10.1177/08944393231178605
 
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. (2020). Huggingface’s transformers: State-of-the-art natural language processing. https://doi.org/10.48550/arXiv.1910.03771
 
Yao S, Yu D, Zhao J, Shafran I, Griffiths TL, Cao Y, et al. (2024). Tree of thoughts: Deliberate problem solving with large language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23. Curran Associates Inc., Red, Hook, NY, USA.

Related articles PDF XML
Related articles PDF XML

Copyright
2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
academic integrity generative AI inclusive teaching statistics and data science education text analytics

Metrics
since February 2021
37

Article info
views

11

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy