Pub. online:6 May 2025Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 416–428
Abstract
The Data Science Consulting Program at North Carolina State University Libraries, in partnership with the Data Science and AI Academy, provides comprehensive support for a wide range of tools and software, including R, Python, MATLAB, ArcGIS, and more, to assist students, faculty, and staff with their data-related needs. This paper explores the integration of generative AI, specifically ChatGPT, into our consultation services, demonstrating how it enhances the efficiency and effectiveness of addressing numerous and diverse requests. ChatGPT has been instrumental in tasks such as data visualization, statistical analysis, and code generation, allowing consultants to quickly resolve complex queries. The paper also discusses the program’s structured approach to consultations, highlighting the iterative process from initial request to resolution. We address challenges like prompt engineering and response variability, offering best practices to maximize the tool’s potential. As AI technology continues to evolve, its role in our data science consultations is expected to expand, improving service quality and the consultant’s ability to handle increasingly complex tasks. The study concludes that ChatGPT is a valuable asset in academic data science, significantly streamlining workflows and broadening the scope of support provided by our program.
Pub. online:6 May 2025Type:Education In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 269–286
Abstract
Many believe that use of generative AI as a private tutor has the potential to shrink access and achievement gaps between students and schools with abundant resources versus those with fewer resources. Shrinking the gap is possible only if paid and free versions of the platforms perform with the same accuracy. In this experiment, we investigate the performance of GPT versions 3.5, 4.0, and 4o-mini on the same 16-question statistics exam given to a class of first-year graduate students. While we do not advocate using any generative AI platform to complete an exam, the use of exam questions allows us to explore aspects of ChatGPT’s responses to typical questions that students might encounter in a statistics course. Results on accuracy indicate that GPT 3.5 would fail the exam, GPT4 would perform well, and GPT4o-mini would perform somewhere in between. While we acknowledge the existence of other Generative AI/LLMs, our discussion concerns only ChatGPT because it is the most widely used platform on college campuses at this time. We further investigate differences among the AI platforms in the answers for each problem using methods developed for text analytics, such as reading level evaluation and topic modeling. Results indicate that GPT3.5 and 4o-mini have characteristics that are more similar than either of them have with GPT4.
Technological advances in software development effectively handled technical details that made life easier for data analysts, but also allowed for nonexperts in statistics and computer science to analyze data. As a result, medical research suffers from statistical errors that could be otherwise prevented such as errors in choosing a hypothesis test and assumption checking of models. Our objective is to create an automated data analysis software package that can help practitioners run non-subjective, fast, accurate and easily interpretable analyses. We used machine learning to predict the normality of a distribution as an alternative to normality tests and graphical methods to avoid their downsides. We implemented methods for detecting outliers, imputing missing values, and choosing a threshold for cutting numerical variables to correct for non-linearity before running a linear regression. We showed that data analysis can be automated. Our normality prediction algorithm outperformed the Shapiro-Wilk test in small samples with Matthews correlation coefficient of 0.5 vs. 0.16. The biggest drawback was that we did not find alternatives for statistical tests to test linear regression assumptions which are problematic in large datasets. We also applied our work to a dataset about smoking in teenagers. Because of the opensource nature of our work, these algorithms can be used in future research and projects.