Pub. online:29 Jan 2026Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 24, Issue 2 (2026): Special Issue: The 2025 Symposium on Data Science and Statistics (SDSS 2025),, pp. 411–435
Abstract
Advances in AI and automation are reshaping qualitative research workflows, making processes more efficient, accurate, consistent, and scalable. This paper presents innovations developed for the Illinois Needs Assessment project, a statewide initiative led by the Illinois State Board of Education and the American Institutes for Research to conduct comprehensive needs assessments for schools that need intensive or comprehensive support. To address the scale and tight timeline requirements of the project, the team designed three interconnected pipelines that work together to produce a finalized report. The first, an Audio Pipeline, uses Whisper and generative AI to automate transcription, text-based speaker role attribution, thematic coding, and insight generation from focus groups and interviews. The second, a Report Generation Pipeline, integrates Airtable automations with AWS infrastructure to produce customized school reports that merge AI-generated findings with survey data, school performance metrics, and contextual comparisons. Third, the Needs Assessment Summary Report automates the assembly of all quantitative and qualitative inputs into a polished, customizable deliverable that combines efficiency with expert review. Together, these pipelines replace ad hoc manual workflows with reproducible, consistent systems that enhance data quality, reduce error, and broaden access for non-technical users. The integrated design demonstrates how automation and generative AI can reduce manual burdens, shorten delivery timelines, and support timely, data-informed, and human-centered decision-making in education.
Technological advances in software development effectively handled technical details that made life easier for data analysts, but also allowed for nonexperts in statistics and computer science to analyze data. As a result, medical research suffers from statistical errors that could be otherwise prevented such as errors in choosing a hypothesis test and assumption checking of models. Our objective is to create an automated data analysis software package that can help practitioners run non-subjective, fast, accurate and easily interpretable analyses. We used machine learning to predict the normality of a distribution as an alternative to normality tests and graphical methods to avoid their downsides. We implemented methods for detecting outliers, imputing missing values, and choosing a threshold for cutting numerical variables to correct for non-linearity before running a linear regression. We showed that data analysis can be automated. Our normality prediction algorithm outperformed the Shapiro-Wilk test in small samples with Matthews correlation coefficient of 0.5 vs. 0.16. The biggest drawback was that we did not find alternatives for statistical tests to test linear regression assumptions which are problematic in large datasets. We also applied our work to a dataset about smoking in teenagers. Because of the opensource nature of our work, these algorithms can be used in future research and projects.