Pub. online:6 May 2025Type:Education In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 269–286
Abstract
Many believe that use of generative AI as a private tutor has the potential to shrink access and achievement gaps between students and schools with abundant resources versus those with fewer resources. Shrinking the gap is possible only if paid and free versions of the platforms perform with the same accuracy. In this experiment, we investigate the performance of GPT versions 3.5, 4.0, and 4o-mini on the same 16-question statistics exam given to a class of first-year graduate students. While we do not advocate using any generative AI platform to complete an exam, the use of exam questions allows us to explore aspects of ChatGPT’s responses to typical questions that students might encounter in a statistics course. Results on accuracy indicate that GPT 3.5 would fail the exam, GPT4 would perform well, and GPT4o-mini would perform somewhere in between. While we acknowledge the existence of other Generative AI/LLMs, our discussion concerns only ChatGPT because it is the most widely used platform on college campuses at this time. We further investigate differences among the AI platforms in the answers for each problem using methods developed for text analytics, such as reading level evaluation and topic modeling. Results indicate that GPT3.5 and 4o-mini have characteristics that are more similar than either of them have with GPT4.
Pub. online:23 Apr 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 287–311
Abstract
Significant attention has been drawn to support vector data description (SVDD) due to its exceptional performance in one-class classification and novelty detection tasks. Nevertheless, all slack variables are assigned the same weight during the modeling process. This can lead to a decline in learning performance if the training data contains erroneous observations or outliers. In this study, an extended SVDD model, Rescale Hinge Loss Support Vector Data Description (RSVDD) is introduced to strengthen the resistance of the SVDD to anomalies. This is achieved by redefining the initial optimization problem of SVDD using a hinge loss function that has been rescaled. As this loss function can increase the significance of samples that are more likely to represent the target class while decreasing the impact of samples that are more likely to represent anomalies, it can be considered one of the variants of weighted SVDD. To efficiently address the optimization challenge associated with the proposed model, the half-quadratic optimization method was utilized to generate a dynamic optimization algorithm. Experimental findings on a synthetic and breast cancer data set are presented to illustrate the new proposed method’s performance superiority over the already existing methods for the settings considered.
Pub. online:23 Apr 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 312–331
Abstract
The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical disclosure control (SDC) aims to maximize data utility while minimizing the disclosure risk, i.e., the risk of individual identification. A key SDC method is data perturbation, with General Additive Data Perturbation (GADP) and Copula General Additive Data Perturbation (CGADP) being two prominent approaches. Both leverage multivariate normal distributions to generate synthetic data while preserving statistical properties of the original dataset. Given the increasing use of machine learning for data modeling, this study compares the performance of various machine learning models on GADP- and CGADP-perturbed data. Using Monte Carlo simulations with three data-generating models and a real dataset, we evaluate the predictive performance and robustness of ten machine learning techniques under data perturbation. Our findings provide insights into the machine learning techniques that perform robustly on GADP- and CGADP-perturbed datasets, extending previous research that primarily focused on simple statistics such as means, variances, and correlations.
Pub. online:17 Apr 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 332–352
Abstract
Analysis of nonprobability survey samples has gained much attention in recent years due to their wide availability and the declining response rates within their costly probabilistic counterparts. Still, valid population inference cannot be deduced from nonprobability samples without additional information, which typically takes the form of a smaller survey sample with a shared set of covariates. In this paper, we propose the matched mass imputation (MMI) approach as a means for integrating data from probability and nonprobability samples when common covariates are present in both samples but the variable of interest is available only in the nonprobability sample. The proposed approach borrows strength from the ideas of statistical matching and mass imputation to provide robustness against potential nonignorable bias in the nonprobability sample. Specifically, MMI is a two-step approach: first, a novel application of statistical matching identifies a subset of the nonprobability sample that closely resembles the probability sample; second, mass imputation is performed using these matched units. Our empirical results, from simulations and a real data application, demonstrate the effectiveness of the MMI estimator under nearest-neighbor matching, which almost always outperformed other imputation estimators in the presence of nonignorable bias. We also explore the effectiveness of a bootstrap variance estimation procedure for the proposed MMI estimator.
Pub. online:23 Apr 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 353–369
Abstract
Studying migration patterns driven by extreme environmental events is crucial for building a sustainable society and stable economy. Motivated by a real dataset about human migrations, this paper develops a transformed varying coefficient model for origin and destination (OD) regression to elucidate the complex associations of migration patterns with spatio-temporal dependencies and socioeconomic factors. Existing studies often overlook the dynamic effects of these factors in OD regression. Furthermore, with the increasing ease of collecting OD data, the scale of current OD regression data is typically large, necessitating the development of methods for efficiently fitting large-scale migration data. We address the challenge by proposing a new Bayesian interpretation for the proposed OD models, leveraging sufficient statistics for efficient big data computation. Our method, inspired by migration studies, promises broad applicability across various fields, contributing to refined statistical analysis techniques. Extensive numerical studies are provided, and insights from real data analysis are shared.
Pub. online:1 Apr 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 370–388
Abstract
Recent studies observed a surprising concept on model test error called the double descent phenomenon where the increasing model complexity decreases the test error first and then the error increases and decreases again. To observe this, we work on a two-layer neural network model with a ReLU activation function designed for binary classification under supervised learning. Our aim is to observe and investigate the mathematical theory behind the double descent behavior of model test error for varying model sizes. We quantify the model size by the ration of number of training samples to the dimension of the model. Due to the complexity of the empirical risk minimization procedure, we use the Convex Gaussian MinMax Theorem to find a suitable candidate for the global training loss.
Pub. online:12 Dec 2024Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 389–398
Abstract
The last decade has seen a vast increase of the abundance of data, fuelling the need for data analytic tools that can keep up with the data size and complexity. This has changed the way we analyze data: moving from away from single data analysts working on their individual computers, to large clusters and distributed systems leveraged by dozens of data scientists. Technological advances have been addressing the scalability aspects, however, the resulting complexity necessitates that more people are involved in a data analysis than before. Collaboration and leveraging of other’s work becomes crucial in the modern, interconnected world of data science. In this article we propose and describe an open-source, web-based, collaborative visualization and data analysis platform RCloud. It de-couples the user from the location of the data analysis while preserving security, interactivity and visualization capabilities. Its collaborative features enable data scientists to explore, work together and share analyses in a seamless fashion. We describe the concepts and design decisions that enabled it to support large data science teams in the industry and academia.
Pub. online:5 May 2025Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 399–415
Abstract
Recently, the log cumulative probability model (LCPM) and its special case the proportional probability model (PPM) was developed to relate ordinal outcomes to predictor variables using the log link instead of the logit link. These models permit the estimation of probability instead of odds, but the log link requires constrained maximum likelihood estimation (cMLE). An algorithm that efficiently handles cMLE for the LCPM is a valuable resource as these models are applicable in many settings and its output is easy to interpret. One such implementation is in the R package lcpm. In this era of big data, all statistical models are under pressure to meet the new processing demands. This work aimed to improve the algorithm in R package lcpm to process more input in less time using less memory.
Pub. online:6 May 2025Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 23, Issue 2 (2025): Special Issue: the 2024 Symposium on Data Science and Statistics (SDSS), pp. 416–428
Abstract
The Data Science Consulting Program at North Carolina State University Libraries, in partnership with the Data Science and AI Academy, provides comprehensive support for a wide range of tools and software, including R, Python, MATLAB, ArcGIS, and more, to assist students, faculty, and staff with their data-related needs. This paper explores the integration of generative AI, specifically ChatGPT, into our consultation services, demonstrating how it enhances the efficiency and effectiveness of addressing numerous and diverse requests. ChatGPT has been instrumental in tasks such as data visualization, statistical analysis, and code generation, allowing consultants to quickly resolve complex queries. The paper also discusses the program’s structured approach to consultations, highlighting the iterative process from initial request to resolution. We address challenges like prompt engineering and response variability, offering best practices to maximize the tool’s potential. As AI technology continues to evolve, its role in our data science consultations is expected to expand, improving service quality and the consultant’s ability to handle increasingly complex tasks. The study concludes that ChatGPT is a valuable asset in academic data science, significantly streamlining workflows and broadening the scope of support provided by our program.