Spatial data display correlation between observations collected at nearby locations. Generally, machine and deep learning methods either do not account for this correlation or do so indirectly through correlated features. To account for spatial correlation, we propose preprocessing the data using a spatial decorrelation transform motivated from properties of a multivariate Gaussian distribution and Vecchia approximations. The preprocessed, transformed data can then be ported into a machine or deep learning tool. After model fitting on the transformed data, the output can be spatially re-correlated via the corresponding inverse transformation. We show that including this spatial adjustment results in higher predictive accuracy on simulated and real spatial datasets.
Competitor rating systems for head-to-head games are typically used to measure playing strength from game outcomes. Ratings computed from these systems are often used to select top competitors for elite events, for pairing players of similar strength in online gaming, and for players to track their own strength over time. Most implemented rating systems assume only win/loss outcomes, and treat occurrences of ties as the equivalent to half a win and half a loss. However, in games such as chess, the probability of a tie (draw) is demonstrably higher for stronger players than for weaker players, so that rating systems ignoring this aspect of game results may produce strength estimates that are unreliable. We develop a new rating system for head-to-head games based on a model that explicitly acknowledges that a tie may depend on the strengths of the competitors. The approach uses a Bayesian dynamic modeling framework. Within each time period, posterior updates are computed in closed form using a single Newton-Raphson iteration evaluated at the prior mean. The approach is demonstrated on a large dataset of chess games played in International Correspondence Chess Federation tournaments.
High or ultra-high-dimensional data are becoming increasingly common in various fields. They often display diverse characteristics, including heterogeneity, longitudinal responses, and imbalanced measurements. These complexities make it challenging to integrate different modeling options and their combinations in order to fully leverage this rich data source. This paper provides an easy-to-use, and stand-alone, R package, geeVerse, that can implement any combination of 1) simultaneous variable selection and estimation, 2) quantile regression or mean regression for heterogeneous data, 3) longitudinal or cross-sectional data analysis, 4) balanced or imbalanced data, and 5) moderate, high, or even ultra-high-dimensional data. To accomplish this, we propose computationally efficient implementations of penalized generalized estimating equations (GEE) for quantile and mean regression. We present multiple applications with ultra-high-dimensional data including analysis of a resampled genetic dataset, quantile and mean regressions, analysis of cross-sectional and longitudinal data, differing correlation structures, and differing number of repeated measurements per subject. We also demonstrate our approach on two real data applications.
The rapid development of artificial intelligence (AI) tools, particularly generative models, has introduced significant challenges in academic assessment. Students increasingly rely on AI to complete assignments, complicating the evaluation of their true understanding and effort. This paper examines the limitations of AI detection tools, the inadequacies of traditional teaching methods in this context, and the potential for responsibly integrating AI into educational practices. Drawing on insights from educators and recent developments in AI, the paper proposes strategies for adapting assessment methods to ensure academic integrity while embracing technological advancements. The findings underscore the need for a balanced approach that leverages AI’s benefits while mitigating its risks.
Neuroimaging technology has received considerable attention in recent years. One of the key problems in the imaging data analysis is the heterogeneity among individual subjects. In particular, the relationship between the imaging biomarkers and the clinical outcomes may vary across different individuals. Popular existing statistical methodologies such as the functional linear regression and high dimensional linear regression can be inadequate because the homogeneous regression relationship is assumed for all subjects. In this paper, we propose the Subject-Specific Scalar-on-Image Regression (S3IR) model to handle heterogeneous populations. Specifically, we utilize a binary subject-specific masking image to capture the heterogeneous sparsity among individuals. The proposed S3IR model incorporates the spatial structure of the imaging data and is able to achieve both local smoothness and subject-specific sparsity of the estimated regression coefficients. Furthermore, we design an EM-type adaptive algorithm to estimate the model coefficients. Simulation studies are presented to show the superior performance of our proposed method over some existing ones in handling heterogeneity. Finally, we apply the S3IR model to analyze data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The results show that our model can effectively identify interpretable and significant disease-related regions and improve prediction performance of the cognitive scores.
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences represented as rankings of responses to prompts. In this paper, we document the phenomenon of reward collapse, an empirical observation where the prevailing ranking-based approach results in an identical reward distribution for diverse prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like “write a short story about your best friend” should yield a continuous range of rewards for their completions, while specific prompts like “what is the capital city of New Zealand” should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. Then we derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic setting. Based on the reward distributions for different utility functions, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
When collaborating with students, colleagues and practitioners, one soon realizes the lack of efficiency when sending around emails with multiple attachments, especially if changes are made in several types of documents (for example, text, code, PDF) and simultaneously by several collaborators. Using a version control system (VCS) can largely improve joint workflows, from file sharing, including merging changes from different collaborators, to providing access to past versions of the shared work, while allowing each collaborator to work under her/his preferred setup (for example, text editor or file manager). There exists lots of technical or specialized information and literature about VCSes online, but, as often, this is rather overwhelming for beginners. Knowing the basics well is more important than getting lost in the vast amount of possible options VCSes offer. Also, the basics are sufficient to enjoy using VCSes and to see their value in collaborative work, additional features can still be picked up along the way once necessary. We focus on such fundamentals of the centralized VCS SVN and the distributed VCS Git. We explain in simple terms how these systems can be set up and interacted with to increase efficiency in collaborative workflows.
Precision medicine is an innovative approach that aims to customize medical treatments and interventions to patients based on their individual characteristics. Several estimation techniques, including Q-learning, have been developed to determine optimal treatment rules. However, the applicability of these methods depends on the availability of precisely measured variables. This study extends the scope of Q-learning to incorporate compound outcomes, deviating from the commonly assumed univariate outcomes, and further accommodates data with mismeasurement in both binary and continuous covariates. Two methods are described to mitigate the impact of mismeasurement. Numerical studies reveal that mismeasurement in covariates leads to notable estimation bias in parameters indexing the optimal treatment, yet the methods addressing the mismeasured effects yield improved results.
Time-to-event data analysis without a well-defined time origin commonly occurs in observational studies that retrospectively collect survival endpoints. For instance, after enrolling participants who have or have not received a specific treatment, an event status can be observed for all participants; however, the start date of treatment is only observable for the treatment group. The corresponding time origin does not exist for the control group, resulting in missing survival time data. Complete-case analysis is often considered the standard approach, but it disregards information from all participants in the control group and does not allow us to compare their survival distributions. To address this challenge, we propose a novel semiparametric proportional hazards model by regarding these missing time origins as nuisance parameters. We approximate the risk sets as cumulative normal distributions to deal with these nuisance parameters and develop estimation and inference procedures for our proposed estimator. We study the asymptotic properties of this model and conduct the simulation studies to validate its finite sample property. Analysis of data from a recent SARS-CoV-2 seroprevaluence study illustrates the applicability of our methods. The proposed methods are implemented in the R package coxphm.
We propose a differentially private Bayesian framework for envelope regression, a technique that improves estimation efficiency by modelling the response as a function of a low-dimensional subspace of the predictors. Our method applies the analytic Gaussian mechanism to privatize sufficient statistics from the data, ensuring formal $(\epsilon ,\delta )$-differential privacy. We develop a tailored Gibbs sampling algorithm that performs valid Bayesian inference using only the noisy sufficient statistics. This approach leverages the envelope structure to isolate the variation in predictors that is relevant to the response, reducing estimation error compared to standard regression under the same privacy constraints. Through simulation studies, we demonstrate improved estimation accuracy and tighter credible intervals relative to a differentially private Bayesian linear regression baseline.