Identifying treatment effect modifiers (i.e., moderators) plays an essential role in improving treatment efficacy when substantial treatment heterogeneity exists. However, studies are often underpowered for detecting treatment effect modifiers, and exploratory analyses that examine one moderator per statistical model often yield spurious interactions. Therefore, in this work, we focus on creating an intuitive and readily implementable framework to facilitate the discovery of treatment effect modifiers and to make treatment recommendations for time-to-event outcomes. To minimize the impact of a misspecified main effect and avoid complex modeling, we construct the framework by matching the treated with the controls and modeling the conditional average treatment effect via regressing the difference in the observed outcomes of a matched pair on the averaged moderators. Inverse-probability-of-censoring weighting is used to handle censored observations. As matching is the foundation of the proposed methods, we explore different matching metrics and recommend the use of Mahalanobis distance when both continuous and categorical moderators are present. After matching, the proposed framework can be flexibly combined with popular variable selection and prediction methods such as linear regression, least absolute shrinkage and selection operator (Lasso), and random forest to create different combinations of potential moderators. The optimal combination is determined by the out-of-bag prediction error and the area under the receiver operating characteristic curve in making correct treatment recommendations. We compare the performance of various combined moderators through extensive simulations and the analysis of real trial data. Our approach can be easily implemented using existing R packages, resulting in a straightforward optimal combined moderator to make treatment recommendations.
Due to long-standing federal restrictions on cannabis-related research, the implications of cannabis legalization on traffic and occupational safety are understudied. Accordingly, there is a need for objective and validated measures of acute cannabis impairment that may be applied in public safety and occupational settings. Pupillary response to light may offer an avenue for detection that outperforms typical sobriety tests and tetrahydrocannabinol concentrations. We developed a video processing and analysis pipeline that extracts pupil sizes during a light stimulus test administered with goggles utilizing infrared videography. The analysis compared pupil size trajectories in response to a light for those with occasional, daily, and no cannabis use before and after smoking. Pupils were segmented using a combination of image pre-processing techniques and segmentation algorithms which were validated using manually segmented data and found to achieve 99% precision and 94% F-score. Features extracted from the pupil size trajectories captured pupil constriction and rebound dilation and were analyzed using generalized estimating equations. We find that acute cannabis use results in less pupil constriction and slower pupil rebound dilation in the light stimulus test.
Obesity rates continue to exhibit an upward trajectory, particularly in the US, and is the underlying cause of several comorbidities, including but not limited to high blood pressure, high cholesterol, diabetes, heart disease, stroke, and cancers. To monitor obesity, body mass index (BMI) and proportion body fat (PBF) are two commonly used measurements. Although BMI and PBF changes over time in an individual’s lifespan and their relationship may also change dynamically, existing work has mostly remained cross-sectional, or separately modeling BMI and PBF. A combined longitudinal assessment is expected to be more effective in unravelling their complex interplay. To mitigate this, we consider Bayesian cross-domain latent growth curve models within a structural equation modeling framework, which simultaneously handles issues such as individually varying time metrics, proportion data, and potential missing not at random data for joint assessment of the longitudinal changes of BMI and PBF. Through simulation studies, we observe that our proposed models and estimation method yielded parameter estimates with small bias and mean squared error in general, however, a mis-specified missing data mechanism may cause inaccurate and inefficient parameter estimates. Furthermore, we demonstrate application of our method to a motivating longitudinal obesity study, controlling for both time-invariant (such as, sex), and time-varying (such as diastolic and systolic blood pressure, biceps skinfold, bioelectrical impedance, and waist circumference) covariates in separate models. Under time-invariance, we observe that the initial BMI level and the rate of change in BMI influenced PBF. However, in presence of time-varying covariates, only the initial BMI level influenced the initial PBF. The added-on selection model estimation indicated that observations with higher PBF values were less likely to be missing.
The COVID-19 outbreak of 2020 has required many governments to develop and adopt mathematical-statistical models of the pandemic for policy and planning purposes. To this end, this work provides a tutorial on building a compartmental model using Susceptible, Exposed, Infected, Recovered, Deaths and Vaccinated (SEIRDV) status through time. The proposed model uses interventions to quantify the impact of various government attempts made to slow the spread of the virus. Furthermore, a vaccination parameter is also incorporated in the model, which is inactive until the time the vaccine is deployed. A Bayesian framework is utilized to perform both parameter estimation and prediction. Predictions are made to determine when the peak Active Infections occur. We provide inferential frameworks for assessing the effects of government interventions on the dynamic progression of the pandemic, including the impact of vaccination. The proposed model also allows for quantification of number of excess deaths averted over the study period due to vaccination.
Inspired by the impressive successes of compress sensing-based machine learning algorithms, data augmentation-based efficient Gibbs samplers for Bayesian high-dimensional classification models are developed by compressing the design matrix to a much lower dimension. Ardent care is exercised in the choice of the projection mechanism, and an adaptive voting rule is employed to reduce sensitivity to the random projection matrix. Focusing on the high-dimensional Probit regression model, we note that the naive implementation of the data augmentation-based Gibbs sampler is not robust to the presence of co-linearity in the design matrix – a setup ubiquitous in $n\lt p$ problems. We demonstrate that a simple fix based on joint updates of parameters in the latent space circumnavigates this issue. With a computationally efficient MCMC scheme in place, we introduce an ensemble classifier by creating R ($\sim 25$–50) projected copies of the design matrix, and subsequently running R classification models with the R projected design matrix in parallel. We combine the output from the R replications via an adaptive voting scheme. Our scheme is inherently parallelizable and capable of taking advantage of modern computing environments often equipped with multiple cores. The empirical success of our methodology is illustrated in elaborate simulations and gene expression data applications. We also extend our methodology to a high-dimensional logistic regression model and carry out numerical studies to showcase its efficacy.
Multi-touch attribution (MTA) estimates the relative contributions of the multiple ads a user may see prior to any observed conversions. Increasingly, advertisers also want to base budget and bidding decisions on these attributions, spending more on ads that drive more conversions. We describe two requirements for an MTA system to be suitable for this application: First, it must be able to handle continuously updated and incomplete data. Second, it must be sufficiently flexible to capture that an ad’s effect will change over time. We describe an MTA system, consisting of a model for user conversion behavior and a credit assignment algorithm, that satisfies these requirements. Our model for user conversion behavior treats conversions as occurrences in an inhomogeneous Poisson process, while our attribution algorithm is based on iteratively removing the last ad in the path.
Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the data science and Natural Language Processing (NLP) communities have been proposed for measuring such disparity at scale using empirically rigorous methodologies. In this article, we contribute to this line of research by studying gender disparity in 2,443 copyright-expired literary texts published in the pre-modern period, defined in this work as the period ranging from the beginning of the nineteenth through the early twentieth century. Using a replicable data science methodology relying on publicly available and established NLP components, we extract three different gendered character prevalence measures within these texts. We use an extensive set of statistical tests to robustly demonstrate a significant disparity between the prevalence of female characters and male characters in pre-modern literature. We also show that the proportion of female characters in literary texts significantly increases in female-authored texts compared to the same proportion in male-authored texts. However, regression-based analysis shows that, over the 120 year period covered by the corpus, female character prevalence does not change significantly over time, and remains below the parity level of 50%, regardless of the gender of the author. Qualitative analyses further show that descriptions associated with female characters across the corpus are markedly different (and stereotypical) from the descriptions associated with male characters.
Pub. online:21 Apr 2023Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 412–427
Abstract
The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in finding the “optimal” boundary. In this paper, we summarized fifteen boundaries that consist of five error spending functions that allow early termination for futility, difference, or both, as well as a fixed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reflect different settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates different weights for the expected sample size under a null experiment with no difference between variants, the expected sample size under an experiment with a difference in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.
Pub. online:31 Mar 2023Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 333–353
Abstract
High-Order Markov Chains (HOMC) are conventional models, based on transition probabilities, that are used by the United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) to study crop-rotation patterns over time. However, HOMCs routinely suffer from sparsity and identifiability issues because the categorical data are represented as indicator (or dummy) variables. In fact, the dimension of the parametric space increases exponentially with the order of HOMCs required for analysis. While parsimonious representations reduce the number of parameters, as has been shown in the literature, they often result in less accurate predictions. Most parsimonious models are trained on big data structures, which can be compressed and efficiently processed using alternative algorithms. Consequently, a thorough evaluation and comparison of the prediction results obtain using a new HOMC algorithm and different types of Deep Neural Networks (DNN) across a range of agricultural conditions is warranted to determine which model is most appropriate for operational crop specific land cover prediction of United States (US) agriculture. In this paper, six neural network models are applied to crop rotation data between 2011 and 2021 from six agriculturally intensive counties, which reflect the range of major crops grown and a variety of crop rotation patterns in the Midwest and southern US. The six counties include: Renville, North Dakota; Perkins, Nebraska; Hale, Texas; Livingston, Illinois; McLean, Illinois; and Shelby, Ohio. Results show the DNN models achieve higher overall prediction accuracy for all counties in 2021. The proposed DNN models allow for the ingestion of long time series data, and robustly achieve higher accuracy values than a new HOMC algorithm considered for predicting crop specific land cover in the US.