In causal mediation analyses, of interest are the direct or indirect pathways from exposure to an outcome variable. For observation studies, massive baseline characteristics are collected as potential confounders to mitigate selection bias, possibly approaching or exceeding the sample size. Accordingly, flexible machine learning approaches are promising in filtering a subset of relevant confounders, along with estimation using the efficient influence function to avoid overfitting. Among various confounding selection strategies, two attract growing attention. One is the popular debiased, or double machine learning (DML), and another is the penalized partial correlation via fitting a Gaussian graphical network model between the confounders and the response variable. Nonetheless, for causal mediation analyses when encountering high-dimensional confounders, there is a gap in determining the best strategy for confounding selection. Therefore, we exemplify a motivating study on the human microbiome, where the dimensions of mediator and confounders approach or exceed the sample size to compare possible combinations of confounding selection methods. By deriving the multiply robust causal direct and indirect effects across various hypotheses, our comprehensive illustrations offer methodological implications on how the confounding selection impacts the final causal target parameter estimation while generating causality insights in demystifying the “gut-brain axis”. Our results highlighted the practicality and necessity of the discussed methods, which not only guide real-world applications for practitioners but also motivate future advancements for this crucial topic in the era of big data.
Cellular deconvolution is a key approach to deciphering the complex cellular makeup of tissues by inferring the composition of cell types from bulk data. Traditionally, deconvolution methods have focused on a single molecular modality, relying either on RNA sequencing (RNA-seq) to capture gene expression or on DNA methylation (DNAm) to reveal epigenetic profiles. While these single-modality approaches have provided important insights, they often lack the depth needed to fully understand the intricacies of cellular compositions, especially in complex tissues. To address these limitations, we introduce EMixed, a versatile framework designed for both single-modality and multi-omics cellular deconvolution. EMixed models raw RNA counts and DNAm counts or frequencies via allocation models that assign RNA transcripts and DNAm reads to cell types, and uses an expectation-maximization (EM) algorithm to estimate parameters. Benchmarking results demonstrate that EMixed significantly outperforms existing methods across both single-modality and multi-modality applications, underscoring the broad utility of this approach in enhancing our understanding of cellular heterogeneity.
Predicting the timing and occurrence of events is a major focus of data science applications, especially in the context of biomedical research. Performance for models estimating these outcomes, often referred to as time-to-event or survival outcomes, is frequently summarized using measures of discrimination, in particular time-dependent AUC and concordance. Many estimators for these quantities have been proposed which can be broadly categorized as either semi-parametric estimators or non-parametric estimators. In this paper, we review the mathematical construction of the two classes of estimators and compare their behavior. Importantly, we identify a previously unknown feature of the class of semi-parametric estimators that can result in vastly overoptimistic out-of-sample estimation of discriminative performance in common applied tasks. Although these semi-parametric estimators are popular in practice, the phenomenon we identify here suggests that this class of estimators may be inappropriate for use in model assessment and selection based on out-of-sample evaluation criteria. This is due to the semi-parametric estimators’ bias in favor of models that are overfit when using out-of-sample prediction criteria (e.g. cross-validation). Non-parametric estimators, which do not exhibit this behavior, are highly variable for local discrimination. We propose to address the high variability problem through penalized regression splines smoothing. The behavior of various estimators of time-dependent AUC and concordance are illustrated via a simulation study using two different mechanisms that produce overoptimistic out-of-sample estimates using semi-parametric estimators. Estimators are further compared using a case study using data from the National Health and Nutrition Examination Survey (NHANES) 2011–2014.