Detecting illicit transactions in Anti-Money Laundering (AML) systems remains a significant challenge due to class imbalances and the complexity of financial networks. This study introduces the Multiple Aggregations for Graph Isomorphism Network with Custom Edges (MAGIC) convolution, an enhancement of the Graph Isomorphism Network (GIN) designed to improve the detection of illicit transactions in AML systems. MAGIC integrates edge convolution (GINE Conv) and multiple learnable aggregations, allowing for varied embedding sizes and increased generalization capabilities. Experiments were conducted using synthetic datasets, which simulate real-world transactions, following the experimental setup of previous studies to ensure comparability. MAGIC, when combined with XGBoost as a link predictor, outperformed existing models in 16 out of 24 metrics, with notable improvements in F1 scores and precision. In the most imbalanced dataset, MAGIC achieved an F1 score of 82.6% and a precision of 90.4% for the illicit class. While MAGIC demonstrated high precision, its recall was lower or comparable to the other models, indicating potential areas for future enhancement. Overall, MAGIC presents a robust approach to AML detection, particularly in scenarios where precision and overall quality are critical. Future research should focus on optimizing the model’s recall, potentially by incorporating additional regularization techniques or advanced sampling methods. Additionally, exploring the integration of foundation models like GraphAny could further enhance the model’s applicability in diverse AML environments.
A challenge that data scientists face is building an analytic product that is useful and trustworthy for a given audience. Previously, a set of principles for describing data analyses were defined that can be used to create a data analysis and to characterize the variation between analyses. Here, we introduce a concept called the alignment of a data analysis, which is between the data analyst and an audience. We define an aligned data analysis as the matching of principles between the analyst and the audience for whom the analysis is developed. In this paper, we propose a model for evaluating the alignment of a data analysis and describe some of its properties. We argue that more generally, this framework provides a language for characterizing alignment and can be used as a guide for practicing data scientists to building better data products.
The rapidly expanding field of metabolomics presents an invaluable resource for understanding the associations between metabolites and various diseases. However, the high dimensionality, presence of missing values, and measurement errors associated with metabolomics data can present challenges in developing reliable and reproducible approaches for disease association studies. Therefore, there is a compelling need for robust statistical analyses that can navigate these complexities to achieve reliable and reproducible disease association studies. In this paper, we construct algorithms to perform variable selection for noisy data and control the False Discovery Rate when selecting mutual metabolomic predictors for multiple disease outcomes. We illustrate the versatility and performance of this procedure in a variety of scenarios, dealing with missing data and measurement errors. As a specific application of this novel methodology, we target two of the most prevalent cancers among US women: breast cancer and colorectal cancer. By applying our method to the Women’s Health Initiative data, we successfully identify metabolites that are associated with either or both of these cancers, demonstrating the practical utility and potential of our method in identifying consistent risk factors and understanding shared mechanisms between diseases.
After the onset of the COVID-19 pandemic, scientific interest in coronaviruses endemic in animal populations has increased dramatically. However, investigating the prevalence of disease in animal populations across the landscape, which requires finding and capturing animals can be difficult. Spatial random sampling over a grid could be extremely inefficient because animals can be hard to locate, and the total number of samples may be small. Alternatively, preferential sampling, using existing knowledge to inform sample location, can guarantee larger numbers of samples, but estimates derived from this sampling scheme may exhibit bias if there is a relationship between higher probability sampling locations and the disease prevalence. Sample specimens are commonly grouped and tested in pools which can also be an added challenge when combined with preferential sampling. Here we present a Bayesian method for estimating disease prevalence with preferential sampling in pooled presence-absence data motivated by estimating factors related to coronavirus infection among Mexican free-tailed bats (Tadarida brasiliensis) in California. We demonstrate the efficacy of our approach in a simulation study, where a naive model, not accounting for preferential sampling, returns biased estimates of parameter values; however, our model returns unbiased results regardless of the degree of preferential sampling. Our model framework is then applied to data from California to estimate factors related to coronavirus prevalence. After accounting for preferential sampling impacts, our model suggests small prevalence differences between male and female bats.
We propose to explore high-dimensional data with categorical outcomes by generalizing the penalized orthogonal-components regression method (POCRE), a supervised dimension reduction method initially proposed for high-dimensional linear regression. This generalized POCRE, i.e., gPOCRE, sequentially builds up orthogonal components by selecting predictors which maximally explain the variation of the response variables. Therefore, gPOCRE simultaneously selects significant predictors and reduces dimensions by constructing linear components of these selected predictors for a high-dimensional generalized linear model. For multiple categorical outcomes, gPOCRE can also construct common components shared by all outcomes to improve the power of selecting variables shared by multiple outcomes. Both simulation studies and real data analysis are carried out to illustrate the performance of gPOCRE.
Forecasting is essential for optimizing resource allocation, particularly during crises such as the unprecedented COVID-19 pandemic. This paper focuses on developing an algorithm for generating k-step-ahead interval forecasts for autoregressive time series. Unlike conventional methods that assume a fixed distribution, our approach utilizes kernel distribution estimation to accommodate the unknown distribution of prediction errors. This flexibility is crucial in real-world data, where deviations from normality are common, and neglecting these deviations can result in inaccurate predictions and unreliable confidence intervals. We evaluate the performance of our method through simulation studies on various autoregressive time series models. The results show that the proposed approach performs robustly, even with small sample sizes, as low as 50 observations. Moreover, our method outperforms traditional linear model-based prediction intervals and those derived from the empirical distribution function, particularly when the underlying data distribution is non-normal. This highlights the algorithm’s flexibility and accuracy for interval forecasting in non-Gaussian contexts. We also apply the method to log-transformed weekly COVID-19 case counts from lower-middle-income countries, covering the period from June 1, 2020, to March 13, 2022.
Approximately 15% of adults in the United States (U.S.) are afflicted with chronic kidney disease (CKD). For CKD patients, the progressive decline of kidney function is intricately related to hospitalizations due to cardiovascular disease and eventual “terminal” events, such as kidney failure and mortality. To unravel the mechanisms underlying the disease dynamics of these interdependent processes, including identifying influential risk factors, as well as tailoring decision-making to individual patient needs, we develop a novel Bayesian multivariate joint model for the intercorrelated outcomes of kidney function (as measured by longitudinal estimated glomerular filtration rate), recurrent cardiovascular events, and competing-risk terminal events of kidney failure and death. The proposed joint modeling approach not only facilitates the exploration of risk factors associated with each outcome, but also allows dynamic updates of cumulative incidence probabilities for each competing risk for future subjects based on their basic characteristics and a combined history of longitudinal measurements and recurrent events. We propose efficient and flexible estimation and prediction procedures within a Bayesian framework employing Markov Chain Monte Carlo methods. The predictive performance of our model is assessed through dynamic area under the receiver operating characteristic curves and the expected Brier score. We demonstrate the efficacy of the proposed methodology through extensive simulations. Proposed methodology is applied to data from the Chronic Renal Insufficiency Cohort study established by the National Institute of Diabetes and Digestive and Kidney Diseases to address the rising epidemic of CKD in the U.S.
When comparing two survival curves, three tests are widely used: the Cox proportional hazards test, the logrank test, and the Wilcoxon test. Despite their popularity in survival data analysis, there is no clear clinical interpretation especially when the proportional hazard assumption is not valid. Meanwhile, the restricted mean survival time (RMST) offers an intuitive and clinically meaningful interpretation. We compare these four tests with regards to statistical power under many configurations (e.g., proportional hazard, early benefit, delayed benefit, and crossing survivals) with data simulated from the Weibull distributions. We then use an example from a lung cancer trial to compare their required sample sizes. As expected, the CoxPH test is more powerful than others when the PH assumption is valid. The Wilcoxon test is often preferable when there is a decreasing trajectory in the event rate as time goes. The RMST test is much more powerful than others when a new treatment has early benefit. The recommended test(s) under each configuration are suggested in this article.
Extensive literature has been proposed for the analysis of correlated survival data. Subjects within a cluster share some common characteristics, e.g., genetic and environmental factors, so their time-to-event outcomes are correlated. The frailty model under proportional hazards assumption has been widely applied for the analysis of clustered survival outcomes. However, the prediction performance of this method can be less satisfactory when the risk factors have complicated effects, e.g., nonlinear and interactive. To deal with these issues, we propose a neural network frailty Cox model that replaces the linear risk function with the output of a feed-forward neural network. The estimation is based on quasi-likelihood using Laplace approximation. A simulation study suggests that the proposed method has the best performance compared with existing methods. The method is applied to the clustered time-to-failure prediction within the kidney transplantation facility using the national kidney transplant registry data from the U.S. Organ Procurement and Transplantation Network. All computer programs are available at https://github.com/rivenzhou/deep_learning_clustered.
In causal mediation analyses, of interest are the direct or indirect pathways from exposure to an outcome variable. For observation studies, massive baseline characteristics are collected as potential confounders to mitigate selection bias, possibly approaching or exceeding the sample size. Accordingly, flexible machine learning approaches are promising in filtering a subset of relevant confounders, along with estimation using the efficient influence function to avoid overfitting. Among various confounding selection strategies, two attract growing attention. One is the popular debiased, or double machine learning (DML), and another is the penalized partial correlation via fitting a Gaussian graphical network model between the confounders and the response variable. Nonetheless, for causal mediation analyses when encountering high-dimensional confounders, there is a gap in determining the best strategy for confounding selection. Therefore, we exemplify a motivating study on the human microbiome, where the dimensions of mediator and confounders approach or exceed the sample size to compare possible combinations of confounding selection methods. By deriving the multiply robust causal direct and indirect effects across various hypotheses, our comprehensive illustrations offer methodological implications on how the confounding selection impacts the final causal target parameter estimation while generating causality insights in demystifying the “gut-brain axis”. Our results highlighted the practicality and necessity of the discussed methods, which not only guide real-world applications for practitioners but also motivate future advancements for this crucial topic in the era of big data.