Financial news headlines serve as a rich source of information on financial activities, offering a wealth of text that can provide insights into human behavior. One key analysis that can be conducted on this text is sentiment analysis. Despite extensive research over the years, sentiment analysis still faces challenges, particularly in handling internet slang, abbreviations, and emoticons commonly found on many websites that cover financial news headlines, including Bloomberg, Yahoo Finance, and Financial Times. This paper compares the performance of two sentiment analyzers—VADER and TextBlob—on financial news headlines from two countries: the USA (a well-developed economic nation) and Nepal (an underdeveloped economic nation). The collected headlines were manually classified into three categories (positive, negative, and neutral) from a financial perspective. The headlines were then cleaned and processed through the sentiment analyzers to compare their performance. The models’ performance is evaluated based on accuracy, sensitivity, specificity, and neutral specificity. Experimental results reveal that VADER performs better than TextBlob on both datasets. Additionally, both models perform better on financial news headlines from the USA than Nepal. These findings are further validated through statistical tests.
Mediation analysis plays an important role in many research fields, yet it is very challenging to perform estimation and hypothesis testing for high-dimensional mediation effects. We develop a user-friendly $\mathsf{R}$ package HIMA for high-dimensional mediation analysis with varying mediator and outcome specifications. The HIMA package is a comprehensive tool that accommodates various types of high-dimensional mediation models. This paper offers an overview of the functions within HIMA and demonstrates the practical utility of HIMA through simulated datasets. The HIMA package is publicly available from the Comprehensive $\mathsf{R}$ Archive Network at https://CRAN.R-project.org/package=HIMA.
We propose a Bayesian Negative Binomial-Bernoulli model to jointly analyze the patterns behind field goal attempts and the factors influencing shot success. We apply nonnegative CANDECOMP/PARAFAC tensor decomposition to study shot patterns and use logistic regression to predict successful shots. To maintain the conditional conjugacy of the model, we employ a double Pólya-Gamma data augmentation scheme and devise an efficient variational inference algorithm for estimation. The model is applied to shot chart data from the National Basketball Association, focusing on the regular seasons from 2015–16 to 2022–23. We consistently identify three latent features in shot patterns across all seasons and verify a popular claim from recent years about the increasing importance of three-point shots. Additionally, we find that the home court advantage in field goal accuracy disappears in the 2020–21 regular season, which was the only full season under strict COVID-19 crowd control, aside from the short bubble period in 2019–20. This finding contributes to the literature on the influence of crowd effects on home advantage in basketball games.
Detecting illicit transactions in Anti-Money Laundering (AML) systems remains a significant challenge due to class imbalances and the complexity of financial networks. This study introduces the Multiple Aggregations for Graph Isomorphism Network with Custom Edges (MAGIC) convolution, an enhancement of the Graph Isomorphism Network (GIN) designed to improve the detection of illicit transactions in AML systems. MAGIC integrates edge convolution (GINE Conv) and multiple learnable aggregations, allowing for varied embedding sizes and increased generalization capabilities. Experiments were conducted using synthetic datasets, which simulate real-world transactions, following the experimental setup of previous studies to ensure comparability. MAGIC, when combined with XGBoost as a link predictor, outperformed existing models in 16 out of 24 metrics, with notable improvements in F1 scores and precision. In the most imbalanced dataset, MAGIC achieved an F1 score of 82.6% and a precision of 90.4% for the illicit class. While MAGIC demonstrated high precision, its recall was lower or comparable to the other models, indicating potential areas for future enhancement. Overall, MAGIC presents a robust approach to AML detection, particularly in scenarios where precision and overall quality are critical. Future research should focus on optimizing the model’s recall, potentially by incorporating additional regularization techniques or advanced sampling methods. Additionally, exploring the integration of foundation models like GraphAny could further enhance the model’s applicability in diverse AML environments.
A challenge that data scientists face is building an analytic product that is useful and trustworthy for a given audience. Previously, a set of principles for describing data analyses were defined that can be used to create a data analysis and to characterize the variation between analyses. Here, we introduce a concept called the alignment of a data analysis, which is between the data analyst and an audience. We define an aligned data analysis as the matching of principles between the analyst and the audience for whom the analysis is developed. In this paper, we propose a model for evaluating the alignment of a data analysis and describe some of its properties. We argue that more generally, this framework provides a language for characterizing alignment and can be used as a guide for practicing data scientists to building better data products.
We propose to explore high-dimensional data with categorical outcomes by generalizing the penalized orthogonal-components regression method (POCRE), a supervised dimension reduction method initially proposed for high-dimensional linear regression. This generalized POCRE, i.e., gPOCRE, sequentially builds up orthogonal components by selecting predictors which maximally explain the variation of the response variables. Therefore, gPOCRE simultaneously selects significant predictors and reduces dimensions by constructing linear components of these selected predictors for a high-dimensional generalized linear model. For multiple categorical outcomes, gPOCRE can also construct common components shared by all outcomes to improve the power of selecting variables shared by multiple outcomes. Both simulation studies and real data analysis are carried out to illustrate the performance of gPOCRE.
Forecasting is essential for optimizing resource allocation, particularly during crises such as the unprecedented COVID-19 pandemic. This paper focuses on developing an algorithm for generating k-step-ahead interval forecasts for autoregressive time series. Unlike conventional methods that assume a fixed distribution, our approach utilizes kernel distribution estimation to accommodate the unknown distribution of prediction errors. This flexibility is crucial in real-world data, where deviations from normality are common, and neglecting these deviations can result in inaccurate predictions and unreliable confidence intervals. We evaluate the performance of our method through simulation studies on various autoregressive time series models. The results show that the proposed approach performs robustly, even with small sample sizes, as low as 50 observations. Moreover, our method outperforms traditional linear model-based prediction intervals and those derived from the empirical distribution function, particularly when the underlying data distribution is non-normal. This highlights the algorithm’s flexibility and accuracy for interval forecasting in non-Gaussian contexts. We also apply the method to log-transformed weekly COVID-19 case counts from lower-middle-income countries, covering the period from June 1, 2020, to March 13, 2022.
Approximately 15% of adults in the United States (U.S.) are afflicted with chronic kidney disease (CKD). For CKD patients, the progressive decline of kidney function is intricately related to hospitalizations due to cardiovascular disease and eventual “terminal” events, such as kidney failure and mortality. To unravel the mechanisms underlying the disease dynamics of these interdependent processes, including identifying influential risk factors, as well as tailoring decision-making to individual patient needs, we develop a novel Bayesian multivariate joint model for the intercorrelated outcomes of kidney function (as measured by longitudinal estimated glomerular filtration rate), recurrent cardiovascular events, and competing-risk terminal events of kidney failure and death. The proposed joint modeling approach not only facilitates the exploration of risk factors associated with each outcome, but also allows dynamic updates of cumulative incidence probabilities for each competing risk for future subjects based on their basic characteristics and a combined history of longitudinal measurements and recurrent events. We propose efficient and flexible estimation and prediction procedures within a Bayesian framework employing Markov Chain Monte Carlo methods. The predictive performance of our model is assessed through dynamic area under the receiver operating characteristic curves and the expected Brier score. We demonstrate the efficacy of the proposed methodology through extensive simulations. Proposed methodology is applied to data from the Chronic Renal Insufficiency Cohort study established by the National Institute of Diabetes and Digestive and Kidney Diseases to address the rising epidemic of CKD in the U.S.
Extensive literature has been proposed for the analysis of correlated survival data. Subjects within a cluster share some common characteristics, e.g., genetic and environmental factors, so their time-to-event outcomes are correlated. The frailty model under proportional hazards assumption has been widely applied for the analysis of clustered survival outcomes. However, the prediction performance of this method can be less satisfactory when the risk factors have complicated effects, e.g., nonlinear and interactive. To deal with these issues, we propose a neural network frailty Cox model that replaces the linear risk function with the output of a feed-forward neural network. The estimation is based on quasi-likelihood using Laplace approximation. A simulation study suggests that the proposed method has the best performance compared with existing methods. The method is applied to the clustered time-to-failure prediction within the kidney transplantation facility using the national kidney transplant registry data from the U.S. Organ Procurement and Transplantation Network. All computer programs are available at https://github.com/rivenzhou/deep_learning_clustered.
Cellular deconvolution is a key approach to deciphering the complex cellular makeup of tissues by inferring the composition of cell types from bulk data. Traditionally, deconvolution methods have focused on a single molecular modality, relying either on RNA sequencing (RNA-seq) to capture gene expression or on DNA methylation (DNAm) to reveal epigenetic profiles. While these single-modality approaches have provided important insights, they often lack the depth needed to fully understand the intricacies of cellular compositions, especially in complex tissues. To address these limitations, we introduce EMixed, a versatile framework designed for both single-modality and multi-omics cellular deconvolution. EMixed models raw RNA counts and DNAm counts or frequencies via allocation models that assign RNA transcripts and DNAm reads to cell types, and uses an expectation-maximization (EM) algorithm to estimate parameters. Benchmarking results demonstrate that EMixed significantly outperforms existing methods across both single-modality and multi-modality applications, underscoring the broad utility of this approach in enhancing our understanding of cellular heterogeneity.