Pub. online:14 Mar 2023Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 354–367
Abstract
We propose a scalable Bayesian network learning algorithm based on sparse Cholesky decomposition. Our approach only requires observational data and user-specified confidence level as inputs and can estimate networks with thousands of variables. The computational complexity of the proposed method is $O({p^{3}})$ for a graph with p vertices. Extensive numerical experiments illustrate the usefulness of our method with promising results. In simulation, the initial step in our approach also improves an alternative Bayesian network structure estimation method that uses an undirected graph as an input.
A graphical tool for choosing the number of nodes for a neural network is introduced. The idea is to fit the neural network with a range of numbers of nodes at first, and then generate a jump plot using a transformation of the mean square errors of the resulting residuals. A theorem is proven to show that the jump plot will select several candidate numbers of nodes among which one is the true number of nodes. Then a single node only test, which has been theoretically justified, is used to rule out erroneous candidates. The method has a sound theoretical background, yields good results on simulated datasets, and shows wide applicability to datasets from real research.
With multiple components and relations, financial data are often presented as graph data, since it could represent both the individual features and the complicated relations. Due to the complexity and volatility of the financial market, the graph constructed on the financial data is often heterogeneous or time-varying, which imposes challenges on modeling technology. Among the graph modeling technologies, graph neural network (GNN) models are able to handle the complex graph structure and achieve great performance and thus could be used to solve financial tasks. In this work, we provide a comprehensive review of GNN models in recent financial context. We first categorize the commonly-used financial graphs and summarize the feature processing step for each node. Then we summarize the GNN methodology for each graph type, application in each area, and propose some potential research areas.
Machine learning methods are increasingly applied for medical data analysis to reduce human efforts and improve our understanding of disease propagation. When the data is complicated and unstructured, shallow learning methods may not be suitable or feasible. Deep learning neural networks like multilayer perceptron (MLP) and convolutional neural network (CNN), have been incorporated in medical diagnosis and prognosis for better health care practice. For a binary outcome, these learning methods directly output predicted probabilities for patient’s health condition. Investigators still need to consider appropriate decision threshold to split the predicted probabilities into positive and negative regions. We review methods to select the cut-off values, including the relatively automatic methods based on optimization of the ROC curve criteria and also the utility-based methods with a net benefit curve. In particular, decision curve analysis (DCA) is now acknowledged in medical studies as a good complement to the ROC analysis for the purpose of decision making. In this paper, we provide the R code to illustrate how to perform the statistical learning methods, select decision threshold to yield the binary prediction and evaluate the accuracy of the resulting classification. This article will help medical decision makers to understand different classification methods and use them in real world scenario.
There has been increasing interest in modeling survival data using deep learning methods in medical research. In this paper, we proposed a Bayesian hierarchical deep neural networks model for modeling and prediction of survival data. Compared with previously studied methods, the new proposal can provide not only point estimate of survival probability but also quantification of the corresponding uncertainty, which can be of crucial importance in predictive modeling and subsequent decision making. The favorable statistical properties of point and uncertainty estimates were demonstrated by simulation studies and real data analysis. The Python code implementing the proposed approach was provided.