With multiple components and relations, financial data are often presented as graph data, since it could represent both the individual features and the complicated relations. Due to the complexity and volatility of the financial market, the graph constructed on the financial data is often heterogeneous or time-varying, which imposes challenges on modeling technology. Among the graph modeling technologies, graph neural network (GNN) models are able to handle the complex graph structure and achieve great performance and thus could be used to solve financial tasks. In this work, we provide a comprehensive review of GNN models in recent financial context. We first categorize the commonly-used financial graphs and summarize the feature processing step for each node. Then we summarize the GNN methodology for each graph type, application in each area, and propose some potential research areas.
The complexity of energy infrastructure at large institutions increasingly calls for data-driven monitoring of energy usage. This article presents a hybrid monitoring algorithm for detecting consumption surges using statistical hypothesis testing, leveraging the posterior distribution and its information about uncertainty to introduce randomness in the parameter estimates, while retaining the frequentist testing framework. This hybrid approach is designed to be asymptotically equivalent to the Neyman-Pearson test. We show via extensive simulation studies that the hybrid approach enjoys control over type-1 error rate even with finite sample sizes whereas the naive plug-in method tends to exceed the specified level, resulting in overpowered tests. The proposed method is applied to the natural gas usage data at the University of Connecticut.
Sport climbing, which made its Olympic debut at the 2020 Summer Games, generally consists of three separate disciplines: speed climbing, bouldering, and lead climbing. However, the International Olympic Committee (IOC) only allowed one set of medals each for men and women in sport climbing. As a result, the governing body of sport climbing, rather than choosing only one of the three disciplines to include in the Olympics, decided to create a competition combining all three disciplines. In order to determine a winner, a combined scoring system was created using the product of the ranks across the three disciplines to determine an overall score for each climber. In this work, the rank-product scoring system of sport climbing is evaluated through simulation to investigate its general features, specifically, the advancement probabilities and scores for climbers given certain placements. Additionally, analyses of historical climbing contest results are presented and real examples of violations of the independence of irrelevant alternatives are illustrated. Finally, this work finds evidence that the current competition format is putting speed climbers at a disadvantage.
Popular music genre preferences can be measured by consumer sales, listening habits, and critics’ opinions. We analyze trends in genre preferences from 1974 through 2018 presented in annual Billboard Hot 100 charts and annual Village Voice Pazz & Jop critics’ polls. We model yearly counts of appearances in these lists for eight music genres with two multinomial logit models, using various demographic, social, and industry variables as predictors. Since the counts are correlated over time, we use a partial likelihood approach to fit the models. Our models provide strong fits to the observed genre proportions and illuminate trends in the popularity of genres over the sampled years, such as the rise of country music and the decline of rock music in consumer preferences, and the rise of rap/hip-hop in popularity among both consumers and critics. We forecast the genre proportions (for consumers and critics) for 2019 using fitted multinomial probabilities constructed from forecasts of 2019 predictor values and compare our Hot 100 forecasts to observed 2019 Hot 100 proportions. We model over time the association between consumer and critics’ preferences using Cramér’s measure of association between nominal variables and forecast how this association might trend in the future.
For large observational studies lacking a control group (unlike randomized controlled trials, RCT), propensity scores (PS) are often the method of choice to account for pre-treatment confounding in baseline characteristics, and thereby avoid substantial bias in treatment estimation. A vast majority of PS techniques focus on average treatment effect estimation, without any clear consensus on how to account for confounders, especially in a multiple treatment setting. Furthermore, for time-to event outcomes, the analytical framework is further complicated in presence of high censoring rates (sometimes, due to non-susceptibility of study units to a disease), imbalance between treatment groups, and clustered nature of the data (where, survival outcomes appear in groups). Motivated by a right-censored kidney transplantation dataset derived from the United Network of Organ Sharing (UNOS), we investigate and compare two recent promising PS procedures, (a) the generalized boosted model (GBM), and (b) the covariate-balancing propensity score (CBPS), in an attempt to decouple the causal effects of treatments (here, study subgroups, such as hepatitis C virus (HCV) positive/negative donors, and positive/negative recipients) on time to death of kidney recipients due to kidney failure, post transplantation. For estimation, we employ a 2-step procedure which addresses various complexities observed in the UNOS database within a unified paradigm. First, to adjust for the large number of confounders on the multiple sub-groups, we fit multinomial PS models via procedures (a) and (b). In the next stage, the estimated PS is incorporated into the likelihood of a semi-parametric cure rate Cox proportional hazard frailty model via inverse probability of treatment weighting, adjusted for multi-center clustering and excess censoring, Our data analysis reveals a more informative and superior performance of the full model in terms of treatment effect estimation, over sub-models that relaxes the various features of the event time dataset.
The COVID-19 pandemic has created a sudden need for a wider uptake of home-based telework as means of sustaining the production. Generally, teleworking arrangements impact directly worker’s efficiency and motivation. The direction of this impact, however, depends on the balance between positive effects of teleworking (e.g. increased flexibility and autonomy) and its downsides (e.g. blurring boundaries between private and work life). Moreover, these effects of teleworking can be amplified in case of vulnerable groups of workers, such as women. The first step in understanding the implications of teleworking on women is to have timely information on the extent of teleworking by age and gender. In the absence of timely official statistics, in this paper we propose a method for nowcasting the teleworking trends by age and gender for 20 Italian regions using mobile network operators (MNO) data. The method is developed and validated using MNO data together with the Italian quarterly Labour Force Survey. Our results confirm that the MNO data have the potential to be used as a tool for monitoring gender and age differences in teleworking patterns. This tool becomes even more important today as it could support the adequate gender mainstreaming in the ‘Next Generation EU’ recovery plan and help to manage related social impacts of COVID-19 through policymaking.
As data acquisition technologies advance, longitudinal analysis is facing challenges of exploring complex feature patterns from high-dimensional data and modeling potential temporally lagged effects of features on a response. We propose a tensor-based model to analyze multidimensional data. It simultaneously discovers patterns in features and reveals whether features observed at past time points have impact on current outcomes. The model coefficient, a k-mode tensor, is decomposed into a summation of k tensors of the same dimension. We introduce a so-called latent F-1 norm that can be applied to the coefficient tensor to performed structured selection of features. Specifically, features will be selected along each mode of the tensor. The proposed model takes into account within-subject correlations by employing a tensor-based quadratic inference function. An asymptotic analysis shows that our model can identify true support when the sample size approaches to infinity. To solve the corresponding optimization problem, we develop a linearized block coordinate descent algorithm and prove its convergence for a fixed sample size. Computational results on synthetic datasets and real-life fMRI and EEG datasets demonstrate the superior performance of the proposed approach over existing techniques.
When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.